<img src="https://s8.hostingkartinok.com/uploads/images/2018/08/308b49fcfbc619d629fe4604bceb67ac.jpg" width=500, height=450>
<h3 style="text-align: center;"><b>Phystech School of Applied Mathematics and Informatics (PSAMI) MIPT</b></h3>

<h2 style="text-align: center;"><b>Pandas</b></h2>

<img align=left src="https://cdn.fedoramagazine.org/wp-content/uploads/2015/11/Python_logo.png" width=450>

<img align=center src="https://i1.wp.com/www.datapluspeople.com/wp-content/uploads/2018/04/pandas_logo-1080x675.jpg?resize=1080%2C675&ssl=1" width=320>

Now we have a look at pandas.

The `pandas` library is actively used in modern data science for working with data that can be presented in the form of tables (and this is a very, very large part of the data)

This library is in the `Anaconda` package, but if you don’t have it for some reason, you can install it by uncommenting the following command:

In [None]:
#!pip install pandas

In [None]:
import numpy as np
import pandas as pd

Now we are looking for data. Do you hear about the AppStore? :)

### Download the data: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

We are interested in the file`AppStore.csv`.

`.csv` (Comma Separated Values) - probably the most common data format in modern data science. This is essentially the same as`.xls` (and `.xlsx`), that is, the table. 

So, reading the data file:

In [None]:
data = pd.read_csv('./AppleStore.csv')

Let’s see what the `data` variable is:

In [None]:
data.head() # This function shows the first few lines

In [None]:
type(data)

type == `pandas.core.frame.DataFrame`, usually say just a data frame, that is, a 'piece of data'.

You can display a lot of information about the data frame:

In [None]:
data.info()

This column appears after download that is why we will drop it:

In [None]:
data.drop(columns=['Unnamed: 0'], inplace=True)

So, our data is some information about ***applications in the AppStore***. Now we will see what is there:

In [None]:
# all columns of the data frame (titles)
data.columns

More statistically significant information:

In [None]:
data.describe()

All values of the data frame:

In [None]:
# one line == description of one object, in this case, applications from the AppStore
data.values

In [None]:
data.values[0]

In [None]:
data.shape

-- that is, 7197 rows and 17 columns.

It is important to be able to access a specific row or column of the data frame. Indexing is very similar to numpy, however there are some subtleties:

* Get the entire column (in this case the column 'track_name')::

In [None]:
data['track_name'] 

In [None]:
type(data['track_name'])

The ‘track_name’ of the first object is:

In [None]:
data['track_name'][0]

`pandas.core.series.Series` -- is a type of subsample data frame. 

* Get a specific string:

We see that the type of this variable is DataFrame, because it is just a slice of the original date frame

In [None]:
data[10:11]

In [None]:
type(data[10:11])

In [None]:
data.iloc[10]

In [None]:
type(data.iloc[10])

`iloc`

* reference to rows and columns by numeric index

-- `data.iloc[i, j]`, where `i` -- the number of the string, `j` -- the number of the column

In [None]:
data.iloc[50, 1]

In [None]:
data.iloc[[50, 100], [1, 2]]

In [None]:
data.iloc[50:60, 1:5]

`loc`

* reference to rows and columns by name

-- `data.loc[name_s, name_c]`, where `name_s` -- the name of the string, `name_c` -- the name of the column

In [None]:
data.loc[0, 'track_name']

In [None]:
data.loc[[0, 1, 2], ['track_name', 'id']]

But usually you need to be able to answer more meaningful questions, for example:

* How many applications have a user rating of at least 4?

In [None]:
len(data[data['user_rating'] >= 4])

* What types of the currencies are in the currency column?

In [None]:
np.unique(data['currency'])

* How many games are there? How many games are there with a rating of at least 4?

In [None]:
# 1
len(data[data['prime_genre'] == 'Games'])

In [None]:
# 2
len(data[(data['user_rating'] >= 4) & (data['prime_genre'] == 'Games')])

The operator "OR" is similar to "AND", only it use as |

In [None]:
len(data[(data['user_rating'] >= 4) | (data['prime_genre'] == 'Games')])

Let's use data as `numpy array`

In [None]:
np.array(data['size_bytes'])

In [None]:
data['size_bytes'].values