# Data Analysis with Pandas

***Note***: this notebook contains cell with ***a*** solution. Remember ther is not only one solution to a problem!  
You will recognise these cells as they start with **# %**.  
If you would like to see the solution, you will have to remove the **#** (which can be done by using **Ctrl** and **?**) and run the cell. If you want to run the solution code, you will have to run the cell again.

## Data Ananlysis Packages  
Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:

| Package | Description |
| -- | -- |
| `NumPy` | Numerical calculations - does all the heavy lifting by passing out to C subroutines. This means you get _both_ the productivity of Python, _and_ the computational power of C. Best of both worlds! |
| `SciPy` | Scientific computing, statistic tests, and much more! |
| `pandas` | Your data manipulation swiss army knife. You'll likely see pandas used in any PyData demo! pandas is built on top of NumPy, so it's **fast**. |
| `matplotlib` | An old but powerful data visualisation package, inspired by Matlab. |
| `Seaborn` | A newer and easy-to-use but limited data visualisation package, built on top of matplotlib. |
| `scikit-learn` | Your one-stop machine learning shop! Classification, regression, clustering, dimensional reduction and more. |
| `nltk` and `spacy` | nltk = natural language processing toolkit; spacy is a newer package for natural language processing but very easy to use. |
| `statsmodels` | Statistical tests, time series forecasting and more. The "model formula" interface will be familiar to R users. |
| `requests` and `Beautiful Soup` | `requests` + `Beautiful Soup` = great combination for building web scrapers. |
| `Jupyter` | Jupyter itself is a package too. See the latest version at https://pypi.org/project/jupyter/, and upgrade with e.g. `conda install jupyter==1.0.0` |

Though there are countless others available.

For today, we'll primarily focus ourselves around the library that is 99% of our work: `pandas`. Pandas is built on top of the speed and power of NumPy.

___
## Imports

In [0]:
import pandas as pd

>Import numpy using the convention seen at the end of the first notebook.

In [0]:
# %load ../solutions/02_01.py

___
## Loading the data

To see a method's documentation, you can use the help function. In jupyter notebook, you can also just put a question mark before the method.

In [0]:
?pd.read_csv

To load the dataframe we are using in this notebook, we will provide the path to the file: ../data/Iris/Iris_data.csv

>Load the dataframe, read it into a Panda's DataFrame and assign it to df

In [0]:
# %load ../solutions/02_02.py

**To have a look at the first 5 rows of df, we can use the *head* method.**

In [0]:
df.head()

>Have a look at the last 3 rows of df using the tail method

In [0]:
# %load ../solutions/02_03.py

___
## General information about the dataset

**To get the size of the datasets, we can use the *shape* attribute.**  
The first number is the number of row, the second one the number of columns

>Show the shape of df (do not put brackets at the end)

In [0]:
# %load ../solutions/02_04.py

>Get the names of the columns and info about them (number of non null and type) using the info method.

In [0]:
# %load ../solutions/02_05.py

>get the columns of the dataframe using the columns attribute.

In [0]:
# %load ../solutions/02_06.py

### Display settings

We can check the dipslay option of the notebook.

In [0]:
pd.options.display.max_rows

>Force pandas to display 25 rows by changing the value of the above.

In [0]:
# %load ../solutions/02_07.py

___
## _Subsetting_
We can subset a dataframe by label, by index or a combination of both.  
There are different ways to do it, using .loc, .iloc and also [].  
See [documentation ](https://pandas.pydata.org/pandas-docs/stable/indexing.html).

>Display the 'SepalLengthCm' column

In [0]:
# %load ../solutions/02_08.py

*Note:* We could also use df.SepalLengthCm   ->   not a great idea because it could be mixed with methods.

>Have a look at the 12th observation:**

In [0]:
# using .iloc (uses positions, "i" stands for integer)


In [0]:
# %load ../solutions/02_09.py

In [0]:
# using .loc (uses indexes and labels)


In [0]:
# %load ../solutions/02_10.py

>Display the ***SepalLengthCm*** of the last three observations.

In [0]:
# using .iloc


In [0]:
# %load ../solutions/02_11.py

In [0]:
# using .loc


In [0]:
# %load ../solutions/02_12.py

**And finally look at the PetalLengthCm and PetalWidthCm of the 146th, the 8th and the 1rst observations:**

In [0]:
# using .iloc


In [0]:
# %load ../solutions/02_13.py

In [0]:
# using .loc


In [0]:
# %load ../solutions/02_14.py

**!!WARNING!!**  Unlike Python and ``.iloc``, the end value in a range specified by ``.loc`` **includes** the last index specified. 

In [0]:
df.iloc[5:10]

In [0]:
df.loc[5:10]

___
## Filtering

**We can also use condition(s) to filter.**  
We want to display the rows of df where **PetalWidthCm** is greater than 2. We will start by creating a mask with this condition.

In [0]:
mask_PW = df['PetalWidthCm'] > 2
mask_PW

Note that this return boleans. If we pass this mask to our dataframe, it will display only the rows where the mask is True.

In [0]:
df[mask]

>Display the rows of df where ***PetalWidthCm*** is greater than 2 and ***PetalLengthCm*** is less than 5.5.

In [0]:
# %load ../solutions/02_15.py

___
## Values

**We can get the number of unique values from a certain column by using the *nunique* method.**  
For example, we can get the number of unique values from the Species column:

In [0]:
df['Species'].nunique()

**We can also get the list of unique values from a certain column by using the *unique* method.**

>Return the list of unique values from the Species column

In [0]:
# %load ../solutions/02_16.py

**To get the count of the different values of a column, we can use the *value_counts* method.**  
For example, for the Species column:

In [0]:
df['Species'].value_counts()

**If we want to know the count of NaN values, we have to pass the value *Flase* to the parameter *dropna* (set to *True* by default).**

>Return the proportion for each species

In [0]:
# %load ../solutions/02_17.py

**To get the proportion instead of the count of these values, we have to pass the value *True* to the parameter *normalize*.**

>Return the proportion for each species

In [0]:
# %load ../solutions/02_18.py

### NaN

**We can use the *isnull* method to know if a value is null or not. It returns booleans.**

In [0]:
df['PetalLengthCm'].isnull()

**We can apply different methods one after the other.**.  
For example, we could apply to method *sum* after the method *isnull* to know the number of null observations in the PetalLengthCm column.

>Get the number of null values for ***PetalLengthCm***.

In [0]:
# %load ../solutions/02_19.py

>Using the index attribute, get the indexes of the observation without PetalLengthCm

In [0]:
# %load ../solutions/02_20.py

**Use the dropna method to remove the row which only has nan values.**

>Get the help for the dropna method.

In [0]:
# %load ../solutions/02_21.py

>Use the dropna method to remove the row of df which only has nan values, and assign it to df_2.

In [0]:
# %load ../solutions/02_22.py

We can use a f-string to format a string. We have to write a ***f*** before the quotation mark, and write what you want to format between curly brackets.

In [0]:
print(f'shape of df: {df.shape}')

>print the number of rows of df_2 using a f_string

In [0]:
# %load ../solutions/02_23.py

>Use the [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method to remove the rows of df_2 which only has nan values, and assign it to df_3

In [0]:
# %load ../solutions/02_24.py

>Print the number of rows of df_3 using a f_string.

In [0]:
# %load ../solutions/02_25.py

### Duplicates

>Remove the duplicates rows from df_3, and assign the new dataframe to df_4

In [0]:
# %load ../solutions/02_26.py

In [0]:
# checking the shape of df_4
df_4.shape

___
## _Some stats_

>Use the describe method to see how the data is distributed (numerical features only!)

In [0]:
# %load ../solutions/02_27.py

We can convert the **Id** column to string:

In [0]:
df_4['Id'] = df_4['Id'].astype('str')

In [0]:
df_4.describe()

We can also change the **Species** column to save memory space.

In [0]:
df_4['Species'] = df_4['Species'].astype('category')

>Using the dtypes attribute, check the types of the columns of df_4

In [0]:
# %load ../solutions/02_28.py

We can also use the functions count(), mean(), sum(), median(), std(), min() and max() separately if we are only interested in one of those.

>Get the minimum for each numerical column of df_4

In [0]:
# %load ../solutions/02_29.py

>Calculate the maximum of the ***PetalLengthCm***

In [0]:
# %load ../solutions/02_30.py

**We can also get information for each type of flower using the groupby methode.**  

>Get the median for each ***Species***.

In [0]:
# %load ../solutions/02_31.py

### Saving the dataframe as a csv file

>Save df_4 using this path: '../data/my_data/my_iris.csv'

In [0]:
# %load ../solutions/02_32.py