# Data Analysis with Pandas

***Note***: this notebook contains cell with ***a*** solution. Remember ther is not only one soltuion to a problem!  
You will recognise these cells as they start with **# %**.  
If you would like to see the soltion, you will have to remove the **#** (which can be done by using **Ctrl** and **?**) and run the cell. If you want to run the solution code, you will have to run the cell again.

## Data Ananlysis Packages  
Data Scientists use a wide variety of libraries in Python that make working with data significantly easier. Those libraries primarily consist of:

| Package | Description |
| -- | -- |
| `NumPy` | Numerical calculations - does all the heavy lifting by passing out to C subroutines. This means you get _both_ the productivity of Python, _and_ the computational power of C. Best of both worlds! |
| `SciPy` | Scientific computing, statistic tests, and much more! |
| `pandas` | Your data manipulation swiss army knife. You'll likely see pandas used in any PyData demo! pandas is built on top of NumPy, so it's **fast**. |
| `matplotlib` | An old but powerful data visualisation package, inspired by Matlab. |
| `Seaborn` | A newer and easy-to-use but limited data visualisation package, built on top of matplotlib. |
| `scikit-learn` | Your one-stop machine learning shop! Classification, regression, clustering, dimensional reduction and more. |
| `nltk` and `spacy` | nltk = natural language processing toolkit; spacy is a newer package for natural language processing but very easy to use. |
| `statsmodels` | Statistical tests, time series forecasting and more. The "model formula" interface will be familiar to R users. |
| `requests` and `Beautiful Soup` | `requests` + `Beautiful Soup` = great combination for building web scrapers. |
| `Jupyter` | Jupyter itself is a package too. See the latest version at https://pypi.org/project/jupyter/, and upgrade with e.g. `conda install jupyter==1.0.0` |

Though there are countless others available.

For today, we'll primarily focus ourselves around the library that is 99% of our work: `pandas`. But first, pandas is built on top of the speed and power of NumPy, so let's dig into that briefly.

___
## Imports

In [1]:
import pandas as pd

In [0]:
# Import numpy using the convention seen at the end of the first notebook.


In [0]:
         # %load ../solutions/02_.py
import numpy as np

___
## Loading the data

   ***explain different ways to get HELP***

In [4]:
?pd.read_csv



To load the dataframe we are using in this notebook, we will provide the path to the file: ../data/Iris/Iris_data.csv

In [0]:
# Load the dataframe, read it into a Panda's DataFrame and assign it to df


In [3]:
         # %load ../solutions/02_.py
df = pd.read_csv('../data/Iris/Iris_data.csv')

**To have a look at the first 5 rows of df, we can use the *head* method.**

In [0]:
df.head()

In [0]:
# Have a look at the last 3 rows of df using the tail method


In [0]:
         # %load ../solutions/02_.py
df.tail(3)

___
## General information about the dataset

**To get the size of the datasets, we can use the *shape* attribute.**  
The first number is the number of row, the second one the number of columns

In [0]:
# Show the shape of df (do not put brackets at the end)


In [0]:
         # %load ../solutions/02_.py
df.shape

**Get the names of the columns and info about them (number of non null and type)**

In [0]:
df.info()

**We can also get the columns of the dataframe:**

In [0]:
df.columns

**To get a list of the columns names:**

In [0]:
df.columns.tolist()

### Display settings

In [0]:
pd.options.display.max_rows

In [0]:
# Force pandas to display 25 rows


In [0]:
         # %load ../solutions/02_.py
pd.options.display.max_rows = 25

___
## _Subsetting_
We can subset a dataframe by label, by index or a combination of both.  
There are different ways to do it, using .loc, .iloc and also []. See documentation:  
https://pandas.pydata.org/pandas-docs/stable/indexing.html

In [0]:
#  Display the 'SepalLengthCm' column


In [0]:
         # %load ../solutions/02_.py
df['SepalLengthCm']

*Note:* We could also use df.SepalLengthCm   ->   not a great idea because og methods

**Then at the 12th observation:**

In [0]:
df.iloc[11]  # .iloc uses positions ("i" stands for integer)

In [0]:
df.loc[11]   # .loc uses indexes and labels

In [0]:
df.iloc[120:125]

**At the 'SepalLengthCm' of the last three observations:**

In [0]:
df.iloc[-3:, 1]

In [0]:
df.loc[151:, 'SepalLengthCm']

**And finally look at the PetalLengthCm and PetalWidthCm of the 146th, the 8th and the 1rst observations:**

In [0]:
df.iloc[[145, 7, 0], [3, -2]]

In [0]:
df.loc[[145, 7, 0], ['PetalLengthCm', 'PetalWidthCm']]

**!!WARNING!!**  Unlike Python and ``.iloc``, the end value in a range specified by ``.loc`` **includes** the last index specified. 

In [0]:
df.iloc[5:10]

In [0]:
df.loc[5:10]

___
## Filtering

**We can also use condition(s) to filter.**  
We want to display the rows of df where **PetalWidthCm** is greater than 2. We will start by creating a mask with this condition.

In [6]:
mask_PW = df['PetalWidthCm'] > 2
mask_PW

0      False
1      False
2      False
3      False
4      False
       ...  
149     True
150    False
151    False
152     True
153    False
Name: PetalWidthCm, Length: 154, dtype: bool

Note that this return boleans. If we pass this mask to our dataframe, it will display only the rows where the mask is True.

In [7]:
df[mask]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
102,102.0,6.3,3.3,6.0,2.5,Iris-virginica
104,104.0,7.1,3.0,5.9,2.1,Iris-virginica
106,106.0,6.5,3.0,5.8,2.2,Iris-virginica
107,107.0,7.6,3.0,6.6,2.1,Iris-virginica
111,111.0,7.2,3.6,6.1,2.5,Iris-virginica
115,115.0,6.8,3.0,5.5,2.1,Iris-virginica
117,117.0,5.8,2.8,5.1,2.4,Iris-virginica
118,118.0,6.4,3.2,5.3,2.3,Iris-virginica
120,120.0,7.7,3.8,6.7,2.2,Iris-virginica
121,121.0,7.7,2.6,6.9,2.3,Iris-virginica


In [0]:
# Display the rows of df where PetalWidthCm is greater than 2 and PetalLengthCm is less than 5.5.


In [0]:
         # %load ../solutions/02_.py
mask__PW_PL = (df['PetalWidthCm'] >2) & (df['PetalLengthCm'] < 5.5)
df[mask__PW_PL]

___
## Values

**We can get the number of unique values from a certain column by using the *nunique* method.**  
For example, we can get the number of unique values from the Species column:

In [0]:
df['Species'].nunique()

**We can also get the list of unique values from a certain column by using the *unique* method.**

In [0]:
# Return the list of unique values from the Species column


In [0]:
         # %load ../solutions/02_.py
df['Species'].unique()

**To get the count of the different values of a column, we can use the *value_counts* method.**  
For example, for the Species column:

In [0]:
df['Species'].value_counts()

**If we want to know the count of NaN values, we have to pass the value *Flase* to the parameter *dropna* (set to *True* by default).**

In [0]:
# Return the proportion for each species


In [0]:
         # %load ../solutions/02_.py
df['Species'].value_counts(dropna=False)

**To get the proportion instead of the count of these values, we have to pass the value *True* to the parameter *normalize*.**

In [0]:
# Return the proportion for each species


In [0]:
         # %load ../solutions/02_.py
df['Species'].value_counts(normalize=True)

### NaN

In [0]:
df['PetalLengthCm'].isnull()

**We can use the *isnull* method to know if a value is null or not. It returns booleans.**

In [0]:
df['PetalLengthCm'].isnull()

**We can apply different methods one after the other.**.  
For example, we could apply to method *sum* after the method *isnull* to know the number of null observations in the PetalLengthCm column.

In [0]:
# Get the number of null values for PetalLengthCm


In [0]:
         # %load ../solutions/02_.py
df['PetalLengthCm'].isnull().sum()

In [0]:
# Using the index attribute, get the indexes of the observation without PetalLengthCm


In [0]:
         # %load ../solutions/02_.py
df[df['PetalLengthCm'].isnull()].index

**Use the dropna method to remove the row which only has nan values.**

In [0]:
# Get the help for the dropna method


In [0]:
         # %load ../solutions/02_.py
?pd.DataFrame.dropna

In [0]:
# Use the dropna method to remove the row of df which only has nan values, and assign it to df_2


In [0]:
         # %load ../solutions/02_.py
df_2 = df.dropna(how='all')

**f-string**  
blabla

In [0]:
print(f'shape of df: {df.shape}')

In [0]:
# print the number of rows of df_2 using a f_string


In [0]:
         # %load ../solutions/02_.py
print(f'number of rows of df_2: {df_2.shape[0]}')

**dropna method -> link to **

In [0]:
# Use the dropna method to remove the rows of df_2 which only has nan values, and assign it to df_3


In [0]:
         # %load ../solutions/02_.py
df_3 = df_2.dropna(how='any')

In [0]:
# print the number of rows of df_2 using a f_string
print(f'number of rows of df_3: {df_3.shape[0]}')

### Duplicates

**The *drop_duplicates* method **

In [0]:
# Remove the duplicates rows from df_3, and assign the new dataframe to df_4


In [0]:
         # %load ../solutions/02_.py
df_4 = df_3.drop_duplicates()

In [0]:
# checking the shape of df_4
df_4.shape

___
## _Some stats_

In [0]:
# Use the describe method to see how the data is distributed (numerical features only!)


In [0]:
         # %load ../solutions/02_.py
df_4.describe()

In [0]:
df_4

We can convert the **Id** column to string:

In [0]:
df_4['Id'] = df_4['Id'].astype('str')

In [0]:
df_4.describe()

We can also change the **Species** column to save memory space.

In [0]:
df_4['Species'] = df_4['Species'].astype('category')

In [0]:
# Using the dtypes attribute, check the types of the columns of df_4


In [0]:
         # %load ../solutions/02_.py
df_4.dtypes

**We can also use the functions count(), mean(), sum(), median(), std(), min() and max() separately if we are only interested in one of those.**

In [0]:
# Get the minimum for each numerical column of df_4


In [0]:
         # %load ../solutions/02_.py
df_4.min()

In [0]:
# Calculate the maximum of the PetalLengthCm


In [0]:
         # %load ../solutions/02_.py
df_4['PetalLengthCm'].max()

**We can also get information for each type of flower using the groupby methode.**  

We'll get the median for each species.

In [0]:
df_4.groupby('Species').median()

### Correlation between the numerical features

In [0]:
df_4.corr()

### Saving the dataframe as a csv file

In [0]:
df_4.to_csv('../data/my_data/my_iris.csv')