# Session 1 - Demo Implementation

Here are some examples of code implementation for data exploration.

## Import libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn 

## Load data set

In [None]:
dataset = pd.read_csv('iris.csv')

*This small dataset from 1936 is often used for testing out machine learning algorithms and visualizations. Each row of the table represents an iris flower, including its species and dimensions of its botanical parts, sepal and petal, in centimeters.*

## Pandas DataFrames: Manipulating the dataset

### **The dataset is contained in a DataFrame**
According to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html):
> A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data.

The important term here is _tabular_. `DataFrame`s are the excel sheets of python! In other words, a `DataFrame` is a 2D indexed array.  

In [None]:
type(dataset)

When the last thing you write in a cell is a variable that refers to a DataFrame, Jupyter gives you an overview of the DataFrame.

In [None]:
dataset

**Display several lines of the data set:**

In [None]:
dataset.head()

In [None]:
dataset.tail(6) # specify how many lines to display

In [None]:
dataset.sample(n=5)

**Info on the dimensions of the data set:**

In [None]:
dataset.shape

**Pandas dataframes are wrapped on NumPy arrays**

In [None]:
dataset.values

In [None]:
type(dataset.values)

**Info about the columns**

In [None]:
dataset.info()

Notice the **Dtype** column on the right. It stands for **data type**. Pandas as its own data type. Have a look [on the documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes).  

**List of column names:**

In [None]:
dataset.columns.values

## Pandas Series: Manipulating rows or columns

### Each column is held in a Pandas Series  

According to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) for `Series`:
> A Series is a one-dimensional ndarray with axis labels.

Notice how pandas leverage NumPy `ndarray`s as data structures. Also observe that a `Series` object wraps the array with "axis labels", also called an _index_.

In fact, each column in a DataFrame is just a named `Series`.

In [None]:
first_col = dataset['sepal_length']
first_col

In [None]:
type(first_col)

In [None]:
first_col.values

### Slicing with Series

In [None]:
first_col[1:4]

In [None]:
first_col[:4]

In [None]:
first_col[140:]

In [None]:
first_col[50:100:5]

In [None]:
first_col[10::-1]

### Rows also can be represented as Series 

The following Series holds the attributes of the flower at `index 0`.

In [None]:
dataset.loc[0]

## Data Selection 

### Index

All DataFrames have an index.

In [None]:
dataset.index

This index is shared by all columns.

In [None]:
dataset['sepal_length'].index

### Selecting data with `.loc[]`
The best Pandas data access methods" is `.loc[]`. It's a _label_ based selection method. It can be used with one or two arguments:

    df.loc[row_label]
    df.loc[row_label, column_label]
You can write `:` in either position if you want to select _all labels_.

ℹ️ This notation is mandatory if you want to **assign** new values to specific cells in the DataFrame.

In [None]:
dataset.loc[0]

In [None]:
dataset.loc[0, 'petal_width']

**We can also use it for slicing**

In [None]:
dataset.loc[0:8]

**We can also slice with labels**

In [None]:
dataset.loc[:,'sepal_length':'petal_length']

### Selecting data with `.iloc[]`
`.iloc[]` works just like `.loc[]`, except with _indices_ instead of _labels_. i.e It accepts the row and column positions as opposed to their names. Careful, it can get confusing when your index labels are numbers! It can be used with one or two arguments:

    df.iloc[row_index]
    df.iloc[row_index, column_index]

Here since the row labels are simply their integer index, there is no difference between `.loc[]` and `.iloc[]` for row selection, but notice how we can select the columns by their positions instead of their labels.

In [None]:
dataset.iloc[0:8,3:]

## Statistics

**Count number of null elements in the columns of the data set:**

In [None]:
dataset.isna().sum()

**Count number of 'setosa' in the 'species' column:**

In [None]:
nb_setosa = (dataset['species']=='setosa').sum()
print(nb_setosa)

**Compute the proportion of 'setosa' in the 'species' column:**

In [None]:
ratio = (dataset['species']=='setosa').sum() / len(dataset)
print(ratio)

**Table of statistics on each feature:**

In [None]:
dataset.describe()

## Plot

### Simple plots

In [None]:
plt.plot(dataset['sepal_width'])
plt.show()

In [None]:
plt.scatter(dataset['sepal_length'], dataset['sepal_width'])
plt.show()

### Histograms/distributions

In [None]:
plt.hist(dataset['sepal_width'])

plt.xlabel('sepal width')
plt.ylabel('Count')
plt.xlim((0, 6))
plt.show()

In [None]:
dataset[['sepal_width', 'sepal_length']].plot(kind='density')

plt.xlabel('sepal width')
plt.show()

In [None]:
dataset[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].boxplot()
plt.show()

### Scatter plot feature_1 vs. feature_2

In [None]:
plt.scatter(dataset['sepal_length'], dataset['sepal_width'])
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.show()

In [None]:
dataset['species'].unique()

In [None]:
unique_species = dataset['species'].unique()


for variety in unique_species:
    subset = dataset[dataset['species'] == variety]
    
    plt.scatter(subset['sepal_length'], subset['sepal_width'], label=variety)
    
plt.title('Sepal characteristics')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.legend()
plt.show()

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(dataset.iloc[:,:4], figsize = [8,8])
plt.show()

### Bars

Display the count of each value within 'species' column:

In [None]:
df_plot = dataset.groupby(['species']).size()
df_plot.plot(kind='bar')
plt.ylabel('Count')
plt.show()

Now let's add a new column in the data set with a feature engineered on the existing ones. Its values are booleans telling if the sepal is long or not (i.e. above a length threshold or not).

In [None]:
dataset['long_sepal'] = dataset['sepal_length'] > 6
dataset.sample(n=7)
# dataset = dataset.drop(columns='long_sepal')

Display the repartition of long sepal vs. short sepal for each species:

In [None]:
df_plot = dataset.groupby(['species', 'long_sepal']).size().unstack().plot(kind='bar', stacked=True, color=['orange', 'skyblue'], width=0.8)
plt.ylabel('Count')
plt.show()

### Correlation matrix

In [None]:
correlation = dataset.iloc[:,:4].corr()
correlation

**Short version with Seaborn:**

In [None]:
import seaborn as sns
sns.heatmap(correlation, annot=True)
plt.show()

**Long version with Matplotlib!:**

In [None]:
fig = plt.figure(figsize=(6,5), dpi= 80)
ax = fig.add_subplot(1,1,1)
cax = ax.matshow(correlation, cmap = plt.cm.magma)
fig.colorbar(cax)
ticks = np.arange(0,4)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names = dataset.iloc[:,:4].columns
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

Colormaps: https://matplotlib.org/tutorials/colors/colormaps.html