# 012-02 - ML Basics - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

Context : 

You will need to know some very core concepts about 'tabular' Machine learning before talking about NLP.

Data  : 

**You can find the dataset [here](https://gist.githubusercontent.com/AlexandreGazagnes/cb63600b7a6a71b5f7b714bfe7540137/raw/cc8563822e19b196aebe0a52c9f74598888d9c29/iris_exam.csv).**

## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [None]:
# pwd

In [None]:
# cd ..

In [None]:
# ls

These commands will install the required packages:

**Please note that if you are using google colab, all you need is already installed**

In [None]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

or copy the file requirements.txt and : 

In [None]:
#! pip install -r requirements.txt

⚠️ Try to use a virtual enviromement with venv, virtualenv or pipenv

In [None]:
#! python3 -m venv .venv # create the .venv folder
#! source .venv/bin/activate # activate the virtual env
#! pip install -r requirements.txt # install the requirements.txt

Please uncomment and run the following lines if needed (to download the dataset) 

In [None]:
# !wget https://gist.githubusercontent.com/AlexandreGazagnes/cb63600b7a6a71b5f7b714bfe7540137/raw/cc8563822e19b196aebe0a52c9f74598888d9c29/iris_exam.csv

### Import 

Import data libraries:

In [None]:
import pandas as pd
import numpy as np

Import Graphical libraries:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Import Machine Learning libraries:

In [None]:
# must to have (mandarory)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import *
from sklearn.model_selection import *
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.impute import *
from sklearn.preprocessing import *
from sklearn.ensemble import *
from sklearn.neighbors import *
from sklearn.dummy import *

Ignore the warnings : 

In [None]:
import warnings

warnings.filterwarnings("ignore")

If needed we can use a TEST_MODE to run the notebook to have a very fast execution : 

In [None]:
TEST_MODE = True

In [None]:
CV = 10  # number of folds for the  cross val
N_JOBS = 7  # number of cpu to use for computations
FRAC = 1.0  # we keep 100% of the dataframe
DISPLAY = True  # display complex viz
TEST_SIZE = 0.25  # Train vs Test %

if TEST_MODE:
    CV = 2
    N_JOBS = -1
    FRAC = 0.1
    DISPLAY = False
    TEST_SIZE = 0.5

### Get the data

1st option : Download the dataset from the web

In [None]:
url = "https://gist.githubusercontent.com/AlexandreGazagnes/cb63600b7a6a71b5f7b714bfe7540137/raw/cc8563822e19b196aebe0a52c9f74598888d9c29/iris_exam.csv"
df = pd.read_csv(url)
df.head()

If needed let's take just a specific % of the dataframe : 

In [None]:
if TEST_MODE:
    df = df.sample(frac=FRAC)

2nd Option : Read data from a file

In [None]:
# or

# fn = "my/super/file.csv"
# df = pd.read_csv(fn)
# df.head()

## First Tour

Print out the first rows of the dataset

In [None]:
df.head()

Print out the last rows of the dataset

In [None]:
df.tail()

Print out 10 random lines of the data set

In [None]:
df.sample(10)

Global information about the dataframe

In [None]:
df.info()

List of data types for each column

In [None]:
df.dtypes

The shape of our dataframe

In [None]:
df.shape

Compute all missing values for each column

In [None]:
df.isna().sum()

Do we have some missing values ? If so how many, and what should we do ?

Compute mean, std, median, min, max etc 

In [None]:
df.describe().round(2)

Compute the number of unique values for each column

In [None]:
df.nunique()

Keep in mind the shape of our data set

In [None]:
df.shape

Let's plot the correlation matrix

In [None]:
def make_corr_heatmap(df):
    corr = df.select_dtypes(include="number").corr()
    mask = np.triu(corr)
    sns.heatmap(
        corr, annot=True, cmap="coolwarm", fmt=".2f", vmin=-1, vmax=1, mask=mask
    )

In [None]:
if DISPLAY:
    make_corr_heatmap(df)

What is your conclusion?

Let's display the pair plot visualisation for numerical features

In [None]:
if DISPLAY:
    sns.pairplot(df, corner=True)

Without any statistical analysis, what can you say about the data?
Regarding the pair plot, how many clusters do we have ? 

Let's do the same but with the hue parameter (the true value of each flower's species)

In [None]:
if DISPLAY:
    sns.pairplot(df, hue="Species", corner=True)

## Cleaning and Preparation

### Cleaning

Keep in mind the number of missing values 

In [None]:
df.isna().sum()

We need to fill the missing values for the column "sepal lenght" with the median value.

Filling missing values with  the mean could be another option, but you know why it is not the best one ;)


So, Let's compute the median value :

In [None]:
_median = df.SepalLengthCm.median()
_median

We can then fill the missing values with the median value

In [None]:
df["SepalLengthCm"] = df["SepalLengthCm"].fillna(_median)

Let's check if the problem is solved

In [None]:
df.isna().sum()

Keep in mind our data numerical description : 

In [None]:
df.describe().round(2)

Do you think we have outliers in our data frame ? 
If so, what is the column concerned, and what value seems to be an outlier ?

Let's select the specific line

In [None]:
df.loc[df.PetalWidthCm > 10, :]

Compute the median of the column "petal width"

In [None]:
_median = df.PetalWidthCm.median()
_median

Let's change the outlier value 

In [None]:
df.loc[df.PetalWidthCm > 10, "PetalWidthCm"] = _median

The problem is solved, let's check it

In [None]:
df.describe().round(2)

Keep in mind our data types

In [None]:
df.dtypes

Keep in mind our number of unique values per column: 

In [None]:
df.nunique()

Do you think we have useless columns in our data frame ?
What are theses columns and why ?

Even if "Species" is a special column, we need to keep this one. 

Please drop the useless columns

In [None]:
cols = ["Date", "Id"]
df = df.drop(columns=cols, errors="ignore")
df

Good, now we need to create our X matrix.

### X and y

We need to extract X (our data) from y (our target)  :

In [None]:
X = df.drop(columns="Species")
X

In [None]:
y = df.Species
y

## Modelisation

### First Try

We need an estimator : 

In [None]:
estimator = LogisticRegression()
estimator

Let's fit this estimator : 

In [None]:
estimator.fit(X, y)
estimator

Let's predict : 

In [None]:
y_pred = estimator.predict(X)
y_pred[:10]

Our score : 

In [None]:
estimator.score(X, y)

The same : 

In [None]:
accuracy_score(y_true=y, y_pred=y_pred)

Using the confusion matrix : 

In [None]:
y = pd.Series(y, name="y_true")
y_pred = pd.Series(y_pred, name="y_pred")
pd.DataFrame(confusion_matrix(y, y_pred))

More readable output : 

In [None]:
pd.crosstab(y, y_pred)

### Using Train and Test values 

Creating train and test values : 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    shuffle=True,
    random_state=42,
)

# other possible values : 0.25, 0.2

X_train : 

In [None]:
X_train.shape

X_test : 

In [None]:
X_test.shape

Estimator : 

In [None]:
estimator = LogisticRegression()

Fit : 

In [None]:
estimator.fit(X_train, y_train)

Train score :  

In [None]:
estimator.score(X_train, y_train)

Test score : 

In [None]:
estimator.score(X_test, y_test)

### Using a Grid Search

About the grid Search

In [None]:
grid = GridSearchCV(
    LogisticRegression(),
    param_grid={},
    cv=CV,
    return_train_score=True,
    refit=True,
    n_jobs=N_JOBS,
    verbose=2,
)

Fit : 

In [None]:
grid.fit(X_train, y_train)

Our results : 

In [None]:
grid.cv_results_

In a dataframe : 

In [None]:
pd.DataFrame(grid.cv_results_)

Lets's create a function : 

In [None]:
def resultize(grid):

    res = grid.cv_results_
    res = pd.DataFrame(res)

    cols = [i for i in res.columns if "split" not in i]
    res = res.loc[:, cols]

    res = res.drop(columns=["mean_score_time", "std_score_time"])

    return res.round(2).sort_values("mean_test_score", ascending=False)

Resultize : 

In [None]:
resultize(grid)

### Using a pipeline

Our first pipeline : 

In [None]:
pipe = Pipeline(
    [
        ("imputer", KNNImputer()),
        ("scaler", StandardScaler()),
        ("estimator", LogisticRegression()),
    ]
)

In [None]:
pipe

Using the pipeline : 

In [None]:
grid = GridSearchCV(
    pipe,
    param_grid={},
    cv=CV,
    return_train_score=True,
    refit=True,
    n_jobs=N_JOBS,
    verbose=2,
)

Fit : 

In [None]:
grid.fit(X_train, y_train)

Resultize : 

In [None]:
resultize(grid)

### Using a Param Grid

Keep in mind : 

In [None]:
pipe

In [None]:
grid

Writing a beautiful param grid : 

In [None]:
param_grid = {
    "imputer": [
        "passthrough",
        KNNImputer(n_neighbors=3),
        KNNImputer(n_neighbors=5),
        SimpleImputer(strategy="median"),
    ],
    "scaler": [
        # "passthrough",
        StandardScaler(),
        Normalizer(),
        QuantileTransformer(n_quantiles=10),
    ],
    "estimator": [
        DummyClassifier(),
        LogisticRegression(),
        KNeighborsClassifier(n_neighbors=5),
        RandomForestClassifier(),
    ],
}

In [None]:
param_grid

Using the param grid : 

In [None]:
grid = GridSearchCV(
    pipe,
    param_grid=param_grid,
    cv=CV,
    return_train_score=True,
    refit=True,
    n_jobs=N_JOBS,
    verbose=1,
)

Fit : 

In [None]:
grid.fit(X_train, y_train)

Resultize ! 

In [None]:
resultize(grid).head(10)