# 012-02 - ML Basics - Solution Notebook

* Written by Alexandre Gazagnes
* Last update: 2024-02-01

## About 

Context : 

You will need to know some very core concepts about 'tabular' Machine learning before talking about NLP.

Data  : 

**You can find the dataset [here](https://gist.githubusercontent.com/AlexandreGazagnes/cb63600b7a6a71b5f7b714bfe7540137/raw/cc8563822e19b196aebe0a52c9f74598888d9c29/iris_exam.csv).**

## Preliminaries

### System

These commands will display the system information:

Uncomment theses lines if needed. 

In [36]:
# pwd

In [37]:
# cd ..

In [38]:
# ls

In [39]:
# cd ..

In [40]:
# ls

These commands will install the required packages:

**Please note that if you are using google colab, all you need is already installed**

In [41]:
# !pip install pandas matplotlib seaborn plotly scikit-learn

Please uncomment and run the following lines if needed (to download the dataset) 

In [42]:
# !wget https://gist.githubusercontent.com/AlexandreGazagnes/cb63600b7a6a71b5f7b714bfe7540137/raw/cc8563822e19b196aebe0a52c9f74598888d9c29/iris_exam.csv

### Import 

Import data libraries:

In [43]:
import pandas as pd
import numpy as np

Import Graphical libraries:

In [44]:
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Import Machine Learning libraries:

In [45]:
# must to have (mandarory)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import *
from sklearn.model_selection import *
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.impute import *
from sklearn.preprocessing import *
from sklearn.ensemble import *
from sklearn.neighbors import *
from sklearn.dummy import *

Ignore the warnings : 

In [46]:
import warnings

warnings.filterwarnings("ignore")

If needed we can use a TEST_MODE to run the notebook to have a very fast execution : 

In [96]:
TEST_MODE = True

In [97]:
CV = 10  # number of folds for the  cross val
N_JOBS = 7  # number of cpu to use for computations
FRAC = 1.0  # we keep 100% of the dataframe
DISPLAY = True  # display complex viz
TEST_SIZE = 0.25  # Train vs Test %

if TEST_MODE:
    CV = 2
    N_JOBS = -1
    FRAC = 0.1
    DISPLAY = False
    TEST_SIZE = 0.5

### Get the data

1st option : Download the dataset from the web

In [98]:
url = "https://gist.githubusercontent.com/AlexandreGazagnes/cb63600b7a6a71b5f7b714bfe7540137/raw/cc8563822e19b196aebe0a52c9f74598888d9c29/iris_exam.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Date
0,1,5.1,3.5,1.4,0.2,setosa,2024-03-03
1,2,4.9,3.0,1.4,0.2,setosa,2024-03-03
2,3,4.7,3.2,1.3,0.2,setosa,2024-03-03
3,4,4.6,3.1,1.5,0.2,setosa,2024-03-03
4,5,5.0,3.6,1.4,0.2,setosa,2024-03-03


If needed let's take just a specific % of the dataframe : 

In [99]:
if TEST_MODE:
    df = df.sample(frac=FRAC)

2nd Option : Read data from a file

In [100]:
# or

# fn = "my/super/file.csv"
# df = pd.read_csv(fn)
# df.head()

## First Tour

Print out the first rows of the dataset

In [101]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Date
107,108,7.3,2.9,6.3,1.8,virginica,2024-03-03
132,133,6.4,2.8,5.6,2.2,virginica,2024-03-03
57,58,4.9,2.4,3.3,1.0,versicolor,2024-03-03
93,94,5.0,2.3,3.3,1.0,versicolor,2024-03-03
21,22,5.1,3.7,1.5,0.4,setosa,2024-03-03


Print out the last rows of the dataset

In [102]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Date
145,146,6.7,3.0,5.2,2.3,virginica,2024-03-03
42,43,4.4,3.2,1.3,0.2,setosa,2024-03-03
88,89,5.6,3.0,4.1,1.3,versicolor,2024-03-03
53,54,5.5,2.3,4.0,1.3,versicolor,2024-03-03
121,122,5.6,2.8,4.9,2.0,virginica,2024-03-03


Print out 10 random lines of the data set

In [103]:
df.sample(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Date
59,60,5.2,2.7,3.9,1.4,versicolor,2024-03-03
1,2,4.9,3.0,1.4,0.2,setosa,2024-03-03
21,22,5.1,3.7,1.5,0.4,setosa,2024-03-03
121,122,5.6,2.8,4.9,2.0,virginica,2024-03-03
42,43,4.4,3.2,1.3,0.2,setosa,2024-03-03
7,8,5.0,3.4,1.5,0.2,setosa,2024-03-03
145,146,6.7,3.0,5.2,2.3,virginica,2024-03-03
107,108,7.3,2.9,6.3,1.8,virginica,2024-03-03
50,51,7.0,3.2,4.7,1.4,versicolor,2024-03-03
57,58,4.9,2.4,3.3,1.0,versicolor,2024-03-03


Global information about the dataframe

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15 entries, 107 to 121
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             15 non-null     int64  
 1   SepalLengthCm  15 non-null     float64
 2   SepalWidthCm   15 non-null     float64
 3   PetalLengthCm  15 non-null     float64
 4   PetalWidthCm   15 non-null     float64
 5   Species        15 non-null     object 
 6   Date           15 non-null     object 
dtypes: float64(4), int64(1), object(2)
memory usage: 960.0+ bytes


List of data types for each column

In [105]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
Date              object
dtype: object

The shape of our dataframe

In [106]:
df.shape

(15, 7)

Compute all missing values for each column

In [107]:
df.isna().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
Date             0
dtype: int64

Do we have some missing values ? If so how many, and what should we do ?

Compute mean, std, median, min, max etc 

In [108]:
df.describe().round(2)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,15.0,15.0,15.0,15.0,15.0
mean,69.47,5.67,2.93,3.7,1.21
std,44.33,0.88,0.4,1.63,0.71
min,2.0,4.4,2.3,1.3,0.2
25%,47.0,5.0,2.75,2.4,0.7
50%,58.0,5.5,3.0,4.0,1.3
75%,101.0,6.4,3.2,4.8,1.65
max,146.0,7.3,3.7,6.3,2.3


Compute the number of unique values for each column

In [109]:
df.nunique()

Id               15
SepalLengthCm    11
SepalWidthCm      9
PetalLengthCm    13
PetalWidthCm     10
Species           3
Date              1
dtype: int64

Keep in mind the shape of our data set

In [110]:
df.shape

(15, 7)

Let's plot the correlation matrix

In [111]:
def make_corr_heatmap(df):
    corr = df.select_dtypes(include="number").corr()
    mask = np.triu(corr)
    sns.heatmap(
        corr, annot=True, cmap="coolwarm", fmt=".2f", vmin=-1, vmax=1, mask=mask
    )

In [113]:
if DISPLAY:
    make_corr_heatmap(df)

What is your conclusion?

Let's display the pair plot visualisation for numerical features

In [114]:
if DISPLAY:
    sns.pairplot(df, corner=True)

Without any statistical analysis, what can you say about the data?
Regarding the pair plot, how many clusters do we have ? 

Let's do the same but with the hue parameter (the true value of each flower's species)

In [115]:
if DISPLAY:
    sns.pairplot(df, hue="Species", corner=True)

## Cleaning and Preparation

### Cleaning

Keep in mind the number of missing values 

In [116]:
df.isna().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
Date             0
dtype: int64

We need to fill the missing values for the column "sepal lenght" with the median value.

Filling missing values with  the mean could be another option, but you know why it is not the best one ;)


So, Let's compute the median value :

In [117]:
_median = df.SepalLengthCm.median()
_median

5.5

We can then fill the missing values with the median value

In [118]:
df["SepalLengthCm"] = df["SepalLengthCm"].fillna(_median)

Let's check if the problem is solved

In [119]:
df.isna().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
Date             0
dtype: int64

Keep in mind our data numerical description : 

In [120]:
df.describe().round(2)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,15.0,15.0,15.0,15.0,15.0
mean,69.47,5.67,2.93,3.7,1.21
std,44.33,0.88,0.4,1.63,0.71
min,2.0,4.4,2.3,1.3,0.2
25%,47.0,5.0,2.75,2.4,0.7
50%,58.0,5.5,3.0,4.0,1.3
75%,101.0,6.4,3.2,4.8,1.65
max,146.0,7.3,3.7,6.3,2.3


Do you think we have outliers in our data frame ? 
If so, what is the column concerned, and what value seems to be an outlier ?

Let's select the specific line

In [121]:
df.loc[df.PetalWidthCm > 10, :]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species,Date


Compute the median of the column "petal width"

In [122]:
_median = df.PetalWidthCm.median()
_median

1.3

Let's change the outlier value 

In [123]:
df.loc[df.PetalWidthCm > 10, "PetalWidthCm"] = _median

The problem is solved, let's check it

In [124]:
df.describe().round(2)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,15.0,15.0,15.0,15.0,15.0
mean,69.47,5.67,2.93,3.7,1.21
std,44.33,0.88,0.4,1.63,0.71
min,2.0,4.4,2.3,1.3,0.2
25%,47.0,5.0,2.75,2.4,0.7
50%,58.0,5.5,3.0,4.0,1.3
75%,101.0,6.4,3.2,4.8,1.65
max,146.0,7.3,3.7,6.3,2.3


Keep in mind our data types

In [125]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
Date              object
dtype: object

Keep in mind our number of unique values per column: 

In [126]:
df.nunique()

Id               15
SepalLengthCm    11
SepalWidthCm      9
PetalLengthCm    13
PetalWidthCm     10
Species           3
Date              1
dtype: int64

Do you think we have useless columns in our data frame ?
What are theses columns and why ?

Even if "Species" is a special column, we need to keep this one. 

Please drop the useless columns

In [127]:
cols = ["Date", "Id"]
df = df.drop(columns=cols, errors="ignore")
df

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
107,7.3,2.9,6.3,1.8,virginica
132,6.4,2.8,5.6,2.2,virginica
57,4.9,2.4,3.3,1.0,versicolor
93,5.0,2.3,3.3,1.0,versicolor
21,5.1,3.7,1.5,0.4,setosa
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
59,5.2,2.7,3.9,1.4,versicolor
1,4.9,3.0,1.4,0.2,setosa
7,5.0,3.4,1.5,0.2,setosa


Good, now we need to create our X matrix.

### X and y

We need to extract X (our data) from y (our target)  :

In [128]:
X = df.drop(columns="Species")
X

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
107,7.3,2.9,6.3,1.8
132,6.4,2.8,5.6,2.2
57,4.9,2.4,3.3,1.0
93,5.0,2.3,3.3,1.0
21,5.1,3.7,1.5,0.4
50,7.0,3.2,4.7,1.4
51,6.4,3.2,4.5,1.5
59,5.2,2.7,3.9,1.4
1,4.9,3.0,1.4,0.2
7,5.0,3.4,1.5,0.2


In [129]:
y = df.Species
y

107     virginica
132     virginica
57     versicolor
93     versicolor
21         setosa
50     versicolor
51     versicolor
59     versicolor
1          setosa
7          setosa
145     virginica
42         setosa
88     versicolor
53     versicolor
121     virginica
Name: Species, dtype: object

## Modelisation

### First Try

We need an estimator : 

In [130]:
estimator = LogisticRegression()
estimator

Let's fit this estimator : 

In [131]:
estimator.fit(X, y)
estimator

Let's predict : 

In [132]:
y_pred = estimator.predict(X)
y_pred[:10]

array(['virginica', 'virginica', 'versicolor', 'versicolor', 'setosa',
       'versicolor', 'versicolor', 'versicolor', 'setosa', 'setosa'],
      dtype=object)

Our score : 

In [133]:
estimator.score(X, y)

1.0

The same : 

In [134]:
accuracy_score(y_true=y, y_pred=y_pred)

1.0

Using the confusion matrix : 

In [135]:
y = pd.Series(y, name="y_true")
y_pred = pd.Series(y_pred, name="y_pred")
pd.DataFrame(confusion_matrix(y, y_pred))

Unnamed: 0,0,1,2
0,4,0,0
1,0,7,0
2,0,0,4


More readable output : 

In [136]:
pd.crosstab(y, y_pred)

y_pred,versicolor,virginica
y_true,Unnamed: 1_level_1,Unnamed: 2_level_1
setosa,1,1


### Using Train and Test values 

Creating train and test values : 

In [137]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    shuffle=True,
    random_state=42,
)

# other possible values : 0.25, 0.2

X_train : 

In [138]:
X_train.shape

(7, 4)

X_test : 

In [139]:
X_test.shape

(8, 4)

Estimator : 

In [140]:
estimator = LogisticRegression()

Fit : 

In [141]:
estimator.fit(X_train, y_train)

Train score :  

In [142]:
estimator.score(X_train, y_train)

0.8571428571428571

Test score : 

In [143]:
estimator.score(X_test, y_test)

1.0

### Using a Grid Search

About the grid Search

In [144]:
grid = GridSearchCV(
    LogisticRegression(),
    param_grid={},
    cv=CV,
    return_train_score=True,
    refit=True,
    n_jobs=N_JOBS,
    verbose=2,
)

Fit : 

In [145]:
grid.fit(X_train, y_train)

Fitting 2 folds for each of 1 candidates, totalling 2 fits


[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s


Our results : 

In [146]:
grid.cv_results_

{'mean_fit_time': array([0.0126189]),
 'std_fit_time': array([0.00318229]),
 'mean_score_time': array([0.00245821]),
 'std_score_time': array([2.22921371e-05]),
 'params': [{}],
 'split0_test_score': array([0.5]),
 'split1_test_score': array([1.]),
 'mean_test_score': array([0.75]),
 'std_test_score': array([0.25]),
 'rank_test_score': array([1], dtype=int32),
 'split0_train_score': array([1.]),
 'split1_train_score': array([0.75]),
 'mean_train_score': array([0.875]),
 'std_train_score': array([0.125])}

In a dataframe : 

In [147]:
pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,mean_train_score,std_train_score
0,0.012619,0.003182,0.002458,2.2e-05,{},0.5,1.0,0.75,0.25,1,1.0,0.75,0.875,0.125


Lets's create a function : 

In [148]:
def resultize(grid):

    res = grid.cv_results_
    res = pd.DataFrame(res)

    cols = [i for i in res.columns if "split" not in i]
    res = res.loc[:, cols]

    res = res.drop(columns=["mean_score_time", "std_score_time"])

    return res.round(2).sort_values("mean_test_score", ascending=False)

Resultize : 

In [149]:
resultize(grid)

Unnamed: 0,mean_fit_time,std_fit_time,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
0,0.01,0.0,{},0.75,0.25,1,0.88,0.12


### Using a pipeline

Our first pipeline : 

In [150]:
pipe = Pipeline(
    [
        ("imputer", KNNImputer()),
        ("scaler", StandardScaler()),
        ("estimator", LogisticRegression()),
    ]
)

In [151]:
pipe

Using the pipeline : 

In [152]:
grid = GridSearchCV(
    pipe,
    param_grid={},
    cv=CV,
    return_train_score=True,
    refit=True,
    n_jobs=N_JOBS,
    verbose=2,
)

Fit : 

In [153]:
grid.fit(X_train, y_train)

Fitting 2 folds for each of 1 candidates, totalling 2 fits
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.0s


Resultize : 

In [154]:
resultize(grid)

Unnamed: 0,mean_fit_time,std_fit_time,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
0,0.01,0.0,{},0.58,0.08,1,1.0,0.0


### Using a Param Grid

Keep in mind : 

In [155]:
pipe

In [156]:
grid

Writing a beautiful param grid : 

In [157]:
param_grid = {
    "imputer": [
        "passthrough",
        KNNImputer(n_neighbors=3),
        KNNImputer(n_neighbors=5),
        SimpleImputer(strategy="median"),
    ],
    "scaler": [
        # "passthrough",
        StandardScaler(),
        Normalizer(),
        QuantileTransformer(n_quantiles=10),
    ],
    "estimator": [
        DummyClassifier(),
        LogisticRegression(),
        KNeighborsClassifier(n_neighbors=5),
        RandomForestClassifier(),
    ],
}

In [158]:
param_grid

{'imputer': ['passthrough',
  KNNImputer(n_neighbors=3),
  KNNImputer(),
  SimpleImputer(strategy='median')],
 'scaler': [StandardScaler(),
  Normalizer(),
  QuantileTransformer(n_quantiles=10)],
 'estimator': [DummyClassifier(),
  LogisticRegression(),
  KNeighborsClassifier(),
  RandomForestClassifier()]}

Using the param grid : 

In [159]:
grid = GridSearchCV(
    pipe,
    param_grid=param_grid,
    cv=CV,
    return_train_score=True,
    refit=True,
    n_jobs=N_JOBS,
    verbose=1,
)

Fit : 

In [160]:
grid.fit(X_train, y_train)

Fitting 2 folds for each of 48 candidates, totalling 96 fits


Traceback (most recent call last):
  File "/home/alex/dev/CentraleSupElec/CentraleSupElec-NLP-Public-Ressources/.venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 982, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "/home/alex/dev/CentraleSupElec/CentraleSupElec-NLP-Public-Ressources/.venv/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 415, in __call__
    return estimator.score(*args, **kwargs)
  File "/home/alex/dev/CentraleSupElec/CentraleSupElec-NLP-Public-Ressources/.venv/lib/python3.10/site-packages/sklearn/pipeline.py", line 997, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
  File "/home/alex/dev/CentraleSupElec/CentraleSupElec-NLP-Public-Ressources/.venv/lib/python3.10/site-packages/sklearn/base.py", line 764, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "/home/alex/dev/CentraleSupElec/CentraleSupElec-NLP-Public-Ressources/.venv/lib/

Resultize ! 

In [161]:
resultize(grid).head(10)

Unnamed: 0,mean_fit_time,std_fit_time,param_estimator,param_imputer,param_scaler,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
39,0.28,0.01,RandomForestClassifier(),KNNImputer(n_neighbors=3),StandardScaler(),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
41,0.28,0.04,RandomForestClassifier(),KNNImputer(n_neighbors=3),QuantileTransformer(n_quantiles=10),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
36,0.36,0.04,RandomForestClassifier(),passthrough,StandardScaler(),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
45,0.21,0.03,RandomForestClassifier(),SimpleImputer(strategy='median'),StandardScaler(),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
44,0.21,0.04,RandomForestClassifier(),KNNImputer(),QuantileTransformer(n_quantiles=10),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
38,0.34,0.08,RandomForestClassifier(),passthrough,QuantileTransformer(n_quantiles=10),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
42,0.38,0.1,RandomForestClassifier(),KNNImputer(),StandardScaler(),"{'estimator': RandomForestClassifier(), 'imput...",0.88,0.12,1,1.0,0.0
47,0.19,0.06,RandomForestClassifier(),SimpleImputer(strategy='median'),QuantileTransformer(n_quantiles=10),"{'estimator': RandomForestClassifier(), 'imput...",0.75,0.25,8,1.0,0.0
40,0.27,0.0,RandomForestClassifier(),KNNImputer(n_neighbors=3),Normalizer(),"{'estimator': RandomForestClassifier(), 'imput...",0.71,0.04,9,1.0,0.0
43,0.27,0.01,RandomForestClassifier(),KNNImputer(),Normalizer(),"{'estimator': RandomForestClassifier(), 'imput...",0.71,0.04,9,1.0,0.0
