# Introduction to Machine Learning with scikit-learn
### Notebook for participants to fill with code themselves

### About me

Stefanie Senger, Historian (PhD)

- Contributor to scikit-learn

- Data Science Teacher at Le Wagon

- Connect with me on LinkedIn: https://www.linkedin.com/in/stefaniesenger/


### Workshop outline (90 min.)

- Machine Learning 101

- What is scikit-learn?

- Practical Part

  - Predictive modeling pipeline

  - Evaluation of models

  - Hyperparameter tuning

</br>
</br>
Link to full notebook without gaps for participants: </br>

https://github.com/StefanieSenger/Talks/blob/main/2023_EuroSciPy/2023_EuroSciPy_Intro_to_scikit-learn_full.ipynb

# Machine Learning 101

### Main Idea: 

### to learn from past data --> make predictions for future data

    We assume there is some structure in the data that is not purely coincidental.
    We further assume, that this structure is going to reoccur in the future.

### Types of Machine Learning

![Regression-vs-Classification](images/Regression-vs-Classification.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

    - Regression: target is continuous (something to measure)
    - Classification: target is a class (bucket)
    - Unsupervised Learning: no target

### Scikit-learn's modelling workflow

In [None]:
%%script false --no-raise-error

from sklearn.some_module import SomeModel

model = SomeModel()
model.fit(X_train, y_train)
model.predict(X_new)
model.score(X_test, y_test)

### Scikit-learns modelling workflow

![Model.fit](images/api_diagram-predictor.fit.svg)


Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

### Scikit-learns modelling workflow

![Model.predict](images/api_diagram-predictor.predict.svg)

Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

### Scikit-learn's modelling workflow

![Model.score](images/api_diagram-predictor.score.svg)

Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

### Holdout Method
![train-test-split](images/Train-Test-Split.png)

Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Train-Test-Validation.png)

### Holdout Method in scikit-learn

In [None]:
%%script false --no-raise-error

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### Cross Validation

    - evens out variance in results

    - scores on validation set

![GridSearchCV](images/grid_search_cv.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

### Cross Validation in scikit-learn

In [None]:
%%script false --no-raise-error

from sklearn.model_selection import cross_validate

score = cross_validate(SomeModel(), X_train, y_train, cv=5)

### Bias-Variance Tradeoff
![Bias-Variance-Tradeoff](images/bias-variance-tradeoff.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

### Reducing Test Error

![Model-complexity](images/model-complexity.png)

Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

## Scikit-learn
    Scikit-learn (Sklearn) is a Machine Learning library that provides data preprocessing, modeling, and model selection tools.

![Scikit-learn_logo](images/1164px-Scikit_learn_logo.svg.png)

Source: [Scikit-learn](https://github.com/scikit-learn/scikit-learn/blob/main/doc/logos/scikit-learn-logo.svg)

### Scikit-learn algorithm cheat sheet
![Scikit-learn algorithm cheat sheet](images/Scikit-learn_machine_learning_decision_tree.png)

Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Scikit-learn_machine_learning_decision_tree.png)

## Predictive modeling pipeline

In [None]:
from sklearn.datasets import fetch_openml
import pandas as pd

data = fetch_openml("adult-census", parser="pandas")
X = data['data']
X

In [None]:
y = data['target']
y

#### Our goal: classify people's income as "high" or "low".

### Data Preparation

![Preproc](images/preprocessing.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

In [None]:
# Cleaning columns names
X.columns = X.columns.str.strip(":")

# Drop columns; 'education-num' contains the same information as 'education' and 'fnlwgt' is a calculated similarity between samples
X = X.drop(columns=['education-num', 'fnlwgt'])
X

In [None]:
# Checking dtypes


In [None]:
# Checking for impossible values


In [None]:
# Inspecting if target is ballanced


In [None]:
# Train-test-split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### Data Preparation

      - Imputing missing values
      - Encoding categorical features
      - Scaling numerical features
  
</br>

#### Preprocessing steps depend on the final estimator choice.

### Imputing missing values

In [None]:
# Checking for missing values


    observation: missing values only occur in categorical data</br>
    -> we will take this into account when we impute values 

### Transformers

![Transformer.fit_transform](images/api_diagram-transformer.fit_transform.svg)

Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

In [None]:
# Imputing missing values 
from sklearn.impute import SimpleImputer

# Instantiate a SimpleImputer object with strategy of choice

# Fit imputer

# The imputed value is stored in the transformer's memory

In [None]:
# Inspect filled up DataFrame

# use `sklearn.set_config(tranform_output="pandas")` for scikit-learn version 1.3 and higher

### Encoding

    - transforming categorical data into an equivalent numerical form

![OHE](images/ohe_visualization.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Instantiate the OneHotEncoder

# Fit encoder

# Display the detected categories


In [None]:
# Display the new feature names


In [None]:
# Transforming learned categories into binary columns


### Pipeline

    - a chain of operations in a Machine Learning project (preprocessing, training, predicting)
    - can be used to put together building blocks of several transformers and a predictive model

In [None]:
from sklearn.pipeline import Pipeline

# Build the pipeline with the different steps


In [None]:
# Check how OHE output looks like


### Scaling

    - transforming numerical features into a common smaller range

#### MinMaxScaler

### $X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$

![MinMaxScaler](images/MinMaxScaler.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate scaler object

# Fit MinMaxScaler

# Inspect `scaler.data_range_` and other learned attributes

In [None]:
# Check how MinMaxScaler output looks like


### ColumnTransformer

![ColumnTransformer](images/api_diagram-columntransformer.svg)

Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector # "trick", but you can also manually select columns

num_transformer = MinMaxScaler()

# Parallelize "num_transformer" and "cat_transfomer"
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, make_column_selector(dtype_exclude='category')),
    ('cat_transformer', cat_transformer, make_column_selector(dtype_include='category'))
])

preprocessor

In [None]:
# Fit and transform on the preprocessor


In [None]:
# We cannot predict, because there is no predictor


### Pipelines

    - a chain of operations in a Machine Learning project (preprocessing, training, predicting)
    - can be used to put together building blocks of several transformers and a predictive model

![Pipeline](images/pipeline.png)

Source: [Le Wagon Data Science Curriculum](https://www.lewagon.com/)

    - output of transformer is input into predictor
    - call a pipeline the same way you would call the last added estimator

Pipelines are powerful because they:

    - make your workflow much easier to read and understand
    - enforce the implementation and order of steps in your project
    - make your work reproducible and deployable

### Pipelines in scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression

# Build the pipeline combining preprocessor and predictor


### Everything in one cell

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector
from sklearn.linear_model import LogisticRegression

# Transformers
num_transformer = MinMaxScaler()
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('ohe', OneHotEncoder(drop = "if_binary", sparse_output = False, handle_unknown='ignore'))
])

# Preprocessor
preprocessor = ColumnTransformer([
    ('num_transformer', num_transformer, make_column_selector(dtype_exclude='category')),
    ('cat_transformer', cat_transformer, make_column_selector(dtype_include='category'))
])

# Pipeline with predictor
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("logreg", LogisticRegression(max_iter=1000))
])


pipeline

## Evaluation of models

    The pipeline becomes an estimator.

    On it we can:

    pipeline.fit(X_train, y_train)

    pipeline.score(X_test, y_test)

    pipeline.predict(X_new)

    ... and tune hyperparameters

In [None]:
# Training and predicting untuned model


In [None]:
# Score on untuned model with default scorer


In [None]:
# Checking with DummyClassifier (makes predictions that ignore the input features
from sklearn.dummy import DummyClassifier


In [None]:
# Comparing with class prevalence


## Hyperparameter tuning

In [None]:
# Get tunable params


### GridSearchCV

    - allows us to check, which combination of preprocessing/modeling hyperparameters works best

![ChangingStuff](images/changing_stuff.png)

Source: [Introduction to scikit-learn by Olivier Grisel](http://ogrisel.github.io/decks/2017_intro_sklearn/#1)

![GridSeachCV](images/grid_vs_random_search.svg)

Source: [Inria Scikit-learn MOOC](https://inria.github.io/scikit-learn-mooc/)

### GridSearchCV in scikit-learn

In [None]:
%%script false --no-raise-error

from sklearn.model_selection import GridSearchCV

# Hyperparameter Grid
grid = {
    'param1': [0.01, 0.1, 1], 
    'param2': [0.2, 0.5, 0.8]
}

# Instantiate Grid Search
grid_search = GridSearchCV(
    SomeModel(),
    grid, 
    scoring = ['scoring_metrics'],
    cv = 5,
) 

# Fit data to Grid Search
grid_search.fit(X_train, y_train)

# Get best params
grid_search.best_params_
grid_search.best_estimator_

In [None]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import GridSearchCV
import numpy as np

# Hyperparameter Grid
grid = {
    'preprocessor__num_transformer__feature_range': [(0,1), (0,2)], 
    'preprocessor__cat_transformer__ohe__min_frequency': [0.02, 0.05],
    'logreg__C': np.logspace(-3, 3, num=10)
}

# Instantiate Grid Search
grid_search = GridSearchCV(
    pipeline,
    grid, 
    scoring = 'accuracy',
    cv = 5,
) 

# Fit data to Grid Search
grid_search.fit(X_train, y_train)

# Get best params
grid_search.best_params_

In [None]:
# Scoring on best estimator


In [None]:
# Inspect grid_search
pd.DataFrame(grid_search.cv_results_)[['param_logreg__C', 'param_preprocessor__num_transformer__feature_range', 'param_preprocessor__cat_transformer__ohe__min_frequency', 'mean_test_score', 'rank_test_score']]

In [None]:
# Predict using best_estimator_


### Not gotten enough yet?

#### Have a look at our scikit-learn MOOC: https://inria.github.io/scikit-learn-mooc/index.html

## Thank you for your attention!