
<img src="https://datascientest.fr/train/assets/logo_datascientest.png" width="400">

<hr style="border-width:2px;border-color:#75DFC1">
<center><H1>Introduction to machine learning with Scikit-Learn</H1></center> 
<center><H2>Part II: Simple classification models</H2></center>
<hr style="border-width:2px;border-color:#75DFC1">

> For this second part of the introduction to the `Scikit-Learn` module, we will be interested in the second type of problem in the machine learning: the **classification problem**.
> 
> The objective of this introduction is:
>> * to introduce the classification problem.
>>
>>
>> * To learn to use the `SCIKIT-LEARN` module to build a classification model, also called" classifier ".
>>
>>
>> * to introduce metrics useful for the evaluation of the model performance.

## Introduction to classification

### Objective of classification

> In supervised learning, the objective is to predict the value of a target variable from explanatory variables.
>>*In a problem of ** regression **, the target variable takes ** continuous values ​​**. These values ​​are digital: price of a house, quantity of oxygen in the air of a city, etc ...<br> La variable cible peut donc prendre une **infinité de valeurs**.
>>
>>
>>*In a ** classification problem **, the target variable takes ** discreet values ​​**. These values ​​can be digital or literal but in both cases, the target variable takes a finished ** number of values ​​**.<br>
> The different values ​​taken by the target variable are what are called **classes**.
>
> **The objective of the classification therefore consists in predicting the class of an observation from its explanatory variables.**

### An example of classification

>Take an example of classification ** binary **, in other words where there are ** two ** classes.<br>
> We seek to determine whether the water of a stream is drinking or not depending on its concentration of toxic substances and its content of mineral salts.
>
> The two classes are therefore **'drinking'** and **'non potable'**.
>
><br>
><img src = 'https://assets-datascientest.s3-eu-west-1.amazonaws.com/train/sklearn_intro_classification_binaire.png' style = "height:400px">
><br>
>
> In the above figure, each point represents a stream whose position on the plane is defined by its values ​​of concentration in toxic substances and content of mineral salts.
> 
> The objective will be to build a **model capable of attributing one of the two classes** ('drinking'/'non -potable') to a stream of which we only know these two variables.
>
> The above figure suggests the existence of two zones to classify the streams easily:
>> * An area where the streams are drinking (top left).
>>
>>
>> * An area where the streams are non -potable (bottom right).
>
> We would like to create a model capable of **separate the database into two parts** corresponding to these areas.
>
> A simple technique would be to separate the two areas to **the help of a line**.

*** (A)** Execute the following cell to display the interactive figure.
>*The points **oranges** are the drinking **** streams** and the points **blue** are the **non-protest streams**.
> 
>*The **red arrow** corresponds to a **vector** defined by $ w = (w_1, w_2) $. The red line corresponds to the orthogonal plane (i.e. perpendicular) to $ w $. You can change the vector contact details $ w $ in two ways:
>> * scrolling the cursors `W_1 'and` W_2`.
>>
>>
>> * by clicking on the values ​​to the right of the sliders then by directly inserting the desired value.


*** (b)** Try to find a vector $ w $ such that **the orthogonal plan at $ w $ perfectly separates the two stream classes**.


*** (C)** A possible solution is given by the vector $ w = (-1.47, 0.84) $. Does the vector $ w = (1.47, -0.84) also gives a solution?



In [None]:
from classification_widgets import linear_classification

linear_classification()



> The classification we have just made is of the **linear** type, that is to say that we used a linear plan to separate our classes.
>
>Thus, the objective of the linear classification models is to find the vector $ W $ allowing to best separate the different classes.<br>
> Each linear type model has its own technique to find this vector.
>
> There are also non-linear classification models, which we will see later.
>
><br>
><img src = 'https://assets-datascientest.s3-eu-west-1.amazonaws.com/train/sklearn_intro_classification_lin_non_lin.png' style = "height:400px">

## 1. Using `scikit-lear 'for classification

> We will now introduce the main tools of the `Scikit-learn` module essential to the resolution of a classification problem.
>
> In this exercise, we will use the dataset [Congressional Voting Records] (https://archive.ucs.uci.edu/ml/datasets/congressional+voing+records) which contains a number of votes made by members of the Congress of the Chamber of Representatives of the United States.
>
> The objective of our classification problem will be to **predict the political party** ("democrat" or "republican") of the members of the House of Representatives according to their votes on subjects such as education, health, budget, etc ...
>
> The explanatory variables will therefore be the votes on different subjects and the target variable will be the political party "democrat" or "republican".
>
> To solve this problem we will use a linear classification model: **Logistics regression**.


### Data preparation

*** (A)** Execute the following cell to import the modules `Pandas` and` Numpy` necessary following the exercise.



In [None]:
import pandas as pd
import numpy as np
%matplotlib inline



*** (b)** Load the data contained in the `'votes.csv'' file in a dataframe` named` votes`.



In [None]:
# Insert your code here





In [None]:
votes = pd.read_csv('votes.csv')



In order to briefly visualize our data:

*** (C)** Show the number of lines and columns of `votes`.


*** (d)** Show an overview of the first 20 lines of `votes`.



In [None]:
# Insert your code here





In [None]:
# Dataframa dimensions
print('Le DataFrame possède', votes.shape[0], 'lignes et', votes.shape[1], 'colonnes.')

# Display of the first 20 lines
votes.head(20)



>*The first column **`" Party "`** contains the name of the **political party** to which each member of the Congress of the House of Representatives belongs.
>
>
>*The following columns ** contain the votes of each member of the Congress on proposals of laws:
>>*`'There indicates that the elected official voted **for** the bill.
>>
>>
>>*`'only indicates that the elected official voted **against** the bill.
>
> In order to use the data in a classification model, it is necessary to transform these columns into **digital** Binary values, in other words or 0.

*** (e)** For each columns 1 to 16 (column 0 being our target variable), replace the values ​​`'y'' by 1 and` 'by 0. For that, we can use the method **`replace'** of the class` dataframe`.


*** (f)** Show the first 10 lines of the modified dataframe` dataframe.



In [None]:
# Insert your code here





In [None]:
# Replacement of values
votes = votes.replace(('y', 'n'), (1, 0))

# Dataframa display
votes.head(10)



*** (g)** In a dataframe` named `x`, store the variables **Explanatory** of the data game (all columns except` 'Party'`). To do this, you can help yourself with the **`Drop`** method of a dataframe`.


*** (h)** in a series called `y ', store the **target variable** (` `party'`).



In [None]:
# Insert your code here





In [None]:
# Data separation

X = votes.drop(['party'], axis = 1)
y = votes['party']



> As for regression, we will have to separate the dataset into 2 parts: a **training game** and a **test** game. As a reminder :
>>*The training game is used to **cause the** classification model, that is to say find the parameters of the model that best separate the classes.
>>
>>
>>*The test game is used to **Evaluate** The model on data he has never seen. This evaluation will allow us to judge the ability to generalize ** of the model.

*** (i)** Import the function `train_test_Split` sous module` Sklearn.model_selection '. It is recalled that this function is used as follows:
> `` python
> X_train, x_test, y_train, y_test = train_test_split (x, y, test_size = 0.2)
> `` `


*** (j)** separate the data into a training game `(x_train, y_train)` and a test game `(x_test, y_test)` while keeping 20% ​​of the data for the test sample.
>To eliminate the hazard from the function `train_test_st_stlit`, you can use the` random_state` parameter with an entire value (for example `random_state = 2`).<br>
> Thus, each time you use the function with the random_state argument = 2`, the data games produced will be the same.



In [None]:
# Insert your code here





In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)



> The logistic regression model is closely linked to the linear **regression** seen in the previous notebook.
>
> Do not confuse them ** since they do not solve the same types of problems:
>>*Regression **Logistics** is used for classification (predicting classes).
>>
>>
>>*Linear **regression** is used for regression (predict a quantitative variable).
>
> The linear regression model was defined by the following formula:
> $$ y \ Approx \ beta_0 + \ sum_ {j = 1}^p \ beta_j x_j $$
>
>Logistics regression no longer considers $ y $ directly but the ** probability ** that $ y $ is equal to 0 or 1.<br>
> Thus, the model is defined by the formula:
> $$ p (y = 1) = f (\ beta_0 + \ sum_ {j = 1}^p \ beta_j x_j) $$
>
> Where $$ f (x) = \ frac {1} {1 + e^{-x}} $$
>
> The function $ f $, often called **sigmoid** or **logistical function**, allows you to transform the linear combination $ \ beta_0 + \ sum_ {j = 1}^p \ beta_j x_j $ in a value between 0 and 1 that can be interpreted as a **probability**:
>> * If $ \ beta_0 + \ sum_ {J = 1}^P \ beta_j x_j $ is positive, then $ p (y = 1) \ GT 0.5 $, therefore the predicted class of the observation will be 1.
>>
>>
>> * if $ \ beta_0 + \ sum_ {j = 1}^P \ beta_j x_j $ is negative, then $ p (y = 1) \ lt 0.5 $, that is to say that $ p (y = 0) \ GT 0.5 $, so the predicted class of the observation will be 0.

*** (K)** Import the class `Logisticregression 'of the submodule` Linear_Model' of `Scikit-Learn`.


*** (l)** Install a model `Logisticregression` named **` Logreg`** without specifying any arguments of the manufacturer.


*** (m)** Train the model on the training datasets thanks to the `FIT" method of the Logisticregression 'class.


*** (n)** Perform a prediction on **test data**. Store these predictions in **`y_Pred_test_logreg`** and display the first 10 predictions.



In [None]:
# Insert your code here





In [None]:
# Import of the Linear_Model Sklearn Linear_Model
from sklearn.linear_model import LogisticRegression

# Instaniation of the model
logreg = LogisticRegression()

# Model training on the training game
logreg.fit(X_train, y_train)

# Prediction on test data
y_pred_test_logreg = logreg.predict(X_test)

# Display of the first 10 predictions
print(y_pred_test_logreg[:10])



## 2. Evaluate the performance of a classification model

> There are different metrics to assess the performance of classification models such as:
>>*L '**Accuracy**.
>> 
>>
>>*** Precision and recall** (*Precision*and*Recall*in English).
>
> 
> Each metric assesses the performance of the model with a different approach.
>
> In order to explain these concepts, we will introduce 4 very important terms.
>
> **Arbitrarily**, we will choose that the class **'republican' will be the positive class** (1) and **'democrat' will be the negative class** (0).
>
> Thus, we will call:
>>*** True positive (VP)** A classified observation **Positive** ('Republican') by the model and which is actually **positive** ('Republican').
>>
>>
>>*** False positive (FP)** A classified observation **Positive** ('Republican') by the model but which was actually **Negative** ('Democrat').
>>
>>
>> **True negative (vn)** a classified observation **negative** ('democrat') by the model and which is actually **negative** ('democrat').
>>
>>
>>*** False negative (FN)** A classified observation **Negative** ('democrat') by the model but which was actually **positive** ('republican').
>
><br>
><img src = "https://assets-datascientest.s3-eu-west-1.amazonaws.com/train/sklearn_intro_positif_negatif.png" style = "height:300px'">
><br>
>
>The ** battery ** is the most commonly used metric to assess a model.<br>
> It simply corresponds to the rate of correct **predictions** carried out by the model.
>
>We assume that we have $ n $ observations.<br>
>We note $ \ Mathrm {VP} $ the number of real positives and $ \ Mathrm {vn} $ the number of real negatives.<br>
> The battery is then given by:
> $$ \ mathrm {accuracy} = \ frac {\ mathrm {vp} + \ mathrm {vn}} {n} $$
> 
> **Precision** is a metric that answers the question: **Among all the positive predictions of the model, how many are real positives?**
>
> If we note $ \ Mathrm {fp} $ the number of false positives of the model, then the precision is given by:
> $$ \ mathrm {precision} = \ frac {\ mathrm {vp}} {\ mathrm {vp} + \ mathrm {fp}} $$
>
> A high precision score informs us that the model does not blindly classify all observations as positive.
> 
> The **Reminder** is a metric which quantifies the proportion of truly positive observations which have been correctly classified positive by the model.
>
> If we note $ \ mathrm {fn} $ the number of false negatives, then the recall is given by:
> $$ \ mathrm {reminder} = \ frac {\ mathrm {vp}} {\ mathrm {vp} + \ mathrm {fn}} $$
>
> A high recall score informs us that the model is capable of detecting really positive observations.
>
> The **confusion matrix** counts for a dataset the values ​​of VP, VN, FP and FN, which allows us to calculate the three previous metrics:
>
> $$
\ Mathrm {Confusion Matrix} = \ Begin {Bmatrix}
\ Mathrm {vn} & \ Mathrm {fp} \\
\ Mathrm {fn} & \ mathrm {vp}
\ End {Bmatrix}
 $$
>
> The function **`Confusion_matrix`** of the submodule` Sklearn.metrics' allows you to generate the confusion matrix from **predictions** of a model:
>
> `` python
> Confusion_matrix (y_true, y_pred)
>
> `` `
>
>> **`y_true`** contains the **true** values ​​of y.
>>
>>
>>*** `y_Pred`** contains the values ​​**predicted** by the model.
>
> The display of the confusion matrix can also be done with the function **`pd.crosstab`**:


*** (A)** Import the functions **`Accident_Score`**, **` Precision_Score`** and **`Recall_Score`** of the submodule` Sklearn.metrics'.


*** (B)** Show the Matrix of the predictions of the model **`Logreg`** using **` pd.crosstab`**.


*** (C)** Calculate the battery, the precision and the recall of the predictions of the model **`Logreg`**. To use the metrics `Precision_Score` and` Recall_Score`, it will be necessary to inform the argument **`Pos_label = 'Republican''** in order to specify that the class` '' Republican' 'is the positive class.


In [None]:
# Insert your code here





In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Calculation and display of the confusion matrix
print(pd.crosstab(y_test, y_pred_test_logreg, rownames=['Realité'], colnames=['Prédiction']))

# Calculation of the battery, precision and reminder
print("\nLogReg Accuracy:", accuracy_score(y_test, y_pred_test_logreg))

print("\nLogReg Précision:", precision_score(y_test, y_pred_test_logreg, pos_label = 'republican'))

print("\nLogReg Rappel:", recall_score(y_test, y_pred_test_logreg, pos_label = 'republican'))



# Recap

> Scikit-Learn offers many classification models such as **`Logisticregression`**.
>
> The use of these models is done in the same way for **all** Scikit-Learn models:
>>*** instantiation** of the model.
>>
>> 
>> **Training** of the model: **`Model.fit (x_train, y_train)`**.
>>
>>
>> **Prediction**: **`Model.Predict (x_test)`**.
>
> The prediction on the test game allows us to **Evaluate** the performance of the model thanks to **metric** adapted.
>
> The metrics we have seen are used for the **binary** classification and calculate with 4 values:
>>*Real positives: prediction = **+** | Reality = **+**
>>
>>
>>*Real negatives: prediction = **-** | Reality = **-**
>>
>>
>>*False positive: prediction = **+** | Reality = **-**
>>
>>
>>*False negatives: prediction = **-** | Reality = **+**
>
> All these values ​​can be calculated using the **confusion matrix** generated by the function **`Confusion_matrix`** of the submodule` Sklearn.metrics` or by the function **`Pd.crosstab`**.
> 
> Thanks to these values, we can calculate metrics like:
>>*L '**accuracy**: the proportion of correctly classified observations.
>>
>>
>>*** Precision**: The proportion of real positives among all the positive predictions of the model.
>>
>>
>>*The **Reminder**: The proportion of truly positive observations which have been correctly classified positive by the model.
>
> All these metrics can be obtained using the function **`Classification_Report`** of the submodle **` Sklearn.metrics`**.
>
# Conclusion and resources

> This module made it possible to present the Python programming language and to introduce its main very useful bookstores in the suite (Numpy, Pandas, Scikit-Learn). The Pandas bookstore notably allows you to obtain data in the form of easily manipulable dataframas.
>
> **If you want to discover a little further methods and in the continuity of this module, you can turn to the "105 data quality" module.**
>
> **If you want to apply the methods presented to other data, you can do it with the "sandbox" module. This module consists of a virgin notebook in which data is available and on which you can code freely.**
