<center><H1> Introduction to Machine Learning with Scikit-learn </H1></center>
<center><H2> Part II: Simple classification models </H2></center>

<hr style="border-width:2px;border-color:#75DFC1">

> For this second part of the introduction to the `scikit-learn` module, we will focus on the second type of problem in Machine Learning: the **classification** problem.
>
> The objective of this introduction is:
>> * To introduce the classification problem.
>>
>>
>> * Learn how to use the `scikit-learn` module to build a classification model, also called a “classifier”.
>>
>>
>> * To introduce metrics useful for evaluating the performance of the model.

## Introduction to classifcation


### Objective of Classification

>In supervised learning, the aim is to predict the value of a target variable based on explanatory variables.
>
>- In a **regression** problem, the target variable consists of **continuous values**, such as the price of a house or the oxygen quantity in the air. These values can take an **infinite range**.
>
>- In a **classification** problem, the target variable takes **discrete values**, which can be numeric or categorical. However, in both cases, the target variable has a **finite set of values**. The distinct values of the target variable are referred to as **classes**.
>
>
>**Hence, the goal of classification is to predict the class of an observation based on its explanatory variables.**
>
### Example of Classification
>
>Let's consider a binary classification problem, where there are **two** classes. We're tasked with determining whether the water in a stream is fit for consumption or not, based on its levels of toxic substances and mineral salt content.
>
>The two classes in this scenario are **'drinkable'** and **'non-drinkable'**.
>
> <br/>
>
> <img src = 'https://assets-datascientest.s3-eu-west-1.amazonaws.com/train/sklearn_intro_classification_binaire_en.png' style = "height:400px">
>
>
>
> In the figure above, each point represents a stream whose position on the map is defined by its values for the concentration of toxic substances and the content of mineral salts.
>
> The objective will be to build a **model capable of assigning one of the two classes** ('drinkable' / 'non-drinkable') to a stream of which only these two variables are known.
>
> The figure above suggests the existence of two zones allowing easy classification of streams:
>> * An area where the streams are drinkable (top left).
>>
>>
>> * An area where the streams are not drinkable (bottom right).
>
> We would like to create a model capable of **separating the dataset into two parts** corresponding to these areas.
>
> A simple technique would be to separate the two areas **using a line**.

* **(a)** Run the next cell to display the interactive figure.

> * The **orange** dots are the **drinkable** streams and the **blue** dots are the **non-drinkable** streams.
>
> * The **red arrow** corresponds to a **vector** defined by $w = (w_1, w_2)$. The red line corresponds to the orthogonal (i.e. perpendicular) plane to $w$. You can change the coordinates of the vector $w$ in two ways:
>> * By moving the sliders of `w_1` and` w_2`.
>>
>>
>> * By clicking on the values to the right of the sliders and typing the desired value.



* **(b)** Try to find a vector $w$ such that **the plane orthogonal to $w$ perfectly separates the two stream classes**.


* **(c)** A possible solution is given by the vector $w = (-1.47, 0.84)$. Does the vector $w = (1.47, -0.84)$ also give a solution?



In [3]:
from classification_widgets import linear_classification

linear_classification()


ModuleNotFoundError: No module named 'classification_widgets'

> The classification we just conducted is **linear**. This means that we employed a flat linear plane to distinguish between our classes. 
>
> This plane was defined by the vector $w$. Consequently, **linear classification models aim to identify the vector $w$ that enables the most effective separation of the distinct classes**. Each linear model has its own methodology for determining this vector.
>
> There are also **non-linear** classification models, which we will see later.
>
> <br>
>
> <img src = 'https://assets-datascientest.s3-eu-west-1.amazonaws.com/train/sklearn_intro_classification_lin_non_lin_en.png' style = "height:400px">
>
> <br>

## 1. Using `scikit-learn` for classification

> We will now introduce the main tools of the `scikit-learn` module for solving a classification problem.
>
>
> In this exercise, we'll be working with the [Congressional Voting Records](https://archive.ics.uci.edu/ml/datasets/congressional+voting+records) dataset, which contains records of votes cast by members of the United States House of Representatives.
>
>Our classification task aims to **predict the political party** (either "Democrat" or "Republican") of House of Representatives members based on their votes on subjects like education, health, budget, etc.
>
>The explanatory variables will comprise the votes on various subjects, while the target variable will denote the political affiliation, i.e., "Democrat" or "Republican".
>
>To tackle this challenge, we'll employ A linear classification model: **Logistic Regression**.


### Data preparation

* **(a)** Run the following cell to import the `pandas` and `numpy` modules needed for the exercise.



In [4]:
import pandas as pd
import numpy as np
%matplotlib inline


* **(b)** Load the data contained in the file `'votes.csv'` into a `DataFrame` named `votes`.



In [5]:
df=pd.read_csv("votes.csv")
df.head(3)

Unnamed: 0,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,n
2,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n,n



In order to briefly visualize our data:

* **(c)** Display the number of rows and columns of `votes`.


* **(d)** Show a preview of the first 20 rows of `votes`.



In [6]:
df.shape

(435, 17)

In [7]:
df.head(20)

Unnamed: 0,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,n
2,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,n,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,n,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,n,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,n,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,n,n



> - The first column **`"party"`** contains the name of the **political party** to which each member of the Congress of the House of Representatives belongs. This is the target variable.
>
>
> - The following **16** columns contain the votes of each member of Congress on legislative proposals:
>
>> - `'y'` indicates that the elected member voted **for** the bill.
>>
>>
>> - `'n'` indicates that the elected member voted **against** the bill.
>
> In order to use the data in a classification model, we must first transform these columns into binary **numeric** values, i.e. either 0 or 1.

* **(e)** For each of the columns 1 to 16 (column 0 being our target variable), replace the values `'y'` by 1 and `'n'` by 0. To do so, we can use the **`replace`** method from the `DataFrame` class.


* **(f)** Display the first 10 rows of the modified `DataFrame`.



In [65]:
df_r = df[:].replace({'y': 1, 'n': 0})

  df_r = df[:].replace({'y': 1, 'n': 0})


In [66]:
df_r.head(10)

Unnamed: 0,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
2,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1
5,democrat,0,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1
6,democrat,0,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1
7,republican,0,1,0,1,1,1,0,0,0,0,0,0,1,1,0,1
8,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
9,democrat,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0



* **(g)** In a `DataFrame` named `X`, store the **explanatory** variables of the dataset (all columns except `'party'`). For this, you can use the **`drop`** method of a `DataFrame`.


* **(h)** In a `DataFrame` named `y`, store the **target variable** (`'party'`).



In [67]:
X=df_r.drop('party', axis=1)
X.head(3)


Unnamed: 0,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
2,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0


In [68]:
y=df_r['party']
y.head(10)

0    republican
1    republican
2      democrat
3      democrat
4      democrat
5      democrat
6      democrat
7    republican
8    republican
9      democrat
Name: party, dtype: object

In [69]:
#y= y.replace({'republican': 1, 'democrat': 0})
#y.head(3)


> As for the regression problem, we must split the data set into 2 sets: a **training set** and a **test set**. As a reminder:
>
>> - The training set is used to **train the classification** model, meaning to find the parameters of the model which best separates the classes.
>>
>>
>> - The test set is used to **evaluate** the model on data that it has never seen. This evaluation will allow us to judge the **generalizability** of the model.

* **(i)** Import the `train_test_split` function from the `sklearn.model_selection` submodule. Remember that this function is used as follows:
>
>```python
>X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
>```

* **(j)** Split the data into a training set `(X_train, y_train)` and a test set `(X_test, y_test)` keeping 20% of the data for the test set.

>
>To **eliminate the randomness** of the `train _test_split` function, you can use the **`random_state`** parameter with an integer value (for example `random_state = 2`). This will make it so every time you use the function with the argument `random_state = 2`, the datasets produced will be the same.



In [70]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (348, 16)
X_test shape: (87, 16)
y_train shape: (348,)
y_test shape: (87,)



> The logistic regression model is closely related to the **linear regression** model seen in the previous lesson.
>
> **They should not get these confused** since **they do not solve the same types of problems**:
>
>> * **Logistic** Regression is used for **classification** (predict classes).
>>
>>
>> * **Linear** regression is used for **regression** (predict a quantitative variable).
>
>
> The linear regression model was defined with the following formula:
>
> $$ y \approx \beta_0 + \sum_{j=1}^p \beta_j x_j $$
>
>
> Logistic regression no longer estimates $y$ directly but the **probability** that $y$ is equal to 0 or 1. Thus, the model is defined by the formula:
>
> $$P(y = 1) = f(\beta_0 + \sum_{j=1}^p \beta_j x_j)$$
>
>
> Where $$f(x) = \frac{1}{1 + e^{-x}}$$
>
>
> The $f$ function, often called **sigmoid** or **logistic function**, transforms the linear combination $\beta_0 + \sum_{j=1}^p \beta_j x_j$ into a value between 0 and 1 that can be interpreted as a **probability**:
>
>> * If $\beta_0 + \sum_{j=1}^p \beta_j x_j$ is **positive**, then $P(y = 1) \gt 0.5$, so the predicted class of the observation will be **1**.
>>
>>
>> * If $\beta_0 + \sum_{j=1}^p \beta_j x_j$ is **negative**, then $P(y = 1) \lt 0.5$, i.e. $P(y = 0) \gt 0.5$, so the predicted class of the observation will be **0**.

* **(k)** Import the `LogisticRegression` class from the `linear_model` submodule of `scikit-learn`.


* **(l)** Instantiate a `LogisticRegression` model named **`logreg`** without specifying constructor arguments.


* **(m)** Train the model on the training dataset.


* **(n)** Make a prediction on the **test** dataset. Store these predictions in **`y_pred_test_logreg`** and display the first 10 predictions.



In [73]:
from sklearn.linear_model import LogisticRegression
logred=LogisticRegression()


In [74]:
logred.fit(X_train, y_train)

In [75]:
y_pred_test_logreg = logred.predict(X_test)
print("Erste 10 Vorhersagen:", y_pred_test_logreg[:10])

Erste 10 Vorhersagen: ['democrat' 'republican' 'democrat' 'democrat' 'democrat' 'democrat'
 'democrat' 'democrat' 'republican' 'democrat']


## 2. Evaluating Classification Model Performance

> There are various metrics available for assessing the performance of classification models, including:
>
>> - **Accuracy**
>>
>> - **Precision and Recall**
>>
>
> Each metric provides a different perspective on the model's performance.
>
> To illustrate these concepts, we'll establish that the class **'republican' will be considered the positive class** (1), while **'democrat' will be the negative class** (0).
>
> With this in mind, we'll define:
>
>> - **True Positive (TP)**: An observation correctly classified as **positive** ('republican') by the model.
>>
>> - **False Positive (FP)**: An observation incorrectly classified as **positive** ('republican') by the model.
>>
>> - **True Negative (TN)**: An observation correctly classified as **negative** ('democrat') by the model.
>>
>> - **False Negative (FN)**: An observation incorrectly classified as **negative** ('democrat') by the model.
>
> <br/>
> <img src = "https://assets-datascientest.s3-eu-west-1.amazonaws.com/train/sklearn_intro_positif_negatif_en.png" style = "height:300px'">
> <br/>
>
> The **accuracy** is the most common metric used to evaluate a model. It simply corresponds to the rate of **correct** predictions made by the model.
>
> We suppose that we have $n$ observations. We denote by $\mathrm{TP}$ the number of True Positives and $\mathrm{TN}$ the number of True Negatives. Then the accuracy is given by:
>
> $$\mathrm{accuracy} = \frac{\mathrm{TP} + \mathrm{TN}}{n}$$
>
>
> The **precision** is a metric which answers the question: **Among all the positive predictions of the model, how many are true positives?**
> If we denote by $\mathrm{FP}$ the number of False Positives of the model, then the precision is given by:
>
> $$\mathrm{precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$$
>
> A high precision score tells us the model does not blindly classify everyone as positive.
>
>
> The **recall** is a metric that quantifies the proportion of truly positive observations that were correctly classified as positive by the model.
>
> If we write $\mathrm{FN}$ as the number of False Negatives, then the callback is given by:
>
>
> $$\mathrm{recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$$
>
> A high recall score tells us the model is able to properly detect the truly positive observations.
>
> The **confusion matrix** counts the values of TP, TN, FP and FN for a set of predictions, which allows us to calculate the three previous metrics:
>
> $$
\mathrm{Confusion Matrix} = \begin{bmatrix}
                                    \mathrm{TN} & \mathrm{FP} \\
                                    \mathrm{FN} & \mathrm{TP}
                                \end{bmatrix}
 $$
>
> The **`confusion_matrix`** function of the `sklearn.metrics` submodule generates the confusion matrix from the **predictions** of a model:
>
> ```python
> confusion_matrix (y_true, y_pred)
>
> ```
> As a reminder: 
>> * **`y_true`** contains the **true** values of y.
>>
>>
>> * **`y_pred`** contains the values of y **predicted** by the model.
>
> Displaying the confusion matrix can also be done with the **`pd.crosstab`** function.
                              


* **(a)** Import the **`accuracy_score`**, **`precision_score`** and **`recall_score`** functions from the `sklearn.metrics` submodule.


* **(b)** Display the confusion matrix of the predictions made by the **`logreg`** model using **`pd.crosstab`**.


* **(c)** Calculate the accuracy, precision and recall of model predictions **`logreg`**. To use the `precision_score` and `recall_score` metrics, you will need to fill in the argument **`pos_label = 'republican'`** in order to specify that the `'republican'` class is the positive class.



In [76]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
confusion_matrix = pd.crosstab(y_test, y_pred_test_logreg)
confusion_matrix.head(3)


col_0,democrat,republican
party,Unnamed: 1_level_1,Unnamed: 2_level_1
democrat,49,4
republican,2,32


In [78]:
# Genauigkeit:
accuracy = accuracy_score(y_test, y_pred_test_logreg)

# Präzision (Precision)
precision = precision_score(y_test, y_pred_test_logreg, pos_label='republican')

# Empfindlichkeit 
recall = recall_score(y_test, y_pred_test_logreg, pos_label='republican')

# Ausgabe der Ergebnisse
print(f"Genauigkeit (Accuracy): {accuracy:.2f}")
print(f"Präzision (Precision): {precision:.2f}")
print(f"Recall (Empfindlichkeit): {recall:.2f}")

Genauigkeit (Accuracy): 0.93
Präzision (Precision): 0.89
Recall (Empfindlichkeit): 0.94



# Recap

> Scikit-learn offers many classification models such as **`LogisticRegression`**.
>
> The implementation of these models is done in the same way for **all** models of scikit-learn:
>
>> * **Instantiation** of the model.
>>
>>
>> * **Training** of the model: **`model.fit(X_train, y_train)`**.
>>
>>
>> * **Prediction**: **`model.predict(X_test)`**.
>
> The prediction on the test set allows us to **evaluate** the performance of the model thanks to suitable **metrics**.
>
> The metrics we have seen are used for **binary** classification and are calculated using 4 values:
>
>> * True Positives: Prediction = **+** | Reality = **+**
>>
>>
>> * True Negatives: Prediction = **-** | Reality =**-**
>>
>>
>> * False Positives: Prediction = **+** |Reality = **-**
>>
>>
>> * False Negatives: Prediction = **-**| Reality = **+**
>
> All these values can be calculated using the **confusion matrix** generated by the **`confusion_matrix`** function of the `sklearn.metrics` submodule or by the **`pd.crosstab`** function.
>
> Thanks to these values, we can calculate metrics like:
>
>
>> * **Accuracy**: The proportion of correctly classified observations.
>>
>>
>> * **Precision**: The proportion of true positives among all the positive predictions of the model.
>>
>>
>> * **Recall**: the proportion of truly positive observations that were correctly classified as positive by the model.
>
>
> All these metrics can be obtained using the **`classification_report`** function of the **` sklearn.metrics`** submodule.