<a href="https://colab.research.google.com/github/Demosthene-OR/Student-AI-and-Data-Management/blob/main/59_sklearn_intro_01_regression_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<img src="https://prof.totalenergies.com/wp-content/uploads/2024/09/TotalEnergies_TPA_picto_DegradeRouge_RVB-1024x1024.png" height="150" width="150">

<hr style="border-width:2px;border-color:#75DFC1">
<center><h1>Introduction to machine learning with Scikit-Learn</h1></center>
<center><h2>Linear regression</h2></center>
<hr style="border-width:2px;border-color:#75DFC1">



## Introduction

> This notebook introduces the concepts of machine learning and more particularly of linear regression, in order to show how Python is a programming language adapted to automatic learning issues. All these concepts will be presented in more detail and put into practice in the modules dedicated to machine learning.
>
> Machine learning is a sub-domain of artificial intelligence, which gives the computer capacity to learn to automatically do tasks from data. When the task to be performed is the prediction of a variable, we are talking about supervised learning.
>
> Linear regression is one of the first predictive models of supervised learning to have been studied. This model predicts a quantitative variable. It is today the most popular model for practical applications thanks to its simplicity.
>
> In the lineaire regression model, we have the quantitative variable to predict (called target variable) and explanatory variables allowing prediction.

### Univariated linear regression

>In the univaried linear model, we have two variables, $ y $ called target variable or *target *and $ x $ called variable *explanatory *.<br>
> Linear regression consists in modeling the link between these two variables by an affine function. Thus, the formula of the univaried linear model is given by:
> $$ y \ Approx \ beta_1 x + \ beta_0 $$
> Where:
>> * $ y $ is the variable we want to predict.
>>
>>
>> * $ x $ is the explanatory variable.
>>
>>
>> *$ \ beta_1 $ and $ \ beta_0 $ are the parameters of the affine function. $ \ beta_1 $ will define its **slope** and $ \ beta_0 $ will define its order originally (also called **biases**).
>
> **The goal of linear regression is to estimate the best parameters $ \ beta_0 $ and $ \ beta_1 $ to predict the variable $ y $ from a given value of $ x $**.
>
> To have an intuition of the univariate linear regression, let's look at the interactive example below.

*** (A)** Execute the following cell to display the interactive figure. In this figure, we have simulated a dataset.


*** (B)** Try to find using the tab of the tab `Regression 'the parameters $ \ BETA_0 $ and $ \ BETA_1 $ which are best closer to all points in the data set.


*** (C)** What is the effect of each parameters on the regression function?



In [1]:
!wget -q https://raw.githubusercontent.com/Demosthene-OR/Student-AI-and-Data-Management/main/widgets.py
from widgets import regression_widget

regression_widget()


interactive(children=(FloatSlider(value=0.5, description='beta', max=2.0, min=-2.0), FloatSlider(value=1.5, de…

Figure(axes=[Axis(scale=LinearScale()), Axis(orientation='vertical', scale=LinearScale())], fig_margin={'top':…


### Multiple linear regression

> Multiple linear regression consists in modeling the link between a target variable $ y $ and **several explanatory variables** $ x_1 $, $ x_2 $, ..., $ x_p $, often called*features*in English:
> $$
\ Begin {align}
Y & \ Approx β_0 + β_1 x_1 + β_2 x_2 + ⋯ + β_p x_p \\
& \ Approx β_0+ \ sum_ {j = 1}^{p} β_j x_j
\ End {align}
$$
>
> There is now $ P + $ 1 settings $ \ beta_j $ to find.


## Use of Scikit-Learn for linear regression

> We will now learn how to use the library **`Scikit-Learn`** in order to solve a machine learning problem. We will see in particular how useful this bookstore is to prepare the data and then to set up models.
>
> We place ourselves as part of a project so the objective is to predict the **sale price of a car** according to its **characteristics**. The variable to be predicted is quantitative and we therefore face a regression problem.

### Importing the dataset

> The database that we will use in the suite contains many characteristics about different cars of 1985.
>
> By simplicity, only the digital variables have been kept and the lines including missing values ​​have been deleted.

*** (a)** Import the module `pandas` under the alias` pd`.


*** (b)** in a dataframe` named `df`, import the dataset` Automobiles.csv` using the function `Read_csv` de` pandas`. This file is in the same folder as the execution environment of this notebook.


*** (C)** Show the first 5 lines of `DF` to verify that the import went well.



In [None]:
# Insert your code here





In [None]:
import pandas as pd

df = pd.read_csv("automobiles.csv")
df.head(5)



> * The symboling variable corresponds to the degree of risk vis-à-vis the insurer (risk of accident, breakdown, etc.).
>
>
> * The normalized_losses' variable is the average cost of vehicle insurance per year. This value is standardized compared to cars of the same type (SUV, utility, sports, etc.).
>
>
>* The following 13 variables concern the technical characteristics of cars such as width, length, displacement of the engine, horses, etc.<br>
>
>
> * The last variable `Price` corresponds to the sale price of the vehicle. This is the variable that we will seek to predict.


### separation of explanatory variables from the target variable

> We will now create two `Dataframes', one containing the explanatory variables and another containing the target variable` Price`.

*** (d)** In a dataframe` named `X`, make a copy of the explanatory variables of our data game, that is to say all the variables **except**` Price`.


*** (e)** in a series of `y ', make a copy of the target variable` Price`.



In [None]:
# Insert your code here





In [None]:
# Explanatory variables
X = df.drop(['price'], axis = 1)

# Target variable
y = df['price']



### Data separation in training and testing

> We will now separate our data game into two parts. A part of **Training** and part of **test**. This stage is **extremely** important in data science.
>
> Indeed, as their names indicate:
>> * The training game is used to "train" the model, that is to say find the $ \ beta_0 settings $, ..., $ \ beta_p $ optimals for this dataset.
>>
>>
>>*The test part is used to "test" the model drawn by evaluating its ability to **generalize** its predictions on data that it has never seen **.
>
> A very useful function for performing this operation is the function `train_test_split 'of the submodule` Model_selection` de **`SCIKIT-Learn`**.

*** (f)** Execute the following cell to import the function `train_test_split`.



In [None]:
from sklearn.model_selection import train_test_split



> This function is used as follows:
>
> `` python
X_Train, x_test, y_train, y_test = train_test_split (x, y, test_size = 0.2, random_state = 42)
> `` `
>
>>*`X_Train` and` Y_Train 'are the explanatory and target variables of the **Training dataset**.
>>
>>
>>*`x_test` and` y_test` are the explanatory and target variables of the **test dataset**.
>>
>>
>>*The argument `test_size` corresponds to the **proportion** of the dataset that we want to keep for the test game. In the previous example, this proportion corresponds to 20% of the initial dataset.
>>
>>
>> * The `random_state` argument makes it possible to ensure that the cutting of data can be reproduced. Indeed, the operation being random, 2 successive divisions will give in theory 2 different results. As long as the value of `random_state` is the same (regardless of this value), the result of the train_test_split function will remain the same.

*** (g)** Using the function `train_test_st_split`, separate the database into a training game (` x_train`, y_train`) and a test part (`x_test`,` y_test`) so that the test part contains **15% of the initial data game**. Specify the parameter `random_state = 42`.



In [None]:
# Insert your code here





In [None]:
# Data separation into training (85%) and test game (15%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state = 42)





### Creation of the regression model

> To cause a linear regression model on this dataset, we will use the class **`Linearregression`** contained in the submodule` Linear_model 'of `Scikit-Learn'.

*** (h)** Execute the following cell to import the class `linearregression '.



In [None]:
from sklearn.linear_model import LinearRegression



> The Scikit-Lear 'API allows you to train and assess models very easily. All classes of Scikit-Learn models have the following two methods:
>> **`Fit`**: leads the model to a dataset.
>>
>>
>>*** `predict`**: performs a prediction from explanatory variables.
>
> Here is an example of using a model with Scikit-Learn:
> 
> `` python
># Instaniation of the model
> Linreg = linearregression ()
>    
># Model training on the training game
> Linreg.fit (x_train, y_train)
>  
># Prediction of the target variable for the test game test. These predictions are stored in Y_Pred
> y_Pred = Linreg.Predict (x_test)
> `` `

*** (i)** Install a model `linearregression 'named **` lr'**.


*** (J)** Train `lr 'on the training data game.


*** (K)** Perform a prediction on training data. Store these predictions in `Y_Pred_Train`.


*** (l)** Predict a prediction on test data. Store these predictions in `y_Pred_test`.



In [None]:
# Insert your code here





In [None]:
# Model instantiation
lr = LinearRegression()

# Model training
lr.fit(X_train, y_train)

# Prediction of the target variable for train datasets
y_pred_train = lr.predict(X_train)

# Prediction of the target variable for the testing game test
y_pred_test = lr.predict(X_test)



### Evaluation of model performance

> In order to assess the **quality of the predictions of the model** obtained thanks to the parameters $ \ beta_0 $, ..., $ \ beta_j $, there are several metrics (or*metrics*in English) in the library `Scikit-Learn '.
>
> One of the most used metrics for regression is the **average quadratic error** (or*Mean Squared Error*in English) which exists under the name of `mean_squared_error;
>
> This function consists in calculating the average of the differences between **true values ​​** of the target variable and **predicted values ​​** thanks to the regression function. The average quadratic error is in fact only the average of these high distances in the square.
>
> The function `mean_squared_error 'of` scikit-learn` is used as follows:
>
> `` python
mean_squared_error (y_true, y_pred)
> `` `
> Where:
>> * `Y_True` corresponds to the true values ​​of the target variable.
>>
>>
>> * `Y_Pred` corresponds to the values ​​predicted by our model.

*** (o)** import the function **`mean_squared_error`** from the submodule` Sklearn.metrics`.


*** (p)** Evaluate the quality of prediction of the model on **the training data**. Store the result in a variable named `MSE_TRAIN`.


*** (q)** Evaluate the quality of prediction of the model on **test data**. Store the result in a variable named `mse_test`.


In [None]:
# Insert your code here





In [None]:
from sklearn.metrics import mean_squared_error

# Calculation of the MSE between the values ​​of the target variable of the train dataset and the prediction on x_train
mse_train = mean_squared_error(y_train, y_pred_train)

# Calculation of the MSE between the values ​​of the target variable of the test game test and the prediction on x_test
mse_test = mean_squared_error(y_test, y_pred_test)

print("MSE train lr:", mse_train)
print("MSE test lr:", mse_test)


> The average quadratic error that you will find should be several million on test data, which can be difficult to interpret.
>
> This is why we will use another metric, the **Absolute Middle Error** (*Mean Absolute Error*in English) which directly calculates the differences in absolute value between the true values ​​of the target variable and the predicted values.

*** (s)** import the function `mean_absolute_error 'from the submodule` Sklearn.metrics`.


*** (t)** Evaluate the quality of prediction on test and training data using the average absolute error.


*** (u)** From the `Dataframe`` Df`, calculate the average purchase price on all vehicles. Do the predictions of the model seem reliable to you?



In [None]:
# Insert your code here





In [None]:
from sklearn.metrics import mean_absolute_error

# Calculation of the MAE between the true values ​​of the target variable of the train and the prediction on x_train
mae_train = mean_absolute_error(y_train, y_pred_train)

# Calculation of the MAE between the true values ​​of the target variable of the test and the prediction on x_test
mae_test = mean_absolute_error(y_test, y_pred_test)

print("MAE train lr:", mae_train)
print("MAE test lr:", mae_test)

mean_price = df['price'].mean()

print("\nRelative error", mae_test / mean_price)

# The average error is around 14% of the average price, which is not optimal
# But is still a good baseline to test more advanced models.



## Conclusion and recap

> In this notebook, we have introduced the resolution of a machine learning problem.
> 
> The different stages that we have studied are the classic steps of any project:
>
> * Exploration of data with the pandas` bookstore
>
> * Preparation of data by separating the explanatory variables from the target variable
>
> * Separation of the dataset in two (a training game and a test game) using the function `train_test_split 'of the bookstore` Scikit-Learn'
>
> * Identification of the problem type: here a regression
>
> * Instantiation of a model like `linearregression 'with the bookstore` Scikit-Learn'
>
> * Model training on the training datasets using the Fit` method
>
> * Prediction on test data thanks to the predictory method
>
> * Evaluation of model performance by calculating the error between these predictions and the real values ​​of the target variable of test data. The evaluation for a regression model is easily done thanks to the functions `mean_squared_error 'or` mean_absolute_error' of the sub-module `Metrics` de Scikit-Learn.
>
> In the next notebook, we will perform the same steps but for the solving a classification machine learning problem.
