<a href="https://colab.research.google.com/github/Demosthene-OR/Student-AI-and-Data-Management/blob/main/06_intro_linear_regression_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://prof.totalenergies.com/wp-content/uploads/2024/09/TotalEnergies_TPA_picto_DegradeRouge_RVB-1024x1024.png" height="150" width="150">
<hr style="border-width:2px;border-color:#75DFC1">
<center><h1> Introduction to Machine Learning with scikit-learn </h1></center>
<center><h2> Linear Regression </h2></center>
<hr style="border-width:2px;border-color:#75DFC1">



## Introduction

> This notebook introduces the concepts of Machine Learning, and more specifically linear regression, to show how Python is a programming language well suited to machine learning problems. All of these concepts will be presented in more detail and put into practice in the modules dedicated to Machine Learning.
>
> Machine learning is a subfield of artificial intelligence that enables computers to learn how to automatically perform tasks based on data. When the task to be performed is the prediction of a variable, we refer to this as supervised learning.
>
> Linear regression is one of the first predictive models of supervised learning to have been studied. This model allows us to predict a quantitative variable. Today, it is the most popular model for practical applications due to its simplicity.
>
> In the linear regression model, we have $y$, the quantitative variable to be predicted (called the target variable), and explanatory variables that enable prediction.

### Univariate Linear Regression

> In the univariate linear model, we have two variables, $y$ called the target variable and $x$ called the explanatory variable. <br>
> Linear regression consists of modeling the relationship between these two variables using an affine function. Thus, the formula for the univariate linear model is given by:
> $$y \approx \beta_1 x + \beta_0 $$
> where:
>> * $y$ is the variable we want to predict.
>>
>>
>> * $x$ is the explanatory variable.
>>
>>
>> * $\beta_1$ and $\beta_0$ are the parameters of the linear function. $\beta_1$ will define its **slope** and $\beta_0$ will define its y-intercept (also called **bias**).
>
> **The goal of linear regression is to estimate the best parameters $\beta_0$ and $\beta_1$ to predict the variable $y$ from a given value of $x$**.
>
> To get a feel for univariate linear regression, let's look at the interactive example below.

* **(a)** Run the following cell to display the interactive figure. In this figure, we have simulated a dataset.


* **(b)** Use the sliders on the `Regression` tab to find the parameters $\beta_0$ and $\beta_1$ that best fit all the points in the dataset.


* **(c)** What is the effect of each parameter on the regression function? 


In [None]:
!wget -q https://raw.githubusercontent.com/Demosthene-OR/Student-AI-and-Data-Management/main/regression_widgets.py
from regression_widgets import regression_widget_linear

regression_widget_linear()


### Multiple Linear Regression

> Multiple linear regression consists of modeling the relationship between a target variable $y$ and **several explanatory variables** $x_1$, $x_2$, ... ,$x_p$, often referred to as *features*:
> $$
\begin{align}
    y & \approx β_0 + β_1 x_1 + β_2 x_2 + ⋯ + β_p x_p \\
      & \approx β_0+ \sum_{j=1}^{p} β_j x_j 
\end{align}
$$
>
> There are now $p + 1$ parameters $\beta_j$ to find.


## Using scikit-learn for linear regression

> We will now learn how to use the **`scikit-learn`** library to solve a machine learning problem. In particular, we will see how useful this library is for preparing data and then implementing models.
>
> We are working on a project where the objective is to predict the **selling price of a car** based on its **characteristics**. The variable to be predicted is quantitative, so we are dealing with a regression problem.

### Importing the dataset

> The dataset we will use below contains many characteristics about different cars from 1985.
>
> For simplicity, only numerical variables have been kept and rows with missing values have been removed.

* **(a)** Import the `pandas` module under the alias `pd`.


* **(b)** In a `DataFrame` named `df`, import the `automobiles.csv` dataset using the `read_csv` function from `pandas`. This file is located in the same folder as the execution environment for this notebook.


* **(c)** Display the first 5 rows of `df` to verify that the import was successful.




In [None]:
# Insert your code here





In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/Demosthene-OR/Student-AI-and-Data-Management/main/data/"
df = pd.read_csv(url+"automobiles.csv")
df.head(5)



> * The variable `symboling` corresponds to the degree of risk to the insurer (risk of accident, breakdown, etc.).
>
>
> * The variable `normalized_losses` is the average relative annual cost of insuring the vehicle. This value is normalized in relation to cars of the same type (SUV, utility vehicle, sports car, etc.).
>
>
> * The following 13 variables relate to the technical characteristics of the cars, such as width, length, engine size, horsepower, etc.
>
>
> * The last variable, `price`, corresponds to the sale price of the vehicle. This is the variable we will be trying to predict.


### Separating the explanatory variables from the target variable

> We will now create two `DataFrames`, one containing the explanatory variables and another containing the target variable `price`.

* **(d)** In a `DataFrame` named `X`, make a copy of the explanatory variables in our dataset, i.e., all variables **except** `price`.


* **(e)** In a `Series` named `y`, make a copy of the target variable `price`.



In [None]:
# Insert your code here





In [None]:
# Explanatory variables
X = df.drop(['price'], axis = 1)

# Target variable
y = df['price']



### Separating data into training and test sets

> We will now separate our dataset into two parts: a **training** set and a **test** set. This step is **extremely** important in data science.
>
> As their names suggest: 
>> * the training part is used to “train” the model, i.e., find the optimal parameters $\beta_0$, ..., $\beta_p$ for this dataset.
>>
>>
>> * The test part is used to “test” the trained model by evaluating its ability to **generalize** its predictions on data it has **never seen** before.
>
> A very useful function for performing this operation is the `train_test_split` function from the `model_selection` submodule of **`scikit-learn`**.

* **(f)** Run the following cell to import the `train_test_split` function.



In [None]:
from sklearn.model_selection import train_test_split



> This function is used as follows:
>
> ```python
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
> ```
>
> * `X_train` and `y_train` are the explanatory and target variables of the **training** dataset.
>
>
> * `X_test` and `y_test` are the explanatory and target variables of the **test** dataset.
>>
>>
>> * The `test_size` argument corresponds to the **proportion** of the dataset that we want to keep for the test set. In the previous example, this proportion corresponds to 20% of the initial dataset.
>>
>>
>> * The `random_state` argument ensures that the data split can be reproduced. Since the operation is random, two successive splits will theoretically give two different results. As long as the value of `random_state` is the same (regardless of what that value is), the result of the train_test_split function will remain the same.

* **(g)** Using the `train_test_split` function, split the dataset into a training set (`X_train`,`y_train`)  and a test set (`X_test`, `y_test`) so that the test set contains **15% of the initial dataset**. Specify the parameter `random_state = 42`.



In [None]:
# Insert your code here





In [None]:
# Data separation into training (85%) and test game (15%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state = 42)





### Creating the regression model

> To train a linear regression model on this dataset, we will use the **`LinearRegression`** class contained in the `linear_model` submodule of `scikit-learn`.

* **(h)** Run the following cell to import the `LinearRegression` class.


In [None]:
from sklearn.linear_model import LinearRegression



> L'API de `scikit-learn` permet d'entraîner et évaluer des modèles très facilement. Toutes les classes de modèles de scikit-learn disposent des deux méthodes suivantes :
>> * **`fit`** : Entraîne le modèle sur un jeu de données. 
>>
>>
>> * **`predict`** : Effectue une prédiction à partir de variables explicatives. 
>
> Voici un exemple d'utilisation d'un modèle avec scikit-learn :
> 
> ```python
># Instanciation du modèle
> linreg = LinearRegression()      
>    
># Entraînement du modèle sur le jeu d'entraînement
> linreg.fit(X_train, y_train)        
>  
># Prédiction de la variable cible pour le jeu de données test. Ces prédictions sont stockées dans y_pred
> y_pred = linreg.predict(X_test)                                           
>    ```

* **(i)** Instancier un modèle `LinearRegression` nommé **`lr`**.


* **(j)** Entraîner `lr` sur le jeu de données d'entraînement.


* **(k)** Effectuer une prédiction sur les données d'entraînement. Stocker ces prédictions dans `y_pred_train`.


* **(l)** Effectuer une prédiction sur les données de test. Stocker ces prédictions dans `y_pred_test`.


In [None]:
# Insert your code here





In [None]:
# Model instantiation
lr = LinearRegression()

# Model training
lr.fit(X_train, y_train)

# Prediction of the target variable for train dataset
y_pred_train = lr.predict(X_train)

# Prediction of the target variable for the testing dataset
y_pred_test = lr.predict(X_test)



### Evaluating model performance

> In order to evaluate the **quality of the model's predictions** obtained using the parameters $\beta_0$, ..., $\beta_j$, there are several metrics available in the `scikit-learn` library.
>
> One of the most commonly used metrics for regression is **Mean Squared Error**, which exists under the name `mean_squared_error` in the `metrics` submodule of `scikit-learn`.
>
> This function calculates the average of the differences between the **true values** of the target variable and the **predicted values** using the regression function. The mean squared error is simply the average of these distances squared.
>
> The `mean_squared_error` function in `scikit-learn` is used as follows:
>
> ```python
    mean_squared_error(y_true, y_pred)
> ```
> where:
>> * `y_true` corresponds to the true values of the target variable.
>>
>>
>> * `y_pred` corresponds to the values predicted by our model.

* **(o)** Import the **`mean_squared_error`** function from the `sklearn.metrics` submodule.


* **(p)** Evaluate the prediction quality of the model on **the training data**. Store the result in a variable named `mse_train`.


* **(q)** Evaluate the model's prediction quality on **the test data**. Store the result in a variable named `mse_test`.


In [None]:
# Insert your code here





In [None]:
from sklearn.metrics import mean_squared_error

# Calculation the MSE between the target variable values in the train dataset and the prediction on X_train
mse_train = mean_squared_error(y_train, y_pred_train)

# Calculation of the MSE between the target variable values in the test dataset and the prediction on X_test
mse_test = mean_squared_error(y_test, y_pred_test)

print("MSE train lr:", mse_train)
print("MSE test lr:", mse_test)


> The mean squared error you find should be several million on the test data, which can be difficult to interpret. 
>
> That's why we're going to use another metric, the **mean absolute error**, which directly calculates the absolute value differences between the true values of the target variable and the predicted values.

* **(s)** Import the `mean_absolute_error` function from the `sklearn.metrics` submodule.


* **(t)** Evaluate the prediction quality on the test and training data using the mean absolute error.


* **(u)** From the `DataFrame` `df`, calculate the average purchase price for all vehicles. Do the model's predictions seem reliable to you?


In [None]:
# Insert your code here





In [None]:
from sklearn.metrics import mean_absolute_error

# Calculation of MAE between the true values of the target variable of the train and the prediction on X_train
mae_train = mean_absolute_error(y_train, y_pred_train)

# Calculation of the MAE between the true values of the target variable of the test and the prediction on X_test
mae_test = mean_absolute_error(y_test, y_pred_test)

print("MAE train lr:", mae_train)
print("MAE test lr:", mae_test)

mean_price = df['price'].mean()

print("\nRelative error", mae_test / mean_price)

# The average error is around 14% of the average price, which is not optimal
# but is still a good baseline for testing more advanced models.



## Conclusion and recap

> In this notebook, we introduced how to solve a machine learning problem. 
> 
> The different steps we studied are the classic steps of any project:
>
> * Data exploration with the `Pandas` library
>
> * Data preparation by separating the explanatory variables from the target variable
>
> * Splitting the dataset into two (a training set and a test set) using the `train_test_split` function from the `scikit-learn` library
>
> * Identifying the type of problem: in this case, regression
>
> * Instantiating a model such as `LinearRegression` with the `scikit-learn` library
>
> * Training the model on the training dataset using the `fit` method.
>
> * Predicting the test data using the `predict` method.
>
> * Evaluating the model's performance by calculating the error between these predictions and the actual values of the target variable in the test data. Evaluation for a regression model is easily done using the `mean_squared_error` or `mean_absolute_error` functions from the `metrics` submodule of scikit-learn.
>
> In the next notebook, we will perform the same steps but for solving a machine learning classification problem.
