# Multiple Linear Regression
## 50 start up Dataset
L'objectif est de prédire le profit des start up en se basant sur les dépenses dans différents domaines : R&D, admin marketing ,etc. 
<br>
Autrement dit, on cherche à savoir quelles variables ont le plus d'impact sur le profit. 

## Importing the libraries

In [1]:
import numpy as np 
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import matplotlib.pyplot as plt
import seaborn as sns

## Importing the dataset

In [2]:
dataset = pd.read_csv('50_Startups.csv')

In [3]:
dataset.head(8)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6


## Encoding categorical data
Transformer les variables catégorielles en variables numériques

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

numeric_features = ['R&D Spend', 'Administration', 'Marketing Spend', 'Profit']
categorical_features = ['State']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'
)

pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

transformed_data = preprocessor.fit_transform(dataset)
transformed_df = pd.DataFrame(transformed_data, columns=preprocessor.get_feature_names_out())

In [5]:
preprocessor

Préparer le dataset de training et de test

In [6]:
X = transformed_df.drop(columns='num__Profit')
y = transformed_df['num__Profit']

In [7]:
## Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Ecrire une fonction qui permet d'entrainer un modèle de régression et de retourner les métriques d'évaluation (mae, mse, rmse, R²,R²-ajusté 

In [8]:
# function the return metrics of linear regression
def linearRegression(x) : 
    n = len(X_test)

    ## Training the Multiple Linear Regression model on the Training set
    regressor = LinearRegression()
    regressor.fit(X_train.iloc[:,x], y_train)
    y_pred = regressor.predict(X_test.iloc[:,x])

    mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
    mse = mean_squared_error(y_true=y_test, y_pred=y_pred)
    rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
    r2 = r2_score(y_true=y_test, y_pred=y_pred)
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - X.shape[1] - 1)

    print('Coefficients : ', regressor.coef_)
    print('MAE : ', mae)
    print('MSE : ', mse)
    print('R² : ', r2)
    print('Adjusted R² : ', adj_r2)
    
    return [mae,mse,rmse,r2,adj_r2]

#### Modèle 1 
Construire un modèle de régression linéaire avec l'ensemble des données d'entraînement. 

In [9]:
# model with all features 
lm1 = linearRegression([0, 1, 2, 3, 4, 5])

Coefficients :  [ 0.88085473  0.02285888  0.11107605  0.00217133 -0.02187023  0.0196989 ]
MAE :  0.18832305113536235
MSE :  0.05244837153773104
R² :  0.9347068473282425
Adjusted R² :  0.8041205419847275


#### Modèle 2
Construire un modèle de régression linéaire avec les variables explicatives 0

In [10]:
# model with  features 0, 1, 3, 4, 5
lm2 = linearRegression([0, 1, 3, 4, 5])

Coefficients :  [ 0.96985089 -0.00117949 -0.00549633  0.00665204 -0.00115571]
MAE :  0.16921571696510615
MSE :  0.042011100958918565
R² :  0.9477002402858978
Adjusted R² :  0.8431007208576935


In [11]:
# model with  features 0, 3, 5
lm3 = linearRegression([0, 3, 5])

Coefficients :  [ 0.96946858 -0.01210896 -0.00761915]
MAE :  0.17017032202158194
MSE :  0.04243788902272595
R² :  0.9471689304016889
Adjusted R² :  0.8415067912050668


In [12]:
# model with  features 0, 3
lm4 = linearRegression([0, 3])

Coefficients :  [ 0.96925114 -0.00773828]
MAE :  0.17048978660920244
MSE :  0.04280452797334553
R² :  0.9467125003491181
Adjusted R² :  0.8401375010473544


In [13]:
lms = [lm1, lm2, lm3, lm4]
df = pd.DataFrame(lms, index=['Model 1','Model 2','Model 3','Model 4'], columns=['MAE', 'MSE', 'RMSE','R²','Adjusted R²'], dtype = float) 
df 

Unnamed: 0,MAE,MSE,RMSE,R²,Adjusted R²
Model 1,0.188323,0.052448,0.229016,0.934707,0.804121
Model 2,0.169216,0.042011,0.204966,0.9477,0.843101
Model 3,0.17017,0.042438,0.206005,0.947169,0.841507
Model 4,0.17049,0.042805,0.206893,0.946713,0.840138


### Conclure