# Baseline model (Modèle de référence)

L'idée c'est de faire tourner un modèle simple afin d'avoir une référence pour comparer nos futurs modèles. A discuter de ce qu'on prend comme modèle de référence.

## Import (package + Train data frame)

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
from src.preprocessing import display_missing_values
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)
sns.set_theme(style="ticks", palette="pastel")

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

pd.set_option('display.max_columns', None)

In [3]:
df = pd.read_csv("data/train.csv", sep= ",", low_memory=False,dtype={"ID":str})

## Preprocessing Train 

In [4]:
col_categoricals = df.select_dtypes(include="object").columns.tolist()

# Puis les numériques
col_numericals = [col for col in df.columns if col not in col_categoricals]

In [5]:
colonnes_a_drop = ["MMS","r","Enedc (g/km)","Ernedc (g/km)","Erwltp (g/km)","De","Vf"]
df_pp= df.drop(columns=colonnes_a_drop)
col_numericals_2 = [col for col in df_pp.columns if col not in col_categoricals]
df_pp= df_pp[col_numericals_2]

### Imputation des NaN 

In [6]:
df_pp.drop_duplicates(inplace=True)

In [7]:
mean_imputer= SimpleImputer(missing_values=np.nan, strategy='mean')

Impute par la moyenne ceux qu'on peut imputer par la moyenne

In [8]:
colonnes_mean_impute=['m (kg)','W (mm)','ep (KW)','At1 (mm)','At2 (mm)','Mt','ec (cm3)']
df_pp[colonnes_mean_impute]= mean_imputer.fit_transform(df_pp[colonnes_mean_impute])

Imputer par 0 car certaines variables correspondent à des critères seulement présent chez les voitures électriques. (ex: consommation electrique `z (Wh/km)` n'a pas lieu d'être sur une voiture essence)

In [9]:
fill_values = {'Fuel consumption ':0, 'z (Wh/km)': 0,'Electric range (km)':0}
df_pp.fillna(fill_values,inplace=True)

In [10]:
display_missing_values(df_pp)

Unnamed: 0,Colonne,pourcentage manquant,nombre
0,m (kg),0.0,0
1,Mt,0.0,0
2,Ewltp (g/km),0.0,0
3,W (mm),0.0,0
4,At1 (mm),0.0,0
5,At2 (mm),0.0,0
6,ec (cm3),0.0,0
7,ep (KW),0.0,0
8,z (Wh/km),0.0,0
9,Fuel consumption,0.0,0


### Train Test Split (pour validation du modèle)

In [11]:
train, test = train_test_split(df_pp,test_size=0.33,random_state=42)

train.reset_index(drop=True, inplace=True ) #car ça fout la merde dans l'index
test.reset_index(drop = True, inplace = True)

In [12]:
X_train, y_train =train.drop(columns=["Ewltp (g/km)"]), train["Ewltp (g/km)"]
X_test, y_test =test.drop(columns=["Ewltp (g/km)"]), test["Ewltp (g/km)"]

## Model (fit) + Validation

In [13]:
model = LinearRegression()
model.fit(X_train, y_train)

### Model predict pour observer la validation

In [14]:
# Prédire 
y_pred= model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Mean Absolute Error: 43.28420226877553


## Fit again sur Train entier (pour prédire sur Test)

In [15]:
X_train, y_train =df_pp.drop(columns=["Ewltp (g/km)"]), df_pp["Ewltp (g/km)"]

In [16]:
model = LinearRegression()
model.fit(X_train, y_train)

## Preprocessing Test

In [17]:
df_test = pd.read_csv("data/test.csv", sep= ",", low_memory=False,dtype={"ID":str})

In [18]:
colonnes_a_drop = ["MMS","r","Enedc (g/km)","Ernedc (g/km)","Erwltp (g/km)","De","Vf"]
df_pp_test= df_test.drop(columns=colonnes_a_drop)
col_numericals_2 = [col for col in df_pp_test.columns if col not in col_categoricals]
df_pp_test= df_pp_test[col_numericals_2]

### Imputation NaN

In [19]:
colonnes_mean_impute=['m (kg)','W (mm)','ep (KW)','At1 (mm)','At2 (mm)','Mt','ec (cm3)']
df_pp_test[colonnes_mean_impute]= mean_imputer.fit_transform(df_pp_test[colonnes_mean_impute])

In [20]:
fill_values = {'Fuel consumption ':0, 'z (Wh/km)': 0,'Electric range (km)':0}
df_pp_test.fillna(fill_values,inplace=True)

In [21]:
display_missing_values(df_pp_test)

Unnamed: 0,Colonne,pourcentage manquant,nombre
0,m (kg),0.0,0
1,Mt,0.0,0
2,W (mm),0.0,0
3,At1 (mm),0.0,0
4,At2 (mm),0.0,0
5,ec (cm3),0.0,0
6,ep (KW),0.0,0
7,z (Wh/km),0.0,0
8,Fuel consumption,0.0,0
9,Electric range (km),0.0,0


## Model Predict on Test

In [None]:
df_test["Ewltp (g/km)"] = model.predict(df_pp_test)

## Prepare for Submission

In [None]:
df_test["ID"]=df_test["ID"].apply(int)

In [None]:
df_test[["ID","Ewltp (g/km)"]].to_csv("data/reg_lin_bateau.csv", index=False)