<h1 style="text-align:center;">Etape 3: Modélisation </h1>

# 1. Importation des packages

In [20]:
import pandas as pd

import numpy as np   # Importe la bibliothèque numpy pour la manipulation de tableaux (arrays).

from sklearn.model_selection import train_test_split  # Importe la fonction train_test_split pour diviser les données en ensembles d'entraînement et de test.
#Model
from sklearn.tree import DecisionTreeRegressor  # Importe le modèle de régression par arbre de décision.
from sklearn.ensemble import RandomForestRegressor  # Importe le modèle de régression par forêt aléatoire.
from sklearn.ensemble import GradientBoostingRegressor  # Importe le modèle de régression par Gradient Boosting Regressor.
from sklearn.linear_model import LinearRegression, Lasso  # Importe les modèles de régression linéaire et de régression Lasso.
from sklearn.linear_model import ElasticNet

#metrique
from sklearn.metrics import classification_report, confusion_matrix  # Importe des métriques pour évaluer les performances des modèles.
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Importe GridSearchCV pour effectuer une recherche des meilleurs hyperparamètres et ShuffleSplit pour diviser les données en ensembles de validation.
from sklearn.model_selection import GridSearchCV, ShuffleSplit

import matplotlib.pyplot as plt  # Importe la bibliothèque matplotlib pour la visualisation de données.
import seaborn as sns  # Importe la bibliothèque seaborn pour la visualisation de données basée sur matplotlib.

from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
import mlflow
import datetime
import warnings

pd.options.display.max_columns = None
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler  # Importe StandardScaler pour la normalisation des données.

# 2. Configuration MLFlow

In [21]:
version = "v1.0"
data_url = "./Data csv/data_cleaned.csv"

In [22]:
import os
os.environ['MLFLOW_TRACKING_USERNAME']= "rihemmanel54"
os.environ["MLFLOW_TRACKING_PASSWORD"] = "3e93c19c879ea3562d4638daa2bab19d3eabb3c9"


#setup mlflow
mlflow.set_tracking_uri('https://dagshub.com/rihemmanel54/CarPricePrediction_mlFlow.mlflow')
mlflow.set_experiment("CarPricePrediction_mlFlow-experiment")

<Experiment: artifact_location='mlflow-artifacts:/071bed8221704aee9f4a1d0750eb5c7b', creation_time=1696849056715, experiment_id='0', last_update_time=1696849056715, lifecycle_stage='active', name='CarPricePrediction_mlFlow-experiment', tags={}>

# 2. charger des données

In [24]:
# Charger données
data = pd.read_csv("./Data_csv/data_cleaned.csv")
# visualization des données
data.head()

Unnamed: 0,year,selling_price,km_driven,fuel,transmission,owner,engine,max_power
0,2014.0,450000.0,145500.0,1,2,0,1248.0,74.0
1,2014.0,370000.0,120000.0,1,2,1,1498.0,103.52
2,2010.0,225000.0,127000.0,1,2,0,1396.0,90.0
3,2017.0,440000.0,45000.0,2,2,0,1197.0,81.86
4,2011.0,350000.0,90000.0,1,2,0,1364.0,67.1


In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6475 entries, 0 to 6474
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   year           6475 non-null   float64
 1   selling_price  6475 non-null   float64
 2   km_driven      6475 non-null   float64
 3   fuel           6475 non-null   int64  
 4   transmission   6475 non-null   int64  
 5   owner          6475 non-null   int64  
 6   engine         6475 non-null   float64
 7   max_power      6475 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 404.8 KB


# 3.Explanation

Pour créer un modèle de prédiction de prix de voiture, il faut suivre plusieurs étapes clés :

    1. Séparation des données : Divisez vos données en ensembles d'apprentissage (pour former le modèle) et de test (pour évaluer sa performance). Une règle courante est de réserver environ 70-80 % des données pour l'apprentissage et le reste pour les tests.
    
    2. Choix du modèle : Il existe différents types de modèles de prédiction, tels que les régressions linéaires, les réseaux de neurones, les arbres de décision, etc. Sélectionnez le modèle qui convient le mieux à votre problème, en fonction de la nature de vos données et de vos objectifs.
    
    3. Entraînement du modèle : Utilisez l'ensemble d'apprentissage pour entraîner votre modèle. Le modèle apprendra à partir des données et ajustera ses paramètres pour minimiser l'erreur de prédiction.
    
    4.Validation et ajustement : Évaluez la performance de votre modèle en utilisant l'ensemble de test. Si les performances ne sont pas satisfaisantes, vous pouvez ajuster les hyperparamètres du modèle ou envisager d'utiliser un modèle différent.

# 4. Séparation des données

In [26]:
X = data.drop(columns='selling_price', axis=1)
Y = data['selling_price']

In [27]:
X

Unnamed: 0,year,km_driven,fuel,transmission,owner,engine,max_power
0,2014.0,145500.0,1,2,0,1248.0,74.00
1,2014.0,120000.0,1,2,1,1498.0,103.52
2,2010.0,127000.0,1,2,0,1396.0,90.00
3,2017.0,45000.0,2,2,0,1197.0,81.86
4,2011.0,90000.0,1,2,0,1364.0,67.10
...,...,...,...,...,...,...,...
6470,2014.0,80000.0,1,2,1,1396.0,88.73
6471,2013.0,110000.0,2,2,0,1197.0,82.85
6472,2009.0,120000.0,1,2,0,1248.0,73.90
6473,2013.0,25000.0,1,2,0,1396.0,70.00


In [28]:
Y

0       450000.0
1       370000.0
2       225000.0
3       440000.0
4       350000.0
          ...   
6470    475000.0
6471    320000.0
6472    382000.0
6473    290000.0
6474    290000.0
Name: selling_price, Length: 6475, dtype: float64

### b. Splitting the data into Training data & Test Data

In [29]:
np.random.seed(42)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [30]:
# Verifier X_train , X_test forme
X_train.shape, X_test.shape

((5180, 7), (1295, 7))

In [31]:
# Verifier Y_train , Y_test forme
Y_train.shape, Y_test.shape

((5180,), (1295,))

# 5. Choix du modèle

Maintenant que nos données sont réparties entre les ensembles d'entraînement et de test, il est temps de construire un modèle d'apprentissage automatique.

nous allons l'entraîner (trouver les modèles) sur l'ensemble d'apprentissage

Et nous allons le tester (utiliser les modèles) sur l'ensemble de test.

Nous allons choisir l'un des trois modèles d'apprentissage automatique suivants :

1. Linear Regression
2. Decision Tree Regressor
3. Random Forest Regressor
4. Gradient Boosting Regression

In [32]:
mlflow.sklearn.autolog(disable=True)

In [33]:
# Put models in a dictionnary
models = {"Linear Regression": LinearRegression(),
         "Decision Tree Regressor": DecisionTreeRegressor(),
         "Random Forest Regressor": RandomForestRegressor(),
         "Gradient Boosting Regression":GradientBoostingRegressor(learning_rate=0.1, n_estimators=500, random_state=0)}


# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, Y_train, Y_test):
    """
    Fits and evaluates given machine learning models
    models: a dict of different Scikit-Learn machine learning models
    X_train: training data (no labels)
    X_test: testing data (no labels)
    Y_train : training labels
    Y_test : test labels
    """
    # Set random seed
    np.random.seed(42)
    #Make a dictionary to keep models scores
    model_scores = {}
    # loop through Models
    for name, model in models.items():
        with mlflow.start_run(run_name=name):
            mlflow.log_param("data_url",data_url)
            mlflow.log_param("data_version",version)
            mlflow.log_param("input_rows",data.shape[0])
            mlflow.log_param("input_cols",data.shape[1])
            #model fitting and training
            mlflow.set_tag(key= "model",value=name)
            params = model.get_params()
            mlflow.log_params(params)
            model.fit(X_train, Y_train)
            train_features_name = f'{X_train=}'.split('=')[0]
            train_label_name = f'{Y_train=}'.split('=')[0]
            mlflow.set_tag(key="train_features_name",value= train_features_name)
            mlflow.set_tag(key= "train_label_name",value=train_label_name)
            predicted=model.predict(X_train)
            # Assuming y_true contains true target values and y_pred contains predicted target values
            mae = mean_absolute_error(Y_train, predicted)
            mse = mean_squared_error(Y_train, predicted)
            rmse = np.sqrt(mse)
            r2 = r2_score(Y_train, predicted)
            Accuracy =  model.score(X_test, Y_test)
            mlflow.log_metric("Accuracy",Accuracy)
            mlflow.log_metric("MAE",mae)
            mlflow.log_metric("MSE",mse)
            mlflow.log_metric("RMSE",rmse)
            mlflow.log_metric("R2",r2)
            mlflow.sklearn.log_model(model,artifact_path="ML_models")
        

In [34]:
fit_and_score(models, X_train, X_test, Y_train, Y_test)

In [35]:
#Reading Pandas Dataframe from mlflow
all_experiments = [exp.experiment_id for exp in mlflow.search_experiments()]
df_mlflow = mlflow.search_runs(experiment_ids=all_experiments,filter_string="metrics.Accuracy <1")
run_id = df_mlflow.loc[df_mlflow['metrics.Accuracy'].idxmax()]['run_id']
print(run_id)

8841d39f8e11401f9b1b384904cb65c8


In [36]:
df_mlflow

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.MSE,metrics.MAE,metrics.RMSE,metrics.R2,metrics.Accuracy,params.tol,params.input_rows,params.min_impurity_decrease,params.n_iter_no_change,params.ccp_alpha,params.max_depth,params.data_url,params.min_samples_leaf,params.subsample,params.loss,params.warm_start,params.random_state,params.n_estimators,params.input_cols,params.criterion,params.min_weight_fraction_leaf,params.data_version,params.min_samples_split,params.verbose,params.learning_rate,params.alpha,params.max_leaf_nodes,params.init,params.validation_fraction,params.max_features,params.n_jobs,params.bootstrap,params.max_samples,params.oob_score,params.splitter,params.fit_intercept,params.positive,params.copy_X,tags.model,tags.train_label_name,tags.mlflow.user,tags.mlflow.source.name,tags.mlflow.log-model.history,tags.mlflow.source.type,tags.train_features_name,tags.mlflow.runName
0,8841d39f8e11401f9b1b384904cb65c8,0,FINISHED,mlflow-artifacts:/071bed8221704aee9f4a1d0750eb...,2023-10-15 20:49:58.087000+00:00,2023-10-15 20:50:55.091000+00:00,4640021000.0,49623.023132,68117.703049,0.909106,0.894397,0.0001,6475,0.0,,0.0,3.0,./Data csv/data_cleaned.csv,1.0,1.0,squared_error,False,0.0,500.0,8,friedman_mse,0.0,v1.0,2.0,0.0,0.1,0.9,,,0.1,,,,,,,,,,Gradient Boosting Regression,Y_train,rihemmanel54,C:\Users\MSI\anaconda3\envs\mlops\lib\site-pac...,"[{""run_id"": ""8841d39f8e11401f9b1b384904cb65c8""...",LOCAL,X_train,Gradient Boosting Regression
1,65e750cd2f8246acb6262619141fea74,0,FINISHED,mlflow-artifacts:/071bed8221704aee9f4a1d0750eb...,2023-10-15 20:45:47.166000+00:00,2023-10-15 20:49:55.747000+00:00,1198219000.0,23367.113732,34615.297654,0.976528,0.883039,,6475,0.0,,0.0,,./Data csv/data_cleaned.csv,1.0,,,False,,100.0,8,squared_error,0.0,v1.0,2.0,0.0,,,,,,1.0,,True,,False,,,,,Random Forest Regressor,Y_train,rihemmanel54,C:\Users\MSI\anaconda3\envs\mlops\lib\site-pac...,"[{""run_id"": ""65e750cd2f8246acb6262619141fea74""...",LOCAL,X_train,Random Forest Regressor
2,8ebed038cca5462fbcd2e2c4ce27b8f7,0,FINISHED,mlflow-artifacts:/071bed8221704aee9f4a1d0750eb...,2023-10-15 20:44:56.979000+00:00,2023-10-15 20:45:46.890000+00:00,463560600.0,6479.951804,21530.457438,0.990919,0.808349,,6475,0.0,,0.0,,./Data csv/data_cleaned.csv,1.0,,,,,,8,squared_error,0.0,v1.0,2.0,,,,,,,,,,,,best,,,,Decision Tree Regressor,Y_train,rihemmanel54,C:\Users\MSI\anaconda3\envs\mlops\lib\site-pac...,"[{""run_id"": ""8ebed038cca5462fbcd2e2c4ce27b8f7""...",LOCAL,X_train,Decision Tree Regressor
3,ff045d3e44d94b8f8d861fb06c16e61a,0,FINISHED,mlflow-artifacts:/071bed8221704aee9f4a1d0750eb...,2023-10-15 20:44:20.934000+00:00,2023-10-15 20:44:53.211000+00:00,14631440000.0,95538.933626,120960.509958,0.713383,0.730273,,6475,,,,,./Data csv/data_cleaned.csv,,,,,,,8,,,v1.0,,,,,,,,,,,,,,True,False,True,Linear Regression,Y_train,rihemmanel54,C:\Users\MSI\anaconda3\envs\mlops\lib\site-pac...,"[{""run_id"": ""ff045d3e44d94b8f8d861fb06c16e61a""...",LOCAL,X_train,Linear Regression


# 6. Feature Scaling

In [37]:
scaller = StandardScaler()
x_train_sc = scaller.fit_transform(X_train)
x_test_sc = scaller.transform(X_test)
x_train_sc[0:10,:]

array([[ 1.16591877e+00, -1.04064318e+00,  9.35849042e-01,
        -3.76704338e+00, -6.43848689e-01, -3.89447874e-01,
        -5.76348029e-02],
       [ 1.16591877e+00, -1.16758995e+00, -9.24357419e-01,
         2.65460176e-01, -6.43848689e-01,  3.36856102e-01,
         7.08048762e-01],
       [-4.26849632e-01,  1.01877812e-01,  9.35849042e-01,
         2.65460176e-01,  9.59582180e-01, -3.84621934e-01,
        -1.67785772e-01],
       [ 5.28811408e-01, -1.19297931e+00,  9.35849042e-01,
         2.65460176e-01, -6.43848689e-01, -1.31361539e+00,
        -1.25138473e+00],
       [ 8.47365088e-01, -7.86749623e-01, -9.24357419e-01,
         2.65460176e-01,  2.56301305e+00,  5.39545584e-01,
         1.92776925e+00],
       [ 8.47365088e-01, -1.52015741e-01, -9.24357419e-01,
         2.65460176e-01, -6.43848689e-01, -2.66386403e-01,
         2.39683260e-01],
       [ 1.48447245e+00, -1.57381964e+00,  9.35849042e-01,
         2.65460176e-01, -6.43848689e-01, -3.89447874e-01,
        -5.7634802

In [38]:
fit_and_score(models, x_train_sc, x_test_sc, Y_train, Y_test)

In [39]:
# Best model
all_experiments = [exp.experiment_id for exp in mlflow.search_experiments()]
df_mlflow = mlflow.search_runs(experiment_ids=all_experiments,filter_string="metrics.Accuracy <1")
run_id = df_mlflow.loc[df_mlflow['metrics.Accuracy'].idxmax()]['run_id']
print(run_id)

18494b6c486242baa36f9541cce13032


<h2 style="text-align:right;">Passer à l'étape 4 ...</h2>