
## Etapes du cycle de vie du projet en notebook 
## Expériences orchestrées.

1. [Importation des Bibliothèques](#importation-des-bibliothèques)
2. [Chargement des Données](#chargement-des-données)
3. [Préparation des Données](#préparation-des-données)
4. [Transformation des Données](#transformation-des-données)
5. [Division des Données](#division-des-données)
6. [Sélection des Caractéristiques](#sélection-des-caractéristiques)
7. [Filtrage des Caractéristiques](#filtrage-des-caractéristiques)
8. [Entraînement et Enregistrement des Modèles](#entra%C3%AEnement-et-enregistrement-des-mod%C3%A8les)
9. [Validation du Modèle](#validation-du-mod%C3%A8le)

---

## Importation des Bibliothèques

Nous commençons par importer toutes les bibliothèques nécessaires pour le traitement des données, la modélisation et l'enregistrement des résultats.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from joblib import parallel_backend
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error, r2_score
import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient
from mlflow.models import validate_serving_input, convert_input_example_to_serving_input
import math
import joblib

import warnings
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns",None)

## Chargement des Données
Définissons une fonction pour charger les données depuis plusieurs fichiers CSV (par années de 2010 a 2018).    
Filtrage de données uniquement entre les aéroprts canadiens et américains

In [2]:
def load_data(data_path):
    data = pd.read_csv(data_path, index_col=0)
    data.reset_index(drop=True)
    return data

In [3]:
data= load_data("../data/external/external_data.csv")
data.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL,DISTANCE,UNIQUE_CARRIER,AIRLINE_ID,UNIQUE_CARRIER_NAME,UNIQUE_CARRIER_ENTITY,REGION,CARRIER,CARRIER_NAME,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_COUNTRY,DEST_COUNTRY_NAME,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
0,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,16057,I,AMQ,Ameristar Air Cargo,1,4,16091,1609101,31295,YIP,"Detroit, MI",US,United States,43,16166,1616601,36166,YQG,"Windsor, Canada",CA,Canada,936,2010,2,6,1,P
1,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,16057,I,AMQ,Ameristar Air Cargo,1,4,16166,1616601,36166,YQG,"Windsor, Canada",CA,Canada,936,16091,1609101,31295,YIP,"Detroit, MI",US,United States,43,2010,1,3,1,P
2,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,16057,I,AMQ,Ameristar Air Cargo,1,4,16166,1616601,36166,YQG,"Windsor, Canada",CA,Canada,936,16091,1609101,31295,YIP,"Detroit, MI",US,United States,43,2010,2,6,1,P
3,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,16057,I,AMQ,Ameristar Air Cargo,1,4,16166,1616601,36166,YQG,"Windsor, Canada",CA,Canada,936,16091,1609101,31295,YIP,"Detroit, MI",US,United States,43,2010,3,8,1,P
4,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,16057,I,AMQ,Ameristar Air Cargo,1,4,16166,1616601,36166,YQG,"Windsor, Canada",CA,Canada,936,16091,1609101,31295,YIP,"Detroit, MI",US,United States,43,2010,3,9,1,P


In [4]:
data.tail()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL,DISTANCE,UNIQUE_CARRIER,AIRLINE_ID,UNIQUE_CARRIER_NAME,UNIQUE_CARRIER_ENTITY,REGION,CARRIER,CARRIER_NAME,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_COUNTRY,ORIGIN_COUNTRY_NAME,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_COUNTRY,DEST_COUNTRY_NAME,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
61251,23203.0,61.0,0.0,800.0,UA,19977,United Air Lines Inc.,0A875,D,UA,United Air Lines Inc.,3,3,16229,1622902,31215,YVR,"Vancouver, Canada",CA,Canada,906,14771,1477104,32457,SFO,"San Francisco, CA",US,United States,91,2018,3,8,2,F
61252,23664.0,0.0,0.0,739.0,DL,19790,Delta Air Lines Inc.,01260,D,DL,Delta Air Lines Inc.,3,3,10397,1039707,30397,ATL,"Atlanta, GA",US,United States,34,16271,1627102,36106,YYZ,"Toronto, Canada",CA,Canada,936,2018,2,5,2,F
61253,23931.0,0.0,0.0,739.0,DL,19790,Delta Air Lines Inc.,01260,D,DL,Delta Air Lines Inc.,3,3,16271,1627102,36106,YYZ,"Toronto, Canada",CA,Canada,936,10397,1039707,30397,ATL,"Atlanta, GA",US,United States,34,2018,4,10,2,F
61254,24612.0,0.0,0.0,739.0,DL,19790,Delta Air Lines Inc.,01260,D,DL,Delta Air Lines Inc.,3,3,10397,1039707,30397,ATL,"Atlanta, GA",US,United States,34,16271,1627102,36106,YYZ,"Toronto, Canada",CA,Canada,936,2018,3,8,2,F
61255,25097.0,0.0,0.0,739.0,DL,19790,Delta Air Lines Inc.,01260,D,DL,Delta Air Lines Inc.,3,3,16271,1627102,36106,YYZ,"Toronto, Canada",CA,Canada,936,10397,1039707,30397,ATL,"Atlanta, GA",US,United States,34,2018,3,8,2,F


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61256 entries, 0 to 61255
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   PASSENGERS             61256 non-null  float64
 1   FREIGHT                61256 non-null  float64
 2   MAIL                   61256 non-null  float64
 3   DISTANCE               61256 non-null  float64
 4   UNIQUE_CARRIER         61224 non-null  object 
 5   AIRLINE_ID             61256 non-null  int64  
 6   UNIQUE_CARRIER_NAME    61256 non-null  object 
 7   UNIQUE_CARRIER_ENTITY  61256 non-null  object 
 8   REGION                 61256 non-null  object 
 9   CARRIER                61224 non-null  object 
 10  CARRIER_NAME           61256 non-null  object 
 11  CARRIER_GROUP          61256 non-null  int64  
 12  CARRIER_GROUP_NEW      61256 non-null  int64  
 13  ORIGIN_AIRPORT_ID      61256 non-null  int64  
 14  ORIGIN_AIRPORT_SEQ_ID  61256 non-null  int64  
 15  ORIGIN_

## Préparation des Données
Préparons les données en nettoyant et en ajustant les types de données.

In [6]:
def prepare_data(data):
    data.columns = data.columns.str.strip()
    colonnes_avec_ID = [col for col in data.columns if 'ID' in col]
    data.drop(columns=colonnes_avec_ID, axis=1, inplace=True)
    data.drop(columns = ['UNIQUE_CARRIER_NAME','UNIQUE_CARRIER_ENTITY','CARRIER_NAME','ORIGIN_CITY_NAME','ORIGIN_COUNTRY_NAME','DEST_CITY_NAME','DEST_COUNTRY_NAME'],axis = 1,inplace =True)
    
    # Conversion des colonnes catégorielles en objets
    cat_columns = data.select_dtypes(include='int64').columns
    data[cat_columns] = data[cat_columns].astype(object)
    
    # Suppression des lignes avec PASSENGERS, FREIGHT, et MAIL tous égaux à 0
    lignes_zero_valeurs = data[(data['PASSENGERS'] == 0) & (data['FREIGHT'] == 0) & (data['MAIL'] == 0)].index
    data.drop(lignes_zero_valeurs, inplace=True)
    
    # Suppression des valeurs manquantes
    data.dropna(inplace=True)
    data.reset_index(drop=True,inplace=True)
    
    return data

In [7]:
prepared_data = prepare_data(data)
prepared_data.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
0,0.0,31.0,0.0,1447.0,DL,D,DL,3,3,YEG,CA,916,ANC,US,1,2010,3,9,3,F
1,0.0,111.0,0.0,1376.0,GFQ,I,GFQ,1,4,LRD,US,74,YQG,CA,936,2010,1,3,3,P
2,0.0,125.0,0.0,409.0,U7,I,U7,1,1,YXU,CA,936,RFD,US,41,2010,3,9,1,P
3,0.0,129.0,0.0,133.0,U7,I,U7,1,1,PHN,US,43,YHM,CA,936,2010,1,1,1,P
4,0.0,188.0,0.0,284.0,U7,I,U7,1,1,GRR,US,43,YHM,CA,936,2010,3,7,1,P


---

## Division des Données
Divisons les données en ensembles d'entraînement et de test.

In [8]:
def split_data(prepared_data):
    X = prepared_data.drop(columns=["PASSENGERS", "FREIGHT", "MAIL"])
    y = prepared_data[["PASSENGERS", "FREIGHT", "MAIL"]]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(prepared_data)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(45532, 17)
(45532, 3)
(11384, 17)
(11384, 3)


In [9]:
X_train.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
31466,366.0,RP,D,RP,2,2,YYZ,CA,936,JFK,US,22,2014,3,7,1,F
45374,1280.0,YV,D,YV,2,2,YYZ,CA,936,IAH,US,74,2016,2,4,3,F
19742,219.0,QX,D,QX,2,2,YLW,CA,906,SEA,US,93,2012,3,8,1,F
34203,2443.0,5V,D,5V,1,1,SDM,US,91,YMX,CA,941,2015,2,4,5,P
23326,1430.0,09Q,D,09Q,1,4,YYC,CA,916,STL,US,64,2013,4,12,3,L


In [10]:
X_test.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
1584,1229.0,9E,D,9E,2,2,TUL,US,73,YOW,CA,936,2010,2,4,3,F
34444,1640.0,5X,D,5X,3,3,YYC,CA,916,SDF,US,52,2015,2,4,4,G
26796,1114.0,UA,D,UA,3,3,DEN,US,82,YVR,CA,906,2013,2,6,3,F
6429,443.0,XE,D,XE,2,2,EWR,US,21,YQB,CA,941,2010,3,7,1,F
39528,127.0,OO,D,OO,3,3,YVR,CA,906,SEA,US,93,2015,4,12,1,F


In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45532 entries, 31466 to 56422
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   DISTANCE           45532 non-null  float64
 1   UNIQUE_CARRIER     45532 non-null  object 
 2   REGION             45532 non-null  object 
 3   CARRIER            45532 non-null  object 
 4   CARRIER_GROUP      45532 non-null  object 
 5   CARRIER_GROUP_NEW  45532 non-null  object 
 6   ORIGIN             45532 non-null  object 
 7   ORIGIN_COUNTRY     45532 non-null  object 
 8   ORIGIN_WAC         45532 non-null  object 
 9   DEST               45532 non-null  object 
 10  DEST_COUNTRY       45532 non-null  object 
 11  DEST_WAC           45532 non-null  object 
 12  YEAR               45532 non-null  object 
 13  QUARTER            45532 non-null  object 
 14  MONTH              45532 non-null  object 
 15  DISTANCE_GROUP     45532 non-null  object 
 16  CLASS              4553

## Transformation des Données
Transformons les données en encodant les variables catégorielles et en appliquant une transformation logarithmique aux données numériques.

In [12]:
cat_columns = X_train.select_dtypes(include='object').columns.tolist()
num_columns = X_train.select_dtypes(exclude='object').columns.tolist()

In [13]:
cat_columns

['UNIQUE_CARRIER',
 'REGION',
 'CARRIER',
 'CARRIER_GROUP',
 'CARRIER_GROUP_NEW',
 'ORIGIN',
 'ORIGIN_COUNTRY',
 'ORIGIN_WAC',
 'DEST',
 'DEST_COUNTRY',
 'DEST_WAC',
 'YEAR',
 'QUARTER',
 'MONTH',
 'DISTANCE_GROUP',
 'CLASS']

In [14]:
num_columns

['DISTANCE']

In [15]:
def encode_categorical_columns(df, cat_columns):
    label_encoders = {}
    
    for col in cat_columns:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        label_encoders[col] = le  # Sauvegarder le LabelEncoder pour chaque colonne
    
    return df, label_encoders

In [16]:
X_train_encoded, encoders = encode_categorical_columns(X_train, cat_columns)
X_train.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
31466,366.0,65,1,68,1,1,475,0,57,163,1,12,4,2,6,0,0
45374,1280.0,83,1,86,1,1,475,0,57,152,1,40,6,1,3,2,0
19742,219.0,64,1,67,1,1,417,0,51,284,1,51,2,2,7,0,0
34203,2443.0,12,1,15,0,0,334,1,48,363,0,59,5,1,3,4,3
23326,1430.0,1,1,1,0,3,470,0,53,297,1,33,3,3,11,2,2


In [48]:
with open('../models/label_encoders.pkl', 'wb') as f:
    pickle.dump(encoders, f)

In [17]:
X_test_encoded, encoders = encode_categorical_columns(X_test, cat_columns)
X_test.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
1584,1229.0,16,1,17,1,1,248,1,37,257,0,57,0,1,3,2,0
34444,1640.0,12,1,13,2,2,322,0,52,191,1,26,5,1,3,3,1
26796,1114.0,66,1,66,2,2,73,1,40,281,0,51,3,1,5,2,0
6429,443.0,75,1,75,1,1,87,1,9,263,0,58,0,2,6,0,0
39528,127.0,53,1,54,2,2,313,0,50,192,1,50,5,3,11,0,0


In [18]:
### Transformation Logarithmique des données numériques:

In [19]:
def log_transform(x):
    return np.log1p(x)

In [20]:
X_train_encoded['DISTANCE'] = log_transform(X_train_encoded['DISTANCE'])
X_test_encoded['DISTANCE'] = log_transform(X_test_encoded['DISTANCE'])

In [21]:
X_train_encoded.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
31466,5.905362,65,1,68,1,1,475,0,57,163,1,12,4,2,6,0,0
45374,7.155396,83,1,86,1,1,475,0,57,152,1,40,6,1,3,2,0
19742,5.393628,64,1,67,1,1,417,0,51,284,1,51,2,2,7,0,0
34203,7.801391,12,1,15,0,0,334,1,48,363,0,59,5,1,3,4,3
23326,7.266129,1,1,1,0,3,470,0,53,297,1,33,3,3,11,2,2


In [22]:
X_test_encoded.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,REGION,CARRIER,CARRIER_GROUP,CARRIER_GROUP_NEW,ORIGIN,ORIGIN_COUNTRY,ORIGIN_WAC,DEST,DEST_COUNTRY,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
1584,7.114769,16,1,17,1,1,248,1,37,257,0,57,0,1,3,2,0
34444,7.403061,12,1,13,2,2,322,0,52,191,1,26,5,1,3,3,1
26796,7.01661,66,1,66,2,2,73,1,40,281,0,51,3,1,5,2,0
6429,6.095825,75,1,75,1,1,87,1,9,263,0,58,0,2,6,0,0
39528,4.85203,53,1,54,2,2,313,0,50,192,1,50,5,3,11,0,0


In [23]:
for col in y_train.columns:
    y_train[col] = log_transform(y_train[col])

In [24]:
y_train.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL
31466,7.241366,0.0,0.0
45374,8.137396,0.0,0.0
19742,8.038512,0.0,0.0
34203,0.0,7.774856,0.0
23326,3.850148,0.0,0.0


In [25]:
for col in y_test.columns:
    y_test[col] = log_transform(y_test[col])

In [26]:
y_test.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL
1584,1.609438,0.0,0.0
34444,0.0,9.615205,0.0
26796,8.628019,9.322597,0.0
6429,8.320935,0.0,0.0
39528,8.102586,0.0,0.0


---

## Sélection des Caractéristiques
Utilisons la méthode SequentialFeatureSelector pour sélectionner les meilleures caractéristiques.

In [27]:
def select_features(X_train_encoded, y_train, k_features='best', n_jobs=-1, cv=3, scoring='neg_mean_squared_error', verbose=2):
    etr = ExtraTreesRegressor(n_jobs = n_jobs,random_state=42)
    sfs = SFS(
        etr,
        k_features=k_features,
        forward=False,
        floating=True,
        verbose=verbose,
        scoring=scoring,
        cv=cv,
        n_jobs=n_jobs
        )
    
    # Perform feature selection using joblib for parallel processing
    with parallel_backend('threading', n_jobs=n_jobs):
        sfs = sfs.fit(X_train_encoded, y_train)
    
    # Get the selected feature indices
    selected_feature_indices = sfs.k_feature_idx_
    
    if isinstance(X_train_encoded, pd.DataFrame):
        feature_names = X_train_encoded.columns[list(selected_feature_indices)]
        return list(selected_feature_indices), feature_names.tolist()
    else:
        return list(selected_feature_indices)

selected_indices, selected_feature_names = select_features(X_train_encoded, y_train)
print("Indices des caractéristiques sélectionnées :", selected_indices)
print("Noms des caractéristiques sélectionnées :", selected_feature_names)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  17 | elapsed:  1.4min remaining:  6.5min
[Parallel(n_jobs=-1)]: Done  12 out of  17 | elapsed:  1.5min remaining:   38.4s
[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:  2.0min finished

[2024-08-21 08:19:48] Features: 16/1 -- score: -0.7892925512262505[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of  16 | elapsed:  1.3min remaining:  9.3min
[Parallel(n_jobs=-1)]: Done  11 out of  16 | elapsed:  1.5min remaining:   40.1s
[Parallel(n_jobs=-1)]: Done  16 out of  16 | elapsed:  1.9min finished

[2024-08-21 08:21:43] Features: 15/1 -- score: -0.7879418833320101[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  15 | elapsed:  1.5min remaining:  1.3min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  1.8min finished
[P

Indices des caractéristiques sélectionnées : [0, 1, 6, 8, 9, 11, 12, 14, 16]
Noms des caractéristiques sélectionnées : ['DISTANCE', 'UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_WAC', 'DEST', 'DEST_WAC', 'YEAR', 'MONTH', 'CLASS']


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   15.6s finished

[2024-08-21 08:44:16] Features: 1/1 -- score: -3.913760567435403

---

## Filtrage des Caractéristiques
Filtrons les données d'entraînement et de test pour ne conserver que les caractéristiques sélectionnées.

In [28]:
def filter_features(X_train_encoded, X_test_encoded, selected_indices):
    if isinstance(X_train_encoded, pd.DataFrame):
        return X_train_encoded.iloc[:, selected_indices], X_test_encoded.iloc[:, selected_indices]
    else:
        return X_train_encoded[:, selected_indices], X_test_encoded[:, selected_indices]

X_train, X_test= filter_features(X_train_encoded, X_test_encoded, selected_indices)
print("X_train avec les caractéristiques sélectionnées :", X_train.shape)
print("X_test avec les caractéristiques sélectionnées :", X_test.shape)
print("y_train :", y_train.shape)
print("y_test :", y_test.shape)

X_train avec les caractéristiques sélectionnées : (45532, 9)
X_test avec les caractéristiques sélectionnées : (11384, 9)
y_train : (45532, 3)
y_test : (11384, 3)


In [36]:
selected_features = X_train.columns
selected_features

Index(['DISTANCE', 'UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_WAC', 'DEST',
       'DEST_WAC', 'YEAR', 'MONTH', 'CLASS'],
      dtype='object')

## Enregistrement des features dans un fichier TXT :
Aprés l'etape de selection ,nous enregistrons les données avec les features selectionnées dans un fichier txt.

### Reconstruire le dataframe:
Prenons uniquement les caractestiques selectionnées du script précédent

In [37]:
def save_selected_features(file_path):
    with open(file_path, 'w') as file:
        for feature in selected_features:
            file.write(f"{feature}\n")

In [38]:
save_selected_features('../selected_features/features.txt')

In [39]:
selected_features = load_selected_features('../selected_features/features.txt')
print("Caractéristiques sélectionnées:", selected_features)

Caractéristiques sélectionnées: ['DISTANCE', 'UNIQUE_CARRIER', 'ORIGIN', 'ORIGIN_WAC', 'DEST', 'DEST_WAC', 'YEAR', 'MONTH', 'CLASS']


In [40]:
X_train = X_train_encoded[selected_features]
X_test = X_test_encoded[selected_features]

In [41]:
X_train.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,ORIGIN,ORIGIN_WAC,DEST,DEST_WAC,YEAR,MONTH,CLASS
31466,5.905362,65,475,57,163,12,4,6,0
45374,7.155396,83,475,57,152,40,6,3,0
19742,5.393628,64,417,51,284,51,2,7,0
34203,7.801391,12,334,48,363,59,5,3,3
23326,7.266129,1,470,53,297,33,3,11,2


In [42]:
X_test.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,ORIGIN,ORIGIN_WAC,DEST,DEST_WAC,YEAR,MONTH,CLASS
1584,7.114769,16,248,37,257,57,0,3,0
34444,7.403061,12,322,52,191,26,5,3,1
26796,7.01661,66,73,40,281,51,3,5,0
6429,6.095825,75,87,9,263,58,0,6,0
39528,4.85203,53,313,50,192,50,5,11,0


In [43]:
y_train.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL
31466,7.241366,0.0,0.0
45374,8.137396,0.0,0.0
19742,8.038512,0.0,0.0
34203,0.0,7.774856,0.0
23326,3.850148,0.0,0.0


In [44]:
y_test.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL
1584,1.609438,0.0,0.0
34444,0.0,9.615205,0.0
26796,8.628019,9.322597,0.0
6429,8.320935,0.0,0.0
39528,8.102586,0.0,0.0


---

## Entraînement / Evaluation et Enregistrement des Modèles
Entraînons plusieurs modèles et enregistrons-les avec MLflow.

In [47]:
import numpy as np
import joblib  # Utilisé pour sauvegarder le modèle localement
import mlflow
import os
import mlflow.sklearn
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pickle  # Pour sauvegarder le modèle localement

# Fonction pour entraîner et évaluer le modèle
def train_and_evaluate_model(model_name, model, X_train, y_train, X_test, y_test):
    # Entraîner le modèle
    model.fit(X_train, y_train)
    
    # Prédictions
    predictions = model.predict(X_test)
    
    # Calculer l'erreur quadratique moyenne et R2
    mse = mean_squared_error(y_test, predictions)
    rmse = np.sqrt(mse)  # Calculer RMSE
    r2 = r2_score(y_test, predictions)
    
    # Retourner les métriques de performance
    return r2, rmse

# Fonction pour enregistrer et suivre le modèle avec MLflow
def log_model_with_mlflow(model_name, model, X_train, y_train, X_test, y_test, local_model_path):
    with mlflow.start_run(run_name=model_name):
        # Entraîner le modèle
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)

        # Calculer les métriques
        mse = mean_squared_error(y_test, predictions)
        rmse = np.sqrt(mse)
        r2 = r2_score(y_test, predictions)

        # Log des paramètres et des métriques
        mlflow.log_params(model.get_params())
        mlflow.log_metric("mean_squared_error", mse)
        mlflow.log_metric("root_mean_squared_error", rmse)
        mlflow.log_metric("R2", r2)

        # Enregistrer le modèle
        mlflow.sklearn.log_model(model, model_name)
        
        # Sauvegarder le modèle localement
        os.makedirs(os.path.dirname(local_model_path), exist_ok=True)
        joblib.dump(model, local_model_path)
        
        print(f"Modèle '{model_name}' enregistré localement sous '{local_model_path}'.")

    return r2, rmse

# Définir les modèles à tester
models = {
    'LinearRegression': LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor(),
    'KNeighborsRegressor': KNeighborsRegressor(),
    'ExtraTreesRegressor': ExtraTreesRegressor()
}

best_model_name = None
best_r2 = -np.inf
best_rmse = np.inf
local_model_path = "../models/best_model.pkl"

# Configurer MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment('experiment1')

# Entraîner et évaluer chaque modèle
for model_name, model in models.items():
    r2, rmse = log_model_with_mlflow(model_name, model, X_train, y_train, X_test, y_test, local_model_path)
    print(f"Modèle: {model_name}, R2: {r2:.4f}, RMSE: {rmse:.4f}")

    # Mettre à jour le meilleur modèle basé sur R2 et RMSE
    if r2 > best_r2 and rmse < best_rmse:
        best_r2 = r2
        best_rmse = rmse
        best_model_name = model_name

if best_model_name:
    print(f"\nLe meilleur modèle est '{best_model_name}' avec R2 = {best_r2:.4f} et RMSE = {best_rmse:.4f}")
else:
    print("Aucun modèle n'a été sélectionné.")


2024/08/21 09:00:49 INFO mlflow.tracking._tracking_service.client: 🏃 View run LinearRegression at: http://localhost:5000/#/experiments/439057172495272962/runs/1e420e6d79474199b61601b2df179a93.
2024/08/21 09:00:49 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/439057172495272962.


Modèle 'LinearRegression' enregistré localement sous '../models/best_model.pkl'.
Modèle: LinearRegression, R2: 0.1825, RMSE: 2.9813


2024/08/21 09:01:29 INFO mlflow.tracking._tracking_service.client: 🏃 View run RandomForestRegressor at: http://localhost:5000/#/experiments/439057172495272962/runs/48e6db22847f45c6b771fbea4c901a4b.
2024/08/21 09:01:29 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/439057172495272962.


Modèle 'RandomForestRegressor' enregistré localement sous '../models/best_model.pkl'.
Modèle: RandomForestRegressor, R2: 0.3166, RMSE: 2.4477


2024/08/21 09:01:35 INFO mlflow.tracking._tracking_service.client: 🏃 View run KNeighborsRegressor at: http://localhost:5000/#/experiments/439057172495272962/runs/e5ec1aed3edc4d868105c5140956e29a.
2024/08/21 09:01:35 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/439057172495272962.


Modèle 'KNeighborsRegressor' enregistré localement sous '../models/best_model.pkl'.
Modèle: KNeighborsRegressor, R2: -0.4718, RMSE: 4.0915


2024/08/21 09:02:12 INFO mlflow.tracking._tracking_service.client: 🏃 View run ExtraTreesRegressor at: http://localhost:5000/#/experiments/439057172495272962/runs/4e63f182f0504d1aa810523a0203c4ce.
2024/08/21 09:02:12 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: http://localhost:5000/#/experiments/439057172495272962.


Modèle 'ExtraTreesRegressor' enregistré localement sous '../models/best_model.pkl'.
Modèle: ExtraTreesRegressor, R2: 0.4599, RMSE: 2.1970

Le meilleur modèle est 'ExtraTreesRegressor' avec R2 = 0.4599 et RMSE = 2.1970


---

## Prediction
Vérifions que le modèle enregistré fonctionne correctement avant de le déployer.

In [54]:
import joblib
import pandas as pd
import numpy as np

def load_model(filepath):
    """Charger le modèle depuis un fichier."""
    return joblib.load(filepath)


In [61]:
# Chargement des encodeurs
with open('../models/label_encoders.pkl', 'rb') as f:
    encoders = pickle.load(f)

# Charger le modèle
model = load_model('../models/best_model.pkl')

# Définition de nouvelles données
new_data = {
     'DISTANCE': 739.00,
     'UNIQUE_CARRIER': 'DL',
     'ORIGIN': 'YYZ',
     'ORIGIN_WAC': 936,
     'DEST': 'ATL',
     'DEST_WAC': 34,
     'YEAR': 2018,
     'MONTH': 8,
     'CLASS': 'F'
 }
new_data = pd.DataFrame([new_data])

for col in new_data.columns:
    if col in encoders:
        new_data[col] = encoders[col].transform(new_data[col])

new_data['DISTANCE'] = log_transform(new_data['DISTANCE'])
new_data.head()

Unnamed: 0,DISTANCE,UNIQUE_CARRIER,ORIGIN,ORIGIN_WAC,DEST,DEST_WAC,YEAR,MONTH,CLASS
0,6.60665,27,327,57,9,14,8,7,0


In [63]:
# Utiliser le modèle pour prédire avec les nouvelles données transformées
predictions = model.predict(new_data)
print("Prediction:", predictions)

Prediction: [[4.52503997 0.36149775 0.06272877]]


In [64]:
# If the model applies a logarithmic transformation, apply np.exp to reverse it
prediction_original = np.exp(predictions)

# Extract predictions
passengers = prediction_original[0][0]
freight = prediction_original[0][1]
mail = prediction_original[0][2]

# Display predictions in Streamlit
print(f"Le nombre de passagers prévu est : {math.floor(passengers)}")
print(f"La quantité de fret prévue : {math.floor(freight)}")
print(f"La quantité de courrier prévue est : {math.floor(mail)}")

Le nombre de passagers prévu est : 92
La quantité de fret prévue : 1
La quantité de courrier prévue est : 1
