
## Table des mati√®res

1. [Importation des Biblioth√®ques](#importation-des-biblioth√®ques)
2. [Chargement des Donn√©es](#chargement-des-donn√©es)
3. [Pr√©paration des Donn√©es](#pr√©paration-des-donn√©es)
4. [Transformation des Donn√©es](#transformation-des-donn√©es)
5. [Division des Donn√©es](#division-des-donn√©es)
6. [S√©lection des Caract√©ristiques](#s√©lection-des-caract√©ristiques)
7. [Filtrage des Caract√©ristiques](#filtrage-des-caract√©ristiques)
8. [Entra√Ænement et Enregistrement des Mod√®les](#entra%C3%AEnement-et-enregistrement-des-mod%C3%A8les)
9. [Validation du Mod√®le](#validation-du-mod%C3%A8le)

---

## Importation des Biblioth√®ques

Nous commen√ßons par importer toutes les biblioth√®ques n√©cessaires pour le traitement des donn√©es, la mod√©lisation et l'enregistrement des r√©sultats.


In [1]:
import datetime

In [2]:
start_time = datetime.datetime.now()

print(start_time)

2024-08-07 11:42:13.263803


In [3]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from joblib import parallel_backend
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error, r2_score
import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient
from mlflow.models import validate_serving_input, convert_input_example_to_serving_input
import math

import warnings
warnings.filterwarnings('ignore')

## Chargement des Donn√©es
D√©finissons une fonction pour charger les donn√©es depuis le fichier CSV.

In [4]:
data_path = "C:/Users/Asus_M/Desktop/data.csv"

def load_data(data_path):
    data = pd.read_csv(data_path, index_col=0)
    return data

data = load_data(data_path)
data.head()

Unnamed: 0,PASSENGERS,FREIGHT,MAIL,DISTANCE,UNIQUE_CARRIER,AIRLINE_ID,UNIQUE_CARRIER_NAME,REGION,CARRIER,CARRIER_NAME,...,DEST,DEST_CITY_NAME,DEST_COUNTRY,DEST_COUNTRY_NAME,DEST_WAC,YEAR,QUARTER,MONTH,DISTANCE_GROUP,CLASS
0,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,I,AMQ,Ameristar Air Cargo,...,YQG,"Windsor, Canada",CA,Canada,936,2010,2,6,1,P
1,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,I,AMQ,Ameristar Air Cargo,...,YIP,"Detroit, MI",US,United States,43,2010,1,3,1,P
2,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,I,AMQ,Ameristar Air Cargo,...,YIP,"Detroit, MI",US,United States,43,2010,2,6,1,P
3,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,I,AMQ,Ameristar Air Cargo,...,YIP,"Detroit, MI",US,United States,43,2010,3,8,1,P
4,0.0,0.0,0.0,29.0,AMQ,20201,Ameristar Air Cargo,I,AMQ,Ameristar Air Cargo,...,YIP,"Detroit, MI",US,United States,43,2010,3,9,1,P


---

## Pr√©paration des Donn√©es
Pr√©parons les donn√©es en nettoyant et en ajustant les types de donn√©es.

In [5]:
def prepare_data(data):
    data.columns = data.columns.str.replace(' ', '', regex=False)
    colonnes_avec_ID = [col for col in data.columns if 'ID' in col]
    data.drop(columns=colonnes_avec_ID, axis=1, inplace=True)
    
    # Conversion des colonnes cat√©gorielles en objets
    cat_columns = ['CARRIER_GROUP', 'CARRIER_GROUP_NEW', 'ORIGIN_WAC', 'DEST_WAC', 'YEAR', 'QUARTER', 'MONTH', 'DISTANCE_GROUP']
    data[cat_columns] = data[cat_columns].astype(object)
    
    # Suppression des lignes avec PASSENGERS, FREIGHT, et MAIL tous √©gaux √† 0
    lignes_zero_valeurs = data[(data['PASSENGERS'] == 0) & (data['FREIGHT'] == 0) & (data['MAIL'] == 0)].index
    data.drop(lignes_zero_valeurs, inplace=True)
    
    # Suppression des valeurs manquantes
    data.dropna(inplace=True)
    
    return data

prepared_data = prepare_data(data)
prepared_data.head()
prepared_data.shape

(54942, 26)

---

## Transformation des Donn√©es
Transformons les donn√©es en encodant les variables cat√©gorielles et en appliquant une transformation logarithmique aux donn√©es num√©riques.

In [6]:
def transform_data(prepared_data):
    label_encoder = LabelEncoder()
    cat_data = prepared_data.select_dtypes(include='object')
    num_data = prepared_data.select_dtypes(exclude='object')
    
    # Transformation logarithmique
    num_data['Log_PASSENGERS'] = np.log1p(num_data['PASSENGERS'])
    num_data['Log_FREIGHT'] = np.log1p(num_data['FREIGHT'])
    num_data['Log_MAIL'] = np.log1p(num_data['MAIL'])
    num_data['Log_DISTANCE'] = np.log1p(num_data['DISTANCE'])
    num_data.drop(columns=['PASSENGERS', 'FREIGHT', 'MAIL', 'DISTANCE'], inplace=True)
    
    # Encodage des variables cat√©gorielles
    for column in cat_data.columns:
        cat_data[column] = label_encoder.fit_transform(cat_data[column])
    
    data = pd.concat([num_data, cat_data], axis=1)
    return data

transformed_data = transform_data(prepared_data)
transformed_data.head()
transformed_data.shape
transformed_data.columns

Index(['Log_PASSENGERS', 'Log_FREIGHT', 'Log_MAIL', 'Log_DISTANCE',
       'UNIQUE_CARRIER', 'UNIQUE_CARRIER_NAME', 'REGION', 'CARRIER',
       'CARRIER_NAME', 'CARRIER_GROUP', 'CARRIER_GROUP_NEW', 'ORIGIN',
       'ORIGIN_CITY_NAME', 'ORIGIN_COUNTRY', 'ORIGIN_COUNTRY_NAME',
       'ORIGIN_WAC', 'DEST', 'DEST_CITY_NAME', 'DEST_COUNTRY',
       'DEST_COUNTRY_NAME', 'DEST_WAC', 'YEAR', 'QUARTER', 'MONTH',
       'DISTANCE_GROUP', 'CLASS'],
      dtype='object')

---

## Division des Donn√©es
Divisons les donn√©es en ensembles d'entra√Ænement et de test.

In [7]:
def split_data(transformed_data):
    X = transformed_data.drop(columns=["Log_PASSENGERS", "Log_FREIGHT", "Log_MAIL"])
    y = transformed_data[["Log_PASSENGERS", "Log_FREIGHT", "Log_MAIL"]]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(transformed_data)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(43953, 23)
(43953, 3)
(10989, 23)
(10989, 3)


---

## S√©lection des Caract√©ristiques
Utilisons la m√©thode SequentialFeatureSelector pour s√©lectionner les meilleures caract√©ristiques.

In [8]:
def select_features(X_train, y_train, k_features='best', n_jobs=-1, cv=3, scoring='neg_mean_squared_error', verbose=2):
    etr = ExtraTreesRegressor(n_jobs=n_jobs,random_state=42)
    sfs = SFS(
        etr,
        k_features=k_features,
        forward=False,
        floating=False,
        verbose=verbose,
        scoring=scoring,
        cv=cv,
        n_jobs=n_jobs
    )
    
    # Perform feature selection using joblib for parallel processing
    with parallel_backend('threading', n_jobs=n_jobs):
        sfs = sfs.fit(X_train, y_train)
    
    # Get the selected feature indices
    selected_feature_indices = sfs.k_feature_idx_
    
    if isinstance(X_train, pd.DataFrame):
        feature_names = X_train.columns[list(selected_feature_indices)]
        return list(selected_feature_indices), feature_names.tolist()
    else:
        return list(selected_feature_indices)

selected_indices, selected_feature_names = select_features(X_train, y_train)
print("Indices des caract√©ristiques s√©lectionn√©es :", selected_indices)
print("Noms des caract√©ristiques s√©lectionn√©es :", selected_feature_names)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  23 | elapsed:  2.7min remaining:  2.5min
[Parallel(n_jobs=-1)]: Done  23 out of  23 | elapsed:  4.1min finished

[2024-08-07 11:46:37] Features: 22/1 -- score: -0.749452776786056[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  11 out of  22 | elapsed:  1.8min remaining:  1.8min
[Parallel(n_jobs=-1)]: Done  22 out of  22 | elapsed:  3.2min finished

[2024-08-07 11:49:49] Features: 21/1 -- score: -0.7465249047195583[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 out of  21 | elapsed:  1.8min remaining:  2.5min
[Parallel(n_jobs=-1)]: Done  21 out of  21 | elapsed:  3.0min finished

[2024-08-07 11:52:47] Features: 20/1 -- score: -0.7443903165484197[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs

Indices des caract√©ristiques s√©lectionn√©es : [0, 3, 5, 7, 8, 9, 12, 13, 14, 18, 20, 21, 22]
Noms des caract√©ristiques s√©lectionn√©es : ['Log_DISTANCE', 'REGION', 'CARRIER_NAME', 'CARRIER_GROUP_NEW', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_WAC', 'DEST', 'DEST_CITY_NAME', 'YEAR', 'MONTH', 'DISTANCE_GROUP', 'CLASS']


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:    1.3s finished

[2024-08-07 12:13:25] Features: 1/1 -- score: -3.8728223161012885

---

## Filtrage des Caract√©ristiques
Filtrons les donn√©es d'entra√Ænement et de test pour ne conserver que les caract√©ristiques s√©lectionn√©es.

In [9]:
def filter_features(X_train, X_test, selected_indices):
    if isinstance(X_train, pd.DataFrame):
        return X_train.iloc[:, selected_indices], X_test.iloc[:, selected_indices]
    else:
        return X_train[:, selected_indices], X_test[:, selected_indices]

X_train_selected, X_test_selected = filter_features(X_train, X_test, selected_indices)
print("X_train avec les caract√©ristiques s√©lectionn√©es :", X_train_selected.shape)
print("X_test avec les caract√©ristiques s√©lectionn√©es :", X_test_selected.shape)
print("y_train :", y_train.shape)
print("y_test :", y_test.shape)

X_train avec les caract√©ristiques s√©lectionn√©es : (43953, 13)
X_test avec les caract√©ristiques s√©lectionn√©es : (10989, 13)
y_train : (43953, 3)
y_test : (10989, 3)


## Enregistrement des features dans un fichier TXT :
Apr√©s l'etape de selection ,nous enregistrons les donn√©es avec les features selectionn√©es dans un fichier txt.

In [10]:
columns = X_train_selected.columns.tolist()

with open('features.txt', 'w') as file:
    for column in columns:
        file.write(f"{column}\n")

print("Les noms des colonnes ont √©t√© enregistr√©s dans features.txt.")    

Les noms des colonnes ont √©t√© enregistr√©s dans features.txt.


---

## Entra√Ænement et Enregistrement des Mod√®les
Entra√Ænons plusieurs mod√®les et enregistrons-les avec MLflow.

In [11]:
# D√©finir l'URI de suivi et l'exp√©rience MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment('experience1')

def train_and_log_model(model_name, model, X_train_selected, y_train, X_test_selected, y_test):
    with mlflow.start_run(run_name=model_name):
        # Entra√Æner le mod√®le
        model.fit(X_train_selected, y_train)
        
        # Pr√©dictions
        predictions = model.predict(X_test_selected)
        
        # Calculer l'erreur quadratique moyenne et R2
        mse = mean_squared_error(y_test, predictions)
        rmse = np.sqrt(mse)  # Calculer RMSE
        r2 = r2_score(y_test, predictions)
        
        # Loguer les param√®tres et les r√©sultats
        mlflow.log_params(model.get_params())
        mlflow.log_metric("mean_squared_error", mse)
        mlflow.log_metric("root_mean_squared_error", rmse)
        mlflow.log_metric("R2", r2)
        
        # Loguer le mod√®le
        mlflow.sklearn.log_model(model, model_name)
        
        return r2, rmse

models = {
    'LinearRegression': LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor(),
    'KNeighborsRegressor': KNeighborsRegressor(),
    'ExtraTreesRegressor': ExtraTreesRegressor()
}

best_model_name = None
best_r2 = -np.inf
best_rmse = np.inf

for model_name, model in models.items():
    r2, rmse = train_and_log_model(model_name, model, X_train_selected, y_train, X_test_selected, y_test)
    print(f"Mod√®le: {model_name}, R2: {r2:.4f}, RMSE: {rmse:.4f}")
    
    # Mettre √† jour le meilleur mod√®le bas√© sur R2 et RMSE
    if r2 > best_r2 and rmse < best_rmse:
        best_r2 = r2
        best_rmse = rmse
        best_model_name = model_name

if best_model_name:
    print(f"\nLe meilleur mod√®le est '{best_model_name}' avec R2 = {best_r2:.4f} et RMSE = {best_rmse:.4f}")

    # Enregistrer le meilleur mod√®le
    with mlflow.start_run(run_name="Best_Model_Production") as best_run:
        best_model = models[best_model_name]
        mlflow.sklearn.log_model(best_model, "best_model")
        mlflow.log_metric("R2", best_r2)
        mlflow.log_metric("root_mean_squared_error", best_rmse)
        
        # Enregistrer le mod√®le dans le registre MLflow
        model_uri = f"runs:/{best_run.info.run_id}/best_model"
        mlflow.register_model(model_uri, "Best_Model")

        # D√©placer le mod√®le vers le stage de production
        client = MlflowClient()
        latest_version = client.get_latest_versions("Best_Model", stages=["None"])[0].version
        client.transition_model_version_stage(
            name="Best_Model",
            version=latest_version,
            stage="Production"
        )

    print(f"Le mod√®le '{best_model_name}' a √©t√© enregistr√© et mis en production.")
else:
    print("Aucun mod√®le n'a √©t√© s√©lectionn√©.")


2024/08/07 12:13:28 INFO mlflow.tracking.fluent: Experiment with name 'experience1' does not exist. Creating a new experiment.
2024/08/07 12:13:36 INFO mlflow.tracking._tracking_service.client: üèÉ View run LinearRegression at: http://localhost:5000/#/experiments/995444263188306398/runs/9b436f8930a743d3adfc4541593898b3.
2024/08/07 12:13:36 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/995444263188306398.


Mod√®le: LinearRegression, R2: 0.1832, RMSE: 2.9935


2024/08/07 12:14:07 INFO mlflow.tracking._tracking_service.client: üèÉ View run RandomForestRegressor at: http://localhost:5000/#/experiments/995444263188306398/runs/2ebdf0293b6e4e99a0ab0d715ffbf9b5.
2024/08/07 12:14:07 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/995444263188306398.


Mod√®le: RandomForestRegressor, R2: 0.9263, RMSE: 0.8188


2024/08/07 12:14:11 INFO mlflow.tracking._tracking_service.client: üèÉ View run KNeighborsRegressor at: http://localhost:5000/#/experiments/995444263188306398/runs/ec5dac052297454b8e9e599f66e7e199.
2024/08/07 12:14:11 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/995444263188306398.


Mod√®le: KNeighborsRegressor, R2: 0.8440, RMSE: 1.2745


2024/08/07 12:14:35 INFO mlflow.tracking._tracking_service.client: üèÉ View run ExtraTreesRegressor at: http://localhost:5000/#/experiments/995444263188306398/runs/7804aba9c89a4213ba39fa23753ea05f.
2024/08/07 12:14:35 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/995444263188306398.


Mod√®le: ExtraTreesRegressor, R2: 0.9262, RMSE: 0.8243

Le meilleur mod√®le est 'RandomForestRegressor' avec R2 = 0.9263 et RMSE = 0.8188


Successfully registered model 'Best_Model'.
2024/08/07 12:14:44 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: Best_Model, version 1
Created version '1' of model 'Best_Model'.
2024/08/07 12:14:44 INFO mlflow.tracking._tracking_service.client: üèÉ View run Best_Model_Production at: http://localhost:5000/#/experiments/995444263188306398/runs/c54223f5e2cd4eeaafa34bdd9b77f31a.
2024/08/07 12:14:44 INFO mlflow.tracking._tracking_service.client: üß™ View experiment at: http://localhost:5000/#/experiments/995444263188306398.


Le mod√®le 'RandomForestRegressor' a √©t√© enregistr√© et mis en production.


---

## Validation du Mod√®le
V√©rifions que le mod√®le enregistr√© fonctionne correctement avant de le d√©ployer.

In [12]:
import json
import pandas as pd
import numpy as np
import mlflow
from mlflow.tracking import MlflowClient

# Define MLflow tracking URI
mlflow.set_tracking_uri("http://localhost:5000")

def get_model_uri_from_stage(model_name, stage="Production"):
    client = MlflowClient()
    # Get the latest version of the model in the specified stage
    model_versions = client.get_latest_versions(model_name, stages=[stage])
    if not model_versions:
        raise ValueError(f"No version of model '{model_name}' found in stage '{stage}'.")
    
    # Get the model URI
    latest_version = model_versions[0].version
    model_uri = f"models:/{model_name}/{latest_version}"
    return model_uri

def convert_input_example_to_serving_input(input_example):
    # Convert DataFrame to NumPy array
    if isinstance(input_example, pd.DataFrame):
        input_example = input_example.values
    
    # Return as a list of lists (2D array) which is often expected by models
    return input_example.tolist()

def validate_serving_input(model_uri, serving_payload):
    # Load the model
    model = mlflow.pyfunc.load_model(model_uri)
    
    # Make a prediction
    prediction = model.predict(serving_payload)
    
    # Print the prediction
    prediction_original = np.exp(prediction)
    passengers = prediction_original[0][0]
    freight = prediction_original[0][1]
    mail = prediction_original[0][2]

    print(f"Le nombre de passagers pr√©vu est : {math.floor(passengers)}")
    print(f"La quantit√© de Fret pr√©vue : {math.floor(freight)}")
    print(f"La quantit√© de Courrier pr√©vue est : {math.floor(mail)}")


# Example input data
INPUT_EXAMPLE = {
    "Log_DISTANCE": 7.723120,
    "REGION": 1,
    "CARRIER_NAME": 7,
    "CARRIER_GROUP_NEW": 2,
    "ORIGIN": 326,
    "ORIGIN_WAC": 48,
    "DEST": 412,
    "DEST_CITY_NAME": 351,
    "DEST_WAC": 58,
    "YEAR": 2,
    "MONTH": 6,
    "DISTANCE_GROUP": 4,
    "CLASS": 0
}


# Convert input example to DataFrame
input_df = pd.DataFrame([INPUT_EXAMPLE])

# Obtain the model URI
model_name = "Best_Model"  # Model name
model_uri = get_model_uri_from_stage(model_name, stage="Production")
print(f"Model URI in production: {model_uri}")

# Convert the input example to a serving input format
serving_payload = convert_input_example_to_serving_input(input_df)

# Validate the model with the example input
validate_serving_input(model_uri, serving_payload)


Model URI in production: models:/Best_Model/1


Downloading artifacts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:46<00:00,  9.35s/it]


Le nombre de passagers pr√©vu est : 50
La quantit√© de Fret pr√©vue : 1
La quantit√© de Courrier pr√©vue est : 1


In [13]:
end_time = datetime.datetime.now()

print(end_time)

2024-08-07 12:15:32.396282


In [14]:
temps_total_pour_execution = start_time - end_time

print(temps_total_pour_execution)

-1 day, 23:26:40.867521
