## Introduction
Avant de faire tourner le code suivant, il est nécessaire d'avoir fait tourner de manière indépendante les notebooks suivants :
- 01_importation_fusion
- 02_weather
- 03_featuring

Il est ensuite possible de :
- Utiliser les notebooks "exploration" ou "dataviz" présent dans le dossier NOTEBOOKS pour l'exploration ou afficher les graphes utiliser dans le rapport
- Utiliser les notebooks 04_regression, 05_classification ou 06_deep_learning pour réaliser l'entraînement des modèles agrémenté de graphiques et d'affichages de tableau pour le suivi du raisonnement.

Il est également possible de :
- Faire tourner les modèles à l'aide du main (ce notebook) en choissisant ainsi les modèles que l'on souhaite faire tourner.
- Pour modifier le preprocessing => aller dans les fonctions dfinies dans le notebook "preprocessing"
- Pour modifier les paramètres des modèles => aller dans les fonctions définies dans les notebooks "regression", "classification" ou "deep_learning"

## Importation des librairies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Standardisation et évaluation
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score

# Réduction de dimension
from sklearn.feature_selection import VarianceThreshold, SelectFromModel
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Undersampling
from imblearn.under_sampling import RandomUnderSampler

# Evaludation des modèles
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score # modèle régression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # modèle classification

# Divers
from utils import dataframe_info, racine_projet
from preprocessing import prepross_reg, prepross_class
from regression import regression_lineaire, ridge_model, lasso_model, elasticnet_model, xgb_model, xgb_gridsearch
from classification import knn_class, decision_tree_class, random_forest_class, xgb_class, random_forest_gridsearch, xgb_class_gridsearch
from deep_learning import deep_learning_dense, deep_learning_improved


2024-07-25 06:46:32.489614: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Régression

In [3]:
X_train, y_train, X_test, y_test = prepross_reg()

In [4]:
dataframe_info(pd.DataFrame(X_train))

Unnamed: 0,Column,Non-Null Count,NaN Count,NaN Percentage,Dtype,Example Value
0,0,1890728,0,0.0,float64,0.0
1,1,1890728,0,0.0,float64,-0.594006
2,2,1890728,0,0.0,float64,0.278521
3,3,1890728,0,0.0,float64,-0.333663
4,4,1890728,0,0.0,float64,0.352111
5,5,1890728,0,0.0,float64,0.925126
6,6,1890728,0,0.0,float64,-0.207481
7,7,1890728,0,0.0,float64,-0.253338
8,8,1890728,0,0.0,float64,-0.470705
9,9,1890728,0,0.0,float64,-1.596613


In [5]:
# Entraînement modèle de Régression linéaire
lr,r2_lr,rmse_lr,mae_lr = regression_lineaire(X_train, y_train, X_test, y_test)

r^2: 0.16339186525241034
Root Mean Squared Error (RMSE): 140.1886675972663
Mean Absolute Error (MAE): 101.43856528846408


In [6]:
# Entraînement modèle Ridge
ridge,r2_ridge,rmse_ridge,mae_ridge = ridge_model(X_train, y_train, X_test, y_test)

Ridge r^2: 0.16339185038426218
Ridge Root Mean Squared Error (RMSE): 140.18866884297847
Ridge Mean Absolute Error (MAE): 101.43856343659506


In [7]:
# Entraînement modèle Lasso
lasso,r2_lasso,rmse_lasso,mae_lasso = lasso_model(X_train, y_train, X_test, y_test)

Lasso r^2: 0.15595092553241696
Lasso Root Mean Squared Error (RMSE): 140.8107188194601
Lasso Mean Absolute Error (MAE): 101.76065437022264


In [8]:
# Entraînement modèle ElasticNet
elastic_net,r2_en,rmse_en,mae_en = elasticnet_model(X_train, y_train, X_test, y_test)

r^2: 0.11936812613980696
Root Mean Squared Error (RMSE): 143.82986297706728
Mean Absolute Error (MAE): 104.20704609731379


In [9]:
# Entraînement modèle XGB Regressor
xgb,r2_xgb,rmse_xgb,mae_xgb = xgb_model(X_train, y_train, X_test, y_test)

r^2: 0.329063355922699
Root Mean Squared Error (RMSE): 125.54303161587222
Mean Absolute Error (MAE): 86.95122787852921


In [None]:
# Entraînement modèle Gridsearch XGB
best_xgb, best_params, r2_bestxgb, rmse_bestxgb, mae_bestxgb = xgb_model(X_train, y_train, X_test, y_test)

## Classification

In [11]:
X_train, y_train, X_test, y_test = prepross_class()

In [16]:
dataframe_info(pd.DataFrame(X_train))

Unnamed: 0,Column,Non-Null Count,NaN Count,NaN Percentage,Dtype,Example Value
0,DeployedFromLocation,200000,0,0.0,int64,0
1,PumpOrder,200000,0,0.0,float64,-0.595599
2,Easting_rounded,200000,0,0.0,float64,-1.887522
3,Northing_rounded,200000,0,0.0,float64,1.175974
4,NumStationsWithPumpsAttending,200000,0,0.0,float64,0.349341
...,...,...,...,...,...,...
67,IncidentType_Spills and Leaks (not RTC),200000,0,0.0,bool,False
68,IncidentType_Stand By,200000,0,0.0,bool,False
69,IncidentType_Suicide/attempts,200000,0,0.0,bool,False
70,IncidentType_Use of Special Operations Room,200000,0,0.0,bool,False


In [12]:
# Entraînement modèle KNN
knn, accuracy_knn, cl_rep_knn, cm_knn = knn_class(X_train, y_train, X_test, y_test)

Accuracy: 0.33918080404837914
              precision    recall  f1-score   support

           0       0.35      0.51      0.42    119701
           1       0.28      0.30      0.29    117113
           2       0.30      0.24      0.27    118583
           3       0.48      0.30      0.37    117286

    accuracy                           0.34    472683
   macro avg       0.35      0.34      0.33    472683
weighted avg       0.35      0.34      0.33    472683


Confusion Matrix:
[[61066 33884 16607  8144]
 [48631 35648 21439 11395]
 [37905 33090 28617 18971]
 [26421 26134 29737 34994]]


In [13]:
# Entraînement modèle DecisionTree
dt, accuracy_dt, cl_rep_dt, cm_dt = decision_tree_class(X_train, y_train, X_test, y_test)

Accuracy: 0.39032501697755156
              precision    recall  f1-score   support

           0       0.48      0.47      0.48    119701
           1       0.32      0.33      0.33    117113
           2       0.32      0.33      0.32    118583
           3       0.43      0.43      0.43    117286

    accuracy                           0.39    472683
   macro avg       0.39      0.39      0.39    472683
weighted avg       0.39      0.39      0.39    472683


Confusion Matrix:
[[56681 31029 18189 13802]
 [29485 38548 29382 19698]
 [17638 29949 38601 32395]
 [13115 20363 33138 50670]]


In [14]:
# Entraînement modèle Random Forest
rf, accuracy_rf, cl_rep_rf, cm_rf = random_forest_class(X_train, y_train, X_test, y_test)

Accuracy: 0.41147238212501824
              precision    recall  f1-score   support

           0       0.43      0.60      0.50    119701
           1       0.33      0.28      0.30    117113
           2       0.34      0.25      0.29    118583
           3       0.51      0.52      0.52    117286

    accuracy                           0.41    472683
   macro avg       0.40      0.41      0.40    472683
weighted avg       0.40      0.41      0.40    472683


Confusion Matrix:
[[71400 26859 12965  8477]
 [46530 32593 22211 15779]
 [30336 25468 29868 32911]
 [18930 15345 22376 60635]]


In [15]:
# Entraînement modèle XGBoost
xgb_class, accuracy_xgb_class, cl_rep_xgb_class, cm_xgb_class = xgb_class(X_train, y_train, X_test, y_test)

Accuracy: 0.41194204149504
              precision    recall  f1-score   support

           0       0.40      0.72      0.52    119701
           1       0.33      0.17      0.22    117113
           2       0.35      0.22      0.27    118583
           3       0.51      0.54      0.52    117286

    accuracy                           0.41    472683
   macro avg       0.40      0.41      0.38    472683
weighted avg       0.40      0.41      0.38    472683


Confusion Matrix:
[[86212 14063 10233  9193]
 [60778 19405 20341 16589]
 [41170 16315 26094 35004]
 [26171  9802 18306 63007]]


In [None]:
# Entraînement modèle RandomForest GridSearch
rf, best_params_rf, accuracy_rf, cl_rep_rf, cm_rf  = random_forest_gridsearch(X_train, y_train, X_test, y_test)

In [None]:
# Entraînement modèle XGBoost GridSearch
xgb_class_gs, best_params_xgb_class_gs, accuracy_xgb_class_gs, cl_rep_xgb_class_gs, cm_xgb_class_gs = xgb_class(X_train, y_train, X_test, y_test)

## Deep Learning

In [18]:
X_train, y_train, X_test, y_test = prepross_class()

In [19]:
# Entraînement d'un modèle fully connected de deep learning
dense1, dense1_history, dense1_loss, dense1_accuracy, dense1_cnf_matrix = deep_learning_dense(X_train, y_train, X_test, y_test)

Epoch 1/100
[1m6250/6250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 2ms/step - accuracy: 0.3687 - loss: 1.2903 - val_accuracy: 0.3994 - val_loss: 1.2456
Epoch 2/100
[1m6250/6250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 2ms/step - accuracy: 0.3957 - loss: 1.2459 - val_accuracy: 0.3993 - val_loss: 1.2402
Epoch 3/100
[1m6250/6250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 2ms/step - accuracy: 0.4012 - loss: 1.2374 - val_accuracy: 0.4053 - val_loss: 1.2340
Epoch 4/100
[1m6250/6250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 2ms/step - accuracy: 0.4041 - loss: 1.2329 - val_accuracy: 0.4066 - val_loss: 1.2308
Epoch 5/100
[1m6250/6250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 2ms/step - accuracy: 0.4076 - loss: 1.2279 - val_accuracy: 0.4073 - val_loss: 1.2298
Epoch 6/100
[1m6250/6250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 2ms/step - accuracy: 0.4101 - loss: 1.2248 - val_accuracy: 0.4075 - val_loss: 1.2284
Epoc

In [None]:
# Entraînement d'un modèle dense amélioré avec des dropout et du batching normalization
dense2, dense2_history, dense2_loss, dense2_accuracy, dense2_cnf_matrix = deep_learning_improved(X_train, y_train, X_test, y_test)