## 03. Modélisation 

L’objectif de cette section est d’évaluer l’impact du rééquilibrage par **SMOTE** (appliqué uniquement sur le jeu d’entraînement) sur les performances de modèles de classification.  

Nous comparons deux modèles : **régression logistique** et **Random Forest**, appliquée sans rééquilibrage de la base de données et avec réequilibrage.

Dans un contexte de classes déséquilibrées, le ROC-AUC ne suffit pas toujours : nous analysons également les matrices de confusion afin d’étudier le compromis entre :
- FN : résiliations non détectées (coût potentiellement élevé),
- FP : fausses alertes (coût commercial / opérationnel).


### 1- Importation des données et packages 

In [18]:
from pathlib import Path
import pandas as pd
import numpy as np
import joblib

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, roc_auc_score, confusion_matrix
)

from imblearn.over_sampling import SMOTE

In [19]:
IN_DIR = Path("..") / "data" / "processed"

data = np.load(IN_DIR / "eudirectlapse_num_scaled_split.npz", allow_pickle=True)

X_train_num_scaled = data["X_train"]
X_test_num_scaled = data["X_test"]
y_train = data["y_train"]
y_test = data["y_test"]
num_cols = data["num_cols"].tolist()

scaler = joblib.load(IN_DIR / "scaler_num.joblib")

print(X_train_num_scaled.shape, X_test_num_scaled.shape)

(17295, 9) (5765, 9)


In [20]:
print("Train target rate:", y_train.mean())
print("Test target rate:", y_test.mean())

Train target rate: 0.1281295172015033
Test target rate: 0.1280138768430182


### 1- Régression logistique 
La régression logistique est un modèle linéaire simple et interprétable.  
Dans un contexte de déséquilibre, elle peut privilégier la classe majoritaire, ce qui se traduit par un faible nombre de résiliations détectées.

#### a - Modèle de référence : Régression logistique (sans SMOTE)

In [21]:
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train_num_scaled, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


Le paramètre max_iter est fixé à 1000 afin de garantir la convergence de l’algorithme d’optimisation, en particulier dans un contexte de données déséquilibrées et standardisées.

In [22]:
y_pred_base = model1.predict(X_test_num_scaled)
y_proba_base = model1.predict_proba(X_test_num_scaled)[:, 1]

La fonction predict() impose que le seuil de décision est fixé à 0.5. Dans un contexte de classes déséquilibrées, ce seuil peut être inadapté, et son ajustement permet de contrôler le compromis entre faux négatifs et faux positifs.

Le seuil 0.5 n’est pas forcément optimal en cas de classes déséquilibrées. Ajuster le seuil permet de contrôler le compromis FN/FP.

In [23]:
roc_auc = roc_auc_score(y_test, y_proba_base)

# Tableau récapitulatif
metrics_summary = pd.DataFrame({
    "Metric": ["ROC-AUC"],
    "Value": [roc_auc]
})

print("=== Performance summary ===")
display(metrics_summary)

print("=== Confusion matrix ===")
display(pd.DataFrame(
    confusion_matrix(y_test, y_pred_base),
    index=["True 0", "True 1"],
    columns=["Pred 0", "Pred 1"]
))

print("=== Classification report ===")
print(classification_report(y_test, y_pred_base, digits=4))


=== Performance summary ===


Unnamed: 0,Metric,Value
0,ROC-AUC,0.585924


=== Confusion matrix ===


Unnamed: 0,Pred 0,Pred 1
True 0,5026,1
True 1,738,0


=== Classification report ===
              precision    recall  f1-score   support

           0     0.8720    0.9998    0.9315      5027
           1     0.0000    0.0000    0.0000       738

    accuracy                         0.8718      5765
   macro avg     0.4360    0.4999    0.4658      5765
weighted avg     0.7603    0.8718    0.8123      5765



Avec un ROC-AUC de 0,586, la performance globale du modèle sans rééquilibrage reste modérée. L’analyse des métriques comme la précision, le rappel et le F1-score, ainsi que l’analyse de la matrice de confusion mettent en évidence l'incapacité du modèle à détecter des résiliations.

En effet, le modèle prédit quasi exclusivement la classe majoritaire : aucune résiliation n’est correctement identifiée (TP = 0), tandis qu’une seule observation est prédite à tort comme résiliation. La précision et le rappel associés à la classe 1 sont ainsi nuls, indiquant une incapacité du modèle à reconnaître les contrats susceptibles de résilier.

Dans ce contexte de classes déséquilibrées, l’accuracy élevée (87,2 %) est trompeuse, car elle reflète essentiellement la bonne prédiction de la classe majoritaire. Ces résultats soulignent l’importance d’étudier d’autres métriques en présence de déséquilibre dans la distribution de la variable à expliquer.

#### b - Régression logistique avec rééquilibrage (SMOTE)

Dans cette section, nous évaluons l’impact du rééquilibrage des classes par la méthode SMOTE sur les performances d’une régression logistique.  
SMOTE est appliqué uniquement sur le jeu d’entraînement afin d’éviter toute fuite d’information vers le jeu de test.

In [24]:
smote = SMOTE(random_state=42)

X_train_smote, y_train_smote = smote.fit_resample(
    X_train_num_scaled,
    y_train
)

print("Distribution après SMOTE :")
print(pd.Series(y_train_smote).value_counts(normalize=True))

Distribution après SMOTE :
0    0.5
1    0.5
Name: proportion, dtype: float64


Après application de SMOTE, la distribution de la variable *lapse* dans le jeu
d’entraînement est équilibrée. Les observations synthétiques sont générées
uniquement à partir des données d’apprentissage. On peut à nouveau effectuer une régression logistique sur cette nouvelles base des données. 


In [25]:
model_smote = LogisticRegression(max_iter=1000)
model_smote.fit(X_train_smote, y_train_smote)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


In [26]:
y_pred_smote = model_smote.predict(X_test_num_scaled)
y_proba_smote = model_smote.predict_proba(X_test_num_scaled)[:, 1]

In [27]:
roc_auc = roc_auc_score(y_test, y_proba_smote)

metrics_smote = pd.DataFrame({
    "Metric": ["ROC-AUC"],
    "Value": [roc_auc]
})

print("=== Logistic Regression + SMOTE — Performance summary ===")
display(metrics_smote)

print("=== Confusion matrix ===")
display(pd.DataFrame(
    confusion_matrix(y_test, y_pred_smote),
    index=["True 0", "True 1"],
    columns=["Pred 0", "Pred 1"]
))

print("=== Classification report ===")
print(classification_report(y_test, y_pred_smote, digits=4))

=== Logistic Regression + SMOTE — Performance summary ===


Unnamed: 0,Metric,Value
0,ROC-AUC,0.586548


=== Confusion matrix ===


Unnamed: 0,Pred 0,Pred 1
True 0,2735,2292
True 1,316,422


=== Classification report ===
              precision    recall  f1-score   support

           0     0.8964    0.5441    0.6771      5027
           1     0.1555    0.5718    0.2445       738

    accuracy                         0.5476      5765
   macro avg     0.5260    0.5579    0.4608      5765
weighted avg     0.8016    0.5476    0.6218      5765



L’application de SMOTE modifie le comportement de la régression logistique. Le ROC-AUC reste du même ordre de grandeur que pour le modèle sans SMOTE (0,587 contre 0,586). L’analyse des métriques usuels sur la classe minoritaire met en évidence une amélioration de la détection des résiliations.

La matrice de confusion montre que le modèle après rééquilibrage identifie correctement 422 résiliations (TP), contre aucune pour le modèle sans SMOTE. Le rappel de la classe minoritaire atteint ainsi 57,2 %, traduisant une réduction importante du nombre de faux négatifs. On remarque aussi que cette amélioration s’accompagne d’une augmentation du nombre de faux positifs (2 292).

Dans un contexte de classes déséquilibrées, la diminution de l’accuracy globale (54,8 %) n’est pas problématique, car elle reflète le changement d’équilibre entre les classes. Ces résultats confirment que SMOTE permet de rendre le modèle sensible à la classe d’intérêt, au prix d’un compromis entre détection des résiliations et augmentation des fausses alertes.

### 3 - Random Forest 
Dans cette section, nous évaluons les performances d’un modèle de Random Forest, d’abord sans rééquilibrage des classes, puis avec application de SMOTE sur le jeu d’entraînement.

La Random Forest est un modèle non-linéaire basé sur des arbres de décison. Elle est souvent plus flexible, mais peut également être affectée par le déséquilibre des classes.  

#### a- Modèle de référence : Random Forest sans SMOTE



In [28]:
rf_base = RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    n_jobs=-1
)

rf_base.fit(X_train_num_scaled, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",500
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [29]:
y_pred_rf_base = rf_base.predict(X_test_num_scaled)
y_proba_rf_base = rf_base.predict_proba(X_test_num_scaled)[:, 1]

In [30]:
roc_auc = roc_auc_score(y_test, y_proba_rf_base)

metrics_rf_base = pd.DataFrame({
    "Metric": ["ROC-AUC"],
    "Value": [roc_auc]
})

print("=== Random Forest — Performance summary ===")
display(metrics_rf_base)

print("=== Confusion matrix ===")
display(pd.DataFrame(
    confusion_matrix(y_test, y_pred_rf_base),
    index=["True 0", "True 1"],
    columns=["Pred 0", "Pred 1"]
))

print("=== Classification report ===")
print(classification_report(y_test, y_pred_rf_base, digits=4))


=== Random Forest — Performance summary ===


Unnamed: 0,Metric,Value
0,ROC-AUC,0.548891


=== Confusion matrix ===


Unnamed: 0,Pred 0,Pred 1
True 0,5015,12
True 1,736,2


=== Classification report ===
              precision    recall  f1-score   support

           0     0.8720    0.9976    0.9306      5027
           1     0.1429    0.0027    0.0053       738

    accuracy                         0.8703      5765
   macro avg     0.5074    0.5002    0.4680      5765
weighted avg     0.7787    0.8703    0.8122      5765



Comme la régression logistique, l’algorithme de random forest entraîné sans rééquilibrage a tendance à privilégier la classe majoritaire et peine à détecter les résiliations.

#### b- Random Forest avec SMOTE

On applique SMOTE aux données d’entraînement (variables numériques standardisées), puis on entraîne la Random Forest sur ce jeu de données rééquilibré. 

In [31]:
smote = SMOTE(random_state=42)
X_train_rf_smote, y_train_rf_smote = smote.fit_resample(X_train_num_scaled, y_train)

rf_smote = RandomForestClassifier(
    n_estimators=500,
    random_state=42,
    n_jobs=-1
)

rf_smote.fit(X_train_rf_smote, y_train_rf_smote)

y_pred_rf_smote = rf_smote.predict(X_test_num_scaled)
y_proba_rf_smote = rf_smote.predict_proba(X_test_num_scaled)[:, 1]

In [32]:
roc_auc = roc_auc_score(y_test, y_proba_rf_smote)

metrics_rf_smote = pd.DataFrame({
    "Metric": ["ROC-AUC"],
    "Value": [roc_auc]
})

print("=== Random Forest + SMOTE — Performance summary ===")
display(metrics_rf_smote)

print("=== Confusion matrix ===")
display(pd.DataFrame(
    confusion_matrix(y_test, y_pred_rf_smote),
    index=["True 0", "True 1"],
    columns=["Pred 0", "Pred 1"]
))

print("=== Classification report ===")
print(classification_report(y_test, y_pred_rf_smote, digits=4))

=== Random Forest + SMOTE — Performance summary ===


Unnamed: 0,Metric,Value
0,ROC-AUC,0.545322


=== Confusion matrix ===


Unnamed: 0,Pred 0,Pred 1
True 0,4895,132
True 1,705,33


=== Classification report ===
              precision    recall  f1-score   support

           0     0.8741    0.9737    0.9212      5027
           1     0.2000    0.0447    0.0731       738

    accuracy                         0.8548      5765
   macro avg     0.5371    0.5092    0.4972      5765
weighted avg     0.7878    0.8548    0.8127      5765



L’application de SMOTE à la base d'apprentissage améliore la capacité de la Random Forest à détecter la classe minoritaire, comme en témoigne l’augmentation du nombre de résiliations correctement identifiées (TP). Toutefois, le gain reste plus limité que pour la régression logistique, et s’accompagne d’une augmentation du nombre de faux positifs.

Ces résultats suggèrent que, dans ce jeu de données, la Random Forest bénéficie moins du rééquilibrage par SMOTE que la régression logistique.


L’ensemble des résultats et des analyses présentés dans ce notebook sera synthétisé et discuté de manière approfondie dans le rapport final, qui proposera une mise en perspective des choix méthodologiques, des limites observées et des pistes d’amélioration possibles.