# PREPROCESSING & FEATURE ENGINEERING - CHURN PREDICTION

### **Objectif** : Préparer les données pour la modélisation

### **Étapes** :
#### 1. Nettoyage des données
#### 2. Encodage des variables catégorielles
#### 3. Feature Engineering (création de nouvelles variables)
#### 4. Gestion du déséquilibre des classes
#### 5. Normalisation/Standardisation
#### 6. Split train/validation

## Importation des bibliothèques

In [1]:
# ## 1. IMPORTATION & CHARGEMENT

# Data manipulation
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Gestion du déséquilibre
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTETomek

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Utils
import warnings
warnings.filterwarnings('ignore')

In [2]:
import sys
sys.path.append('..')
from src.data_preprocessing import (
    create_cleaning_pipeline,
    create_feature_engineering_pipeline,
    create_encoding_pipeline,
    create_scaling_pipeline
)

In [3]:
## Gestion du déséquilibre
# Note: Désactivation temporaire de imbalanced-learn à cause de problèmes de compatibilité
# Utilisation de techniques alternatives manuelles
IMBLEARN_AVAILABLE = False
print(" Utilisation de techniques de rééquilibrage manuelles")
print(" (Alternative à imbalanced-learn pour éviter les conflits de versions)")

 Utilisation de techniques de rééquilibrage manuelles
 (Alternative à imbalanced-learn pour éviter les conflits de versions)


## Collecte de données

In [4]:

# Configuration
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("Bibliothèques importées avec succès!")


Bibliothèques importées avec succès!


In [5]:
# Chargement des données avec gestion robuste des chemins
from pathlib import Path

In [6]:
# Détecter le chemin de base
current_path = Path.cwd()
if current_path.name == 'notebooks':
    base_path = current_path.parent
else:
    base_path = current_path

data_path = base_path / 'data' / 'raw'

print(f" Chemin de base: {base_path}")
print(f" Dossier data: {data_path}")

# Charger les données
train_df = pd.read_csv(data_path / 'train.csv')
test_df = pd.read_csv(data_path / 'test.csv')

print(f" Train set: {train_df.shape}")
print(f" Test set:  {test_df.shape}")


 Chemin de base: /Users/Apple/Desktop/Projets/machine_learning/Projet-Machine-Learning-No2
 Dossier data: /Users/Apple/Desktop/Projets/machine_learning/Projet-Machine-Learning-No2/data/raw
 Train set: (165034, 14)
 Test set:  (110023, 13)


In [7]:
# Aperçu des données
print("\n Aperçu des données d'entraînement:")
display(train_df.head())

print("\n Informations sur les colonnes:")
display(train_df.info())


 Aperçu des données d'entraînement:


Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.0,2,1.0,1.0,15068.83,0



 Informations sur les colonnes:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165034 entries, 0 to 165033
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               165034 non-null  int64  
 1   CustomerId       165034 non-null  int64  
 2   Surname          165034 non-null  object 
 3   CreditScore      165034 non-null  int64  
 4   Geography        165034 non-null  object 
 5   Gender           165034 non-null  object 
 6   Age              165034 non-null  float64
 7   Tenure           165034 non-null  int64  
 8   Balance          165034 non-null  float64
 9   NumOfProducts    165034 non-null  int64  
 10  HasCrCard        165034 non-null  float64
 11  IsActiveMember   165034 non-null  float64
 12  EstimatedSalary  165034 non-null  float64
 13  Exited           165034 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 17.6+ MB


None

## Traitement des données

In [8]:
# %%
# Split train/validation
train_df, val_df = train_test_split(
    train_df, 
    test_size=0.2, 
    random_state=42,
    stratify=train_df['Exited']
)

In [9]:
train_df.shape, val_df.shape, test_df.shape

((132027, 14), (33007, 14), (110023, 13))

### Nettoyer les données

In [10]:
pipeline_cleaning = create_cleaning_pipeline()
train_clean = pipeline_cleaning.fit_transform(train_df)
val_clean = pipeline_cleaning.transform(val_df)
test_clean = pipeline_cleaning.transform(test_df)

In [11]:
test_clean.shape, test_df.shape

((110023, 10), (110023, 13))

In [12]:
# Exporter le pipeline de nettoyage
import joblib
joblib.dump(pipeline_cleaning, '../models/pipeline_cleaning.pkl')
print("\n Pipeline de nettoyage sauvegardé dans models/")


 Pipeline de nettoyage sauvegardé dans models/


In [13]:
# Sauvegarder les colonnes pour référence
if 'Exited' in train_clean.columns:
    feature_columns = [col for col in train_clean.columns if col != 'Exited']
    target_column = 'Exited'
    print(f"\n Variable cible: {target_column}")
    print(f" {len(feature_columns)} variables explicatives identifiées")


 Variable cible: Exited
 10 variables explicatives identifiées


### feature engineering

In [14]:
pipeline_features = create_feature_engineering_pipeline()
train_engineered = pipeline_features.fit_transform(train_clean)
val_engineered = pipeline_features.transform(val_clean)
test_engineered = pipeline_features.transform(test_clean)

In [15]:
test_engineered.shape, test_clean.shape

((110023, 19), (110023, 10))

In [16]:
# importer le pipeline de feature engineering
joblib.dump(pipeline_features, '../models/pipeline_features.pkl')
print("\n Pipeline de feature engineering sauvegardé dans models/")


 Pipeline de feature engineering sauvegardé dans models/


In [17]:
# Visualiser quelques nouvelles features
new_features = ['BalanceSalaryRatio', 'IsZeroBalance', 'HasMultipleProducts', 'EngagementScore']

print("\n Aperçu des nouvelles features:\n")
display(train_engineered[new_features].head(10))

print("\n Statistiques des nouvelles features:\n")
display(train_engineered[new_features].describe())


 Aperçu des nouvelles features:



Unnamed: 0,BalanceSalaryRatio,IsZeroBalance,HasMultipleProducts,EngagementScore
112149,0.0,1,1,2.5
70095,0.869893,0,1,1.5
29247,0.0,1,0,0.25
161355,0.0,1,1,1.5
105992,0.912064,0,0,2.25
76772,2.83249,0,1,2.5
12234,0.0,1,1,1.5
83824,0.658205,0,1,1.5
143071,0.0,1,1,2.5
105103,3.380781,0,0,2.25



 Statistiques des nouvelles features:



Unnamed: 0,BalanceSalaryRatio,IsZeroBalance,HasMultipleProducts,EngagementScore
count,132027.0,132027.0,132027.0,132027.0
mean,1.99108,0.543245,0.530505,1.640187
std,79.949185,0.498128,0.49907,0.670583
min,0.0,0.0,0.0,0.25
25%,0.0,0.0,0.0,1.25
50%,0.0,1.0,1.0,1.5
75%,0.980423,1.0,1.0,2.25
max,10961.156598,1.0,1.0,3.0


In [18]:
# Analyser l'impact des nouvelles features sur le churn
if 'Exited' in train_engineered.columns:
    print("\n IMPACT DES NOUVELLES FEATURES SUR LE CHURN\n")
    
    for feature in new_features:
        if feature in train_engineered.columns:
            print(f"\n {feature}:")
            
            # Moyenne par groupe
            churn_mean = train_engineered.groupby('Exited')[feature].mean()
            print(f"   Moyenne Restés (0): {churn_mean[0]:.4f}")
            print(f"   Moyenne Partis (1): {churn_mean[1]:.4f}")
            
            # Test statistique
            from scipy import stats
            group_0 = train_engineered[train_engineered['Exited'] == 0][feature].dropna()
            group_1 = train_engineered[train_engineered['Exited'] == 1][feature].dropna()
            t_stat, p_value = stats.ttest_ind(group_0, group_1)
            
            print(f"   P-value: {p_value:.6f}", end="")
            if p_value < 0.001:
                print(" *** (Très significatif)")
            elif p_value < 0.01:
                print(" ** (Significatif)")
            elif p_value < 0.05:
                print(" * (Légèrement significatif)")
            else:
                print(" (Non significatif)")


 IMPACT DES NOUVELLES FEATURES SUR LE CHURN


 BalanceSalaryRatio:
   Moyenne Restés (0): 2.0729
   Moyenne Partis (1): 1.6861
   P-value: 0.472736 (Non significatif)

 IsZeroBalance:
   Moyenne Restés (0): 0.5777
   Moyenne Partis (1): 0.4147
   P-value: 0.000000 *** (Très significatif)

 HasMultipleProducts:
   Moyenne Restés (0): 0.6112
   Moyenne Partis (1): 0.2300
   P-value: 0.000000 *** (Très significatif)

 EngagementScore:
   Moyenne Restés (0): 1.7147
   Moyenne Partis (1): 1.3625
   P-value: 0.000000 *** (Très significatif)


### Encoder les variables

In [19]:
# PIPELINE 3 : Encodage
pipeline_encoding = create_encoding_pipeline()
train_encoded = pipeline_encoding.fit_transform(train_engineered)
val_encoded = pipeline_encoding.transform(val_engineered)
test_encoded = pipeline_encoding.transform(test_engineered)

In [20]:
test_encoded.shape, test_engineered.shape

((110023, 28), (110023, 19))

In [21]:
# importer le pipeline d'encodage
joblib.dump(pipeline_encoding, '../models/pipeline_encoding.pkl')
print("\n Pipeline d'encodage sauvegardé dans models/")


 Pipeline d'encodage sauvegardé dans models/


### STANDARDISATION

In [22]:
X_val = val_encoded.drop(columns=['Exited'], errors='ignore')
y_val = val_encoded['Exited']

X_train = train_encoded.drop(columns=['Exited'], errors='ignore')
y = train_encoded['Exited']

X_test = test_encoded.drop(columns=['Exited'], errors='ignore')

In [23]:
# PIPELINE 4 : Scaling
pipeline_scaling = create_scaling_pipeline(method='standard')
X_train_scaled = pipeline_scaling.fit_transform(X_train)
X_val_scaled = pipeline_scaling.transform(X_val)
X_test_scaled = pipeline_scaling.transform(X_test)
print(f"\n Scaling appliqué sur train, validation et test")


 Scaling appliqué sur train, validation et test


In [24]:
X_test_scaled.shape, X_test.shape

((110023, 28), (110023, 28))

In [25]:
# importer le pipeline d scaling
joblib.dump(pipeline_scaling, '../models/pipeline_scaling.pkl')
print("\n Pipeline de scaling sauvegardé dans models/")


 Pipeline de scaling sauvegardé dans models/


In [26]:
# Aperçu des données encodées
print("\n Aperçu des données encodées:")
display(X_train_scaled.head())

print(f"\n Nombre total de features: {X_train_scaled.shape[1]}")


 Aperçu des données encodées:


Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,BalanceSalaryRatio,IsZeroBalance,HasMultipleProducts,EngagementScore,Age_Balance_Interaction,TenureAgeRatio,Geography_Germany,Geography_Spain,Geography_UNKNOWN,AgeGroup_Adult,AgeGroup_Middle,AgeGroup_Senior,AgeGroup_UNKNOWN,TenureGroup_Regular,TenureGroup_Loyal,TenureGroup_UNKNOWN,CreditScoreGroup_Good,CreditScoreGroup_Excellent,CreditScoreGroup_UNKNOWN
112149,0.867996,-1.13772,-0.352117,-1.4324,-0.883215,0.815324,0.569983,1.006101,0.576076,-0.024904,0.916946,0.940742,1.282192,-0.842787,-1.292426,-0.514619,-0.531754,0.0,0.984062,-0.539716,-0.316989,0.0,-0.67429,-0.897784,0.0,-0.884025,1.52497,0.0
70095,-2.191754,-1.13772,-0.126383,1.063646,1.542353,0.815324,0.569983,-0.993936,1.244268,-0.014024,-1.090577,0.940742,-0.209054,1.366817,0.913114,-0.514619,-0.531754,0.0,0.984062,-0.539716,-0.316989,0.0,-0.67429,1.113854,0.0,-0.884025,-0.655751,0.0
29247,-0.917899,-1.13772,-0.352117,-0.006088,-0.883215,-1.011828,-1.754438,-0.993936,-0.196878,-0.024904,0.916946,-1.06299,-2.073111,-0.842787,0.050866,-0.514619,-0.531754,0.0,0.984062,-0.539716,-0.316989,0.0,1.483042,-0.897784,0.0,-0.884025,-0.655751,0.0
161355,-0.156083,-1.13772,-0.690718,0.707068,-0.883215,0.815324,0.569983,-0.993936,-0.686173,-0.024904,0.916946,0.940742,-0.209054,-0.842787,0.942896,-0.514619,-0.531754,0.0,0.984062,-0.539716,-0.316989,0.0,-0.67429,1.113854,0.0,1.13119,-0.655751,0.0
105992,0.655687,0.878951,-1.029319,-0.006088,1.164031,-1.011828,0.569983,1.006101,0.565633,-0.013496,-1.090577,-1.06299,0.909381,0.618943,0.398269,-0.514619,-0.531754,0.0,-1.016196,-0.539716,-0.316989,0.0,1.483042,-0.897784,0.0,-0.884025,1.52497,0.0



 Nombre total de features: 28


In [27]:
# Choisir le dataset à utiliser (SMOTE par défaut)
#X_train, y_train = balanced_datasets['original']
X_train, y_train = X_train_scaled, y



print("\n Dataset sélectionné: SMOTE")
print(f"   Shape: {X_train.shape}")
print(f"   Distribution: {pd.Series(y_train).value_counts().to_dict()}")




print("="*60)
print("SPLIT TRAIN / VALIDATION")
print("="*60)
print(f"\n Train set: {X_train.shape}")
print(f" Validation set: {X_val.shape}")

print(f"\n Distribution Train:")
print(pd.Series(y_train).value_counts())

print(f"\n Distribution Validation:")
print(pd.Series(y_val).value_counts())


 Dataset sélectionné: SMOTE
   Shape: (132027, 28)
   Distribution: {0: 104090, 1: 27937}
SPLIT TRAIN / VALIDATION

 Train set: (132027, 28)
 Validation set: (33007, 28)

 Distribution Train:
Exited
0    104090
1     27937
Name: count, dtype: int64

 Distribution Validation:
Exited
0    26023
1     6984
Name: count, dtype: int64


### SAUVEGARDE DES DONNÉES PRÉPROCESSÉES

In [28]:
# Sauvegarder les données préprocessées
import joblib

print("="*60)
print("SAUVEGARDE DES DONNÉES")
print("="*60)

SAUVEGARDE DES DONNÉES


In [29]:
# Sauvegarder les datasets
X_train_scaled.to_csv('../data/processed/X_train.csv', index=False)
X_val_scaled.to_csv('../data/processed/X_val.csv', index=False)
X_test_scaled.to_csv('../data/processed/X_test.csv', index=False)

pd.Series(y_train).to_csv('../data/processed/y_train.csv', index=False, header=['Exited'])
pd.Series(y_val).to_csv('../data/processed/y_val.csv', index=False, header=['Exited'])

print("Datasets sauvegardés dans data/processed/")

Datasets sauvegardés dans data/processed/


### RÉSUMÉ DU PREPROCESSING

In [30]:
print("="*80)
print("RÉSUMÉ DU PREPROCESSING")
print("="*80)

print(f"""
DIMENSIONS FINALES:
   Features d'entraînement:  {X_train_scaled.shape}
   Features de validation:   {X_val_scaled.shape}
   Features de test:         {X_test_scaled.shape}

FEATURES CRÉÉES:
   Features originales:      {len([col for col in train_clean.columns if col != 'Exited'])}
   Features engineered:      {train_engineered.shape[1] - train_clean.shape[1]}


SCALING:
   Méthode:                 StandardScaler
   Appliqué sur:            Train, Validation, Test

FICHIERS SAUVEGARDÉS:
   data/processed/X_train.csv
   data/processed/X_val.csv
   data/processed/X_test.csv
   data/processed/y_train.csv
   data/processed/y_val.csv
   models/scaler.pkl
   models/encoders.pkl
   data/processed/balanced_datasets.pkl

PROCHAINE ÉTAPE:
   → Modélisation (churn_03_modelisation.ipynb)
""")

print("="*80)
print("PREPROCESSING TERMINÉ AVEC SUCCÈS!")
print("="*80)

RÉSUMÉ DU PREPROCESSING

DIMENSIONS FINALES:
   Features d'entraînement:  (132027, 28)
   Features de validation:   (33007, 28)
   Features de test:         (110023, 28)

FEATURES CRÉÉES:
   Features originales:      10
   Features engineered:      9


SCALING:
   Méthode:                 StandardScaler
   Appliqué sur:            Train, Validation, Test

FICHIERS SAUVEGARDÉS:
   data/processed/X_train.csv
   data/processed/X_val.csv
   data/processed/X_test.csv
   data/processed/y_train.csv
   data/processed/y_val.csv
   models/scaler.pkl
   models/encoders.pkl
   data/processed/balanced_datasets.pkl

PROCHAINE ÉTAPE:
   → Modélisation (churn_03_modelisation.ipynb)

PREPROCESSING TERMINÉ AVEC SUCCÈS!
