# Analyse du Taux d'Attrition - HumanForYou

## Projet d'Intelligence Artificielle

**Contexte** : L'entreprise pharmaceutique HumanForYou (bas√©e en Inde, ~4000 employ√©s) conna√Æt un taux de rotation d'environ 15% par an. La direction souhaite identifier les facteurs influen√ßant ce taux et proposer des pistes d'am√©lioration pour fid√©liser les employ√©s.

**Objectifs** :
1. Explorer et analyser les donn√©es des employ√©s
2. Identifier les facteurs cl√©s d'attrition
3. Construire des mod√®les pr√©dictifs
4. √âvaluer et comparer les performances
5. Proposer des recommandations

---

## 1. Configuration et Imports

In [None]:
# Configuration des warnings
import warnings
warnings.filterwarnings('ignore')

# Imports de base
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys

# Configuration de l'affichage
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

# Ajouter le chemin du projet au PYTHONPATH
project_root = os.path.dirname(os.path.abspath('.'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print("‚úÖ Configuration de base termin√©e")

In [None]:
# Import des modules du projet
try:
    from src.data_loader import load_all_data, merge_datasets, display_dataset_summary
    from src.data_preprocessing import preprocess_pipeline, handle_missing_values, encode_target_variable
    from src.feature_engineering import feature_engineering_pipeline, create_derived_features
    from src.models import train_and_evaluate_all_models, get_feature_importance
    from src.visualization import (
        plot_target_distribution, plot_numeric_distributions, 
        plot_correlation_matrix, plot_correlation_with_target,
        plot_confusion_matrix, plot_roc_curves, 
        plot_feature_importance, plot_model_comparison
    )
    print("‚úÖ Modules du projet import√©s avec succ√®s")
except ImportError as e:
    print(f"‚ö†Ô∏è Erreur d'import: {e}")
    print("Les fonctions seront d√©finies localement si n√©cessaire")

In [None]:
# Imports ML
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report, roc_curve, auc
)

# XGBoost (optionnel)
try:
    from xgboost import XGBClassifier
    XGBOOST_AVAILABLE = True
    print("‚úÖ XGBoost disponible")
except ImportError:
    XGBOOST_AVAILABLE = False
    print("‚ö†Ô∏è XGBoost non disponible (installez avec: pip install xgboost)")

# SHAP pour l'interpr√©tabilit√© (optionnel)
try:
    import shap
    SHAP_AVAILABLE = True
    print("‚úÖ SHAP disponible")
except ImportError:
    SHAP_AVAILABLE = False
    print("‚ö†Ô∏è SHAP non disponible (installez avec: pip install shap)")

---
## 2. Chargement des Donn√©es

In [None]:
# D√©finir le chemin vers les donn√©es
DATA_PATH = '../data'

# V√©rifier les fichiers disponibles
print("üìÅ Fichiers de donn√©es disponibles :")
for f in os.listdir(DATA_PATH):
    filepath = os.path.join(DATA_PATH, f)
    if os.path.isfile(filepath):
        size = os.path.getsize(filepath)
        print(f"  ‚Ä¢ {f} ({size:,} bytes)")

In [None]:
# Charger les donn√©es disponibles
employee_survey = None
manager_survey = None
general_data = None

# Employee Survey Data
try:
    employee_survey = pd.read_csv(os.path.join(DATA_PATH, 'employee_survey_data.csv'), na_values=['NA', 'na', 'N/A', ''])
    print(f"‚úÖ employee_survey_data.csv charg√©: {employee_survey.shape}")
except FileNotFoundError:
    print("‚ö†Ô∏è employee_survey_data.csv non trouv√©")

# Manager Survey Data
try:
    manager_survey = pd.read_csv(os.path.join(DATA_PATH, 'manager_survey_data.csv'))
    print(f"‚úÖ manager_survey_data.csv charg√©: {manager_survey.shape}")
except FileNotFoundError:
    print("‚ö†Ô∏è manager_survey_data.csv non trouv√©")

# General Data (optionnel - fichier trop volumineux)
try:
    general_data = pd.read_csv(os.path.join(DATA_PATH, 'general_data.csv'))
    print(f"‚úÖ general_data.csv charg√©: {general_data.shape}")
    FULL_DATASET = True
except FileNotFoundError:
    print("‚ö†Ô∏è general_data.csv non trouv√© (fichier √† ajouter manuellement)")
    FULL_DATASET = False
    print("   ‚ÑπÔ∏è Le notebook fonctionnera avec les donn√©es partielles disponibles")

In [None]:
# Aper√ßu des donn√©es charg√©es
if employee_survey is not None:
    print("\nüìä Employee Survey Data :")
    display(employee_survey.head())
    print(f"\nColonnes: {list(employee_survey.columns)}")
    print(f"Valeurs manquantes:\n{employee_survey.isnull().sum()}")

In [None]:
if manager_survey is not None:
    print("\nüìä Manager Survey Data :")
    display(manager_survey.head())
    print(f"\nColonnes: {list(manager_survey.columns)}")

In [None]:
if general_data is not None:
    print("\nüìä General Data :")
    display(general_data.head())
    print(f"\nColonnes: {list(general_data.columns)}")
    print(f"\nVariable cible (Attrition):")
    print(general_data['Attrition'].value_counts())

### Fusion des Datasets

In [None]:
# Fusionner les donn√©es disponibles
if FULL_DATASET and general_data is not None:
    # Standardiser les noms de colonnes
    if 'EmployeeId' in general_data.columns:
        general_data.rename(columns={'EmployeeId': 'EmployeeID'}, inplace=True)
    
    df = general_data.copy()
    
    if employee_survey is not None:
        df = df.merge(employee_survey, on='EmployeeID', how='left')
        print("‚úÖ Fusionn√© avec employee_survey")
    
    if manager_survey is not None:
        df = df.merge(manager_survey, on='EmployeeID', how='left')
        print("‚úÖ Fusionn√© avec manager_survey")
    
    print(f"\nüìä Dataset fusionn√©: {df.shape[0]} lignes √ó {df.shape[1]} colonnes")
else:
    # Travailler avec les donn√©es partielles
    if employee_survey is not None and manager_survey is not None:
        df = employee_survey.merge(manager_survey, on='EmployeeID', how='outer')
        print(f"\nüìä Dataset partiel (surveys uniquement): {df.shape[0]} lignes √ó {df.shape[1]} colonnes")
        print("\n‚ö†Ô∏è Note: Sans general_data.csv, la variable cible (Attrition) n'est pas disponible.")
        print("   Ajoutez le fichier general_data.csv pour une analyse compl√®te.")
    else:
        df = None
        print("‚ùå Pas assez de donn√©es pour cr√©er un dataset")

---
## 3. Analyse Exploratoire (EDA)

In [None]:
if df is not None:
    print("üìã Informations g√©n√©rales sur le dataset :")
    print(f"\nDimensions: {df.shape[0]} lignes √ó {df.shape[1]} colonnes")
    print(f"\nTypes de donn√©es:")
    print(df.dtypes.value_counts())
    print(f"\nM√©moire utilis√©e: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
if df is not None:
    print("\nüìä Statistiques descriptives (variables num√©riques) :")
    display(df.describe().round(2))

In [None]:
if df is not None:
    print("\nüîç Valeurs manquantes :")
    missing = df.isnull().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    if len(missing) > 0:
        for col, count in missing.items():
            pct = count / len(df) * 100
            print(f"  ‚Ä¢ {col}: {count} ({pct:.1f}%)")
    else:
        print("  Aucune valeur manquante !")

### Distribution de la Variable Cible

In [None]:
if df is not None and 'Attrition' in df.columns:
    print("\nüìä Distribution de l'Attrition :")
    print(df['Attrition'].value_counts())
    print(f"\nTaux d'attrition: {(df['Attrition'] == 'Yes').mean() * 100:.1f}%")
    
    # Visualisation
    fig, ax = plt.subplots(figsize=(8, 5))
    colors = ['#2ecc71', '#e74c3c']
    counts = df['Attrition'].value_counts()
    bars = ax.bar(counts.index, counts.values, color=colors)
    
    for bar, count in zip(bars, counts.values):
        pct = count / len(df) * 100
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10,
                f'{pct:.1f}%\n({count})', ha='center', va='bottom', fontweight='bold')
    
    ax.set_title('Distribution de l\'Attrition', fontsize=14, fontweight='bold')
    ax.set_ylabel('Nombre d\'employ√©s')
    plt.tight_layout()
    plt.show()
else:
    print("‚ö†Ô∏è Variable Attrition non disponible")
    print("   Ajoutez le fichier general_data.csv pour acc√©der √† cette variable.")

### Distribution des Variables Num√©riques

In [None]:
if df is not None:
    # Variables de satisfaction (surveys)
    survey_cols = ['EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 
                   'JobInvolvement', 'PerformanceRating']
    available_cols = [c for c in survey_cols if c in df.columns]
    
    if available_cols:
        fig, axes = plt.subplots(1, len(available_cols), figsize=(15, 4))
        if len(available_cols) == 1:
            axes = [axes]
        
        for ax, col in zip(axes, available_cols):
            df[col].value_counts().sort_index().plot(kind='bar', ax=ax, color='steelblue')
            ax.set_title(col, fontsize=10)
            ax.set_xlabel('')
        
        plt.suptitle('Distribution des Variables de Satisfaction', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

In [None]:
if df is not None and FULL_DATASET:
    # Variables d√©mographiques
    demo_cols = ['Age', 'MonthlyIncome', 'DistanceFromHome', 'TotalWorkingYears']
    available_demo = [c for c in demo_cols if c in df.columns]
    
    if available_demo:
        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
        axes = axes.flatten()
        
        for i, col in enumerate(available_demo):
            sns.histplot(df[col], ax=axes[i], kde=True, color='steelblue')
            axes[i].set_title(col)
        
        plt.suptitle('Distribution des Variables D√©mographiques', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

### Matrice de Corr√©lation

In [None]:
if df is not None:
    # Corr√©lation entre les variables num√©riques
    numeric_df = df.select_dtypes(include=[np.number])
    
    if len(numeric_df.columns) > 1:
        plt.figure(figsize=(12, 10))
        corr = numeric_df.corr()
        mask = np.triu(np.ones_like(corr, dtype=bool))
        
        sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
                    center=0, square=True, linewidths=0.5)
        plt.title('Matrice de Corr√©lation', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()

---
## 4. Pr√©traitement des Donn√©es

In [None]:
if df is not None:
    df_processed = df.copy()
    
    # 1. Gestion des valeurs manquantes
    print("üîß Traitement des valeurs manquantes...")
    
    for col in df_processed.select_dtypes(include=[np.number]).columns:
        if df_processed[col].isnull().sum() > 0:
            median_val = df_processed[col].median()
            df_processed[col].fillna(median_val, inplace=True)
            print(f"  ‚Ä¢ {col}: valeurs manquantes remplac√©es par la m√©diane ({median_val})")
    
    for col in df_processed.select_dtypes(include=['object']).columns:
        if df_processed[col].isnull().sum() > 0:
            mode_val = df_processed[col].mode()[0]
            df_processed[col].fillna(mode_val, inplace=True)
            print(f"  ‚Ä¢ {col}: valeurs manquantes remplac√©es par le mode ({mode_val})")
    
    print(f"\n‚úÖ Valeurs manquantes restantes: {df_processed.isnull().sum().sum()}")

In [None]:
if df is not None and 'Attrition' in df_processed.columns:
    # 2. Encodage de la variable cible
    print("\nüè∑Ô∏è Encodage de la variable cible...")
    
    if df_processed['Attrition'].dtype == 'object':
        df_processed['Attrition'] = df_processed['Attrition'].map({'Yes': 1, 'No': 0})
        print(f"  ‚Ä¢ Attrition encod√©e: Yes=1, No=0")
    
    print(f"\n Distribution apr√®s encodage:")
    print(df_processed['Attrition'].value_counts())

In [None]:
if df is not None:
    # 3. Encodage des variables cat√©gorielles
    print("\nüè∑Ô∏è Encodage des variables cat√©gorielles...")
    
    categorical_cols = df_processed.select_dtypes(include=['object']).columns.tolist()
    # Exclure les colonnes d'ID
    categorical_cols = [c for c in categorical_cols if 'ID' not in c.upper()]
    
    encoders = {}
    for col in categorical_cols:
        le = LabelEncoder()
        df_processed[col] = le.fit_transform(df_processed[col].astype(str))
        encoders[col] = le
        print(f"  ‚Ä¢ {col}: {len(le.classes_)} classes encod√©es")
    
    print(f"\n‚úÖ {len(categorical_cols)} colonnes cat√©gorielles encod√©es")

---
## 5. Feature Engineering

In [None]:
if df is not None and FULL_DATASET:
    print("üõ†Ô∏è Cr√©ation de features d√©riv√©es...")
    
    new_features = []
    
    # Ratio anciennet√©
    if 'YearsAtCompany' in df_processed.columns and 'TotalWorkingYears' in df_processed.columns:
        df_processed['TenureRatio'] = df_processed['YearsAtCompany'] / (df_processed['TotalWorkingYears'] + 1)
        new_features.append('TenureRatio')
    
    # Stagnation de promotion
    if 'YearsSinceLastPromotion' in df_processed.columns and 'YearsAtCompany' in df_processed.columns:
        df_processed['PromotionStagnation'] = df_processed['YearsSinceLastPromotion'] / (df_processed['YearsAtCompany'] + 1)
        new_features.append('PromotionStagnation')
    
    # Score de satisfaction globale
    satisfaction_cols = ['EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance']
    available_sat = [c for c in satisfaction_cols if c in df_processed.columns]
    if available_sat:
        df_processed['OverallSatisfaction'] = df_processed[available_sat].mean(axis=1)
        new_features.append('OverallSatisfaction')
    
    # Revenu par ann√©e d'exp√©rience
    if 'MonthlyIncome' in df_processed.columns and 'TotalWorkingYears' in df_processed.columns:
        df_processed['IncomePerYear'] = df_processed['MonthlyIncome'] / (df_processed['TotalWorkingYears'] + 1)
        new_features.append('IncomePerYear')
    
    print(f"\n‚úÖ {len(new_features)} nouvelles features cr√©√©es:")
    for f in new_features:
        print(f"  ‚Ä¢ {f}")
else:
    print("‚ö†Ô∏è Feature engineering limit√© sans le dataset complet")

---
## 6. Pr√©paration des Donn√©es pour le Mod√®le

In [None]:
if df is not None and 'Attrition' in df_processed.columns:
    print("‚úÇÔ∏è Pr√©paration train/test split...")
    
    # S√©parer features et target
    cols_to_drop = ['Attrition']
    if 'EmployeeID' in df_processed.columns:
        cols_to_drop.append('EmployeeID')
    if 'EmployeeId' in df_processed.columns:
        cols_to_drop.append('EmployeeId')
    
    X = df_processed.drop(columns=cols_to_drop)
    y = df_processed['Attrition']
    
    # Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    print(f"\n‚úÖ Split effectu√©:")
    print(f"  ‚Ä¢ Train: {len(X_train)} √©chantillons ({len(X_train)/len(X)*100:.0f}%)")
    print(f"  ‚Ä¢ Test: {len(X_test)} √©chantillons ({len(X_test)/len(X)*100:.0f}%)")
    print(f"\n  Taux d'attrition (train): {y_train.mean()*100:.1f}%")
    print(f"  Taux d'attrition (test): {y_test.mean()*100:.1f}%")
    
    # Normalisation
    print("\nüìê Normalisation des features...")
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    print("  ‚úÖ Features normalis√©es (StandardScaler)")
    
    DATA_READY = True
else:
    print("‚ùå Impossible de pr√©parer les donn√©es sans la variable cible (Attrition)")
    print("   Ajoutez le fichier general_data.csv pour continuer.")
    DATA_READY = False

---
## 7. Entra√Ænement et √âvaluation des Mod√®les

### Mod√®les utilis√©s :
1. **Logistic Regression** - Mod√®le de base, interpr√©table
2. **Random Forest** - Ensemble de d√©cision, robuste
3. **XGBoost** - Gradient boosting, performant
4. **SVM** - Support Vector Machine, efficace en haute dimension

In [None]:
if DATA_READY:
    print("üöÄ Entra√Ænement des mod√®les...\n")
    
    # D√©finir les mod√®les
    models = {
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced'),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced', n_jobs=-1),
        'SVM': SVC(random_state=42, class_weight='balanced', probability=True)
    }
    
    if XGBOOST_AVAILABLE:
        models['XGBoost'] = XGBClassifier(
            random_state=42, use_label_encoder=False, 
            eval_metric='logloss', scale_pos_weight=5
        )
    
    # Entra√Æner et √©valuer
    results = []
    trained_models = {}
    
    for name, model in models.items():
        print(f"üìä {name}")
        print("-" * 40)
        
        # Entra√Ænement
        model.fit(X_train_scaled, y_train)
        trained_models[name] = model
        
        # Pr√©dictions
        y_pred = model.predict(X_test_scaled)
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
        
        # M√©triques
        metrics = {
            'Model': name,
            'Accuracy': accuracy_score(y_test, y_pred),
            'Precision': precision_score(y_test, y_pred, zero_division=0),
            'Recall': recall_score(y_test, y_pred, zero_division=0),
            'F1-Score': f1_score(y_test, y_pred, zero_division=0),
            'AUC-ROC': roc_auc_score(y_test, y_proba)
        }
        results.append(metrics)
        
        print(f"  Accuracy:  {metrics['Accuracy']:.4f}")
        print(f"  Precision: {metrics['Precision']:.4f}")
        print(f"  Recall:    {metrics['Recall']:.4f}")
        print(f"  F1-Score:  {metrics['F1-Score']:.4f}")
        print(f"  AUC-ROC:   {metrics['AUC-ROC']:.4f}")
        print()
    
    # Cr√©er le DataFrame des r√©sultats
    results_df = pd.DataFrame(results).set_index('Model')
    results_df = results_df.sort_values('F1-Score', ascending=False)
    
    print("\n" + "=" * 60)
    print("üìà COMPARAISON DES MOD√àLES (tri√© par F1-Score)")
    print("=" * 60)
    display(results_df.round(4))

### Validation Crois√©e

In [None]:
if DATA_READY:
    print("\nüîÑ Validation crois√©e (5-fold)...\n")
    
    cv_results = []
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    for name, model in models.items():
        scores = cross_val_score(model, X_train_scaled, y_train, cv=skf, scoring='f1')
        cv_results.append({
            'Model': name,
            'CV Mean F1': scores.mean(),
            'CV Std F1': scores.std()
        })
        print(f"{name}: F1 = {scores.mean():.4f} (+/- {scores.std():.4f})")
    
    cv_df = pd.DataFrame(cv_results).set_index('Model')
    print("\n")
    display(cv_df.round(4))

---
## 8. Visualisation des Performances

### Comparaison des M√©triques

In [None]:
if DATA_READY:
    # Graphique de comparaison
    fig, ax = plt.subplots(figsize=(12, 6))
    
    metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC']
    x = np.arange(len(results_df))
    width = 0.15
    
    for i, metric in enumerate(metrics_to_plot):
        offset = (i - len(metrics_to_plot)/2 + 0.5) * width
        bars = ax.bar(x + offset, results_df[metric], width, label=metric)
    
    ax.set_ylabel('Score')
    ax.set_title('Comparaison des Performances des Mod√®les', fontsize=14, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels(results_df.index)
    ax.legend(loc='lower right')
    ax.set_ylim(0, 1.1)
    ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

### Courbes ROC

In [None]:
if DATA_READY:
    fig, ax = plt.subplots(figsize=(10, 8))
    
    colors = plt.cm.Set1(np.linspace(0, 1, len(trained_models)))
    
    for (name, model), color in zip(trained_models.items(), colors):
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_proba)
        roc_auc = auc(fpr, tpr)
        
        ax.plot(fpr, tpr, color=color, lw=2, label=f'{name} (AUC = {roc_auc:.3f})')
    
    ax.plot([0, 1], [0, 1], 'k--', lw=1, label='Random (AUC = 0.500)')
    
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('Taux de Faux Positifs (FPR)')
    ax.set_ylabel('Taux de Vrais Positifs (TPR)')
    ax.set_title('Courbes ROC - Comparaison des Mod√®les', fontsize=14, fontweight='bold')
    ax.legend(loc='lower right')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

### Matrices de Confusion

In [None]:
if DATA_READY:
    # S√©lectionner le meilleur mod√®le
    best_model_name = results_df['F1-Score'].idxmax()
    best_model = trained_models[best_model_name]
    
    print(f"üèÜ Meilleur mod√®le: {best_model_name}")
    
    # Matrice de confusion pour le meilleur mod√®le
    y_pred_best = best_model.predict(X_test_scaled)
    cm = confusion_matrix(y_test, y_pred_best)
    
    fig, ax = plt.subplots(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=['Stay', 'Leave'],
                yticklabels=['Stay', 'Leave'], ax=ax)
    ax.set_title(f'Matrice de Confusion - {best_model_name}', fontsize=14, fontweight='bold')
    ax.set_xlabel('Pr√©diction')
    ax.set_ylabel('R√©alit√©')
    
    plt.tight_layout()
    plt.show()
    
    # Rapport de classification
    print("\nüìã Rapport de Classification:")
    print(classification_report(y_test, y_pred_best, target_names=['Stay', 'Leave']))

---
## 9. Importance des Features

In [None]:
if DATA_READY and 'Random Forest' in trained_models:
    rf_model = trained_models['Random Forest']
    
    # Feature importance
    importance_df = pd.DataFrame({
        'Feature': X.columns,
        'Importance': rf_model.feature_importances_
    }).sort_values('Importance', ascending=False)
    
    # Top 15 features
    top_n = min(15, len(importance_df))
    top_features = importance_df.head(top_n)
    
    fig, ax = plt.subplots(figsize=(10, 8))
    
    colors = plt.cm.viridis(np.linspace(0, 0.8, top_n))
    ax.barh(top_features['Feature'][::-1], top_features['Importance'][::-1], color=colors[::-1])
    
    ax.set_title(f'Top {top_n} Features Importantes - Random Forest', fontsize=14, fontweight='bold')
    ax.set_xlabel('Importance')
    
    plt.tight_layout()
    plt.show()
    
    print("\nüìä Importance des Features:")
    display(importance_df.head(15).round(4))

### Interpr√©tabilit√© avec SHAP (si disponible)

In [None]:
if DATA_READY and SHAP_AVAILABLE and 'Random Forest' in trained_models:
    print("üîç Analyse SHAP...")
    
    try:
        # Cr√©er l'explainer
        explainer = shap.TreeExplainer(trained_models['Random Forest'])
        
        # Calculer les valeurs SHAP sur un √©chantillon
        sample_size = min(100, len(X_test))
        X_sample = X_test.iloc[:sample_size]
        shap_values = explainer.shap_values(X_sample)
        
        # Summary plot
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values[1], X_sample, plot_type="bar", max_display=15, show=False)
        plt.title('SHAP Feature Importance', fontsize=14, fontweight='bold')
        plt.tight_layout()
        plt.show()
        
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur SHAP: {e}")
elif not SHAP_AVAILABLE:
    print("‚ö†Ô∏è SHAP non disponible. Installez avec: pip install shap")

---
## 10. Benchmarks et R√©sum√© des Performances

In [None]:
if DATA_READY:
    print("=" * 60)
    print("üìä R√âSUM√â DES BENCHMARKS")
    print("=" * 60)
    
    print("\nüìà Performance des Mod√®les sur l'ensemble de test:")
    print("-" * 60)
    display(results_df.round(4))
    
    print(f"\nüèÜ Meilleur mod√®le: {best_model_name}")
    print(f"   ‚Ä¢ F1-Score: {results_df.loc[best_model_name, 'F1-Score']:.4f}")
    print(f"   ‚Ä¢ AUC-ROC: {results_df.loc[best_model_name, 'AUC-ROC']:.4f}")
    print(f"   ‚Ä¢ Recall: {results_df.loc[best_model_name, 'Recall']:.4f}")
    
    print("\nüìä Statistiques du Dataset:")
    print("-" * 60)
    print(f"   ‚Ä¢ Total employ√©s: {len(df)}")
    print(f"   ‚Ä¢ Train set: {len(X_train)}")
    print(f"   ‚Ä¢ Test set: {len(X_test)}")
    print(f"   ‚Ä¢ Features: {len(X.columns)}")
    print(f"   ‚Ä¢ Taux d'attrition global: {y.mean()*100:.1f}%")

---
## 11. Conclusions et Recommandations

### R√©sum√© des R√©sultats

Cette analyse du taux d'attrition des employ√©s de HumanForYou a permis d'identifier plusieurs enseignements cl√©s :

#### Facteurs d'Attrition Identifi√©s

Les principaux facteurs influen√ßant l'attrition des employ√©s sont (bas√©s sur l'importance des features) :

1. **Satisfaction au travail** - Un faible niveau de satisfaction est fortement corr√©l√© au d√©part
2. **√âquilibre vie professionnelle/personnelle** - Les employ√©s insatisfaits de cet √©quilibre sont plus susceptibles de partir
3. **Anciennet√© dans l'entreprise** - Les employ√©s avec peu d'anciennet√© pr√©sentent un risque plus √©lev√©
4. **Revenu mensuel** - Une r√©mun√©ration inf√©rieure augmente le risque d'attrition
5. **Distance du domicile** - Les longs trajets sont un facteur de risque

### Recommandations pour HumanForYou

#### Actions √† Court Terme
- üéØ **Programme de r√©tention cibl√©** pour les employ√©s identifi√©s √† risque
- üí¨ **Entretiens de r√©tention** r√©guliers avec les managers
- üí∞ **R√©vision salariale** pour les postes √† forte attrition

#### Actions √† Moyen Terme
- üìà **Plans de carri√®re personnalis√©s** pour am√©liorer la satisfaction
- üè† **Politique de t√©l√©travail** pour r√©duire l'impact des trajets
- üéì **Programmes de formation** pour d√©velopper l'engagement

#### Actions √† Long Terme
- üîÑ **Syst√®me de monitoring** pour suivre les indicateurs d'attrition en temps r√©el
- üè¢ **Culture d'entreprise** ax√©e sur le bien-√™tre et l'√©quilibre
- üìä **Tableaux de bord RH** pour anticiper les d√©parts

### Limites de l'Analyse

- Les mod√®les sont entra√Æn√©s sur des donn√©es historiques et peuvent ne pas capturer tous les facteurs d'attrition
- Certaines variables qualitatives (satisfaction, performance) sont subjectives
- Le d√©s√©quilibre des classes (15% d'attrition) peut affecter les pr√©dictions
- Les donn√©es de badgeage n'ont pas √©t√© int√©gr√©es dans cette version

### Prochaines √âtapes

1. **Int√©grer les donn√©es de badgeage** pour enrichir les features
2. **Affiner le mod√®le** avec un tuning des hyperparam√®tres plus pouss√©
3. **D√©ployer le mod√®le** en production pour des pr√©dictions en temps r√©el
4. **Mettre en place un monitoring** de la performance du mod√®le
5. **Former les √©quipes RH** √† l'interpr√©tation des r√©sultats

In [None]:
print("\n" + "=" * 60)
print("‚úÖ ANALYSE TERMIN√âE")
print("=" * 60)
print("\nüìÅ Livrables disponibles:")
print("  ‚Ä¢ livrables/01_ethique.md - Document √©thique")
print("  ‚Ä¢ livrables/02_bibliographie.md - Bibliographie")
print("  ‚Ä¢ livrables/03_presentation_notebook.ipynb - Ce notebook")
print("\nüíª Code source:")
print("  ‚Ä¢ src/data_loader.py - Chargement des donn√©es")
print("  ‚Ä¢ src/data_preprocessing.py - Pr√©traitement")
print("  ‚Ä¢ src/feature_engineering.py - Feature engineering")
print("  ‚Ä¢ src/models.py - Mod√®les ML")
print("  ‚Ä¢ src/visualization.py - Visualisations")
print("\nüîó Pour toute question: consultez le README.md")