# Partie 4 - Sur-apprentissage, Forets Aleatoires et Boosting

Ce notebook analyse le phenomene de sur-apprentissage et explore les methodes d'ensemble.

**Contenu:**
- Analyse du sur-apprentissage en fonction de la profondeur
- Effet de min_samples_leaf
- Forets aleatoires (RandomForest)
- Boosting (AdaBoost)
- Comparaison des methodes

**Auteur**: Projet Data Mining - Arbres de Decision

## 1. Importation des bibliotheques

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Configuration
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

## 2. Chargement et preparation des donnees

In [None]:
# Chargement depuis GitHub
base_url = 'https://raw.githubusercontent.com/NassimZahri/Data_Mining/main/data/'
df = pd.read_csv(base_url + 'credit_simple.csv')

# Preparation
X = pd.get_dummies(df.drop('defaut', axis=1))
y = df['defaut'].map({'oui': 1, 'non': 0})

# Division train/test (70%/30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print(f"Ensemble d'entrainement: {len(X_train)} exemples")
print(f"Ensemble de test: {len(X_test)} exemples")
print(f"\nDistribution des classes (train):")
print(y_train.value_counts())

## 3. Analyse du sur-apprentissage avec max_depth

In [None]:
# Test de differentes profondeurs
depths = [1, 2, 3, 4, 5, 6, 7, 8, None]  # None = pas de limite

train_accuracies = []
test_accuracies = []
train_f1 = []
test_f1 = []

for depth in depths:
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Metriques
    train_accuracies.append(accuracy_score(y_train, y_pred_train))
    test_accuracies.append(accuracy_score(y_test, y_pred_test))
    train_f1.append(f1_score(y_train, y_pred_train, zero_division=0))
    test_f1.append(f1_score(y_test, y_pred_test, zero_division=0))

# Affichage des resultats
depth_labels = [str(d) if d is not None else 'None' for d in depths]
results_depth = pd.DataFrame({
    'max_depth': depth_labels,
    'Train Accuracy': train_accuracies,
    'Test Accuracy': test_accuracies,
    'Train F1': train_f1,
    'Test F1': test_f1
})

print("Resultats en fonction de max_depth:")
results_depth

In [None]:
# Visualisation du sur-apprentissage
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
x_pos = range(len(depths))
axes[0].plot(x_pos, train_accuracies, 'o-', label='Train', color='#3498db', linewidth=2, markersize=8)
axes[0].plot(x_pos, test_accuracies, 's-', label='Test', color='#e74c3c', linewidth=2, markersize=8)
axes[0].set_xlabel('max_depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy vs Profondeur de l\'Arbre')
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(depth_labels)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# F1-Score
axes[1].plot(x_pos, train_f1, 'o-', label='Train', color='#3498db', linewidth=2, markersize=8)
axes[1].plot(x_pos, test_f1, 's-', label='Test', color='#e74c3c', linewidth=2, markersize=8)
axes[1].set_xlabel('max_depth')
axes[1].set_ylabel('F1-Score')
axes[1].set_title('F1-Score vs Profondeur de l\'Arbre')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(depth_labels)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservation:")
print("- Quand la profondeur augmente, l'accuracy sur le train augmente (memorisation)")
print("- L'accuracy sur le test peut diminuer (sur-apprentissage)")
print("- L'ecart entre train et test indique le niveau de sur-apprentissage")

## 4. Effet de min_samples_leaf

In [None]:
# Test de differentes valeurs de min_samples_leaf
min_samples_values = [1, 2, 3, 5, 10]

train_acc_samples = []
test_acc_samples = []

for min_samples in min_samples_values:
    model = DecisionTreeClassifier(min_samples_leaf=min_samples, random_state=42)
    model.fit(X_train, y_train)
    
    train_acc_samples.append(model.score(X_train, y_train))
    test_acc_samples.append(model.score(X_test, y_test))

# Visualisation
plt.figure(figsize=(10, 5))
plt.plot(min_samples_values, train_acc_samples, 'o-', label='Train', color='#3498db', linewidth=2, markersize=8)
plt.plot(min_samples_values, test_acc_samples, 's-', label='Test', color='#e74c3c', linewidth=2, markersize=8)
plt.xlabel('min_samples_leaf')
plt.ylabel('Accuracy')
plt.title('Effet de min_samples_leaf sur les Performances')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Observation:")
print("- Un min_samples_leaf plus eleve reduit le sur-apprentissage")
print("- Mais un seuil trop eleve peut causer du sous-apprentissage")

## 5. Forets Aleatoires (Random Forest)

In [None]:
# Test avec differents nombres d'estimateurs
n_estimators_list = [10, 25, 50, 100, 200]

rf_train_acc = []
rf_test_acc = []

for n_est in n_estimators_list:
    rf = RandomForestClassifier(n_estimators=n_est, random_state=42, n_jobs=-1)
    rf.fit(X_train, y_train)
    
    rf_train_acc.append(rf.score(X_train, y_train))
    rf_test_acc.append(rf.score(X_test, y_test))

# Resultats
rf_results = pd.DataFrame({
    'n_estimators': n_estimators_list,
    'Train Accuracy': rf_train_acc,
    'Test Accuracy': rf_test_acc
})

print("Random Forest - Resultats selon n_estimators:")
rf_results

In [None]:
# Visualisation Random Forest
plt.figure(figsize=(10, 5))
plt.plot(n_estimators_list, rf_train_acc, 'o-', label='Train', color='#27ae60', linewidth=2, markersize=8)
plt.plot(n_estimators_list, rf_test_acc, 's-', label='Test', color='#8e44ad', linewidth=2, markersize=8)
plt.xlabel('Nombre d\'arbres (n_estimators)')
plt.ylabel('Accuracy')
plt.title('Random Forest - Performances selon le Nombre d\'Arbres')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Modele Random Forest optimal
rf_optimal = RandomForestClassifier(n_estimators=100, random_state=42)
rf_optimal.fit(X_train, y_train)

y_pred_rf = rf_optimal.predict(X_test)

print("Random Forest (n_estimators=100) - Rapport de classification:")
print("=" * 60)
print(classification_report(y_test, y_pred_rf, target_names=['Non', 'Oui']))

In [None]:
# Importance des features pour Random Forest
rf_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_optimal.feature_importances_
}).sort_values('Importance', ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(rf_importance['Feature'], rf_importance['Importance'], color='#27ae60')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Importance des Features - Random Forest')
plt.tight_layout()
plt.show()

## 6. Boosting avec AdaBoost

In [None]:
# AdaBoost avec differents nombres d'estimateurs
ada_estimators_list = [10, 25, 50, 100, 200]

ada_train_acc = []
ada_test_acc = []

for n_est in ada_estimators_list:
    ada = AdaBoostClassifier(n_estimators=n_est, random_state=42)
    ada.fit(X_train, y_train)
    
    ada_train_acc.append(ada.score(X_train, y_train))
    ada_test_acc.append(ada.score(X_test, y_test))

# Resultats
ada_results = pd.DataFrame({
    'n_estimators': ada_estimators_list,
    'Train Accuracy': ada_train_acc,
    'Test Accuracy': ada_test_acc
})

print("AdaBoost - Resultats selon n_estimators:")
ada_results

In [None]:
# Visualisation AdaBoost
plt.figure(figsize=(10, 5))
plt.plot(ada_estimators_list, ada_train_acc, 'o-', label='Train', color='#e67e22', linewidth=2, markersize=8)
plt.plot(ada_estimators_list, ada_test_acc, 's-', label='Test', color='#c0392b', linewidth=2, markersize=8)
plt.xlabel('Nombre d\'estimateurs (n_estimators)')
plt.ylabel('Accuracy')
plt.title('AdaBoost - Performances selon le Nombre d\'Estimateurs')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Modele AdaBoost optimal
ada_optimal = AdaBoostClassifier(n_estimators=50, random_state=42)
ada_optimal.fit(X_train, y_train)

y_pred_ada = ada_optimal.predict(X_test)

print("AdaBoost (n_estimators=50) - Rapport de classification:")
print("=" * 60)
print(classification_report(y_test, y_pred_ada, target_names=['Non', 'Oui']))

## 7. Comparaison globale des methodes

In [None]:
# Comparaison de toutes les methodes
models = {
    'Decision Tree (depth=3)': DecisionTreeClassifier(max_depth=3, random_state=42),
    'Decision Tree (no limit)': DecisionTreeClassifier(random_state=42),
    'Random Forest (100)': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost (50)': AdaBoostClassifier(n_estimators=50, random_state=42)
}

comparison_results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred, zero_division=0)
    
    # Validation croisee
    cv_scores = cross_val_score(model, X, y, cv=5)
    
    comparison_results.append({
        'Modele': name,
        'Train Accuracy': f'{train_acc:.2%}',
        'Test Accuracy': f'{test_acc:.2%}',
        'F1-Score': f'{f1:.2%}',
        'CV Mean': f'{cv_scores.mean():.2%}',
        'CV Std': f'{cv_scores.std():.2%}'
    })

comparison_df = pd.DataFrame(comparison_results)
print("Comparaison des differentes methodes:")
print("=" * 80)
comparison_df

In [None]:
# Graphique de comparaison
fig, ax = plt.subplots(figsize=(12, 6))

model_names = list(models.keys())
x = np.arange(len(model_names))
width = 0.35

train_accs = [float(r['Train Accuracy'].strip('%'))/100 for r in comparison_results]
test_accs = [float(r['Test Accuracy'].strip('%'))/100 for r in comparison_results]

bars1 = ax.bar(x - width/2, train_accs, width, label='Train', color='#3498db')
bars2 = ax.bar(x + width/2, test_accs, width, label='Test', color='#e74c3c')

ax.set_xlabel('Modele')
ax.set_ylabel('Accuracy')
ax.set_title('Comparaison des Performances des Differentes Methodes')
ax.set_xticks(x)
ax.set_xticklabels(model_names, rotation=15, ha='right')
ax.legend()
ax.set_ylim(0, 1.1)

# Ajouter les valeurs sur les barres
for bar in bars1:
    height = bar.get_height()
    ax.annotate(f'{height:.1%}', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=9)

for bar in bars2:
    height = bar.get_height()
    ax.annotate(f'{height:.1%}', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

## 8. Analyse et Conclusions

### Sur-apprentissage:
- Un arbre de decision sans limite de profondeur a tendance a sur-apprendre
- L'ecart entre accuracy train et test est un indicateur de sur-apprentissage
- Les parametres max_depth et min_samples_leaf permettent de controler ce phenomene

### Forets Aleatoires:
- Combinent plusieurs arbres pour reduire la variance
- Plus stables que les arbres individuels
- Le nombre d'arbres (n_estimators) affecte peu les performances apres un certain seuil

### AdaBoost:
- Methode de boosting qui combine des classifieurs faibles
- Se concentre sur les exemples mal classes
- Peut atteindre de bonnes performances mais risque de sur-apprentissage

### Recommandations:
1. Pour l'interpretabilite: utiliser un arbre de decision avec profondeur limitee
2. Pour la performance: utiliser Random Forest ou AdaBoost
3. Toujours valider avec cross-validation

In [None]:
print("Fin du notebook - Sur-apprentissage et methodes d'ensemble")
print("Le notebook suivant presentera l'application metier.")