# üèÜ S√âLECTION DU MOD√àLE FINAL
## Comparaison des performances des mod√®les entra√Æn√©s

Ce notebook compare les trois mod√®les entra√Æn√©s pour pr√©dire le **rendement du ma√Øs (yield)** :
1. Ridge Regression (R√©gression Lin√©aire R√©gularis√©e)
2. Random Forest Regressor
3. Gradient Boosting Regressor

In [None]:
import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

: 

## 1. Chargement des donn√©es et pr√©paration

In [None]:
# Chargement des donn√©es
df = pd.read_csv('../data/hvstat_africa_data_v1.0.csv')
df_maize = df[df['product'].str.contains('maize|corn|ma√Øs', case=False, na=False)].copy()

# Nettoyage des outliers
mask = (
    df_maize['area'].between(0.1, 50000) &
    df_maize['production'].between(1, 50000) &
    df_maize['yield'].between(0.1, 8)
)
df_clean = df_maize[mask].copy()
df_clean = df_clean.dropna(subset=['yield', 'area', 'production'])

# Standardisation syst√®me de production
df_clean['system_simplified'] = 'other'
sys_lower = df_clean['crop_production_system'].str.lower()
df_clean.loc[sys_lower.str.contains('irrigated|water|dam|riverine', na=False), 'system_simplified'] = 'irrigated'
df_clean.loc[sys_lower.str.contains('rainfed|dieri|recessional', na=False), 'system_simplified'] = 'rainfed'
df_clean.loc[sys_lower.str.contains('commercial|mechanized|large_scale|lscf', na=False), 'system_simplified'] = 'commercial_mechanized'
df_clean.loc[sys_lower.str.contains('traditional|communal|small|pastoral|sscf|a1|a2', na=False), 'system_simplified'] = 'traditional_small_scale'
df_clean.loc[sys_lower.str.contains('all|none|or \(ps\)', na=False), 'system_simplified'] = 'general_unknown'

print(f"‚úÖ Dataset nettoy√©: {len(df_clean):,} observations")

In [None]:
# Pr√©paration des features
features_cols = ['country_code', 'season_name', 'planting_month', 'harvest_month', 'area', 'system_simplified']
X = df_clean[features_cols]
y = df_clean['yield']

# Encodage One-Hot
X_encoded = pd.get_dummies(X, columns=['country_code', 'season_name', 'system_simplified'], drop_first=True)

# Division Train/Test
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

# Scaling pour Ridge Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Train: {X_train.shape[0]} samples, Test: {X_test.shape[0]} samples")
print(f"Features: {X_train.shape[1]} colonnes")

## 2. Chargement des mod√®les entra√Æn√©s

In [None]:
# Chargement des mod√®les
ridge_model = joblib.load('../ml_models_pkg/ridge_regression_model.pkl')
rf_model = joblib.load('../ml_models_pkg/random_forest_model.pkl')
gb_model = joblib.load('../ml_models_pkg/gb_model.pkl')

print("‚úÖ Mod√®les charg√©s avec succ√®s!")
print(f"  - Ridge Regression: alpha = {ridge_model.alpha}")
print(f"  - Random Forest: {rf_model.n_estimators} arbres, max_depth = {rf_model.max_depth}")
print(f"  - Gradient Boosting: {gb_model.n_estimators} arbres, learning_rate = {gb_model.learning_rate}")

## 3. √âvaluation comparative des mod√®les

In [None]:
def evaluate_model(model, X_test, y_test, model_name, scaled=False):
    """√âvalue un mod√®le et retourne les m√©triques."""
    y_pred = model.predict(X_test)
    
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    return {
        'Mod√®le': model_name,
        'MAE (t/ha)': mae,
        'RMSE (t/ha)': rmse,
        'R¬≤ Score': r2,
        'Predictions': y_pred
    }

# √âvaluation des trois mod√®les
results = []
results.append(evaluate_model(ridge_model, X_test_scaled, y_test, 'Ridge Regression'))
results.append(evaluate_model(rf_model, X_test, y_test, 'Random Forest'))
results.append(evaluate_model(gb_model, X_test, y_test, 'Gradient Boosting'))

# Cr√©ation du tableau comparatif
comparison_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'Predictions'} for r in results])
comparison_df = comparison_df.sort_values('R¬≤ Score', ascending=False)

print("\n" + "="*70)
print("üìä TABLEAU COMPARATIF DES PERFORMANCES")
print("="*70)
print(comparison_df.to_string(index=False))
print("="*70)

In [None]:
# Visualisation des performances
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

colors = ['#3498db', '#2ecc71', '#e74c3c']
models = ['Ridge Regression', 'Random Forest', 'Gradient Boosting']

# R¬≤ Score
r2_scores = [r['R¬≤ Score'] for r in results]
bars1 = axes[0].bar(models, r2_scores, color=colors, edgecolor='black')
axes[0].set_ylabel('R¬≤ Score')
axes[0].set_title('R¬≤ Score (Plus √©lev√© = Meilleur)', fontweight='bold')
axes[0].set_ylim(0, 1)
for bar, score in zip(bars1, r2_scores):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                 f'{score:.4f}', ha='center', fontweight='bold')

# MAE
mae_scores = [r['MAE (t/ha)'] for r in results]
bars2 = axes[1].bar(models, mae_scores, color=colors, edgecolor='black')
axes[1].set_ylabel('MAE (tonnes/ha)')
axes[1].set_title('MAE - Erreur Absolue Moyenne (Plus bas = Meilleur)', fontweight='bold')
for bar, score in zip(bars2, mae_scores):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                 f'{score:.4f}', ha='center', fontweight='bold')

# RMSE
rmse_scores = [r['RMSE (t/ha)'] for r in results]
bars3 = axes[2].bar(models, rmse_scores, color=colors, edgecolor='black')
axes[2].set_ylabel('RMSE (tonnes/ha)')
axes[2].set_title('RMSE - Racine de l\'Erreur Quadratique (Plus bas = Meilleur)', fontweight='bold')
for bar, score in zip(bars3, rmse_scores):
    axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
                 f'{score:.4f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('../ml_models_pkg/model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("\nüìà Graphique sauvegard√©: ml_models_pkg/model_comparison.png")

## 4. Analyse des pr√©dictions vs valeurs r√©elles

In [None]:
# Graphiques de pr√©dictions vs r√©alit√©
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, (result, ax, color) in enumerate(zip(results, axes, colors)):
    y_pred = result['Predictions']
    ax.scatter(y_test, y_pred, alpha=0.5, color=color, edgecolor='none', s=20)
    
    # Ligne parfaite (y = x)
    min_val, max_val = min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())
    ax.plot([min_val, max_val], [min_val, max_val], 'k--', lw=2, label='Pr√©diction parfaite')
    
    ax.set_xlabel('Valeurs r√©elles (t/ha)')
    ax.set_ylabel('Pr√©dictions (t/ha)')
    ax.set_title(f"{result['Mod√®le']}\nR¬≤ = {result['R¬≤ Score']:.4f}", fontweight='bold')
    ax.legend(loc='upper left')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../ml_models_pkg/predictions_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

## 5. Distribution des erreurs

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, (result, ax, color) in enumerate(zip(results, axes, colors)):
    y_pred = result['Predictions']
    errors = y_test.values - y_pred
    
    ax.hist(errors, bins=50, color=color, edgecolor='black', alpha=0.7)
    ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Erreur nulle')
    ax.axvline(x=errors.mean(), color='blue', linestyle='-', linewidth=2, label=f'Moyenne: {errors.mean():.3f}')
    
    ax.set_xlabel('Erreur (t/ha)')
    ax.set_ylabel('Fr√©quence')
    ax.set_title(f"Distribution des erreurs - {result['Mod√®le']}", fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../ml_models_pkg/error_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. üèÜ S√âLECTION DU MOD√àLE FINAL

In [None]:
# D√©termination du meilleur mod√®le
best_idx = np.argmax([r['R¬≤ Score'] for r in results])
best_model_name = results[best_idx]['Mod√®le']

print("\n" + "="*70)
print("üèÜ MOD√àLE FINAL S√âLECTIONN√â")
print("="*70)
print(f"\n‚úÖ Le mod√®le retenu est: {best_model_name}")
print(f"\nüìä Performances sur le jeu de test:")
print(f"   ‚Ä¢ R¬≤ Score: {results[best_idx]['R¬≤ Score']:.4f}")
print(f"   ‚Ä¢ MAE: {results[best_idx]['MAE (t/ha)']:.4f} t/ha")
print(f"   ‚Ä¢ RMSE: {results[best_idx]['RMSE (t/ha)']:.4f} t/ha")
print("\n" + "="*70)

In [None]:
# Justification du choix
print("\nüìù JUSTIFICATION DU CHOIX DU MOD√àLE")
print("-" * 50)

# Analyse comparative
ridge_r2 = results[0]['R¬≤ Score']
rf_r2 = results[1]['R¬≤ Score']
gb_r2 = results[2]['R¬≤ Score']

print(f"""
1. **Ridge Regression** (R¬≤ = {ridge_r2:.4f})
   - ‚úÖ Simple et interpr√©table
   - ‚úÖ Rapide √† entra√Æner
   - ‚ùå Performance limit√©e sur donn√©es non-lin√©aires

2. **Random Forest** (R¬≤ = {rf_r2:.4f})
   - ‚úÖ Robuste aux outliers
   - ‚úÖ Capture les relations non-lin√©aires
   - ‚úÖ Feature importance disponible
   - ‚ùå Peut √™tre lent pour de gros datasets

3. **Gradient Boosting** (R¬≤ = {gb_r2:.4f})
   - ‚úÖ Souvent le plus pr√©cis
   - ‚úÖ Optimise it√©rativement les erreurs
   - ‚ùå Plus sensible au surapprentissage
   - ‚ùå Plus lent √† entra√Æner
""")

print(f"\nüéØ CONCLUSION: Le mod√®le {best_model_name} offre le meilleur compromis")
print("   entre performance pr√©dictive et g√©n√©ralisation.")

In [None]:
# Sauvegarde du mod√®le final pour le d√©ploiement
import shutil

# Le mod√®le final sera utilis√© par l'API
final_model_path = '../ml_models_pkg/final_model.pkl'

if best_model_name == 'Ridge Regression':
    joblib.dump(ridge_model, final_model_path)
    joblib.dump(scaler, '../ml_models_pkg/final_scaler.pkl')
    model_type = 'ridge'
elif best_model_name == 'Random Forest':
    joblib.dump(rf_model, final_model_path)
    model_type = 'random_forest'
else:
    joblib.dump(gb_model, final_model_path)
    model_type = 'gradient_boosting'

# Sauvegarde des m√©tadonn√©es
model_metadata = {
    'model_type': model_type,
    'model_name': best_model_name,
    'r2_score': results[best_idx]['R¬≤ Score'],
    'mae': results[best_idx]['MAE (t/ha)'],
    'rmse': results[best_idx]['RMSE (t/ha)'],
    'features': X_encoded.columns.tolist(),
    'target': 'yield',
    'training_samples': len(X_train),
    'test_samples': len(X_test)
}

joblib.dump(model_metadata, '../ml_models_pkg/model_metadata.pkl')

print(f"\n‚úÖ Mod√®le final sauvegard√©: {final_model_path}")
print(f"‚úÖ M√©tadonn√©es sauvegard√©es: ../ml_models_pkg/model_metadata.pkl")
print(f"\nüöÄ Le mod√®le est pr√™t pour le d√©ploiement!")

In [None]:
print(model_type)

## 7. R√©sum√© pour le rapport

### Tableau r√©capitulatif des performances

| Mod√®le | R¬≤ Score | MAE (t/ha) | RMSE (t/ha) |
|--------|----------|------------|-------------|
| Ridge Regression | voir ci-dessus | voir ci-dessus | voir ci-dessus |
| Random Forest | voir ci-dessus | voir ci-dessus | voir ci-dessus |
| Gradient Boosting | voir ci-dessus | voir ci-dessus | voir ci-dessus |

### Interpr√©tation des m√©triques

- **R¬≤ Score**: Proportion de la variance expliqu√©e par le mod√®le (1 = parfait)
- **MAE**: Erreur moyenne en valeur absolue (en tonnes par hectare)
- **RMSE**: Erreur quadratique moyenne (p√©nalise plus les grosses erreurs)