# FTS Cameroun — Projet ML end‑to‑endObjectifs : prédire le **taux de financement** (ou l'**écart de financement**) par *global cluster* pour l'année suivante.Ce notebook couvre : EDA → feature engineering → split temporel → entraînement → validation walk‑forward → interprétation → export modèle.

In [ ]:
# 1) Imports de base (sans dépendances exotiques)import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom pathlib import Path# MLfrom sklearn.model_selection import TimeSeriesSplitfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import mean_absolute_error, mean_absolute_percentage_errorfrom sklearn.linear_model import Ridgefrom sklearn.ensemble import RandomForestRegressorplt.rcParams['figure.figsize'] = (9, 5)

In [ ]:
# 2) ChargementDATA_PATH = Path('/mnt/data/fts_requirements_funding_globalcluster_cmr.csv')  # adapter si besoindf = pd.read_csv(DATA_PATH)df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]df = df[['year', 'clustercode', 'cluster', 'requirements', 'funding', 'percentfunded']].copy()# Typesdf['year'] = pd.to_numeric(df['year'], errors='coerce').astype('Int64')for c in ['requirements', 'funding', 'percentfunded']:    df[c] = pd.to_numeric(df[c], errors='coerce')df = df.dropna(subset=['year', 'clustercode'])# Cibles utilesdf['funding_rate'] = np.where(df['requirements']>0, df['funding']/df['requirements'], np.nan)df['gap'] = df['requirements'] - df['funding']df.head()

In [ ]:
# 3) EDA rapideannual = (df.groupby('year')[['requirements','funding']].sum()            .reset_index().sort_values('year'))annual['funding_rate'] = np.where(annual['requirements']>0, annual['funding']/annual['requirements'], np.nan)annual['gap'] = annual['requirements'] - annual['funding']display(annual.head(10))plt.figure()plt.plot(annual['year'], annual['requirements'], marker='o', label='Requirements')plt.plot(annual['year'], annual['funding'], marker='o', label='Funding')plt.title('Requirements vs Funding (annuel)')plt.xlabel('Year'); plt.ylabel('USD'); plt.legend(); plt.tight_layout(); plt.show()

In [ ]:
# 4) Feature engineering (lags et dynamiques) au niveau cluster-annéedef add_lags(g, cols=('requirements','funding','funding_rate','gap'), lags=(1,2,3)):    g = g.sort_values('year').copy()    for col in cols:        for L in lags:            g[f'{col}_lag{L}'] = g[col].shift(L)    # variations    g['req_yoy'] = g['requirements'] - g['requirements'].shift(1)    g['fund_yoy'] = g['funding'] - g['funding'].shift(1)    # pente récente (2 ans)    for col in ['requirements','funding','funding_rate']:        g[f'{col}_slope2'] = g[col].diff().rolling(2).mean()    return gXy = (df.groupby('clustercode', group_keys=False)        .apply(add_lags)        .dropna(subset=['funding_rate']))# Objectif : prédire funding_rate de l'année courante à partir d'historique (lags)target_col = 'funding_rate'feature_num = [c for c in Xy.columns if any(k in c for k in ['lag','yoy','slope'])]feature_cat = ['clustercode']Xy = Xy.dropna(subset=feature_num + [target_col]).copy()X = Xy[feature_num + feature_cat + ['year']].copy()y = Xy[target_col].copy()X.head()

In [ ]:
# 5) Split temporel (train: années initiales, test: dernières années) + pipelinemin_year, max_year = int(X['year'].min()), int(X['year'].max())cut_year = max_year - 1  # garder la dernière année pour test de tenue hors-échantillonX_train = X[X['year'] <= cut_year].copy()y_train = y.loc[X_train.index]X_test  = X[X['year'] > cut_year].copy()y_test  = y.loc[X_test.index]pre = ColumnTransformer([    ('onehot', OneHotEncoder(handle_unknown='ignore'), feature_cat)], remainder='passthrough')models = {    'ridge': Ridge(alpha=1.0, random_state=42),    'rf': RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)}results = {}for name, est in models.items():    pipe = Pipeline([('pre', pre), ('est', est)])    pipe.fit(X_train.drop(columns=['year']), y_train)    pred = pipe.predict(X_test.drop(columns=['year']))    mae = mean_absolute_error(y_test, pred)    mape = mean_absolute_percentage_error(y_test.clip(1e-6, None), np.clip(pred, 1e-6, None))    results[name] = {'MAE': float(mae), 'MAPE': float(mape)}    print(name, results[name])best_name = min(results, key=lambda k: results[k]['MAE'])print('\nBest by MAE:', best_name, results[best_name])

In [ ]:
# 6) Validation walk-forward (TimeSeriesSplit) pour robustessetscv = TimeSeriesSplit(n_splits=4)est = RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1)pipe = Pipeline([('pre', pre), ('est', est)])maes = []for fold, (tr, va) in enumerate(tscv.split(X_train, y_train), start=1):    pipe.fit(X_train.drop(columns=['year']).iloc[tr], y_train.iloc[tr])    pred = pipe.predict(X_train.drop(columns=['year']).iloc[va])    mae = mean_absolute_error(y_train.iloc[va], pred)    maes.append(mae)    print(f'Fold {fold} MAE = {mae:.4f}')print('MAE moyen (walk-forward):', float(np.mean(maes)))

In [ ]:
# 7) Interprétation globale simple par permutation importancefrom sklearn.inspection import permutation_importancepipe.fit(X_train.drop(columns=['year']), y_train)r = permutation_importance(pipe, X_test.drop(columns=['year']), y_test, n_repeats=10, random_state=0)# récupérer noms des features après preprocessorohe = pipe.named_steps['pre'].named_transformers_['onehot']cat_names = list(ohe.get_feature_names_out(['clustercode']))final_feature_names = cat_names + [c for c in X_train.drop(columns=['year']).columns if c not in ['clustercode']]imp = pd.DataFrame({'feature': final_feature_names, 'importance': r.importances_mean}).sort_values('importance', ascending=False)imp.head(15)

In [ ]:
# 8) Export du modèle et d'un scaler/minimal pipeline (joblib)import joblibjoblib.dump(pipe, '/mnt/data/cmr_fts_model_pipeline.joblib')print('Modèle exporté -> /mnt/data/cmr_fts_model_pipeline.joblib')