
# Analyse de Campagne Marketing — Notebook Guide Pas-à-Pas

**Objectifs pédagogiques :**
1. Chiffrer l’efficacité des campagnes marketing (KPIs, segments).
2. Comprendre et justifier une segmentation client (clustering + RFM).
3. Construire un modèle de prédiction de la **réponse** (`Response`) à la campagne récente.
4. Adapter la présentation aux enjeux métier (recommandations actionnables).

**Outil :** Python + Jupyter Notebook — dépendances standard (pandas, numpy, scikit-learn, matplotlib).


## 1. Setup & Chargement des données

In [None]:

# Imports de base
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Affichage
pd.set_option("display.max_columns", 100)

# Chemin des données
DATA_PATH = Path("/mnt/data/Camp_Market.csv")

# Chargement
df = pd.read_csv(DATA_PATH, sep=None, engine="python")
print(df.shape)
df.head(5)


## 2. Structure des données (dtypes, valeurs manquantes, cardinalités)

In [None]:

shape = df.shape
dtypes = df.dtypes.astype(str).to_frame("dtype")
missing = df.isna().sum().to_frame("missing")
missing["missing_%"] = (df.isna().mean()*100).round(2)
cardinality = df.nunique().to_frame("n_unique")

summary = dtypes.join(missing).join(cardinality).sort_values(by=["missing_%","n_unique"], ascending=[False, False])
summary.head(30)


In [None]:

print("Aperçu des 20 premières colonnes:")
summary.iloc[:20]


## 3. Nettoyage minimal & conversions utiles

In [None]:

# Copie de travail
data = df.copy()

# Harmoniser les noms de colonnes (éviter espaces).
data.columns = [c.strip().replace(" ", "_") for c in data.columns]

# Conversion automatique de date si présente
if "Dt_Customer" in data.columns:
    data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], errors="coerce", dayfirst=True)
    
# Conversion probable des colonnes binaires en int (AcceptedCmp*, Response)
for c in data.columns:
    if c.lower().startswith("acceptedcmp") or c.lower()=="response":
        data[c] = pd.to_numeric(data[c], errors="coerce").astype("Int64")

# Aperçu post-nettoyage
data.head(3)


## 4. KPIs — Efficacité des campagnes

In [None]:

accepted_cols = [c for c in data.columns if c.lower().startswith("acceptedcmp")]
response_col = "Response" if "Response" in data.columns else None

kpi = {}
kpi["n_clients"] = len(data)

# Revenus/valeur client (si Mnt* présents)
mnt_cols = [c for c in data.columns if c.startswith("Mnt")]
if mnt_cols:
    data["TotalSpend"] = data[mnt_cols].sum(axis=1)
    kpi["CA_total_estime"] = float(data["TotalSpend"].sum())
    kpi["CA_median_client"] = float(data["TotalSpend"].median())

# Income median si dispo
if "Income" in data.columns:
    kpi["Revenu_median"] = float(data["Income"].median())

# Taux d'acceptation par campagne + campagne récente
for c in accepted_cols:
    ser = pd.to_numeric(data[c], errors="coerce")
    if ser.notna().any():
        kpi[f"Taux_acceptation_{c}"] = ser.mean()

if response_col:
    kpi["Taux_conversion_campagne_recente"] = data[response_col].mean()

pd.Series(kpi).to_frame("Valeur")


In [None]:

# Visualisation simple de la distribution de la réponse récente (si dispo)
if response_col:
    counts = data[response_col].value_counts(dropna=False).sort_index()
    plt.figure()
    counts.plot(kind="bar")
    plt.title("Distribution de la réponse récente (Response)")
    plt.xlabel("Response")
    plt.ylabel("Effectif")
    plt.tight_layout()
    plt.show()


## 5. Feature Engineering — variables explicatives

In [None]:

from datetime import datetime

fe = data.copy()

# Age
if "Year_Birth" in fe.columns:
    fe["Age"] = datetime.now().year - fe["Year_Birth"]

# Enfants au foyer
if set(["Kidhome","Teenhome"]).issubset(fe.columns):
    fe["KidsTotal"] = fe["Kidhome"] + fe["Teenhome"]
    fe["HasKids"] = (fe["KidsTotal"] > 0).astype(int)

# Ancienneté client (si date dispo)
if "Dt_Customer" in fe.columns:
    fe["CustomerSeniority_days"] = (pd.Timestamp.now().normalize() - fe["Dt_Customer"]).dt.days

# Agrégats monétaires
mnt_cols = [c for c in fe.columns if c.startswith("Mnt")]
if mnt_cols:
    fe["TotalSpend"] = fe[mnt_cols].sum(axis=1)
    fe["AvgBasket"] = fe["TotalSpend"] / (fe["TotalSpend"]>0).replace({False: np.nan, True: 1})

# Fréquence par canaux
if set(["NumWebPurchases","NumCatalogPurchases","NumStorePurchases"]).issubset(fe.columns):
    fe["TotalPurchases"] = fe[["NumWebPurchases","NumCatalogPurchases","NumStorePurchases"]].sum(axis=1)
    for c in ["NumWebPurchases","NumCatalogPurchases","NumStorePurchases"]:
        fe[c.replace("Num","Share_")] = fe[c] / fe["TotalPurchases"].replace(0, np.nan)

# RFM proxy
if "Recency" in fe.columns:
    fe["R_recency"] = fe["Recency"]
if "TotalPurchases" in fe.columns:
    fe["F_frequency"] = fe["TotalPurchases"]
if "TotalSpend" in fe.columns:
    fe["M_monetary"] = fe["TotalSpend"]

fe.head(5)


## 6. Diagnostics visuels (distributions, corrélations)

In [None]:

# Corrélations numériques
num_cols = fe.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) >= 2:
    corr = fe[num_cols].corr(numeric_only=True)
    plt.figure(figsize=(8,6))
    plt.imshow(corr, interpolation="nearest")
    plt.xticks(range(len(num_cols)), num_cols, rotation=90)
    plt.yticks(range(len(num_cols)), num_cols)
    plt.title("Matrice de corrélation (numérique)")
    plt.colorbar()
    plt.tight_layout()
    plt.show()


## 7. Segmentation client — Clustering (KMeans)

In [None]:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Choix des features de clustering (adapter selon le contexte)
cluster_features = []
for col in ["R_recency","F_frequency","M_monetary","Share_WebPurchases","Share_CatalogPurchases","Share_StorePurchases","Age","Income","CustomerSeniority_days"]:
    if col in fe.columns:
        cluster_features.append(col)

X = fe[cluster_features].copy().fillna(0)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Heuristique sur k (inertie, silhouette)
inertias = []
sils = []
ks = list(range(2,8))
for k in ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil = silhouette_score(X_scaled, labels)
    sils.append(sil)

plt.figure()
plt.plot(ks, inertias, marker='o')
plt.title("Elbow method (Inertie vs k)")
plt.xlabel("k")
plt.ylabel("Inertie")
plt.tight_layout()
plt.show()

plt.figure()
plt.plot(ks, sils, marker='o')
plt.title("Silhouette vs k")
plt.xlabel("k")
plt.ylabel("Silhouette")
plt.tight_layout()
plt.show()

# Fixer un k (ex: meilleur silhouette)
best_k = ks[int(np.argmax(sils))]
kmeans = KMeans(n_clusters=best_k, n_init=20, random_state=42)
fe["Cluster"] = kmeans.fit_predict(X_scaled)

fe.groupby("Cluster")[cluster_features].median().sort_index()


## 8. Efficacité des campagnes par segment

In [None]:

# Taux d'acceptation par segment (campagnes historiques et récente)
tab = {}
accepted_cols = [c for c in fe.columns if c.lower().startswith("acceptedcmp")]
for c in accepted_cols + (["Response"] if "Response" in fe.columns else []):
    if c in fe.columns:
        tab[c] = fe.groupby("Cluster")[c].mean()

camp_by_seg = pd.DataFrame(tab).sort_index()
camp_by_seg


## 9. Modélisation — Prédire la réponse client (`Response`)

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

if "Response" in fe.columns:
    y = fe["Response"].fillna(0).astype(int)  # hypothèse: NaN -> non réponse
    # Sélection de features (num + cat)
    cat_cols = [c for c in fe.select_dtypes(include=["object"]).columns if c not in ["Dt_Customer"]]
    num_cols = [c for c in fe.select_dtypes(include=[np.number]).columns if c not in ["Response"]]

    X = fe[cat_cols + num_cols].copy()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

    preproc = ColumnTransformer([
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", "passthrough", num_cols)
    ])

    models = {
        "LogReg": LogisticRegression(max_iter=200),
        "RandomForest": RandomForestClassifier(n_estimators=300, random_state=42)
    }

    results = {}
    for name, model in models.items():
        pipe = Pipeline([("pre", preproc), ("model", model)])
        pipe.fit(X_train, y_train)
        proba = pipe.predict_proba(X_test)[:,1]
        auc = metrics.roc_auc_score(y_test, proba)
        pr_auc = metrics.average_precision_score(y_test, proba)
        results[name] = {"ROC_AUC": auc, "PR_AUC": pr_auc, "pipeline": pipe, "proba": proba, "y_test": y_test}

    pd.DataFrame({k: {m: v[m] for m in ["ROC_AUC","PR_AUC"]} for k,v in results.items()})
else:
    print("Aucune colonne 'Response' trouvée — la modélisation de propension est sautée.")


In [None]:

# Courbes ROC & PR pour le meilleur modèle (selon PR_AUC)
if "Response" in fe.columns:
    best_name = max(results, key=lambda k: results[k]["PR_AUC"])
    best = results[best_name]
    proba = best["proba"]
    y_true = best["y_test"]

    fpr, tpr, _ = metrics.roc_curve(y_true, proba)
    prec, rec, _ = metrics.precision_recall_curve(y_true, proba)

    plt.figure()
    plt.plot(fpr, tpr)
    plt.plot([0,1], [0,1], linestyle="--")
    plt.title(f"ROC — {best_name}")
    plt.xlabel("FPR")
    plt.ylabel("TPR")
    plt.tight_layout()
    plt.show()

    plt.figure()
    plt.plot(rec, prec)
    plt.title(f"Precision-Recall — {best_name}")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.tight_layout()
    plt.show()

    # Décile lift
    import pandas as pd
    deciles = pd.qcut(proba, 10, labels=False, duplicates="drop")
    lift_table = pd.DataFrame({"decile": deciles, "y": y_true}).groupby("decile")["y"].mean().sort_index(ascending=False)
    lift_table.index = lift_table.index + 1  # deciles 1..10
    lift_table


## 10. Explicabilité (importance des variables)

In [None]:

# Importance brute pour RandomForest (sur variables numériques post-encodage, proxy)
if "Response" in fe.columns:
    from sklearn.inspection import permutation_importance
    best_name = "RandomForest"
    if best_name in results:
        best_pipe = results[best_name]["pipeline"]
        X_test = fe[[c for c in fe.columns if c not in ["Response"]]].iloc[results[best_name]["y_test"].index]
        y_test = results[best_name]["y_test"]
        r = permutation_importance(best_pipe, X_test, y_test, n_repeats=5, random_state=42, scoring="roc_auc")
        imp = pd.Series(r.importances_mean, index=[f"f{i}" for i in range(len(r.importances_mean))]).sort_values(ascending=False).head(20)
        plt.figure(figsize=(6,5))
        imp[::-1].plot(kind="barh")
        plt.title("Top 20 importances — Permutation importance (proxy)")
        plt.tight_layout()
        plt.show()



## 11. Recommandations métier (à compléter après lecture des résultats)
- Ciblage prioritaire : segments/clusters avec meilleur taux de réponse ou meilleure valeur `M`.
- Canal d’activation recommandé : selon parts d’achats `Share_*` et performance par segment.
- Pression marketing : ajuster selon `Recency` et `NumWebVisitsMonth` pour éviter la sur-sollicitation.
- Offre personnalisée : catégories `Mnt*` dominantes par segment.
- Suivi des KPIs : taux d’acceptation, **PR AUC** du modèle, **lift décile 1**, ROI (si `Z_CostContact`/`Z_Revenue`).

> Pendant la soutenance : partir des KPIs → insights segment → modèle → plan d’action.


## 12. Annexes — fonctions utilitaires

In [None]:

def top_decile_lift(y_true, proba):
    import numpy as np
    order = np.argsort(-proba)
    n = max(1, int(0.1 * len(y_true)))
    top_idx = order[:n]
    rate_top = y_true.iloc[top_idx].mean()
    rate_all = y_true.mean()
    return rate_top / rate_all if rate_all > 0 else np.nan
