Capstone Project — Customer Churn Prediction (E-commerce)
by: Luthfi Muzammil

Notebook ini berisi pipeline end-to-end untuk prediksi tindak lanjut terhadap customer churn (yang tidak jadi pelanggan lagi) dan/atau customer yang tidak churn (setia).

Konteks: sebuah perusahaan e-commerce ingin menindaklanjuti kegiatan promosi pelanggan yang churn/tidak churn, jadi bisa ditentukan pelanggan manakah yang akan diberikan promo/program promosi

Dataset: `data_ecommerce_customer_churn.csv`


Analytic Approach

Langkah analitik yang ditempuh:
1. Data Understanding → memahami struktur dataset, missing values, distribusi target.
2. Data Cleaning → imputasi missing values, drop kolom ID.
3. Feature Engineering → tambah fitur baru (cashback_per_tenure).
4. Modeling → baseline Logistic Regression, lalu RandomForest & GradientBoosting dengan hyperparameter tuning.
5. Evaluation → metrik klasifikasi (Accuracy, Precision, Recall, F1, ROC AUC, PR Curve, AP Score).
6. Feature Importance → interpretasi dengan permutation importance.
7. Limitasi Model → keterbatasan pendekatan yang digunakan.
8. Kesimpulan & Rekomendasi → pilih model terbaik dan strategi retensi.


## 1. Load Data

0 = tidak churn
1 = churn

In [None]:
import pandas as pd
DATA_PATH = 'data_ecommerce_customer_churn.csv'
df = pd.read_csv(DATA_PATH)
df.head()

## 2. Data Understanding & Cleaning

In [None]:
df.info()
df.isnull().sum()
df['Churn'].value_counts(normalize=True)

## 3. Feature Engineering

In [None]:
import numpy as np
X = df.drop(columns=['Churn'])
y = df['Churn']
X['cashback_per_tenure'] = X['CashbackAmount'] / X['Tenure'].replace({0: np.nan})
X.head()

## 4. Modeling & Evaluation

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from scipy.stats import randint as sp_randint
import matplotlib.pyplot as plt

# Define preprocessing for numeric and categorical features
numeric_features = X.select_dtypes(include=['float64', 'int64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

# Pipelines for numeric and categorical features with imputation
numeric_transformer = Pipeline([
	('imputer', SimpleImputer(strategy='mean')),
	('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
	('imputer', SimpleImputer(strategy='most_frequent')),
	('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
	transformers=[
		('num', numeric_transformer, numeric_features),
		('cat', categorical_transformer, categorical_features)
	]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Logistic Regression baseline
logpipe = Pipeline([('preproc', preprocessor), ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))])
logpipe.fit(X_train, y_train)
y_pred = logpipe.predict(X_test)
y_proba = logpipe.predict_proba(X_test)[:,1]
print('Logistic Regression - Classification Report')
print(classification_report(y_test, y_pred))
print('ROC AUC:', roc_auc_score(y_test, y_proba))

# RandomForest
rf = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_pipe = Pipeline([('preproc', preprocessor), ('clf', rf)])
param_dist = {'clf__n_estimators': sp_randint(50,200), 'clf__max_depth': sp_randint(3,15)}
rs = RandomizedSearchCV(rf_pipe, param_dist, n_iter=5, scoring='roc_auc', cv=3, random_state=42, n_jobs=-1)
rs.fit(X_train, y_train)
best_rf = rs.best_estimator_

# GradientBoosting
gb = GradientBoostingClassifier(random_state=42)
gb_pipe = Pipeline([('preproc', preprocessor), ('clf', gb)])
param_dist_gb = {'clf__n_estimators': sp_randint(50,150), 'clf__learning_rate': [0.01,0.05,0.1]}
rs_gb = RandomizedSearchCV(gb_pipe, param_dist_gb, n_iter=5, scoring='roc_auc', cv=3, random_state=42, n_jobs=-1)
rs_gb.fit(X_train, y_train)
best_gb = rs_gb.best_estimator_

# Save models for eval
models = {'Logistic': logpipe, 'RandomForest': best_rf, 'GradientBoosting': best_gb}

## 5. Evaluation Graphics

In [None]:
probas = {name: model.predict_proba(X_test)[:,1] for name, model in models.items()}

# ROC Curve
plt.figure(figsize=(8,5))
for name, prob in probas.items():
    fpr, tpr, _ = roc_curve(y_test, prob)
    plt.plot(fpr, tpr, label=f'{name}')
plt.plot([0,1],[0,1],'k--')
plt.legend(); plt.title('ROC Curve'); plt.show()

# Precision-Recall Curve
plt.figure(figsize=(8,5))
for name, prob in probas.items():
    prec, rec, _ = precision_recall_curve(y_test, prob)
    ap = average_precision_score(y_test, prob)
    plt.plot(rec, prec, label=f'{name} (AP={ap:.3f})')
plt.legend(); plt.title('Precision-Recall Curve'); plt.show()


## 6. Feature Importance

In [None]:
from sklearn.inspection import permutation_importance
import pandas as pd
import matplotlib.pyplot as plt

# Hitung permutation importance
ri = permutation_importance(best_rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)

# Ambil nama fitur dari preprocessor
preproc = best_rf.named_steps['preproc']
raw_feature_names = preproc.get_feature_names_out()

# Bersihkan prefix 'num__' dan 'cat__'
feature_names = [name.replace("num__", "").replace("cat__", "") for name in raw_feature_names]

# Pastikan panjang feature_names dan importances sama
if len(feature_names) != len(ri.importances_mean):
    print("WARNING: Feature names and importances length mismatch!")
    print("Features:", len(feature_names), "Importances:", len(ri.importances_mean))
    # Truncate to the shortest length to avoid ValueError
    min_len = min(len(feature_names), len(ri.importances_mean))
    feature_names = feature_names[:min_len]
    importances_mean = ri.importances_mean[:min_len]
else:
    importances_mean = ri.importances_mean

# Buat DataFrame feature importance
feat_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances_mean
}).sort_values('Importance', ascending=False).reset_index(drop=True)

# Tampilkan top 15
print("Top 15 Feature Importances (Permutation Importance):")
display(feat_imp_df.head(15))

# Plot bar chart
plt.figure(figsize=(10,6))
plt.barh(feat_imp_df['Feature'].head(15)[::-1], feat_imp_df['Importance'].head(15)[::-1])
plt.xlabel('Permutation Importance (mean decrease in score)')
plt.title('Top 15 Feature Importances — RandomForest (Clean Names)')
plt.tight_layout()
plt.show()

## 7. Limitasi Model

- Model hanya menggunakan data historis internal, tanpa mempertimbangkan faktor eksternal (kompetitor, tren pasar, kondisi ekonomi).
- Model berpotensi mengalami **concept drift**, karena perilaku pelanggan dapat berubah seiring waktu → perlu retraining periodik.
- **Imbalance data**: meski sudah ditangani dengan `class_weight`, model tetap bisa bias ke kelas mayoritas (karena non-churn nya 0.82).
- Algoritma yang dipakai masih model klasik dan sederhana(Logistic Regression, RandomForest, GradientBoosting)


## 8. Kesimpulan & Rekomendasi

- Model terbaik: RandomForest (ROC AUC tertinggi).
- Recall cukup baik → efektif mendeteksi pelanggan berisiko churn.
- Rekomendasi bisnis:
  - Fokus pada pelanggan tenure rendah (karena tenure importance paling besar)
  - Beri insentif tambahan bagi pelanggan setia (karena cashback importancenya kedua paling besar)
  - Lakukan retraining model secara berkala.
  - Perlu ditambahkan data eksternal (kompetitor, tren pasar, dan kondisi ekonomi)


## 10. Simpan & Load Model dengan Pickle

In [None]:
import pickle

# Simpan model terbaik (misalnya best_rf)
with open("customer_churn_model.pkl", "wb") as f:
    pickle.dump(best_rf, f)

print("Model berhasil disimpan sebagai customer_churn_model.pkl")

In [None]:
# Load kembali model yang sudah disimpan
with open("customer_churn_model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

# Tes prediksi dengan model hasil load
y_pred_loaded = loaded_model.predict(X_test)
print("Hasil prediksi (10 pertama):", y_pred_loaded[:10])

## 11. Worksheet Prediksi dari Model Pickle

In [None]:
import pandas as pd

# Buat worksheet prediksi dari X_test
test_results = X_test.copy()

# Tambahkan hasil aktual dan prediksi dari model pickle
test_results["Actual_Churn"] = y_test.values
test_results["Predicted_Churn"] = loaded_model.predict(X_test)
test_results["Churn_Probability"] = loaded_model.predict_proba(X_test)[:,1]

# Tampilkan 10 baris pertama
display(test_results.head(10))

# Simpan ke CSV dan Excel
test_results.to_csv("churn_predictions.csv", index=False)
test_results.to_excel("churn_predictions.xlsx", index=False)

print("Hasil prediksi lengkap disimpan ke churn_predictions.csv dan churn_predictions.xlsx")