Authors: Me & Sila (https://github.com/silabou)
Context: Python for Data Science class @ รcole Polytechnique
This repository contains a complete Machine Learning pipeline designed to predict user churn for a music streaming service. Using raw user activity logs (listening history, page visits, errors, etc.), we engineered time-series features to classify whether a user is likely to cancel their subscription within a 10-day window.
The goal was to build a robust model capable of handling class imbalance and maximizing the F1-Score on the leaderboard.
-
Advanced Feature Engineering:
- Multi-Scale Rolling Windows: Calculated 3d, 7d, 14d, and 30d moving stats (sums, averages) to capture short-term vs. long-term behavior.
- Velocity & Trend Features: Engineered ratio features (e.g., Activity last 3 days / Activity last 14 days) to explicitly model "slowing down" behavior.
- Interaction Ratios: Created "Frustration" (Errors per hour) and "Engagement" (Songs per session) indices.
-
Data Augmentation via "Snapshot Stacking":
- Instead of a single row per user, we implemented a Sliding Window Snapshot strategy.
- We generated training samples every 2 days (from Oct 7 to Nov 1). This multiplied our training size and allowed the model to learn the evolution of a user's journey, effectively turning a static classification problem into a temporal one.
-
Robust Modeling Pipeline:
- Feature Selection: Used
SelectFromModelwith a base XGBoost estimator to prune noise and retain only the top predictive features. - Algorithm: Single XGBoost Classifier (optimized with the
histtree method for speed). - Grouped Cross-Validation: Trained a 5-fold ensemble (grouped by UserID to prevent leakage) and averaged the predictions to reduce variance.
- Feature Selection: Used
-
Automated Tuning: Used Optuna for Bayesian optimization of the XGBoost hyperparameters (Learning Rate, Depth, L1/L2 Reg), optimizing specifically for AUC.
We approached this problem iteratively to improve performance and stability:
-
Baseline & Feature Engineering:
- We started by aggregating logs per user. However, simple aggregates failed to capture the speed of churn.
- Pivot: We introduced "Velocity" features (ratios of short-term vs long-term windows) to detect sudden drops in activity.
-
Addressing Data Scarcity (The "Snapshot" Shift):
- Problem: With a limited number of unique users, a simple "one row per user" model was overfitting and lacked sufficient training examples.
- Solution: We moved to a Stacked Snapshot Dataset. By taking a snapshot of every user every 2 days and predicting churn in the next 10 days, we increased our dataset size by ~13x. This helped the model distinguish between a "safe" period and a "risk" period for the same user.
-
Leakage Prevention:
- Strictly separated feature calculation (history) from the target window.
- Used
GroupKFoldduring validation to ensure that the same user (appearing in multiple snapshots) never appeared in both the Train and Validation sets simultaneously.
-
Final Calibration (Dynamic Thresholding):
- Instead of a standard 0.5 threshold, we implemented a Target-Rate Calibration.
- The final submission dynamically selects the probability threshold that forces the predicted churn rate to match the expected population churn (~40%), ensuring the predictions are aligned with the business reality.
The pipeline generates comprehensive performance visualizations:
- ROC & Precision-Recall Curves: To evaluate the trade-off between True Positives and False Positives.
- Cumulative Gains Curve (The "Banana" Plot): Explicit visualization of how much better the model is compared to random guessing (e.g., "Top 40% of predictions capture X% of churners").
- Submission Output: A probability file (
submission_final_binary.csv) generated by the 5-fold ensemble.
This project is part of the academic coursework at รcole Polytechnique.