🎵 Music Streaming Churn Prediction

Authors: Me & Sila (https://github.com/silabou)
Context: Python for Data Science class @ École Polytechnique

🚀 Project Overview

This repository contains a complete Machine Learning pipeline designed to predict user churn for a music streaming service. Using raw user activity logs (listening history, page visits, errors, etc.), we engineered time-series features to classify whether a user is likely to cancel their subscription within a 10-day window.

The goal was to build a robust model capable of handling class imbalance and maximizing the F1-Score on the leaderboard.

🛠 Key Features

Advanced Feature Engineering:
- Multi-Scale Rolling Windows: Calculated 3d, 7d, 14d, and 30d moving stats (sums, averages) to capture short-term vs. long-term behavior.
- Velocity & Trend Features: Engineered ratio features (e.g., Activity last 3 days / Activity last 14 days) to explicitly model "slowing down" behavior.
- Interaction Ratios: Created "Frustration" (Errors per hour) and "Engagement" (Songs per session) indices.
Data Augmentation via "Snapshot Stacking":
- Instead of a single row per user, we implemented a Sliding Window Snapshot strategy.
- We generated training samples every 2 days (from Oct 7 to Nov 1). This multiplied our training size and allowed the model to learn the evolution of a user's journey, effectively turning a static classification problem into a temporal one.
Robust Modeling Pipeline:
- Feature Selection: Used SelectFromModel with a base XGBoost estimator to prune noise and retain only the top predictive features.
- Algorithm: Single XGBoost Classifier (optimized with the hist tree method for speed).
- Grouped Cross-Validation: Trained a 5-fold ensemble (grouped by UserID to prevent leakage) and averaged the predictions to reduce variance.
Automated Tuning: Used Optuna for Bayesian optimization of the XGBoost hyperparameters (Learning Rate, Depth, L1/L2 Reg), optimizing specifically for AUC.

📉 Iterative Modeling Strategy

We approached this problem iteratively to improve performance and stability:

Baseline & Feature Engineering:
- We started by aggregating logs per user. However, simple aggregates failed to capture the speed of churn.
- Pivot: We introduced "Velocity" features (ratios of short-term vs long-term windows) to detect sudden drops in activity.
Addressing Data Scarcity (The "Snapshot" Shift):
- Problem: With a limited number of unique users, a simple "one row per user" model was overfitting and lacked sufficient training examples.
- Solution: We moved to a Stacked Snapshot Dataset. By taking a snapshot of every user every 2 days and predicting churn in the next 10 days, we increased our dataset size by ~13x. This helped the model distinguish between a "safe" period and a "risk" period for the same user.
Leakage Prevention:
- Strictly separated feature calculation (history) from the target window.
- Used GroupKFold during validation to ensure that the same user (appearing in multiple snapshots) never appeared in both the Train and Validation sets simultaneously.
Final Calibration (Dynamic Thresholding):
- Instead of a standard 0.5 threshold, we implemented a Target-Rate Calibration.
- The final submission dynamically selects the probability threshold that forces the predicted churn rate to match the expected population churn (~40%), ensuring the predictions are aligned with the business reality.

📊 Results

The pipeline generates comprehensive performance visualizations:

ROC & Precision-Recall Curves: To evaluate the trade-off between True Positives and False Positives.
Cumulative Gains Curve (The "Banana" Plot): Explicit visualization of how much better the model is compared to random guessing (e.g., "Top 40% of predictions capture X% of churners").
Submission Output: A probability file (submission_final_binary.csv) generated by the 5-fold ensemble.

This project is part of the academic coursework at École Polytechnique.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
submissions		submissions
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
churn_prediction_alec_best.ipynb		churn_prediction_alec_best.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎵 Music Streaming Churn Prediction

🚀 Project Overview

🛠 Key Features

📉 Iterative Modeling Strategy

📊 Results

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

4LEC212/python_for_dsb_final_project

Folders and files

Latest commit

History

Repository files navigation

🎵 Music Streaming Churn Prediction

🚀 Project Overview

🛠 Key Features

📉 Iterative Modeling Strategy

📊 Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages