## Predicting Teacher Churn in Yandex Uchebnik: A Short-Term Retention Forecasting System
*Preventing churn before it happens — by predicting which teachers are likely to leave the platform within 1 week or 1 month*

### Overview

#### - Problem definition

At Yandex Uchebnik (Yandex’s educational platform for K–12 teachers), retaining active users is critical to maintaining engagement and product value. Teachers who stop using the platform represent lost opportunities for impact, feedback, and long-term growth.

This project aims to build a short-term churn prediction system that forecasts whether a teacher will stop actively using the platform within:

1 week (week_churn) — immediate risk </br>
1 month (month_churn) — medium-term risk </br>
These predictions are based on historical behavioral data, transformed into numerical features (e.g., activity frequency, content interaction patterns, session duration aggregates, and behavioral embeddings). </br>

The output is two independent probability scores per teacher, enabling targeted interventions such as personalized emails, feature recommendations, or support outreach — all before the user disengages. </br>

#### - Data

All raw events (logins, lesson views, assignment submissions, etc.) and text interactions have been preprocessed into anonymized, tabular numeric features. For each teacher snapshot, we have:

- nid: Unique teacher ID (may repeat across dates)
- report_date: Date of the snapshot (when the feature vector was generated) 
- v_0 ... v_N: Numerical features derived from aggregated activity and behavioral embeddings
- week_churn: Binary target: 1 = teacher churned within 7 days after report_date, 0 = did not churn
- month_churn: Binary target: 1 = teacher churned within 30 days after report_date, 0 = did not churn

> Note: Only train.csv contains labels. test.csv contains only features and IDs — used for final prediction. 

Data is fully anonymized and provided in CSV format with comma delimiters.
The datasets can be accessed via the links below. </br> 
- Train dataset [train.csv]()eee
- Test dataset [test.csv]()

#### - Evaluation metric - AUC-ROC

Imagine you’re a doctor trying to predict which patients are at high risk of developing a disease.

You don’t just want to say “yes” or “no” — you want to assign a **risk score** (like 0.85 = very high risk, 0.12 = low risk). The goal is to rank patients correctly: those who actually get sick should have higher scores than those who don’t.

ROC-AUC measures **how well the model ranks positive cases (churners) above negative ones (non-churners)** — without forcing you to pick a specific cutoff.

It ranges from **0.5 to 1.0**:
- **0.5** = Random guessing (like flipping a coin)
- **0.7–0.8** = Fair performance
- **0.8–0.9** = Good performance
- **>0.9** = Excellent performance

In our case, since we have **two separate targets** (`week_churn` and `month_churn`), we compute **ROC-AUC for each independently**, then take the **macro average** — meaning we treat both tasks equally important.

**As an example**

| Score Range     | Interpretation                                                                 |
|------------------|--------------------------------------------------------------------------------|
| < 0.6            | Poor — model performs worse than random guessing                               |
| 0.6 – 0.7        | Weak — barely better than chance; needs major improvement                      |
| 0.7 – 0.8        | Moderate — useful for basic prioritization                                     |
| 0.8 – 0.9        | Strong — good for operational use (e.g., triggering alerts or campaigns)       |
| > 0.9            | Excellent — highly reliable for proactive retention strategies                 |

> **Goal**: Achieve **>0.85** macro-ROC-AUC — indicating the model can reliably distinguish between teachers who will churn soon vs. those who won’t.

#### - Theory

**Why ROC-AUC?**

- **Class imbalance friendly**: In churn prediction, most users don’t churn — so accuracy is misleading. ROC-AUC doesn’t care about class balance.
- **Threshold agnostic**: You don’t need to decide “what score means ‘churn’?” — it evaluates ranking quality regardless of threshold.
- **Probability-focused**: Perfect for systems where you want to sort users by risk, not just classify them.

**Why Macro-Average?**

Because `week_churn` and `month_churn` are **independent business problems**:
- One informs **urgent interventions** (within 7 days)
- The other informs **medium-term strategy** (within 30 days)

We give equal weight to both — hence, **macro-average** (simple mean) instead of weighted average.


In [51]:
# import libraries
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier 
from xgboost import XGBClassifier 
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_val_score, StratifiedKFold 
import matplotlib.pyplot as plt 
import optuna as opt
import pandas as pd 
import numpy as np 
import seaborn as sns 
%matplotlib inline

### Data Analysis

In [12]:
# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


In [13]:
train.sample(frac=.01)

Unnamed: 0,month_churn,nid,report_date,v_0,v_1,v_10,v_100,v_101,v_102,v_103,...,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99,week_churn
195015,0,11156,2024-12-29,46,90,40,33,3,6,0,...,0,0,51,95,0,0,33,33,16,1
233532,0,7718,2024-12-31,212,631,245,242,108,114,0,...,0,0,279,1622,16,19,245,245,218,0
71622,0,8876,2025-01-31,13,31,7,6,4,5,0,...,0,0,13,31,0,0,7,7,5,1
15956,0,9390,2024-02-18,95,116,355,196,100,138,0,...,0,0,160,837,19,36,197,197,178,0
12472,0,11520,2025-02-23,38,65,26,24,14,19,0,...,0,0,93,158,0,0,26,26,23,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230538,0,5604,2023-11-30,6,6,6,6,5,5,0,...,0,0,6,6,5,4,6,6,5,0
188435,0,4152,2024-10-20,70,149,291,100,91,92,0,...,0,0,106,646,38,52,102,102,93,0
157670,0,9697,2023-12-10,135,178,406,251,96,116,0,...,0,0,217,1346,4,6,254,254,225,0
39285,0,7890,2024-02-11,3,3,61,22,7,12,0,...,0,0,12,34,0,0,22,22,17,1


In [14]:
test.sample(frac=.1)

Unnamed: 0,nid,report_date,v_0,v_1,v_10,v_100,v_101,v_102,v_103,v_104,...,v_90,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99
25295,8578,2025-03-16,26,33,460,174,99,109,0,0,...,431,0,0,95,431,1,2,200,200,169
141,8882,2024-02-25,29,32,385,207,184,203,0,0,...,1072,0,0,196,895,15,35,208,208,204
2172,6542,2024-12-31,187,412,234,108,33,67,0,0,...,1195,0,0,273,1193,0,1,119,119,89
26360,2380,2025-03-16,29,31,119,66,38,41,0,0,...,109,0,0,60,109,2,3,66,66,54
18878,1334,2024-10-13,23,49,6,6,6,6,0,0,...,93,0,0,23,49,1,2,6,6,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14406,3947,2024-10-13,21,43,264,38,28,29,0,0,...,238,0,0,46,217,3,2,40,40,37
21072,8719,2024-12-15,6,6,15,15,1,3,0,0,...,97,0,0,52,69,0,1,15,15,12
5830,9912,2025-01-31,32,45,145,91,44,53,0,0,...,486,0,1,156,486,1,6,92,92,67
20450,11323,2025-01-12,93,98,942,290,255,279,0,0,...,1513,0,2,254,1513,50,96,297,297,288


In [23]:
# select only needed features
features = [col for col in train.columns if col.startswith('v_')]

# training sets
X_train = train[features].copy() 
y_week = train['week_churn'].copy() 
y_month = train['month_churn'].copy()

# test sets 
X_test = test[features].copy() 
test_nid = test['nid'].copy()


#### EDA (Exploratory Data Analysis)
> Explore relevant relationships that exists amongst several features

In [45]:
for label,content in train.items():
    if pd.isna(content).any():
        print(label)


In [18]:
# check distribution in week and month dataframes 
y_week.value_counts(normalize=True)

week_churn
0    0.751181
1    0.248819
Name: proportion, dtype: float64

In [19]:
y_month.value_counts(normalize=True)

month_churn
0    0.870445
1    0.129555
Name: proportion, dtype: float64

In [20]:
train.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248180 entries, 0 to 248179
Columns: 279 entries, month_churn to week_churn
dtypes: int64(278), object(1)
memory usage: 528.3+ MB


In [21]:
test.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27437 entries, 0 to 27436
Columns: 277 entries, nid to v_99
dtypes: int64(276), object(1)
memory usage: 58.0+ MB


In [22]:
train.describe() 

Unnamed: 0,month_churn,nid,v_0,v_1,v_10,v_100,v_101,v_102,v_103,v_104,...,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99,week_churn
count,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,...,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0
mean,0.129555,5883.293863,53.080615,97.953369,257.293311,111.428342,53.258756,67.539254,0.0,0.02995,...,0.036586,1.383339,109.210911,512.872314,6.535567,11.213676,116.50249,116.50249,91.221662,0.248819
std,0.335814,3397.063277,58.427309,157.546228,311.641437,123.58078,68.650707,82.499486,0.0,0.247189,...,0.834675,7.416295,94.818224,615.401135,13.756437,20.132281,131.100894,131.100894,102.653687,0.43233
min,0.0,1.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2917.0,12.0,16.0,55.0,29.0,10.0,14.0,0.0,0.0,...,0.0,0.0,35.0,101.0,0.0,0.0,31.0,31.0,23.0,0.0
50%,0.0,5934.0,33.0,45.0,152.0,72.0,29.0,39.0,0.0,0.0,...,0.0,0.0,85.0,298.0,1.0,4.0,75.0,75.0,57.0,0.0
75%,0.0,8849.0,73.0,114.0,345.0,150.0,70.0,90.0,0.0,0.0,...,0.0,0.0,158.0,692.0,7.0,13.0,156.0,156.0,122.0,0.0
max,1.0,11735.0,909.0,4351.0,8157.0,3555.0,933.0,1100.0,0.0,20.0,...,85.0,718.0,984.0,9052.0,323.0,355.0,4185.0,4185.0,2396.0,1.0


### Modelling

In [None]:
# create a function to test for baseline models 

rf = RandomForestClassifier(random_state=12, n_jobs=-1)
xgb = XGBClassifier(random_state=12, n_jobs=-1)
cat = CatBoostClassifier(random_state=12, thread_count=-1)

def baseline_train(model):

    CV = StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
    
    week_auc_scores = []
    month_auc_scores = []
    
    for fold, (train_idx, val_idx) in enumerate(CV.split(X_train[:30000], y_week[:30000])): 
        print(f'Fold {fold + 1}')
        
        # split data 
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        yw_tr, yw_val = y_week.iloc[train_idx], y_week.iloc[val_idx]
        ym_tr, ym_val = y_month.iloc[train_idx], y_month.iloc[val_idx]

        # train models
        week_model = model
        month_model = model

        week_model.fit(X_tr, yw_tr)
        month_model.fit(X_tr, ym_tr)

        # predict 
        pred_w = week_model.predict_proba(X_val)[:,1]
        pred_m = month_model.predict_proba(X_val)[:,1]

        week_auc_score = roc_auc_score(yw_val, pred_w)
        month_auc_score = roc_auc_score(ym_val, pred_m)
        macro_auc_avg = (week_auc_score + month_auc_score)/2

        week_auc_scores.append(week_auc_score)
        month_auc_scores.append(month_auc_score)

        print(f' Week AUC: {week_auc_score:.4f} | Month AUC: {month_auc_score:.4f} | Macro Avg: {macro_auc_avg:.4f}')

    # final aggregated scores 
    avg_week_auc = np.mean(week_auc_scores)
    avg_month_auc = np.mean(month_auc_scores)
    avg_macro_auc = (avg_week_auc + avg_month_auc)/2

    return {
        'Avg_week_AUC' : avg_week_auc, 
        'Avg_month_AUC' : avg_month_auc, 
        'Avg_macro_AUC': avg_macro_auc
    }


In [60]:
baseline_train(cat)

Fold 1
Learning rate set to 0.040021
0:	learn: 0.6697239	total: 93.2ms	remaining: 1m 33s
1:	learn: 0.6504020	total: 172ms	remaining: 1m 26s
2:	learn: 0.6323896	total: 273ms	remaining: 1m 30s
3:	learn: 0.6161251	total: 352ms	remaining: 1m 27s
4:	learn: 0.6014136	total: 446ms	remaining: 1m 28s
5:	learn: 0.5875292	total: 604ms	remaining: 1m 40s
6:	learn: 0.5748327	total: 736ms	remaining: 1m 44s
7:	learn: 0.5635872	total: 821ms	remaining: 1m 41s
8:	learn: 0.5533289	total: 910ms	remaining: 1m 40s
9:	learn: 0.5439808	total: 986ms	remaining: 1m 37s
10:	learn: 0.5358622	total: 1.07s	remaining: 1m 36s
11:	learn: 0.5279709	total: 1.16s	remaining: 1m 35s
12:	learn: 0.5201397	total: 1.25s	remaining: 1m 34s
13:	learn: 0.5136202	total: 1.33s	remaining: 1m 33s
14:	learn: 0.5079991	total: 1.41s	remaining: 1m 32s
15:	learn: 0.5025484	total: 1.5s	remaining: 1m 32s
16:	learn: 0.4972991	total: 1.64s	remaining: 1m 34s
17:	learn: 0.4921812	total: 1.72s	remaining: 1m 33s
18:	learn: 0.4874092	total: 1.81s	rem

(np.float64(0.694858441272802),
 np.float64(0.8510252166311691),
 np.float64(0.7729418289519856))

NameError: name 'avg_week_auc' is not defined

#### - Hyper-parameter tuning

#### - Fit best model

#### - Pre-process test data (optional)

#### - Feature importance

### Deployment

### Experiments (optional)

### Conclusion