## Predicting Teacher Churn in Yandex Uchebnik: A Short-Term Retention Forecasting System
*Preventing churn before it happens — by predicting which teachers are likely to leave the platform within 1 week or 1 month*

### Overview

#### - Problem definition

At Yandex Uchebnik (Yandex’s educational platform for K–12 teachers), retaining active users is critical to maintaining engagement and product value. Teachers who stop using the platform represent lost opportunities for impact, feedback, and long-term growth.

This project aims to build a short-term churn prediction system that forecasts whether a teacher will stop actively using the platform within:

1 week (week_churn) — immediate risk </br>
1 month (month_churn) — medium-term risk </br>
These predictions are based on historical behavioral data, transformed into numerical features (e.g., activity frequency, content interaction patterns, session duration aggregates, and behavioral embeddings). </br>

The output is two independent probability scores per teacher, enabling targeted interventions such as personalized emails, feature recommendations, or support outreach — all before the user disengages. </br>

#### - Data

All raw events (logins, lesson views, assignment submissions, etc.) and text interactions have been preprocessed into anonymized, tabular numeric features. For each teacher snapshot, we have:

- nid: Unique teacher ID (may repeat across dates)
- report_date: Date of the snapshot (when the feature vector was generated) 
- v_0 ... v_N: Numerical features derived from aggregated activity and behavioral embeddings
- week_churn: Binary target: 1 = teacher churned within 7 days after report_date, 0 = did not churn
- month_churn: Binary target: 1 = teacher churned within 30 days after report_date, 0 = did not churn

> Note: Only train.csv contains labels. test.csv contains only features and IDs — used for final prediction. 

Data is fully anonymized and provided in CSV format with comma delimiters.
The datasets can be accessed via the links below. </br> 
- Train dataset [train.csv]()eee
- Test dataset [test.csv]()

#### - Evaluation metric - AUC-ROC

Imagine you’re a doctor trying to predict which patients are at high risk of developing a disease.

You don’t just want to say “yes” or “no” — you want to assign a **risk score** (like 0.85 = very high risk, 0.12 = low risk). The goal is to rank patients correctly: those who actually get sick should have higher scores than those who don’t.

ROC-AUC measures **how well the model ranks positive cases (churners) above negative ones (non-churners)** — without forcing you to pick a specific cutoff.

It ranges from **0.5 to 1.0**:
- **0.5** = Random guessing (like flipping a coin)
- **0.7–0.8** = Fair performance
- **0.8–0.9** = Good performance
- **>0.9** = Excellent performance

In our case, since we have **two separate targets** (`week_churn` and `month_churn`), we compute **ROC-AUC for each independently**, then take the **macro average** — meaning we treat both tasks equally important.

**As an example**

| Score Range     | Interpretation                                                                 |
|------------------|--------------------------------------------------------------------------------|
| < 0.6            | Poor — model performs worse than random guessing                               |
| 0.6 – 0.7        | Weak — barely better than chance; needs major improvement                      |
| 0.7 – 0.8        | Moderate — useful for basic prioritization                                     |
| 0.8 – 0.9        | Strong — good for operational use (e.g., triggering alerts or campaigns)       |
| > 0.9            | Excellent — highly reliable for proactive retention strategies                 |

> **Goal**: Achieve **>0.85** macro-ROC-AUC — indicating the model can reliably distinguish between teachers who will churn soon vs. those who won’t.

#### - Theory

**Why ROC-AUC?**

- **Class imbalance friendly**: In churn prediction, most users don’t churn — so accuracy is misleading. ROC-AUC doesn’t care about class balance.
- **Threshold agnostic**: You don’t need to decide “what score means ‘churn’?” — it evaluates ranking quality regardless of threshold.
- **Probability-focused**: Perfect for systems where you want to sort users by risk, not just classify them.

**Why Macro-Average?**

Because `week_churn` and `month_churn` are **independent business problems**:
- One informs **urgent interventions** (within 7 days)
- The other informs **medium-term strategy** (within 30 days)

We give equal weight to both — hence, **macro-average** (simple mean) instead of weighted average.


In [16]:
# import libraries
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier 
from xgboost import XGBClassifier 
from sklearn.base import clone 
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_val_score, StratifiedKFold 
import matplotlib.pyplot as plt 
import optuna as opt
import pandas as pd 
import numpy as np 
import seaborn as sns 
%matplotlib inline

### Data Analysis

In [2]:
# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


In [3]:
train.sample(frac=.01)

Unnamed: 0,month_churn,nid,report_date,v_0,v_1,v_10,v_100,v_101,v_102,v_103,...,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99,week_churn
147428,1,6889,2024-12-31,5,6,31,31,12,15,0,...,0,0,132,286,0,0,31,31,26,1
140032,0,8027,2023-12-31,175,303,217,161,61,76,0,...,0,0,240,970,1,3,163,163,145,1
26759,0,7181,2024-11-24,36,40,241,141,66,73,0,...,0,2,108,644,1,3,142,142,127,0
169825,1,8425,2024-05-05,4,4,310,104,50,75,0,...,0,1,86,291,5,14,106,106,88,0
46528,1,4897,2024-05-05,21,22,234,50,14,18,0,...,0,0,41,97,4,4,53,53,38,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
201010,0,6021,2024-11-24,44,63,295,127,55,69,0,...,0,0,100,621,0,4,140,140,111,0
115902,0,1901,2023-12-31,63,245,165,142,81,114,0,...,0,0,63,510,9,16,152,152,126,1
33224,0,6088,2023-11-05,34,36,86,85,0,0,0,...,0,0,104,453,0,0,86,86,81,0
5392,0,1138,2023-09-24,134,163,123,115,21,28,0,...,0,6,199,534,1,5,123,123,80,0


In [4]:
test.sample(frac=.1)

Unnamed: 0,nid,report_date,v_0,v_1,v_10,v_100,v_101,v_102,v_103,v_104,...,v_90,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99
15328,7,2024-11-17,8,8,126,26,13,19,0,0,...,92,0,0,18,92,0,0,26,26,20
20607,9224,2023-10-29,14,14,28,28,22,23,0,1,...,99,0,0,21,67,4,9,28,28,23
18256,613,2025-02-23,51,58,993,381,101,147,0,0,...,980,0,0,156,980,11,31,403,403,186
22580,1336,2024-12-15,94,115,345,74,59,53,0,0,...,912,0,6,271,839,5,10,90,90,65
11466,10007,2024-03-03,14,14,278,0,0,0,0,0,...,16,0,0,14,14,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26817,4603,2024-11-10,3,3,319,92,70,76,0,0,...,1331,0,0,318,1256,4,18,92,92,89
11666,5662,2025-03-23,36,39,375,176,73,118,0,0,...,812,0,5,78,812,9,24,176,176,149
14562,412,2024-12-22,4,4,38,38,16,20,0,0,...,91,0,0,26,91,1,4,38,38,24
20521,7260,2024-03-17,24,26,18,13,8,13,0,0,...,100,0,0,26,99,0,1,17,17,10


In [5]:
# select only needed features
features = [col for col in train.columns if col.startswith('v_')]

# training sets
X_train = train[features].copy() 
y_week = train['week_churn'].copy() 
y_month = train['month_churn'].copy()

# test sets 
X_test = test[features].copy() 
test_nid = test['nid'].copy()


#### EDA (Exploratory Data Analysis)
> Explore relevant relationships that exists amongst several features

In [6]:
for label,content in train.items():
    if pd.isna(content).any():
        print(label)


In [7]:
# check distribution in week and month dataframes 
y_week.value_counts(normalize=True)

week_churn
0    0.751181
1    0.248819
Name: proportion, dtype: float64

In [9]:
y_month.value_counts(normalize=True)

month_churn
0    0.870445
1    0.129555
Name: proportion, dtype: float64

In [10]:
train.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248180 entries, 0 to 248179
Columns: 279 entries, month_churn to week_churn
dtypes: int64(278), object(1)
memory usage: 528.3+ MB


In [11]:
test.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27437 entries, 0 to 27436
Columns: 277 entries, nid to v_99
dtypes: int64(276), object(1)
memory usage: 58.0+ MB


In [12]:
train.describe() 

Unnamed: 0,month_churn,nid,v_0,v_1,v_10,v_100,v_101,v_102,v_103,v_104,...,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99,week_churn
count,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,...,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0
mean,0.129555,5883.293863,53.080615,97.953369,257.293311,111.428342,53.258756,67.539254,0.0,0.02995,...,0.036586,1.383339,109.210911,512.872314,6.535567,11.213676,116.50249,116.50249,91.221662,0.248819
std,0.335814,3397.063277,58.427309,157.546228,311.641437,123.58078,68.650707,82.499486,0.0,0.247189,...,0.834675,7.416295,94.818224,615.401135,13.756437,20.132281,131.100894,131.100894,102.653687,0.43233
min,0.0,1.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2917.0,12.0,16.0,55.0,29.0,10.0,14.0,0.0,0.0,...,0.0,0.0,35.0,101.0,0.0,0.0,31.0,31.0,23.0,0.0
50%,0.0,5934.0,33.0,45.0,152.0,72.0,29.0,39.0,0.0,0.0,...,0.0,0.0,85.0,298.0,1.0,4.0,75.0,75.0,57.0,0.0
75%,0.0,8849.0,73.0,114.0,345.0,150.0,70.0,90.0,0.0,0.0,...,0.0,0.0,158.0,692.0,7.0,13.0,156.0,156.0,122.0,0.0
max,1.0,11735.0,909.0,4351.0,8157.0,3555.0,933.0,1100.0,0.0,20.0,...,85.0,718.0,984.0,9052.0,323.0,355.0,4185.0,4185.0,2396.0,1.0


### Modelling

In [None]:
# create a function to test for baseline models 

rf = RandomForestClassifier(random_state=12, n_jobs=-1)
xgb = XGBClassifier(random_state=12, n_jobs=-1)
cat = CatBoostClassifier(random_state=12, thread_count=-1)

def baseline_train(model):

    # create KFold for cross validation
    CV = StratifiedKFold(n_splits=5, shuffle=True, random_state=12)
    
    # create list to save AUC scores
    week_auc_scores = []
    month_auc_scores = []
    
    # create loop to evaluate AUC score on each fold for train and validation set of the week and month churn 
    for fold, (train_idx, val_idx) in enumerate(CV.split(X_train[:30000], y_week[:30000])): 
        print(f'Fold {fold + 1}')
        
        # split data 
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        yw_tr, yw_val = y_week.iloc[train_idx], y_week.iloc[val_idx]
        ym_tr, ym_val = y_month.iloc[train_idx], y_month.iloc[val_idx]

        # train models
        week_model = model
        month_model = model

        week_model.fit(X_tr, yw_tr)
        month_model.fit(X_tr, ym_tr)

        # predict 
        pred_w = week_model.predict_proba(X_val)[:,1]
        pred_m = month_model.predict_proba(X_val)[:,1]

        week_auc_score = roc_auc_score(yw_val, pred_w)
        month_auc_score = roc_auc_score(ym_val, pred_m)
        macro_auc_avg = (week_auc_score + month_auc_score)/2

        week_auc_scores.append(week_auc_score)
        month_auc_scores.append(month_auc_score)

        print(f' Week AUC: {week_auc_score:.4f} | Month AUC: {month_auc_score:.4f} | Macro Avg: {macro_auc_avg:.4f}')

    # final aggregated scores 
    avg_week_auc = np.mean(week_auc_scores)
    avg_month_auc = np.mean(month_auc_scores)
    avg_macro_auc = (avg_week_auc + avg_month_auc)/2

    return {
        'Avg_week_AUC' : avg_week_auc, 
        'Avg_month_AUC' : avg_month_auc, 
        'Avg_macro_AUC': avg_macro_auc
    }


features = [col for col in train.columns if col.startswith('v_')]

# training sets
X_train = train[features].copy() 
y_week = train['week_churn'].copy() 
y_month = train['month_churn'].copy()

# test sets 
X_test = test[features].copy() 
test_nid = test['nid'].copy()

In [27]:
# create models 
rf = RandomForestClassifier(random_state=12, n_jobs=-1) 
cat = CatBoostClassifier(random_state=12, thread_count=-1)
xgb = XGBClassifier(n_jobs=-1, random_state=12)

# create function to run baseline training 
def baseline_trainer(model):
    # create stratified cross_validator 
    CV = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)

    # create lists to hold AUC scores 
    weekly_churn = []
    monthly_churn = [] 

    # create loop to run weekly churn training 
    for fold, (train_idx, val_idx) in enumerate(CV.split(X_train[:30000], y_week[:30000])):
        print(f'Running weekly fold {fold+1}')

        #split dataset for weekly churn training
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        yw_tr, yw_val = y_week.iloc[train_idx], y_week.iloc[val_idx]

        # create model 
        yw_model = clone(model) # clone to avoid carrying over settings/weights from another training
        yw_model.fit(X_tr,yw_tr)

        # make probabilistic predictions 
        pred_w = yw_model.predict_proba(X_val)[:,1]

        # evaluate 
        weekly_AUC = roc_auc_score(yw_val, pred_w)

        print(f'Fold {fold+1} weekly AUC score = {weekly_AUC}')

        weekly_churn.append(weekly_AUC)

    # create loop to run monthly churn training 
    for fold, (train_idx, val_idx) in enumerate(CV.split(X_train[:30000], y_month[:30000])):
        print(f'Running monthly fold {fold+1}')

        # split dataset 
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        ym_tr, ym_val = y_month.iloc[train_idx], y_month[val_idx]

        # create model 
        ym_model = clone(model)
        ym_model.fit(X_tr, ym_tr)

        # make probabilistic predictions
        pred_m = ym_model.predict_proba(X_val)[:,1]

        # evaluate 
        monthly_AUC = roc_auc_score(ym_val, pred_m)

        print(f'Fold {fold+1} monthly AUC score = {monthly_AUC}')

        monthly_churn.append(monthly_AUC)

    for i, (avg_w, avg_m) in enumerate(zip(weekly_churn, monthly_churn)):
        print(f'{i+1}. Weekly AUC = {avg_w:.4f}, Monthly AUC = {avg_m:.4f}, Average AUC = {(avg_w + avg_m)/2}')

    # average weekly_churn AUC 
    AVG_weekly = np.mean(weekly_churn)
    AVG_weekly = round(AVG_weekly,2)

    # average monthly_churn AUC
    AVG_monthly = np.mean(monthly_churn)
    AVG_monthly = round(AVG_monthly,2)

    # overal average
    AVG_AUC = (AVG_weekly + AVG_monthly)/2
    AVG_AUC = round(AVG_AUC,2)
    
    # print overall score (average)
    print(f'Average weekly AUC score = {AVG_weekly} | average monthly AUC score = {AVG_monthly} | overall AVG score = {AVG_AUC}') 

    return {'weekly_auc': AVG_weekly,
            'monthly_auc': AVG_monthly,
            'AUC_score': AVG_AUC}

        

In [None]:
cat_base = baseline_trainer(cat)

Running weekly fold 1
Learning rate set to 0.040021
0:	learn: 0.6697239	total: 464ms	remaining: 7m 44s
1:	learn: 0.6504020	total: 842ms	remaining: 7m
2:	learn: 0.6323896	total: 1.29s	remaining: 7m 8s
3:	learn: 0.6161251	total: 1.72s	remaining: 7m 8s
4:	learn: 0.6014136	total: 2.14s	remaining: 7m 6s
5:	learn: 0.5875292	total: 2.53s	remaining: 6m 58s
6:	learn: 0.5748327	total: 2.91s	remaining: 6m 53s
7:	learn: 0.5635872	total: 3.32s	remaining: 6m 51s
8:	learn: 0.5533289	total: 3.67s	remaining: 6m 43s
9:	learn: 0.5439808	total: 4.05s	remaining: 6m 41s
10:	learn: 0.5358622	total: 4.52s	remaining: 6m 46s
11:	learn: 0.5279709	total: 5.01s	remaining: 6m 52s
12:	learn: 0.5201397	total: 5.52s	remaining: 6m 59s
13:	learn: 0.5136202	total: 6.05s	remaining: 7m 6s
14:	learn: 0.5079991	total: 6.45s	remaining: 7m 3s
15:	learn: 0.5025484	total: 6.87s	remaining: 7m 2s
16:	learn: 0.4972991	total: 7.21s	remaining: 6m 57s
17:	learn: 0.4921812	total: 7.55s	remaining: 6m 52s
18:	learn: 0.4874092	total: 7.92

#### - Hyper-parameter tuning

#### - Fit best model

#### - Pre-process test data (optional)

#### - Feature importance

### Deployment

### Experiments (optional)

### Conclusion