## Predicting Teacher Churn in Yandex Uchebnik: A Short-Term Retention Forecasting System
*Preventing churn before it happens — by predicting which teachers are likely to leave the platform within 1 week or 1 month*

### Overview

#### - Problem definition

At Yandex Uchebnik (Yandex’s educational platform for K–12 teachers), retaining active users is critical to maintaining engagement and product value. Teachers who stop using the platform represent lost opportunities for impact, feedback, and long-term growth.

This project aims to build a short-term churn prediction system that forecasts whether a teacher will stop actively using the platform within:

1 week (week_churn) — immediate risk </br>
1 month (month_churn) — medium-term risk </br>
These predictions are based on historical behavioral data, transformed into numerical features (e.g., activity frequency, content interaction patterns, session duration aggregates, and behavioral embeddings). </br>

The output is two independent probability scores per teacher, enabling targeted interventions such as personalized emails, feature recommendations, or support outreach — all before the user disengages. </br>

#### - Data

All raw events (logins, lesson views, assignment submissions, etc.) and text interactions have been preprocessed into anonymized, tabular numeric features. For each teacher snapshot, we have:

- nid: Unique teacher ID (may repeat across dates)
- report_date: Date of the snapshot (when the feature vector was generated) 
- v_0 ... v_N: Numerical features derived from aggregated activity and behavioral embeddings
- week_churn: Binary target: 1 = teacher churned within 7 days after report_date, 0 = did not churn
- month_churn: Binary target: 1 = teacher churned within 30 days after report_date, 0 = did not churn

> Note: Only train.csv contains labels. test.csv contains only features and IDs — used for final prediction. 

Data is fully anonymized and provided in CSV format with comma delimiters.
The datasets can be accessed via the links below. </br> 
- Train dataset [train.csv]()eee
- Test dataset [test.csv]()

#### - Evaluation metric - AUC-ROC

Imagine you’re a doctor trying to predict which patients are at high risk of developing a disease.

You don’t just want to say “yes” or “no” — you want to assign a **risk score** (like 0.85 = very high risk, 0.12 = low risk). The goal is to rank patients correctly: those who actually get sick should have higher scores than those who don’t.

ROC-AUC measures **how well the model ranks positive cases (churners) above negative ones (non-churners)** — without forcing you to pick a specific cutoff.

It ranges from **0.5 to 1.0**:
- **0.5** = Random guessing (like flipping a coin)
- **0.7–0.8** = Fair performance
- **0.8–0.9** = Good performance
- **>0.9** = Excellent performance

In our case, since we have **two separate targets** (`week_churn` and `month_churn`), we compute **ROC-AUC for each independently**, then take the **macro average** — meaning we treat both tasks equally important.

**As an example**

| Score Range     | Interpretation                                                                 |
|------------------|--------------------------------------------------------------------------------|
| < 0.6            | Poor — model performs worse than random guessing                               |
| 0.6 – 0.7        | Weak — barely better than chance; needs major improvement                      |
| 0.7 – 0.8        | Moderate — useful for basic prioritization                                     |
| 0.8 – 0.9        | Strong — good for operational use (e.g., triggering alerts or campaigns)       |
| > 0.9            | Excellent — highly reliable for proactive retention strategies                 |

> **Goal**: Achieve **>0.85** macro-ROC-AUC — indicating the model can reliably distinguish between teachers who will churn soon vs. those who won’t.

#### - Theory

**Why ROC-AUC?**

- **Class imbalance friendly**: In churn prediction, most users don’t churn — so accuracy is misleading. ROC-AUC doesn’t care about class balance.
- **Threshold agnostic**: You don’t need to decide “what score means ‘churn’?” — it evaluates ranking quality regardless of threshold.
- **Probability-focused**: Perfect for systems where you want to sort users by risk, not just classify them.

**Why Macro-Average?**

Because `week_churn` and `month_churn` are **independent business problems**:
- One informs **urgent interventions** (within 7 days)
- The other informs **medium-term strategy** (within 30 days)

We give equal weight to both — hence, **macro-average** (simple mean) instead of weighted average.


In [2]:
# import libraries
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier 
from xgboost import XGBClassifier 
from sklearn.base import clone 
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_val_score, StratifiedKFold 
import matplotlib.pyplot as plt 
import optuna as opt
import pandas as pd 
import numpy as np 
import seaborn as sns 
%matplotlib inline

### Data Analysis

In [3]:
# import data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


In [4]:
train.sample(frac=.01)

Unnamed: 0,month_churn,nid,report_date,v_0,v_1,v_10,v_100,v_101,v_102,v_103,...,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99,week_churn
97865,0,11545,2024-12-22,47,75,267,51,9,22,0,...,0,0,76,385,0,0,51,51,47,0
165416,1,7456,2023-11-30,15,35,18,15,0,0,0,...,0,0,15,35,0,0,18,18,10,1
58844,0,6810,2025-02-28,116,578,1226,447,190,316,0,...,1,23,121,1669,26,66,477,477,302,0
173061,0,2968,2025-01-31,149,189,341,158,60,69,0,...,0,0,256,687,7,14,192,192,121,0
55748,0,7215,2024-03-24,119,143,645,276,135,224,0,...,0,0,354,1402,19,51,285,285,253,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
155104,0,7468,2025-02-09,49,65,449,101,47,50,0,...,0,0,79,268,2,5,101,101,94,0
42454,0,7600,2025-02-16,8,11,262,69,26,47,0,...,0,0,15,77,8,10,73,73,41,0
174550,0,934,2024-11-30,55,172,193,161,88,90,0,...,0,3,92,509,25,30,165,165,149,0
191346,0,10524,2025-01-31,24,70,65,49,16,17,0,...,0,0,25,151,0,0,49,49,48,0


In [5]:
test.sample(frac=.1)

Unnamed: 0,nid,report_date,v_0,v_1,v_10,v_100,v_101,v_102,v_103,v_104,...,v_90,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99
9575,10825,2025-02-09,65,95,156,102,62,90,0,0,...,224,0,0,87,224,8,14,102,102,93
7935,4071,2025-02-02,19,27,275,68,10,22,0,0,...,173,0,0,59,173,0,1,72,72,47
16693,6542,2023-11-26,27,27,37,33,9,21,0,0,...,283,0,0,123,270,1,1,37,37,25
21797,2580,2025-02-23,14,14,124,44,17,24,0,0,...,329,0,2,88,329,0,0,45,45,41
21115,3097,2025-01-12,37,94,715,167,55,60,0,0,...,514,0,0,136,511,5,5,179,179,156
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16937,5475,2024-03-24,22,41,170,151,96,127,0,0,...,349,0,0,111,333,1,1,160,160,115
13688,9301,2024-05-12,20,28,162,29,21,27,0,0,...,478,0,1,182,437,0,1,29,29,27
2962,3150,2024-01-31,62,138,152,145,95,123,0,0,...,524,0,1,120,379,32,35,145,145,135
20730,6022,2025-03-02,276,349,1264,434,358,392,0,0,...,2272,0,0,389,2272,51,73,434,434,402


In [6]:
# select only needed features
features = [col for col in train.columns if col.startswith('v_')]

# training sets
X_train = train[features].copy() 
y_week = train['week_churn'].copy() 
y_month = train['month_churn'].copy()

# test sets 
X_test = test[features].copy() 
test_nid = test['nid'].copy()


#### EDA (Exploratory Data Analysis)
> Explore relevant relationships that exists amongst several features

In [7]:
for label,content in train.items():
    if pd.isna(content).any():
        print(label)


In [8]:
# check distribution in week and month dataframes 
y_week.value_counts(normalize=True)

week_churn
0    0.751181
1    0.248819
Name: proportion, dtype: float64

In [9]:
y_month.value_counts(normalize=True)

month_churn
0    0.870445
1    0.129555
Name: proportion, dtype: float64

In [10]:
train.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248180 entries, 0 to 248179
Columns: 279 entries, month_churn to week_churn
dtypes: int64(278), object(1)
memory usage: 528.3+ MB


In [11]:
test.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27437 entries, 0 to 27436
Columns: 277 entries, nid to v_99
dtypes: int64(276), object(1)
memory usage: 58.0+ MB


In [12]:
train.describe() 

Unnamed: 0,month_churn,nid,v_0,v_1,v_10,v_100,v_101,v_102,v_103,v_104,...,v_91,v_92,v_93,v_94,v_95,v_96,v_97,v_98,v_99,week_churn
count,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,...,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0,248180.0
mean,0.129555,5883.293863,53.080615,97.953369,257.293311,111.428342,53.258756,67.539254,0.0,0.02995,...,0.036586,1.383339,109.210911,512.872314,6.535567,11.213676,116.50249,116.50249,91.221662,0.248819
std,0.335814,3397.063277,58.427309,157.546228,311.641437,123.58078,68.650707,82.499486,0.0,0.247189,...,0.834675,7.416295,94.818224,615.401135,13.756437,20.132281,131.100894,131.100894,102.653687,0.43233
min,0.0,1.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2917.0,12.0,16.0,55.0,29.0,10.0,14.0,0.0,0.0,...,0.0,0.0,35.0,101.0,0.0,0.0,31.0,31.0,23.0,0.0
50%,0.0,5934.0,33.0,45.0,152.0,72.0,29.0,39.0,0.0,0.0,...,0.0,0.0,85.0,298.0,1.0,4.0,75.0,75.0,57.0,0.0
75%,0.0,8849.0,73.0,114.0,345.0,150.0,70.0,90.0,0.0,0.0,...,0.0,0.0,158.0,692.0,7.0,13.0,156.0,156.0,122.0,0.0
max,1.0,11735.0,909.0,4351.0,8157.0,3555.0,933.0,1100.0,0.0,20.0,...,85.0,718.0,984.0,9052.0,323.0,355.0,4185.0,4185.0,2396.0,1.0


### Modelling

In [13]:
# create models 
rf = RandomForestClassifier(random_state=12, n_jobs=-1) 
cat = CatBoostClassifier(random_state=12, thread_count=-1)
xgb = XGBClassifier(n_jobs=-1, random_state=12)

# create function to run baseline training 
def baseline_trainer(model):
    # create stratified cross_validator 
    CV = StratifiedKFold(n_splits=5, random_state=12, shuffle=True)

    # create lists to hold AUC scores 
    weekly_churn = []
    monthly_churn = [] 

    # create loop to run weekly churn training 
    for fold, (train_idx, val_idx) in enumerate(CV.split(X_train[:1000], y_week[:1000])):
        print(f'Running weekly fold {fold+1}')

        #split dataset for weekly churn training
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        yw_tr, yw_val = y_week.iloc[train_idx], y_week.iloc[val_idx]

        # create model 
        yw_model = clone(model) # clone to avoid carrying over settings/weights from another training
        yw_model.fit(X_tr,yw_tr)

        # make probabilistic predictions 
        pred_w = yw_model.predict_proba(X_val)[:,1]

        # evaluate 
        weekly_AUC = roc_auc_score(yw_val, pred_w)

        print(f'Fold {fold+1} weekly AUC score = {weekly_AUC}')

        weekly_churn.append(weekly_AUC)

    # create loop to run monthly churn training 
    for fold, (train_idx, val_idx) in enumerate(CV.split(X_train[:1000], y_month[:1000])):
        print(f'Running monthly fold {fold+1}')

        # split dataset 
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        ym_tr, ym_val = y_month.iloc[train_idx], y_month[val_idx]

        # create model 
        ym_model = clone(model)
        ym_model.fit(X_tr, ym_tr)

        # make probabilistic predictions
        pred_m = ym_model.predict_proba(X_val)[:,1]

        # evaluate 
        monthly_AUC = roc_auc_score(ym_val, pred_m)

        print(f'Fold {fold+1} monthly AUC score = {monthly_AUC}')

        monthly_churn.append(monthly_AUC)

    for i, (avg_w, avg_m) in enumerate(zip(weekly_churn, monthly_churn)):
        print(f'{i+1}. Weekly AUC = {avg_w:.4f}, Monthly AUC = {avg_m:.4f}, Average AUC = {(avg_w + avg_m)/2}')

    # average weekly_churn AUC 
    AVG_weekly = np.mean(weekly_churn)
    AVG_weekly = round(AVG_weekly,2)

    # average monthly_churn AUC
    AVG_monthly = np.mean(monthly_churn)
    AVG_monthly = round(AVG_monthly,2)

    # overal average
    AVG_AUC = (AVG_weekly + AVG_monthly)/2
    AVG_AUC = round(AVG_AUC,2)
    
    # print overall score (average)
    print(f'Average weekly AUC score = {AVG_weekly} | average monthly AUC score = {AVG_monthly} | overall AVG score = {AVG_AUC}') 

    return {'weekly_auc': AVG_weekly,
            'monthly_auc': AVG_monthly,
            'AUC_score': AVG_AUC}

        

In [14]:
# cat_base = baseline_trainer(cat)

In [15]:
# xgb_base = baseline_trainer(xgb)

In [16]:
# rf_base = baseline_trainer(rf)

From the results it is clear XGBoost and CatBoost holds more promise especially catboost, so further step will involve hyper-tuning catboost to get the highest possible score

#### - Hyper-parameter tuning

*Hyper-tuning CatBoost*

<center>Parameters search using RandomizedSearchCV</center>

In [21]:
# create random search grid
cat_rf_grid = {
    'learning_rate':np.arange(.01,.3,.03),
    'iterations':np.arange(50,500,50),
    'depth':np.arange(4,10,2),
    'l2_leaf_reg':np.arange(3,10,.5), 
    # 'random_strength':np.arange(,10,.5), tune if overfitting
    # 'bagging_temperature':np.arange(1,2) tune for overfitting 
    'border_count':np.arange(50,250,50)
}

# create randomsearch instance with cross validation
cat_rs = RandomizedSearchCV(CatBoostClassifier(thread_count=-1, random_state=12),
                            param_distributions=cat_rf_grid,
                            n_iter=2,
                            cv=5,
                            verbose=True)


def RandomizedSeach_trainer(grid, model, train_data, yw_data, ym_data):

    # reduce samples
    train_data = train_data
    yw_data = yw_data
    ym_data = ym_data

    # case 1 (week): split data for training 
    Xw_train, Xw_test, yw_train, yw_test = train_test_split(train_data, yw_data, random_state=12, test_size=.2, stratify=yw_data)

    # train 
    yw_model = clone(cat_rs)
    yw_model.fit(Xw_train, yw_train)

    # get probabilities
    yw_proba = yw_model.predict_proba(Xw_test)[:,1]

    # evaluate
    week_AUC_score = roc_auc_score(yw_test, yw_proba)

    # get best parameters
    best_par_week = yw_model.best_params_

    print(f'Week AUC score = {week_AUC_score:.4f} \n Best parameter for week churn training = {best_par_week}')

    # case 2 (month): split data for training
    Xm_train, Xm_test, ym_train, ym_test = train_test_split(train_data, ym_data, random_state=12, test_size=.2, stratify=ym_data)

    # train 
    ym_model = clone(cat_rs)
    ym_model.fit(Xm_train, ym_train)

    # get probabilities 
    ym_proba = ym_model.predict_proba(Xm_test)[:,1]

    # evaluate 
    month_AUC_score = roc_auc_score(ym_test, ym_proba)

    # get best parameters 
    best_par_month = ym_model.best_params_

    print(f'Month AUC score = {month_AUC_score:.4f} \n Best parameter for month churn training = {best_par_month} \n Week AUC score = {week_AUC_score:.4f} \n Best parameter for week churn training = {best_par_week}')

    Avg_AUC_score = (week_AUC_score + month_AUC_score)/2
    
    return Avg_AUC_score 


In [20]:
RandomizedSeach_trainer(cat_rf_grid, cat_rs, X_train, y_week, y_month)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
0:	learn: 0.6178397	total: 232ms	remaining: 1m 32s
1:	learn: 0.5690344	total: 458ms	remaining: 1m 31s
2:	learn: 0.5342123	total: 662ms	remaining: 1m 27s
3:	learn: 0.5099420	total: 893ms	remaining: 1m 28s
4:	learn: 0.4934495	total: 1.15s	remaining: 1m 31s
5:	learn: 0.4816228	total: 1.37s	remaining: 1m 30s
6:	learn: 0.4729543	total: 1.58s	remaining: 1m 28s
7:	learn: 0.4656601	total: 1.89s	remaining: 1m 32s
8:	learn: 0.4609584	total: 2.1s	remaining: 1m 31s
9:	learn: 0.4561979	total: 2.33s	remaining: 1m 31s
10:	learn: 0.4526916	total: 2.56s	remaining: 1m 30s
11:	learn: 0.4504564	total: 2.81s	remaining: 1m 30s
12:	learn: 0.4489244	total: 3.1s	remaining: 1m 32s
13:	learn: 0.4468782	total: 3.35s	remaining: 1m 32s
14:	learn: 0.4449683	total: 3.53s	remaining: 1m 30s
15:	learn: 0.4432499	total: 3.67s	remaining: 1m 28s
16:	learn: 0.4415193	total: 3.88s	remaining: 1m 27s
17:	learn: 0.4406154	total: 4.14s	remaining: 1m 27s
18:	learn: 0.439

np.float64(0.8591503576415493)

#### - Fit best model

#### - Pre-process test data (optional)

#### - Feature importance

### Deployment

### Experiments (optional)

### Conclusion