
# EdTech — Student Churn / Dropout Risk (Ready-to-Demo)

**Goal:** Identify students likely to stop using the platform (churn) so the product/CS team can intervene with targeted actions (mentoring, reminders, offers).

This notebook generates **synthetic student engagement data** and walks through:
- Data generation & EDA
- Feature engineering for engagement signals
- Baseline models (Logistic Regression, Random Forest)
- Model evaluation (confusion matrix, ROC AUC)
- Client-friendly interpretation and recommended actions


In [None]:
# 0) Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_fscore_support
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

pd.set_option('display.max_columns', 100)


## 1) Generate synthetic student engagement data

In [None]:
# Create synthetic dataset
rng = np.random.default_rng(42)
n_students = 1600

students = pd.DataFrame({
    'student_id': np.arange(1, n_students+1),
    'signup_date': pd.to_datetime('2023-01-01') + pd.to_timedelta(rng.integers(0, 700, n_students), unit='D'),
    'country': rng.choice(['US','IN','UK','CA','AU'], size=n_students, p=[0.4,0.25,0.15,0.12,0.08]),
    'cohort': rng.choice(['A','B','C','D'], size=n_students)
})

# Engagement features (synthetic)
lessons_completed_week = np.clip(rng.normal(3.2, 1.6, n_students), 0, None)
days_since_last_login = np.clip(rng.normal(9, 7, n_students), 0, None).astype(int)
practice_streak_days = np.clip(rng.normal(12, 9, n_students), 0, None).astype(int)
avg_quiz_score = np.clip(rng.normal(70, 14, n_students), 0, 100)
received_mentor_msg = rng.binomial(1, 0.3, n_students)
time_spent_minutes_week = np.clip(rng.normal(180, 80, n_students), 10, None)

# Synthetic churn generation (probability model)
logit = (
    -1.0
    - 0.25*lessons_completed_week
    + 0.06*days_since_last_login
    - 0.02*practice_streak_days
    - 0.015*avg_quiz_score
    - 0.6*received_mentor_msg
    - 0.001*time_spent_minutes_week
)
p = 1/(1+np.exp(-logit))
churn = rng.binomial(1, p)

df = pd.DataFrame({
    'student_id': students['student_id'],
    'signup_date': students['signup_date'],
    'country': students['country'],
    'cohort': students['cohort'],
    'lessons_completed_week': lessons_completed_week,
    'days_since_last_login': days_since_last_login,
    'practice_streak_days': practice_streak_days,
    'avg_quiz_score': avg_quiz_score,
    'received_mentor_msg': received_mentor_msg,
    'time_spent_minutes_week': time_spent_minutes_week,
    'churned_next_month': churn
})

df.head()

## 2) Quick EDA

In [None]:
print(df.shape)
df.describe().round(2)

In [None]:
# Plot churn distribution
plt.figure()
df['churned_next_month'].value_counts().sort_index().plot(kind='bar')
plt.title('Churn vs Keep (counts)')
plt.xlabel('churned_next_month')
plt.ylabel('count')
plt.show()

In [None]:
# Plot feature distributions (few)
for col in ['lessons_completed_week','days_since_last_login','practice_streak_days','avg_quiz_score','time_spent_minutes_week']:
    plt.figure()
    df[col].hist(bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('count')
    plt.show()

**Inference (EDA):**
- Check which students have low engagement (lessons, time spent) and high days since last login. These are likely to churn.
- Mentor messages correlate with retention; promoting mentor contact may reduce churn.


## 3) Feature engineering & Baseline Models

In [None]:
features = ['lessons_completed_week','days_since_last_login','practice_streak_days','avg_quiz_score','received_mentor_msg','time_spent_minutes_week']
X = df[features].copy()
y = df['churned_next_month'].copy()

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numeric features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)
proba_lr = lr.predict_proba(X_test_scaled)[:,1]
pred_lr = (proba_lr >= 0.5).astype(int)

# Random Forest
rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X_train, y_train)
proba_rf = rf.predict_proba(X_test)[:,1]
pred_rf = (proba_rf >= 0.5).astype(int)

print('Logistic Regression:')
print(classification_report(y_test, pred_lr))
print('ROC AUC (LR):', round(roc_auc_score(y_test, proba_lr),3))

print('\nRandom Forest:')
print(classification_report(y_test, pred_rf))
print('ROC AUC (RF):', round(roc_auc_score(y_test, proba_rf),3))

In [None]:
# ROC curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, proba_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, proba_rf)

plt.figure()
plt.plot(fpr_lr, tpr_lr, label='Logistic Regression')
plt.plot(fpr_rf, tpr_rf, label='Random Forest')
plt.plot([0,1],[0,1],'--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.show()

In [None]:
# Confusion matrix for Random Forest
cm = confusion_matrix(y_test, pred_rf)
print('Confusion matrix (RF):\n', cm)

plt.figure()
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion Matrix (RF)')
plt.colorbar()
for (i, j), v in np.ndenumerate(cm):
    plt.text(j, i, str(v), ha='center', va='center')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Feature importance from Random Forest
importances = rf.feature_importances_
fi = pd.DataFrame({'feature': features, 'importance': importances}).sort_values('importance', ascending=False)
fi

**Inference (Model Results):**
- The model successfully separates at-risk students based on engagement.
- Important signals often include: lessons completed, days since last login, time spent, and mentor contact.
- Use model scores to create 'early warning' lists for the student success team.


In [None]:
# Threshold tuning table (precision vs recall)
thresholds = [0.2,0.3,0.4,0.5,0.6,0.7]
rows = []
for t in thresholds:
    pred = (proba_rf >= t).astype(int)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, pred, average='binary', zero_division=0)
    rows.append({'threshold': t, 'precision': round(precision,3), 'recall': round(recall,3), 'f1': round(f1,3)})
pd.DataFrame(rows)


## 4) Actionable Playbook (What to tell the client)

- **Weekly early-warning list:** Run the model weekly and send top N at-risk students to Student Success for outreach.  
- **Personalized interventions:** For students with low quiz scores, offer mini-tutorials; for those with low activity, send re-engagement emails or mentor sessions.  
- **Measure results:** Track whether interventions reduce subsequent churn (A/B test recommended).  
- **Data to add later for better accuracy:** course progress, assignment submissions, in-app events, message response times.

**Next steps to production:** Replace synthetic data with real exports, retrain model, and serve via simple API or integrate with CRM.  
