---
# Assignment Description
---

In this notebook, we build a machine learning model to predict whether a user will stop making payments within the next three months, given a specific reference date.

The analysis uses two datasets:

- A user table containing one row per user with five user-level features.

- A monthly payments table containing payment dates for each user.

Users make payments monthly until they stop, and once a user stops paying, they never resume.

The notebook covers the full modeling workflow:

1- Loading and exploring the data

2- Constructing a dataset for a given reference date (features and target per user)

3- Splitting the data into training, validation, and test sets

4- Training and evaluating a predictive model

---
# 0. Load libraries
---

In [2]:
# =========================================
# 0. Load libraries
# =========================================
import pandas as pd
from matplotlib import pyplot
from pandas import read_csv
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier


---
# 1. Load the two CSV files with the data
---


In [3]:
# =========================================
# 1. Load the two CSV files with the data
# =========================================

# 1.1 Load data
users = read_csv("user_table.csv")
payments = read_csv("monthly_payments.csv")

# 1.2 Clean data
payments["date"] = pd.to_datetime(payments["date"])
payments = payments.sort_values(by=["date", "user_id"])
#if payments['payment'].value_counts()[1] == len(payments): del(payments['payment'])


---
# 2. Write a function that constructs a dataset 
---

(corresponding to a feature vector and a target per user) for a given date


In [3]:
# =========================================================
# 2. Write a function that constructs a dataset 
# (corresponding to a feature vector and a target per user) 
# for a given date
# =========================================================

# 2.1 Preprocess payments data
snapshots = payments[["user_id", "date"]].copy()
snapshots.rename(columns={"date": "snapshot_date"}, inplace=True)

# 2.2 Labeling function
# snapshot_date = snapshots['snapshot_date'][0]
# user_id = snapshots['user_id'][0]
def label_future_payments(payments, snapshot_date, user_id):
    future_start = snapshot_date + pd.DateOffset(months=1)
    future_end = snapshot_date + pd.DateOffset(months=3)

    future_payments = payments[
        (payments.user_id == user_id) &
        (payments.date >= future_start) &
        (payments.date <= future_end)
    ]

    # Expect 3 monthly payments
    return int(len(future_payments) < 3)

snapshots["y"] = snapshots.apply(
    lambda row: label_future_payments(
        payments, row.snapshot_date, row.user_id
    ),
    axis=1
)

data = snapshots.merge(users, on="user_id", how="left")


# 2.3 Tenure and number of payments features
# snapshot_date = snapshots['snapshot_date'][0]
# user_id = snapshots['user_id'][0]
def build_behavioral_features(payments, snapshot_date, user_id):
    history = payments[
        (payments.user_id == user_id) &
        (payments.date < snapshot_date)
    ].sort_values("date")

    if history.empty:
        return pd.Series({
            "num_payments": 0
        })

    tenure = (snapshot_date - history.date.min()).days / 30

    return pd.Series({
        "num_payments": len(history)
    })

behavioral = snapshots.apply(
    lambda row: build_behavioral_features(
        payments, row.snapshot_date, row.user_id
    ),
    axis=1
)

data = pd.concat([data, behavioral], axis=1)


---
# 3. Construct a train, validation, and test dataset
---

In [4]:
# ==================================================
# 3. Construct a train, validation, and test dataset
# ==================================================

# 3.1 Split into train, validation and test based on snapshot_date
""" Note:
# To evaluate the model in a realistic setting, the dataset is split
# chronologically based on the snapshot_date. This avoids information
# leakage from the future into the past, which would occur with random
# splits or standard cross-validation.

# The earliest 60% of snapshot dates are used for training, the next
# 20% for validation (model selection), and the most recent 20% for
# final testing. This setup simulates the real-world scenario where
# models are trained on historical data and used to predict future
# user behavior.
"""

cutoff_date_1 = data["snapshot_date"].quantile(0.6)
cutoff_date_2 = data["snapshot_date"].quantile(0.8)

train = data[data.snapshot_date <= cutoff_date_1]
val = data[(data.snapshot_date > cutoff_date_1) & (data.snapshot_date <= cutoff_date_2)]
test = data[(data.snapshot_date > cutoff_date_2)]

X_train = train.drop(columns=["user_id", "snapshot_date", "y"])
Y_train = train["y"]

X_val = val.drop(columns=["user_id", "snapshot_date", "y"])
Y_val = val["y"]

X_test = test.drop(columns=["user_id", "snapshot_date", "y"])
Y_test = test["y"]


---
# 4. Train and evaluate a model
---

In [None]:
# =============================
# 4. Train and evaluate a model
# =============================

# 4.1 Test options and evaluation metric
scoring = 'roc_auc' 
""" Note:
# ROC-AUC is used because the task focuses on ranking churn risk
# rather than making binary decisions at a fixed threshold
"""
seed = 7

# 4.2 Spot-Check Algorithms
models = {}
models['LR'] = LogisticRegression(random_state=seed)
models['DT'] = DecisionTreeClassifier(random_state=seed)
models['RF'] = RandomForestClassifier(random_state=seed)

results = {}
for name, model in models.items():
    model.fit(X_train, Y_train)
    val_pred = model.predict_proba(X_val)[:, 1]
    score = roc_auc_score(Y_val, val_pred)
    results[name] = score
    print(f"{name} validation ROC-AUC: {score:.4f}")

best_model_name = max(results, key=results.get)
best_model = models[best_model_name]

print("Best model:", best_model_name)


# 4.3 Final evaluation on test (only once)
X_train_val = pd.concat([X_train, X_val]) # Note: retrain on train + val
Y_train_val = pd.concat([Y_train, Y_val]) 
best_model.fit(X_train_val, Y_train_val)

test_pred = best_model.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(Y_test, test_pred)

print("Test ROC-AUC:", test_auc)
# 0.5383776561198086 (only fit on train)
# 0.5467572259423489 with tuning

LR validation ROC-AUC: 0.6914
DT validation ROC-AUC: 0.5992
RF validation ROC-AUC: 0.6529
Best model: LR
Test ROC-AUC: 0.5542090760141853


' Note:\n# Performance improves when training on later snapshots because users close \n# to churn exhibit more deterministic outcomes under the assumption of \n# irreversible monthly churn.\n'