# Prediction -> Decision (Threshold / Top-K)

**Business assumptions (adjustable parameters):**
- Contact cost: `cost = 1` (unit cost)
- If the customer truly churns, successful retention yields a benefit: `benefit = 3`
- Maximum of `K` customers can be contacted per month (resource constraint)

In [76]:
import pandas as pd

In [77]:
COST = 1.0
BENEFIT = 3.0
NO_PEOPLE_CONTACTED_LIST = [20, 50, 100, 200, 300, 500, 800]

## Score Distribution Sanity Check

In [78]:
df = pd.read_parquet("../../data/predictions/validation_predictions.parquet")
print(df["p_churn"].describe(percentiles=[0.5, 0.7, 0.8, 0.9, 0.95, 0.99]))

count    1409.000000
mean        0.415687
std         0.300147
min         0.012712
50%         0.404591
70%         0.649503
80%         0.735047
90%         0.847172
95%         0.890847
99%         0.930516
max         0.952580
Name: p_churn, dtype: float64


Fact: The vast majority of people have `p_churn < 0.5`, so `threshold = 0.5` is often not a good decision; top-K is more natural.

## Top-K Decision Policy Simulation

In [79]:
def simulate_top_k(df, k, cost, benefit):
    """
    Simulate a top-K intervention strategy.
    """
    topk = df.sort_values("p_churn", ascending=False).head(k)

    # True churners in top-K
    true_churners = topk["churn"].sum()

    total_cost = k * cost
    total_benefit = true_churners * benefit
    net_gain = total_benefit - total_cost

    return {
        "k": k,
        "true_churners": int(true_churners),
        "total_cost": total_cost,
        "total_benefit": total_benefit,
        "net_gain": net_gain,
    }

## Decision Curve Analysis (Net Gain vs. K)

In [80]:
results = []
for k in NO_PEOPLE_CONTACTED_LIST:
    results.append(simulate_top_k(df, k, cost=COST, benefit=BENEFIT))

pd.DataFrame(results)

Unnamed: 0,k,true_churners,total_cost,total_benefit,net_gain
0,20,17,20.0,51.0,31.0
1,50,41,50.0,123.0,73.0
2,100,78,100.0,234.0,134.0
3,200,141,200.0,423.0,223.0
4,300,197,300.0,591.0,291.0
5,500,277,500.0,831.0,331.0
6,800,345,800.0,1035.0,235.0


Fact: `net_gain` first rises, then fallsâ€”this is the prototype of a **decision curve**.

## Comparison with Random Baseline

In [81]:
def simulate_random(df, k, cost, benefit, seed=42):
    rnd = df.sample(k, random_state=seed)
    true_churners = rnd["churn"].sum()
    return {
        "k": k,
        "net_gain": true_churners * benefit - k * cost
    }

In [82]:
results = []
for k in NO_PEOPLE_CONTACTED_LIST:
    results.append(simulate_random(df, k, cost=COST, benefit=BENEFIT))

pd.DataFrame(results)

Unnamed: 0,k,net_gain
0,20,-5.0
1,50,-8.0
2,100,-22.0
3,200,-29.0
4,300,-54.0
5,500,-86.0
6,800,-152.0
