# [Optimization Project](https://chatgpt.com/s/t_693f266a368081919a228f4d68a173e3)

Executive interpretation (what they really want)

>“Using only information available at the time of application, identify a small number of rules or scores that reduce delinquency without killing active accounts.”

This is not about perfect prediction.
It is about:

- risk screening
- policy rules
- simple thresholds
- explainability

# Step 1: Lock the data assumptions (non-negotiable)

Before modeling, write this down in your notes or README:

## 1.1 Confirm scope

- Population: approved applications only
- - Assumption because there's no column that identified approved or not.
- Timing: all features measured at application time
- - We can assume this because there are no dates in the sample data.
- No post-approval behavior allowed
- - We're not updating the data.

This protects you from leakage accusations later.

## 1.2 Define outcomes clearly

Primary KPI 

1. Not Delinquent
2. Active Account

# Step 2: Feature triage (this matters more than modeling)

You already started this, which is good.

## 2.1 Categorize features into 3 buckets
| Category |	What it means |	Keep? |
| --- | --- | --- |
| Immutable |	Identity / history |	✅ Yes |
| Actionable |	Can influence decision |	⚠️ Careful |
| Derived / behavioral |	Post-decision signals |	❌ No |

Explicitly exclude:

- days from last login
- anything that requires activity after approval
- anything influenced by delinquency itself

This step alone can win you credibility.

## [2.2 Reduce feature set aggressively](https://chatgpt.com/s/t_693f28c541448191bffee6a7b448f910)

Your mandate says:

> “as simple a strategy as possible”

So your goal is 5–10 features max.

How to reduce without math:

1. [Compute delinquency rate by feature bucket:](https://chatgpt.com/s/t_693f28c541448191bffee6a7b448f910)

   - quartiles for numeric

   - yes/no for booleans

2. Keep features where:

   - delinquency rate clearly increases or decreases

   - effect is monotonic (direction doesn’t flip)

Example logic:

> “Applicants with ≥3 payday inquiries have 2.3× delinquency”

That’s gold.



In [None]:
import polars as pl
from typing import List, Dict, Any
import os

In [None]:
df = pl.read_csv(os.path.join(os.getcwd(),"Project Data.csv"))
data_dict = pl.read_excel(os.path.join(os.getcwd(),"Project Data Dictionary 1.xlsx"),columns=['Field','Description','Classification','type'])

In [None]:
df_q = df.with_columns(
    pl.col("income")
      .qcut(4, labels=["Q1", "Q2", "Q3", "Q4"])
      .alias("income_quartered")
)

In [None]:

delinq_by_bucket = (
    df_q
    .group_by("inq_bucket")
    .agg(
        pl.count().alias("n"),
        pl.mean("is_delinquent").alias("delinquency_rate")
    )
    .sort("inq_bucket")
)

print(delinq_by_bucket)