## Q2. Create a risk scoring system using at least 3 features from the credit dataset. Segment customers into Low, Medium, High Risk. Show:
### a. Number of customers in each segment
### b. Average EMI delay per segment

We assume a customer is “risky” if they show:

- Poor EMI repayment behavior

- Frequent or long payment delays

- Low financial documentation or weak proof of income

So we pick 3 features to judge risk:

| **Feature**             | **Risk Indicator**                    |
|-------------------------|----------------------------------------|
| `emi_paid_months`       | Low ratio means bad repayment          |
| `payment_delay_days`    | More delay = more risk                 |
| `income_doc_score`      | Lower score = less reliable income     |


We define rules:

- For each bad behavior, assign points

- The more points = the more risk

- This becomes the risk score

Once each customer has a risk score, we group them:

- Low Risk → Score `0–2`

- Medium Risk → Score `3–4`

- High Risk → Score `5+`

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("tractor_credit_data.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   customer_id          300 non-null    object
 1   upazila              300 non-null    object
 2   district             300 non-null    object
 3   purchase_month       300 non-null    object
 4   total_credit_amount  300 non-null    int64 
 5   emi_paid_months      300 non-null    int64 
 6   total_emi_months     300 non-null    int64 
 7   payment_delay_days   300 non-null    int64 
 8   dealer_id            300 non-null    object
 9   income_doc_score     300 non-null    int64 
 10  land_ownership_flag  300 non-null    int64 
dtypes: int64(6), object(5)
memory usage: 25.9+ KB


In [4]:
def risk_score(row):
    score = 0

    # EMI repayment behavior
    emi_ratio = row["emi_paid_months"] / row["total_emi_months"] if row["total_emi_months"] > 0 else 1
    if emi_ratio < 0.5:
        score += 2
    elif emi_ratio < 0.75:
        score += 1

    # Payment delay
    if row["payment_delay_days"] > 30:
        score += 2
    elif row["payment_delay_days"] > 0:
        score += 1

    # Income doc score
    if row["income_doc_score"] < 400:
        score += 2
    elif row["income_doc_score"] < 600:
        score += 1

    return score


In [5]:
df["risk_score"] = df.apply(risk_score, axis=1)

In [6]:
df.head()

Unnamed: 0,customer_id,upazila,district,purchase_month,total_credit_amount,emi_paid_months,total_emi_months,payment_delay_days,dealer_id,income_doc_score,land_ownership_flag,risk_score
0,CUST_1000,Banaripara,Barishal,2023-07,346867,12,12,0,DEALER_8,320,0,2
1,CUST_1001,Godagari,Rajshahi,2023-11,287498,7,12,0,DEALER_8,430,0,2
2,CUST_1002,Sadar,Bogura,2023-02,394027,5,12,0,DEALER_1,759,1,2
3,CUST_1003,Daudkandi,Cumilla,2023-12,356730,9,12,0,DEALER_16,489,0,1
4,CUST_1004,Rupsha,Khulna,2023-12,231551,8,12,0,DEALER_19,320,1,3


In [7]:
def segment(score):
    if score <= 2:
        return "Low"
    elif score <= 4:
        return "Medium"
    else:
        return "High"


In [8]:
df["risk_segment"] = df["risk_score"].apply(segment)


In [9]:
df.head()

Unnamed: 0,customer_id,upazila,district,purchase_month,total_credit_amount,emi_paid_months,total_emi_months,payment_delay_days,dealer_id,income_doc_score,land_ownership_flag,risk_score,risk_segment
0,CUST_1000,Banaripara,Barishal,2023-07,346867,12,12,0,DEALER_8,320,0,2,Low
1,CUST_1001,Godagari,Rajshahi,2023-11,287498,7,12,0,DEALER_8,430,0,2,Low
2,CUST_1002,Sadar,Bogura,2023-02,394027,5,12,0,DEALER_1,759,1,2,Low
3,CUST_1003,Daudkandi,Cumilla,2023-12,356730,9,12,0,DEALER_16,489,0,1,Low
4,CUST_1004,Rupsha,Khulna,2023-12,231551,8,12,0,DEALER_19,320,1,3,Medium


In [10]:
segment_counts = df["risk_segment"].value_counts().reset_index()
segment_counts.columns = ["Risk Segment", "Number of Customers"]


In [11]:
avg_emi_delay = df.groupby("risk_segment")["payment_delay_days"].mean().reset_index()
avg_emi_delay.columns = ["Risk Segment", "Average EMI Delay (Days)"]


In [12]:
risk_summary = pd.merge(segment_counts, avg_emi_delay, on="Risk Segment")


In [13]:
risk_summary

Unnamed: 0,Risk Segment,Number of Customers,Average EMI Delay (Days)
0,Low,172,4.796512
1,Medium,118,10.211864
2,High,10,24.0
