# Credit Risk Exploratory Data Analysis (EDA)

## Dataset
Default of Credit Card Clients (Taiwan, 30,000 customers)

## Objective
To investigate behavioral and financial patterns that precede credit card default.

---

This notebook performs a structured data audit and foundational exploration 
before any modeling is attempted.

#  Objective:

We are performing exploratory data analysis on a credit card default dataset.

Objective:
To understand what financial and behavioral patterns in the previous 6 months 
are associated with default in the next month.

We are not building a predictive model yet.
We are investigating patterns and relationships.

Questions we intend to explore:

1. Is default driven by high debt levels?
2. Is it driven by repayment delays?
3. Does instability in bill or payment behavior increase risk?
4. Are demographic variables strongly associated with default?

In [None]:
import pandas as pd
import numpy as np
import os



# Define relative path (adjust filename if needed)
DATA_PATH = "../data/raw/default of credit card clients.csv"

# Load dataset
df = pd.read_csv(DATA_PATH)

# Display first 5 rows
df.head()

The dataset contains financial and demographic information for 30,000 credit card clients.
Each row represents one customer observed over a 6-month period.

ID is just a unique identifier for each client and will not be used in analysis.

LIMIT_BAL is the total credit limit assigned to the customer.
It represents how much the bank allows the customer to borrow.

SEX indicates gender:
1 = Male  
2 = Female  

EDUCATION indicates education level:
1 = Graduate school  
2 = University  
3 = High school  
4 = Others  
(Some datasets may also contain 0, 5, or 6 which can represent unknown or unspecified categories.)

MARRIAGE indicates marital status:
1 = Married  
2 = Single  
3 = Others  
(0 may sometimes appear and represents unknown.)

AGE is the age of the customer in years.

PAY_1 to PAY_6 represent repayment status for the last six months.
These show whether the customer paid on time or delayed payments.

Values generally mean:
-2 = No consumption that month  
-1 = Paid in full on time  
0 = No delay  
1 = Payment delayed by 1 month  
2 = Payment delayed by 2 months  
3 and above = Payment delayed by that many months  

Higher positive values indicate more severe delinquency.

BILL_AMT1 to BILL_AMT6 represent the total bill amount (outstanding balance) for each of the last six months.

PAY_AMT1 to PAY_AMT6 represent how much the customer actually paid in each of the last six months.

dpnm represents default payment next month:
0 = No default  
1 = Default  

The objective of this analysis is to understand how past repayment behavior, debt levels, and payment patterns relate to default risk.

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
# checking target distribution
# i want to see how many customers defaulted

df['dpnm'].value_counts()

In [None]:
df['dpnm'].value_counts(normalize=True)

Target Variable Distribution

The target variable `dpnm` represents default payment next month.

Observation:
- ~22.1% of customers default.
- ~77.9% do not default.

This indicates moderate class imbalance.
Accuracy alone would not be a reliable metric in future modeling.



In [None]:
df.describe

In [None]:
# checking unique values for repayment status columns

repay_cols = ['PAY_1','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']

for col in repay_cols:
    print(col, sorted(df[col].unique()))

Repayment status values range from -2 to 8.

-2 indicates no consumption in that month.
-1 indicates payment made on time.
0 indicates no delay.
Positive values (1 to 8) indicate the number of months payment was delayed.

Higher positive values represent more severe delinquency.

These variables are ordinal indicators of repayment behavior and will be treated as ordered categories rather than continuous numeric values.

In [None]:
# checking distribution of credit limit
df['LIMIT_BAL'].describe()

In [None]:
# checking unique credit limits
sorted(df['LIMIT_BAL'].unique())[:10]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
sns.histplot(df['LIMIT_BAL'], bins=30)
plt.title("Distribution of Credit Limit (LIMIT_BAL)")
plt.xlabel("Credit Limit")
plt.ylabel("Count")
plt.show()

In [None]:
plt.figure(figsize=(8,4))
sns.boxplot(x=df['LIMIT_BAL'])
plt.title("Boxplot of Credit Limit")
plt.show()

Observation:

Credit limits range from 10,000 to 1,000,000.

The distribution is slightly right-skewed (mean > median), indicating 
a subset of customers with very high credit limits.

Credit limits appear in structured increments, suggesting tier-based 
credit assignment rather than continuous random values.

LIMIT_BAL represents borrowing capacity and will be analyzed 
in relation to repayment behavior and default risk.

In [None]:
# comparing average credit limit by default status
df.groupby('dpnm')['LIMIT_BAL'].mean()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='dpnm', y='LIMIT_BAL', data=df)
plt.title("Credit Limit by Default Status")
plt.xlabel("Default (0 = No, 1 = Yes)")
plt.ylabel("Credit Limit")
plt.show()

Observation:

Customers who did not default have a higher average and median credit limit 
compared to those who defaulted.

However, there is significant overlap between the two groups, indicating that 
credit limit alone does not fully explain default behavior.

This suggests that repayment patterns and financial behavior may play a more 
important role than credit capacity alone.

## Repayment Behavior Analysis

We now analyze repayment status variables (PAY_1 to PAY_6).

These variables represent payment delay severity over the last six months.

Objective:
To examine whether repayment delay patterns differ between 
default and non-default customers.

In [None]:
# checking overall distribution of PAY_1

df['PAY_1'].value_counts().sort_index()

In [None]:
# checking repayment status distribution split by default

pd.crosstab(df['PAY_1'], df['dpnm'], normalize='columns')

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x='PAY_1', hue='dpnm', data=df)
plt.title("Repayment Status (PAY_1) by Default Status")
plt.xlabel("Repayment Status (PAY_1)")
plt.ylabel("Count")
plt.show()

Observation:

Repayment status in the most recent month (PAY_1) shows strong separation 
between defaulters and non-defaulters.

Defaulters are significantly more likely to have 2 or more months of delay.

Approximately 27.8% of defaulters had a 2-month delay, compared to only 3.5% 
of non-defaulters.

Stable repayment (no delay) is much more common among non-defaulters.

This suggests that recent delinquency is a strong leading indicator of default.

 ## Delinquency Over Time (PAY_1 to PAY_6)

We now analyze repayment delay across all six months.

Instead of looking at each delay category separately, 
we simplify the problem:

We define a customer as "delayed" if PAY_X >= 1 
(i.e., at least one month of payment delay).

Objective:
To compare the proportion of delayed customers 
between defaulters and non-defaulters across all six months.

This allows us to test whether recent delinquency 
has stronger association with default than older delinquency.

In [None]:
# defining repayment status columns
repay_cols = ['PAY_1','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']

In [None]:
# dictionary to store delay proportions
delay_results = {}

for col in repay_cols:
    # creating boolean indicator: True if delay >= 1
    delay_flag = df[col] >= 1
    
    # cross tabulation normalized by default group
    table = pd.crosstab(delay_flag, df['dpnm'], normalize='columns')
    
    # store proportion of delayed customers (True row)
    delay_results[col] = table.loc[True]

In [None]:
delay_df = pd.DataFrame(delay_results).T
delay_df

In [None]:
import matplotlib.pyplot as plt

delay_df.plot(kind='bar', figsize=(8,5))
plt.title("Proportion of Customers with Payment Delay (>=1 Month)")
plt.ylabel("Proportion")
plt.xlabel("Repayment Month")
plt.show()

Observation:

The proportion of customers with payment delay (>=1 month) 
is consistently higher in the default group across all six months.

The separation is strongest in PAY_1 (most recent month) 
and gradually decreases as we move to older months.

This suggests that recent delinquency is a stronger signal of 
imminent default, while older delinquency still contributes 
but with weaker association.

This pattern supports the hypothesis that financial distress 
intensifies closer to default.

## Delay Severity Analysis (>=2 Months)

We now test whether more severe delinquency (2 or more months delay) 
shows stronger separation between defaulters and non-defaulters.

Objective:
To examine whether delay severity (>=2 months) 
acts as a stronger threshold indicator of default risk 
compared to delay >=1 month.

In [None]:
# dictionary to store severe delay proportions
severe_delay_results = {}

for col in repay_cols:
    severe_flag = df[col] >= 2
    table = pd.crosstab(severe_flag, df['dpnm'], normalize='columns')
    severe_delay_results[col] = table.loc[True]

severe_df = pd.DataFrame(severe_delay_results).T
severe_df

In [None]:
severe_df.plot(kind='bar', figsize=(8,5))
plt.title("Proportion of Customers with Severe Delay (>=2 Months)")
plt.ylabel("Proportion")
plt.xlabel("Repayment Month")
plt.show()

## Payment Coverage Ratio Analysis

We now examine how much customers paid relative to their outstanding bill.

We compute a simple coverage ratio:

PAY_AMT_X / BILL_AMT_X

Objective:
To determine whether defaulters consistently pay a smaller proportion 
of their bill compared to non-defaulters.

In [None]:
# avoid division by zero
df['COVERAGE_1'] = np.where(df['BILL_AMT1'] > 0,
                            df['PAY_AMT1'] / df['BILL_AMT1'],
                            0)

In [None]:
df.groupby('dpnm')['COVERAGE_1'].mean()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='dpnm', y='COVERAGE_1', data=df)
plt.title("Payment Coverage Ratio (Month 1) by Default Status")
plt.xlabel("Default (0 = No, 1 = Yes)")
plt.ylabel("Payment / Bill Ratio")
plt.show()

In [None]:
df.groupby('dpnm')['COVERAGE_1'].median()

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df[df['dpnm']==0]['COVERAGE_1'], bins=50, color='blue', label='Non-default', kde=False)
sns.histplot(df[df['dpnm']==1]['COVERAGE_1'], bins=50, color='orange', label='Default', kde=False)
plt.legend()
plt.xlim(0,5)
plt.title("Coverage Ratio Distribution (Limited View)")
plt.show()

Payment Coverage Behavior Over Time

Previously, we examined the payment coverage ratio for the most recent month:

COVERAGE_1 = PAY_AMT1 / BILL_AMT1

We observed that:
- The mean coverage ratio was higher for non-defaulters.
- The median coverage ratio was similar between groups.
- The distribution was highly skewed due to extreme values.

This suggests that raw coverage ratio may be noisy and influenced by outliers.

We now extend the analysis across all six months.

Objective:
To examine whether defaulters consistently pay a lower proportion 
of their outstanding bills over time.

Instead of focusing on a single month, 
we analyze coverage behavior across PAY_AMT1 to PAY_AMT6 
relative to BILL_AMT1 to BILL_AMT6.

This helps determine whether underpayment is persistent 
rather than isolated.

In [None]:
# creating coverage ratios for all six months

for i in range(1, 7):
    bill_col = f'BILL_AMT{i}'
    pay_col = f'PAY_AMT{i}'
    cov_col = f'COVERAGE_{i}'
    
    df[cov_col] = np.where(df[bill_col] > 0,
                           df[pay_col] / df[bill_col],
                           0)

In [None]:
# average coverage across six months

coverage_cols = [f'COVERAGE_{i}' for i in range(1, 7)]

df['AVG_COVERAGE'] = df[coverage_cols].mean(axis=1)

In [None]:
df.groupby('dpnm')['AVG_COVERAGE'].mean()

In [None]:
df.groupby('dpnm')['AVG_COVERAGE'].median()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='dpnm', y='AVG_COVERAGE', data=df)
plt.title("Average Payment Coverage (6 Months) by Default Status")
plt.xlabel("Default (0 = No, 1 = Yes)")
plt.ylabel("Average Payment / Bill Ratio")
plt.show()

### Coverage Ratio Conclusion

We analyzed payment coverage behavior across six months by computing:

AVG_COVERAGE = average(PAY_AMT_X / BILL_AMT_X)

Findings:

- The mean average coverage ratio is higher for non-defaulters.
  However, the distribution is heavily right-skewed due to extreme overpayments.

- The median average coverage ratio is moderately higher for non-defaulters
  (approximately 0.087 vs 0.053).

- While defaulters tend to pay a smaller proportion of their bills on average,
  the separation between groups is not as strong as observed in delinquency status.

Conclusion:

Chronic underpayment contributes to default risk, 
but repayment delay severity (PAY variables) provides a much stronger signal.

Payment coverage appears to be a secondary behavioral indicator,
while delinquency status remains the dominant predictor observed so far.

#### Bill Amount Volatility Analysis

So far, we examined levels (credit limit, delay, coverage).

We now analyze stability.

Objective:
To measure how much bill amounts fluctuate over the six months
for each customer.

Hypothesis:
Customers who default may exhibit higher volatility
in their bill amounts prior to default,
indicating financial instability.

In [None]:
# bill amount columns
bill_cols = [f'BILL_AMT{i}' for i in range(1, 7)]

# compute standard deviation across 6 months
df['BILL_VOLATILITY'] = df[bill_cols].std(axis=1)

In [None]:
df.groupby('dpnm')['BILL_VOLATILITY'].mean()

In [None]:
df.groupby('dpnm')['BILL_VOLATILITY'].median()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='dpnm', y='BILL_VOLATILITY', data=df)
plt.title("Bill Amount Volatility by Default Status")
plt.xlabel("Default (0 = No, 1 = Yes)")
plt.ylabel("Standard Deviation of Bill Amount")
plt.show()

### Bill Volatility Conclusion

Absolute bill volatility (standard deviation of bill amounts)
is higher among non-defaulters.

This likely reflects higher credit limits and larger spending capacity,
rather than financial instability.

Raw volatility is scale-dependent and does not account for
relative fluctuation compared to typical bill size.

Conclusion:
Absolute bill volatility does not appear to be a strong
indicator of default risk.

Further analysis should consider relative volatility
(normalized by average bill amount).

## Relative Bill Volatility Analysis

Absolute volatility was higher among non-defaulters,
likely due to higher credit limits and larger spending levels.

To properly measure instability, we compute relative volatility:

Relative Volatility = 
Standard Deviation of Bill Amounts / Mean Bill Amount

This adjusts for scale and allows fair comparison
between customers with different spending levels.

Objective:
To test whether defaulters exhibit greater proportional
instability in their bill amounts prior to default.

In [None]:
# compute average bill amount across six months
df['AVG_BILL'] = df[bill_cols].mean(axis=1)

In [None]:
# avoid division by zero
df['RELATIVE_VOL'] = np.where(df['AVG_BILL'] > 0,
                              df['BILL_VOLATILITY'] / df['AVG_BILL'],
                              0)

In [None]:
df.groupby('dpnm')['RELATIVE_VOL'].mean()

In [None]:
df.groupby('dpnm')['RELATIVE_VOL'].median()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='dpnm', y='RELATIVE_VOL', data=df)
plt.title("Relative Bill Volatility by Default Status")
plt.xlabel("Default (0 = No, 1 = Yes)")
plt.ylabel("Relative Volatility")
plt.show()

### Relative Volatility Conclusion

Relative bill volatility remains higher among non-defaulters,
even after adjusting for scale.

This suggests that financial instability in spending patterns
is not a primary driver of default in this dataset.

Default appears more strongly associated with delinquency
and repayment behavior rather than bill amount fluctuations.

Conclusion:
In this dataset, repayment discipline is a stronger signal
than spending instability.

## Delinquency Deterioration Over Time

We previously observed that repayment delay is strongly associated with default,
and that recent delinquency has stronger separation than older delinquency.

We now formally analyze deterioration.

Objective:
To examine whether delinquency severity increases
as we move from PAY_6 (oldest month)
to PAY_1 (most recent month) in the default group.

This helps determine whether default is preceded
by gradual worsening repayment behavior.

In [None]:
# compute average PAY value for each month by default status

trend_data = {}

for col in repay_cols:
    trend_data[col] = df.groupby('dpnm')[col].mean()

trend_df = pd.DataFrame(trend_data).T
trend_df

In [None]:
trend_df.plot(figsize=(8,5))
plt.title("Average Repayment Delay Over Time")
plt.xlabel("Month (PAY_6 oldest â†’ PAY_1 most recent)")
plt.ylabel("Average Delay Score")
plt.show()



The average repayment delay score increases steadily
from PAY_6 (oldest month) to PAY_1 (most recent month)
in the default group.

This indicates progressive worsening repayment behavior
as customers approach default.

In contrast, non-defaulters maintain consistently low
or negative delay scores across all months,
indicating stable repayment patterns.

Conclusion:
Default is strongly associated with gradual delinquency
intensification rather than sudden failure.

## Demographic Association Analysis

We now examine whether demographic variables
are strongly associated with default.

Variables analyzed:
- SEX
- EDUCATION
- MARRIAGE
- AGE

Objective:
To determine whether default risk is primarily driven
by demographic characteristics,
or whether financial behavior is the dominant factor.

In [None]:
pd.crosstab(df['SEX'], df['dpnm'], normalize='index')

In [None]:
pd.crosstab(df['EDUCATION'], df['dpnm'], normalize='index')

In [None]:
pd.crosstab(df['MARRIAGE'], df['dpnm'], normalize='index')

In [None]:
df.groupby('dpnm')['AGE'].mean()

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x='dpnm', y='AGE', data=df)
plt.title("Age Distribution by Default Status")
plt.show()

### Demographic Association Conclusion

Demographic variables show weak to moderate association with default.

- Gender shows a small difference in default rate.
- Education level shows modest variation, with slightly higher default rates 
  in lower education categories.
- Marital status shows limited variation.
- Age shows negligible difference between default and non-default groups.

Compared to repayment behavior variables,
demographic features exhibit much weaker separation.

Conclusion:
Default risk in this dataset is driven primarily by financial behavior 
(delinquency patterns) rather than demographic characteristics.

# Final Project Conclusions

This analysis set out to explore four primary questions regarding credit card default risk.

---

## 1. Is default driven by high debt levels?

Partially, but not primarily.

While defaulters tend to have lower average credit limits,
debt level alone does not strongly separate default from non-default groups.

Spending volatility and absolute bill levels were not dominant signals.

Conclusion:
Debt capacity and bill size are not the primary drivers of default in this dataset.

---

## 2. Is default driven by repayment delays?

Yes. Strongly.

Findings show:

- Presence of repayment delay is significantly higher among defaulters.
- Severe delay (>=2 months) shows dramatic separation.
- The most recent month (PAY_1) provides the strongest signal.
- Average delay severity increases steadily as customers approach default.

Conclusion:
Repayment delinquency is the dominant predictor of default.
Default is strongly associated with worsening repayment behavior.

---

## 3. Does instability in bill or payment behavior increase risk?

No significant evidence supports this.

Absolute and relative bill volatility were higher among non-defaulters.
This suggests that spending instability is not a major driver of default risk.

Conclusion:
Default appears to result from repayment discipline failure,
not chaotic spending patterns.

---

## 4. Are demographic variables strongly associated with default?

Only weakly.

- Gender differences are small.
- Education shows moderate variation.
- Marital status differences are limited.
- Age shows negligible difference.

Compared to repayment behavior variables,
demographics exhibit much weaker predictive strength.

Conclusion:
Default risk in this dataset is primarily behavioral,
not demographic.

---

# Overall Insight

The strongest pattern observed is progressive deterioration in repayment behavior.

Defaulters show:
- Increasing delay severity over time.
- Higher frequency of severe delinquency.
- Clear intensification of repayment problems
  as default approaches.

Default in this dataset is not sudden.
It is preceded by gradual repayment breakdown.

Financial behavior, particularly delinquency history,
is the dominant signal of credit risk.