## EDA Objectives & Constraints

The goal of this exploratory analysis is to understand the structure, quality, and limitations of the credit default dataset without performing feature engineering or modeling.

This EDA is limited to:
- Understanding the target distribution
- Identifying feature types
- Assessing missingness and data quality
- Detecting potential data leakage
- Descriptive, univariate summaries only

No transformations, encoding, scaling, or modeling are performed in this notebook.

In [1]:
import pandas as pd

df = pd.read_csv("../data/credit_default.csv", header = 1)

df.shape, df.columns

((30000, 25),
 Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
        'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
        'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
        'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
        'default payment next month'],
       dtype='object'))

In [2]:
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [3]:
df = df.rename(columns={"default payment next month":"DEFAULT"})
del df['ID']
df.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


## Official Column Definitions

The following definitions are taken from the UCI Machine Learning Repository documentation
for the "Default of Credit Card Clients" dataset.

- LIMIT_BAL — Amount of given credit (NT dollar)
- SEX — Gender (1 = male, 2 = female)
- EDUCATION — Education level (1 = graduate school, 2 = university, 3 = high school, 4 = others)
- MARRIAGE — Marital status (1 = married, 2 = single, 3 = others)
- AGE — Age in years
<br><br>
- PAY_0 — Repayment status in September
- PAY_2 — Repayment status in August
- PAY_3 — Repayment status in July
- PAY_4 — Repayment status in June
- PAY_5 — Repayment status in May
- PAY_6 — Repayment status in April
<br><br>
- BILL_AMT1 — Bill statement amount in September
- BILL_AMT2 — Bill statement amount in August
- BILL_AMT3 — Bill statement amount in July
- BILL_AMT4 — Bill statement amount in June
- BILL_AMT5 — Bill statement amount in May
- BILL_AMT6 — Bill statement amount in April
<br><br>
- PAY_AMT1 — Amount paid in September
- PAY_AMT2 — Amount paid in August
- PAY_AMT3 — Amount paid in July
- PAY_AMT4 — Amount paid in June
- PAY_AMT5 — Amount paid in May
- PAY_AMT6 — Amount paid in April
<br><br>
- Default: Default payment next month (1 = yes, 0 = no)


## Target Variable: Default Indicator

The target variable for this analysis is `DEFAULT`, which indicates whether a client defaulted on their credit card payment in the following month.

This section examines the distribution of the target variable to assess class imbalance and implications for model evaluation.


In [4]:
# Target distribution
target_counts = df["DEFAULT"].value_counts()
target_percent = df["DEFAULT"].value_counts(normalize=True)

target_counts, target_percent

(DEFAULT
 0    23364
 1     6636
 Name: count, dtype: int64,
 DEFAULT
 0    0.7788
 1    0.2212
 Name: proportion, dtype: float64)

## Feature Type Classification

This section classifies features into continuous, ordinal, and categorical variables based on dataset documentation, not observed distributions.

No transformations or encoding are performed at this stage.

In [5]:
#Inspect data types
df.dtypes

LIMIT_BAL    int64
SEX          int64
EDUCATION    int64
MARRIAGE     int64
AGE          int64
PAY_0        int64
PAY_2        int64
PAY_3        int64
PAY_4        int64
PAY_5        int64
PAY_6        int64
BILL_AMT1    int64
BILL_AMT2    int64
BILL_AMT3    int64
BILL_AMT4    int64
BILL_AMT5    int64
BILL_AMT6    int64
PAY_AMT1     int64
PAY_AMT2     int64
PAY_AMT3     int64
PAY_AMT4     int64
PAY_AMT5     int64
PAY_AMT6     int64
DEFAULT      int64
dtype: object

## Feature Type Classification (Pre-Processing)
Based on dataset documentation and credit-risk domain knowledge, features are classified
by their semantic meaning rather than their numeric encoding.

This classification guides preprocessing decisions and prevents invalid modeling assumptions.


### Feature Groups

**Continuous Variables**
- LIMIT_BAL
- AGE
- BILL_AMT1
- BILL_AMT2
- BILL_AMT3
- BILL_AMT4
- BILL_AMT5
- BILL_AMT6
- PAY_AMT1
- PAY_AMT2
- PAY_AMT3
- PAY_AMT4
- PAY_AMT5
- PAY_AMT6

**Ordinal Variables (Severity / Order Matters)**
- PAY_0
- PAY_2
- PAY_3
- PAY_4
- PAY_5
- PAY_6

**Categorical Variables (Nominal Codes)**
- SEX
- EDUCATION
- MARRIAGE
