## EDA Objectives & Constraints

The goal of this exploratory analysis is to understand the structure, quality, and limitations of the credit default dataset without performing feature engineering or modeling.

This EDA is limited to:
- Understanding the target distribution
- Identifying feature types
- Assessing missingness and data quality
- Detecting potential data leakage
- Descriptive, univariate summaries only

No transformations, encoding, scaling, or modeling are performed in this notebook.

In [11]:
import pandas as pd

df = pd.read_csv("../data/credit_default.csv")

In [16]:
df.head()

Unnamed: 0.1,Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X15,X16,X17,X18,X19,X20,X21,X22,X23,Y
0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
1,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0


The first row appears to be the column header that will be needed inplace of the current headers involving **Unnamed: 0 - X23**. This will be taken care of during cleaning in the **02_cleaning.ipynb** notebook.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30001 entries, 0 to 30000
Data columns (total 25 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  30001 non-null  object
 1   X1          30001 non-null  object
 2   X2          30001 non-null  object
 3   X3          30001 non-null  object
 4   X4          30001 non-null  object
 5   X5          30001 non-null  object
 6   X6          30001 non-null  object
 7   X7          30001 non-null  object
 8   X8          30001 non-null  object
 9   X9          30001 non-null  object
 10  X10         30001 non-null  object
 11  X11         30001 non-null  object
 12  X12         30001 non-null  object
 13  X13         30001 non-null  object
 14  X14         30001 non-null  object
 15  X15         30001 non-null  object
 16  X16         30001 non-null  object
 17  X17         30001 non-null  object
 18  X18         30001 non-null  object
 19  X19         30001 non-null  object
 20  X20   

First look:
- There appears to be no missing data.
- All data types are shown as **object** which is likely due to the first row being column headers.

## Official Column Definitions

The following definitions are taken from the UCI Machine Learning Repository documentation
for the "Default of Credit Card Clients" dataset.

- X1: LIMIT_BAL — Amount of given credit (NT dollar)
- X2: SEX — Gender (1 = male, 2 = female)
- X3: EDUCATION — Education level (1 = graduate school, 2 = university, 3 = high school, 4 = others)
- X4: MARRIAGE — Marital status (1 = married, 2 = single, 3 = others)
- X5: AGE — Age in years
<br><br>
- X6: PAY_0 — Repayment status in September (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
- X7: PAY_2 — Repayment status in August (scale same as above)
- X8: PAY_3 — Repayment status in July (scale same as above)
- X9: PAY_4 — Repayment status in June (scale same as above)
- X10: PAY_5 — Repayment status in May (scale same as above)
- X11: PAY_6 — Repayment status in April (scale same as above)
<br><br>
- X12: BILL_AMT1 — Bill statement amount in September
- X13: BILL_AMT2 — Bill statement amount in August
- X14: BILL_AMT3 — Bill statement amount in July
- X15: BILL_AMT4 — Bill statement amount in June
- X16: BILL_AMT5 — Bill statement amount in May
- X17: BILL_AMT6 — Bill statement amount in April
<br><br>
- X18: PAY_AMT1 — Amount paid in September
- X19: PAY_AMT2 — Amount paid in August
- X20: PAY_AMT3 — Amount paid in July
- X21: PAY_AMT4 — Amount paid in June
- X22: PAY_AMT5 — Amount paid in May
- X23: PAY_AMT6 — Amount paid in April
<br><br>
- Y: default payment next month - Default payment next month (1 = yes, 0 = no)


## Target Variable: Default Indicator

The target variable for this analysis is `Y`, which indicates whether a client defaulted on their credit card payment in October.

This section examines the distribution of the target variable to assess class imbalance and implications for model evaluation.


In [14]:
# Target distribution
target_counts = df["Y"].value_counts()
target_percent = df["Y"].value_counts(normalize=True)

print(target_counts)
print(target_percent)

Y
0                             23364
1                              6636
default payment next month        1
Name: count, dtype: int64
Y
0                             0.778774
1                             0.221193
default payment next month    0.000033
Name: proportion, dtype: float64


Target breakdown:
- ~78% did not default in October
- ~22% did default in October

This distribution is adequate for creating a model.

## Feature Type Classification (Pre-Processing)
Based on dataset documentation and credit-risk domain knowledge, features are classified
by their semantic meaning rather than their numeric encoding.

This classification guides preprocessing decisions and prevents invalid modeling assumptions.


### Feature Groups

**Continuous Variables**
- LIMIT_BAL
- AGE
- BILL_AMT1
- BILL_AMT2
- BILL_AMT3
- BILL_AMT4
- BILL_AMT5
- BILL_AMT6
- PAY_AMT1
- PAY_AMT2
- PAY_AMT3
- PAY_AMT4
- PAY_AMT5
- PAY_AMT6

**Ordinal Variables**
- PAY_0
- PAY_2
- PAY_3
- PAY_4
- PAY_5
- PAY_6

**Categorical Variables**
- SEX
- EDUCATION
- MARRIAGE
