# Preprocessing Data

### Required Libraries

In [10]:
! pip install -q pandas numpy matplotlib seaborn

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading Dataset

In [12]:
df = pd.read_csv("../data/raw/Credit Risk Benchmark Dataset.csv")

### Removing Duplicates

In [13]:
df.drop_duplicates(inplace=True)

### Creating new columns

In [14]:
df['total_late'] = df['late_30_59'] + df['late_60_89'] + df['late_90']
df['financial_stress'] = df['debt_ratio'] * df['rev_util']

**total_late** combines all delinquency categories (30–59, 60–89, and 90+ days past due) into a single aggregated feature. Rather than analyzing each bucket separately, this variable captures the borrower’s overall history of missed payments. A higher value indicates repeated repayment issues and signals behavioral credit risk, which is one of the strongest predictors of default in lending models.


**financial_stress** is an interaction feature created by multiplying debt ratio with revolving utilization. It reflects compounded financial pressure, where high leverage combined with high credit usage suggests liquidity strain and an increased probability of default.

### Saving Preprocessed data

In [17]:
df.to_csv("../data/processed/processed_data.csv", index=False)