## Data Preprocessing

In this notebook, we clean and prepare the dataset for modeling. This includes handling categorical variables, encoding, and setting up features and target variables for training. Log transformations and outlier capping were explored but ultimately discarded based on performance results.


In [2]:
# Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn import preprocessing

## Load and Prepare Dataset

We standardize column names and clean up string formatting.


In [3]:
#Load The Data
df = pd.read_csv("data/train.csv")
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

## Encode Categorical Variables

We map binary features and apply one-hot encoding to nominal variables to prepare for models that require numerical inputs.


In [None]:
#Encoding Categorical Variables

df['cb_person_default_on_file'] = df['cb_person_default_on_file'].str.strip().str.upper()
df['cb_person_default_on_file'] = df['cb_person_default_on_file'].map({'Y': 1, 'N': 0})

df = pd.get_dummies(df, columns=['person_home_ownership', 'loan_intent'], drop_first=True)

label_encoder = preprocessing.LabelEncoder()
df['loan_grade']= label_encoder.fit_transform(df['loan_grade'])

df = df.astype({col: 'int' for col in df.select_dtypes('bool').columns})
df.columns = df.columns.str.strip().str.replace(" ", "_")

df.to_csv("data/train_preprocessed.csv", index=False)
