# **Loan Default Prediction**: *Preprocessing & Baseline Modeling*
In this notebook we are carrying out the preprocessing steps, as well as setting up the baseline model.

As this is a classification project, we are preprocessing data tailoring it to the specific algorithm we are using at a specific time. Before splitting the data, we are ensuring it is in an optimal condition for modeling.

In [11]:
# Import relevant dependencies.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

In [12]:
# Load dataset for preprocessing
loans_info = pd.read_csv("../Data/RawData/Loan_default.csv")  

# Preview data
loans_info.head()

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In the next cell, we are dropping the ID column, because it is irrelevant.

In [13]:
# Drop irrelevant columns.
loans_info = loans_info.drop(columns= ['LoanID'])

To ensure that we do not have any categorical/standardization issues we are checking the values in each categorical column.

In [14]:
# View the most occuring records in categorical columns
for col in loans_info.select_dtypes(include='object').columns:
    print(f"\n--- {col.upper()} ---")
    print(loans_info[col].value_counts(dropna=False))


--- EDUCATION ---
Bachelor's     64366
High School    63903
Master's       63541
PhD            63537
Name: Education, dtype: int64

--- EMPLOYMENTTYPE ---
Part-time        64161
Unemployed       63824
Self-employed    63706
Full-time        63656
Name: EmploymentType, dtype: int64

--- MARITALSTATUS ---
Married     85302
Divorced    85033
Single      85012
Name: MaritalStatus, dtype: int64

--- HASMORTGAGE ---
Yes    127677
No     127670
Name: HasMortgage, dtype: int64

--- HASDEPENDENTS ---
Yes    127742
No     127605
Name: HasDependents, dtype: int64

--- LOANPURPOSE ---
Business     51298
Home         51286
Education    51005
Other        50914
Auto         50844
Name: LoanPurpose, dtype: int64

--- HASCOSIGNER ---
Yes    127701
No     127646
Name: HasCoSigner, dtype: int64


## **Data Preprocessing**
In this section, we are preparing our data for modeling, this will involve, splitting the data to training and testing datasets.

- We are using sklearn's model selection function, `train-test-split` to split our data. This is an important for building reliable machine learning models as it allows us to train the model on one portion of the data and then assess its performance on a completely separate, unseen portion, which helps prevent overfitting and provides a realistic evaluation of how well the model generalizes to new data. 

In [15]:
# Assign predictors and target variable to their respective variables.
X = loans_info.drop('Default', axis=1)
y = loans_info['Default']

# Split the dataset to train and test datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


## **Baseline Model**
In this section we are training our baseline model, using the `Logistic Regression` algorithm.

Logistic regression requires the data to be standardized and the categorical columns be one hot encoded. Before we begin modeling, we are preprocessing the data, tailoring it to logistic regression's needs, that is:
- Standardizing the numerical columns.

- One hot encoding te categorical columns.

In [None]:
# Identify column types
categorical_cols = X.select_dtypes(include='object').columns.tolist()
numeric_cols = X.select_dtypes(include='number').columns.tolist()

# Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

# Fit transform the training data
X_train_processed = preprocessor.fit_transform(X_train)

# Get new feature names 
# From numeric columns, they stay the same
num_features = numeric_cols

# From OneHotEncoder
cat_features = preprocessor.named_transformers_['cat'].get_feature_names(categorical_cols)

# Combine all new column names
all_features = list(num_features) + list(cat_features)

# Convert back to DataFrames
X_train_df = pd.DataFrame(X_train_processed.toarray() if hasattr(X_train_processed, 'toarray') else X_train_processed,
                          columns=all_features,
                          index=X_train.index)

In [None]:
# Preview the dataframe.
X_train_df.head()

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education_Bachelor's,...,HasMortgage_Yes,HasDependents_No,HasDependents_Yes,LoanPurpose_Auto,LoanPurpose_Business,LoanPurpose_Education,LoanPurpose_Home,LoanPurpose_Other,HasCoSigner_No,HasCoSigner_Yes
15826,0.099486,-1.167307,1.698627,0.312274,-1.170591,-0.449335,-1.335928,1.414459,1.512942,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
147371,0.29962,1.319747,-0.86417,-0.5058,1.714171,0.445947,0.185568,0.707489,-0.045473,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
178180,0.232908,0.453497,-1.700954,0.903803,1.396847,0.445947,-1.201856,-0.706451,1.123338,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
126915,-0.100649,-1.191966,-1.432895,-1.44973,-1.661,0.445947,0.723364,0.000519,1.123338,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
163930,-1.568299,0.434508,1.707671,-1.613345,0.416028,0.445947,0.89811,1.414459,-0.21863,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


### **Model 1: `LogisticRegression()` baseline model**