# 2. Preprocessing

In this notebook, we preprocess the data to prepare it for the model.

In [2]:
import sys
sys.path.append('..')

import pandas as pd
from src.preprocessing.utils import create_preprocessing_pipeline, train_val_split

from joblib import dump

## Loading data

In [3]:
df = pd.read_csv('../data/raw/loan-data.csv')
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


## Preprocessing

In [4]:
df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

**Dropping features**

We drop the ID and two sensitive features.

After investigating the model, we found that marital status and education had a negative impact on the model. We therefore drop these columns to avoid biasing the model.

In [5]:
df.drop(columns=['Loan_ID'], inplace=True)
df.drop(columns=['Married', 'Education'], inplace=True)

**Numerical and features columns**

In [6]:
num_features = [
    'Dependents', 'ApplicantIncome', 'CoapplicantIncome', 
    'LoanAmount', 'Loan_Amount_Term',
]
cat_features = [
    'Gender', 'Self_Employed', 'Property_Area', 'Credit_History',
    # 'Married', # 'Education',
] 

print(num_features)
print(cat_features)

['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
['Gender', 'Self_Employed', 'Property_Area', 'Credit_History']


**Converting "Dependents" feature to numeric**

The number of dependents will be mapped using an ordinal encoder. 
+3 will be mapped to 3. This we avoid one-hot encoding while preserving information on the order of the values.

In [7]:
display(df.Dependents.value_counts())
df.Dependents.replace('3+', 3, inplace=True)

Dependents
0     345
1     102
2     101
3+     51
Name: count, dtype: int64

**Creating the preprocessing pipeline**

In [8]:
preprocessor = create_preprocessing_pipeline(num_features, cat_features)

**Separating features and target**

In [9]:
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

**Encoding target variable**

In [10]:
# 0 = No, 1 = Yes
y = y.apply(
    lambda x: 1 if x == 'Y' else 0
)

**Invoking the preprocessing pipeline**

In [11]:
X.isnull().sum()

Gender               13
Dependents           15
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
dtype: int64

In [12]:
preprocessor.fit(X)
X = preprocessor.transform(X)

**Merging the features and target**

In [13]:
df = pd.concat([X, y], axis=1)
df.head()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Gender_Male,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Credit_History_1.0,Loan_Status
0,-0.827104,0.544331,-1.102837,-0.149985,0.17554,1.0,0.0,0.0,0.0,1.0,1.0,1
1,0.854259,0.170974,0.750578,-0.019602,0.17554,1.0,0.0,1.0,0.0,0.0,1.0,0
2,-0.827104,-0.499955,-1.102837,-1.335521,0.17554,1.0,1.0,0.0,0.0,1.0,1.0,1
3,-0.827104,-0.743873,0.891686,-0.149985,0.17554,1.0,0.0,0.0,0.0,1.0,1.0,1
4,-0.827104,0.582817,-1.102837,0.176671,0.17554,1.0,0.0,0.0,0.0,1.0,1.0,1


**Checking nulls**

In [14]:
df.isna().sum()

Dependents                 0
ApplicantIncome            0
CoapplicantIncome          0
LoanAmount                 0
Loan_Amount_Term           0
Gender_Male                0
Self_Employed_Yes          0
Property_Area_Rural        0
Property_Area_Semiurban    0
Property_Area_Urban        0
Credit_History_1.0         0
Loan_Status                0
dtype: int64

**Split into train and validation sets**

In [15]:
train_df, val_df = train_val_split(
    df, val_size=0.15
)

## Save the preprocessed data

In [16]:
train_df.to_csv('../data/processed/train.csv', index=False)
val_df.to_csv('../data/processed/val.csv', index=False)

**Save the preprocessing pipeline**

In [17]:
with open('../models/preprocessor.pkl', 'wb') as f:
    dump(preprocessor, f)