STEP 3: Data Cleaning & Preprocessing

This notebook focuses on cleaning the raw loan dataset and
preparing it for machine learning and explainable decision-making.

Key goals:
- Handle missing values responsibly
- Encode categorical variables
- Preserve interpretability
- Prepare clean, reproducible data

1. IMPORTS

In [1]:
import pandas as pd
import numpy as np

2. LOAD RAW DATA

In [2]:
df = pd.read_csv(
    r"E:\ALL Documents\LEVEL 6 Completed\Projects\Week 1 AI & ML & Linux\end-to-end-explainable-ai-system\data\raw\loan-prediction-dataset.csv"
)

df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


3. CHECK MISSING VALUES

In [3]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

4. HANDLE MISSING NUMERICAL VALUES

In [4]:
df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
1      360.0
2      360.0
3      360.0
4      360.0
       ...  
609    360.0
610    180.0
611    360.0
612    360.0
613    360.0
Name: Loan_Amount_Term, Length: 614, dtype: float64>' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median, inplace=True)


In [5]:
df['LoanAmount'].isnull().sum()
df['Loan_Amount_Term'].isnull().sum()


0

6. HANDLE CREDIT HISTORY

In [6]:
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)


In [7]:
df['Credit_History'].isnull().sum()


0

7. HANDLE CATEGORICAL MISSING VALUES

In [8]:
categorical_col = [
    'Gender', 'Married', 'Dependents',
    'Self_Employed'
]

for col in categorical_col:
    df[col].fillna(df[col].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


In [9]:
df[categorical_col].isnull().sum()

Gender           0
Married          0
Dependents       0
Self_Employed    0
dtype: int64

8. VERIFY NO MISSING VALUES

In [10]:
df.isnull().sum()


Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

9. ENCODE CATEGORICAL VARIABLES

In [11]:
binary_mapping = {
    'Y': 1,
    'N': 0
}

df['Loan_Status'] = df['Loan_Status'].map(binary_mapping)

In [12]:
df = pd.get_dummies(
    df,
    columns=[
        'Gender', 'Married', 'Dependents',
        'Education', 'Self_Employed',
        'Property_Area'
    ],
    drop_first=True
)


10. FINAL DATA CHECK

In [13]:
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Loan_ID                  614 non-null    object 
 1   ApplicantIncome          614 non-null    int64  
 2   CoapplicantIncome        614 non-null    float64
 3   LoanAmount               614 non-null    float64
 4   Loan_Amount_Term         614 non-null    object 
 5   Credit_History           614 non-null    float64
 6   Loan_Status              614 non-null    int64  
 7   Gender_Male              614 non-null    bool   
 8   Married_Yes              614 non-null    bool   
 9   Dependents_1             614 non-null    bool   
 10  Dependents_2             614 non-null    bool   
 11  Dependents_3+            614 non-null    bool   
 12  Education_Not Graduate   614 non-null    bool   
 13  Self_Employed_Yes        614 non-null    bool   
 14  Property_Area_Semiurban  6

11. SAVE CLEANED DATA (IMPORTANT)

In [14]:
df.to_csv("../data/processed/cleaned_data.csv", index=False)