STEP 2 — Dataset Understanding — Loan Approval System

This notebook focuses on understanding the structure, quality,
and ethical implications of the loan approval dataset before
any modeling or decision-making is performed.

The goal is to ensure that the AI system learns from reliable,
fair, and meaningful data.

1. Import Libraries

In [1]:
import pandas as pd
import numpy as np

2. Load Data

In [2]:
df = pd.read_csv(
    r"E:\ALL Documents\LEVEL 6 Completed\Projects\Week 1 AI & ML & Linux\end-to-end-explainable-ai-system\data\raw\loan-prediction-dataset.csv"
)

df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


3. Dataset Structure

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [4]:
list(df.columns)

['Loan_ID',
 'Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History',
 'Property_Area',
 'Loan_Status']

4. STATISTICS

In [5]:
df.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


5. FEATURES & TARGET

Feature & Target Identification

Target Variable
    - Loan_Status
        - Y → Loan Approved
        - N → Loan Rejected

Input Features
        
        - Demographic: Gender,
                       Married,
                       Dependents,
                       Education
        
        - Financial: ApplicantIncome,       
                     CoapplicantIncome,
                     LoanAmount
        
        - Risk-related: Credit_History
        
        - Contextual: Property_Area,             
                    Loan_Amount_Term

6. TARGET DISTRIBUTION

In [6]:
df['Loan_Status'].value_counts()

df['Loan_Status'].value_counts(normalize=True)

Loan_Status
Y    0.687296
N    0.312704
Name: proportion, dtype: float64

7. Missing Values

In [7]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

7. DATA QUALITY

    Data Quality & Noise Observations

- Income values vary significantly, indicating potential outliers
- Missing credit history may increase decision uncertainty
- Some features may require normalization

8. Ethics / Data Observations

- Potential class imbalance observed
- Missing values may introduce bias
- Feature distributions need normalization
- Dataset source must be evaluated for fairness