**Project Overview**

 In this project, you will conduct an in-depth Exploratory Data Analysis (EDA) on a Home Loan dataset. The objective is to understand the underlying structure, trends, and relationships in the data through data cleaning, visualization, and statistical analysis. This initial investigation is essential for uncovering patterns that may influence loan approvals and risk assessment.

**Project Introduction**

The home loan industry plays a pivotal role in the financial services sector, enabling individuals and families to secure funding for property purchases. Financial institutions rely on historical loan data to assess creditworthiness and refine their lending practices. The Home Loan dataset contains key information on applicants, such as income, employment status, credit history, and property details, along with the corresponding loan outcomes. By performing a comprehensive EDA, you can reveal critical insights into factors that affect loan approvals, defaults, and overall financial risk, which is instrumental for data-driven decision making in the mortgage industry.

**Project Objective**

 The primary goal of this project is to perform a thorough exploratory analysis of the Home Loan dataset. Specific objectives include:
- Data Cleaning and Preparation: Identify and handle missing values, inconsistencies, and outliers in the dataset.
- Descriptive Analysis: Understand the distribution of key features such as applicant income, loan amounts, and property characteristics.
- Correlation Analysis: Explore relationships between variables (e.g., the impact of credit history on loan approval) using correlation matrices and statistical measures.
- Visualization: Generate meaningful charts and plots (histograms, scatter plots, box plots, etc.) to visually represent data distributions and relationships.
- Insight Generation: Summarize and interpret findings to support subsequent predictive modeling and strategic decision-making in home loan processing.


 #### **Project Phases**

**Phase 1: Data Collection and Preparation**

In [4]:
# Importing Libraries

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

In [5]:
# Task 1.1: Load the Home Loan dataset into a Pandas DataFrame.

url1 = 'https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/home_loan_train.csv'
url2 = 'https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/home_loan_test.csv'

df1 = pd.read_csv(url1)
df1.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
df1_copy = df1.copy()
df1_copy.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [7]:
df2 = pd.read_csv(url2)
df2.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [8]:
df2_copy = df2.copy()
df2_copy.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [9]:
df1_copy.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


In [10]:
df1_copy.describe(include='all')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
count,614,601,611,599.0,614,582,614.0,614.0,592.0,600.0,564.0,614,614
unique,614,2,2,4.0,2,2,,,,,,3,2
top,LP001002,Male,Yes,0.0,Graduate,No,,,,,,Semiurban,Y
freq,1,489,398,345.0,480,500,,,,,,233,422
mean,,,,,,,5403.459283,1621.245798,146.412162,342.0,0.842199,,
std,,,,,,,6109.041673,2926.248369,85.587325,65.12041,0.364878,,
min,,,,,,,150.0,0.0,9.0,12.0,0.0,,
25%,,,,,,,2877.5,0.0,100.0,360.0,1.0,,
50%,,,,,,,3812.5,1188.5,128.0,360.0,1.0,,
75%,,,,,,,5795.0,2297.25,168.0,360.0,1.0,,


In [11]:
df1_copy.describe(include='object')

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,Property_Area,Loan_Status
count,614,601,611,599,614,582,614,614
unique,614,2,2,4,2,2,3,2
top,LP001002,Male,Yes,0,Graduate,No,Semiurban,Y
freq,1,489,398,345,480,500,233,422


In [12]:
df1_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [13]:
df1_copy.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [14]:
df1_copy.shape

(614, 13)

In [15]:
# Task 1.2: Inspect the dataset for missing values, duplicates, and data type inconsistencies.

# Missing values

df1_copy.isna().sum()


Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [16]:
df1_copy.duplicated().sum()

0

In [17]:
df1_copy.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [18]:
# Task 1.3: Clean the dataset by handling missing values, correcting data types, and addressing outliers.

# Correcting 'ApplicantIncome' dytpe to float

df1_copy['ApplicantIncome'] = df1_copy['ApplicantIncome'].astype(float)

In [19]:
# Handling missing values both numerical and categorical-wise by;
# 1. filling the missing numerical values with median, and 
# 2. filling the categorical missing values with the frequent value (mode).

df1_copy = df1_copy.apply(lambda col: col.fillna(col.median()) if col.dtype in ['float64'] else col.fillna(col.mode()[0]))

In [20]:
# # Check outliers in numerical columns (z-score and IQR methods)

# # numeric columns
# num_cols = df1_copy.select_dtypes(include=np.number).columns

# # 1) Z-score method (uses precomputed z_score if available)
# z = z_score[num_cols] if 'z_score' in globals() else pd.DataFrame(stats.zscore(df1_copy[num_cols]), columns=num_cols)
# z_out_mask = (z.abs() > 3)
# z_out_counts = z_out_mask.sum()

# print("Outlier counts by column (z-score |z|>3):")
# print(z_out_counts[z_out_counts > 0] if z_out_counts.sum() > 0 else "No z-score outliers detected")
# print()

# # Rows that have any z-score outlier
# rows_with_z_outliers = df1_copy[z_out_mask.any(axis=1)]
# print(f"Number of rows with any z-score outlier: {len(rows_with_z_outliers)}")
# display(rows_with_z_outliers.head())

# # 2) IQR method (classic boxplot rule)
# iqr_out_counts = {}
# iqr_masks = []
# for col in num_cols:
#     Q1 = df1_copy[col].quantile(0.25)
#     Q3 = df1_copy[col].quantile(0.75)
#     IQR = Q3 - Q1
#     lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
#     mask = (df1_copy[col] < lower) | (df1_copy[col] > upper)
#     iqr_out_counts[col] = mask.sum()
#     iqr_masks.append(mask)

# iqr_out_counts = pd.Series(iqr_out_counts)
# print("Outlier counts by column (IQR rule):")
# print(iqr_out_counts[iqr_out_counts > 0] if iqr_out_counts.sum() > 0 else "No IQR outliers detected")
# print()

# # Rows with any IQR outlier
# any_iqr_out = pd.concat(iqr_masks, axis=1).any(axis=1)
# rows_with_iqr_outliers = df1_copy[any_iqr_out]
# print(f"Number of rows with any IQR outlier: {len(rows_with_iqr_outliers)}")
# display(rows_with_iqr_outliers.head())

# # Quick boxplot visualization for numeric columns
# plt.figure(figsize=(10, 5))
# df1_copy[num_cols].boxplot(rot=45)
# plt.title("Boxplots of numerical columns")
# plt.tight_layout()
# plt.show()

In [21]:
# Addressing outliers
# Calculate the z_score for the whole dataframe
z_score = stats.zscore(df1_copy.select_dtypes(include=np.number))
z_score
# Lets print the z_score
# print(z_score)

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
0,0.072991,-0.554487,-0.211241,0.273231,0.411733
1,-0.134412,-0.038732,-0.211241,0.273231,0.411733
2,-0.393747,-0.554487,-0.948996,0.273231,0.411733
3,-0.462062,0.251980,-0.306435,0.273231,0.411733
4,0.097728,-0.554487,-0.056551,0.273231,0.411733
...,...,...,...,...,...
609,-0.410130,-0.554487,-0.889500,0.273231,0.411733
610,-0.212557,-0.554487,-1.258378,-2.522836,0.411733
611,0.437174,-0.472404,1.276168,0.273231,0.411733
612,0.357064,-0.554487,0.490816,0.273231,0.411733


In [22]:
# Check outliers in numerical columns (z-score and IQR methods)

# numeric columns
num_cols = df1_copy.select_dtypes(include=np.number).columns

# Z-score method (uses precomputed z_score if available)
z = z_score[num_cols] if 'z_score' in globals() else pd.DataFrame(stats.zscore(df1_copy[num_cols]), columns=num_cols)
z_out_mask = (z.abs() > 3)
z_out_counts = z_out_mask.sum()

print("Outlier counts by column (z-score |z|>3):")
print(z_out_counts[z_out_counts > 0] if z_out_counts.sum() > 0 else "No z-score outliers detected")
print()

# Rows that have any z-score outlier
rows_with_z_outliers = df1_copy[z_out_mask.any(axis=1)]
print(f"Number of rows with any z-score outlier: {len(rows_with_z_outliers)}")
display(rows_with_z_outliers.head())

Outlier counts by column (z-score |z|>3):
ApplicantIncome       8
CoapplicantIncome     6
LoanAmount           15
Loan_Amount_Term     12
dtype: int64

Number of rows with any z-score outlier: 37


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
9,LP001020,Male,Yes,1,Graduate,No,12841.0,10968.0,349.0,360.0,1.0,Semiurban,N
14,LP001030,Male,Yes,2,Graduate,No,1299.0,1086.0,17.0,120.0,1.0,Urban,Y
68,LP001238,Male,Yes,3+,Not Graduate,Yes,7100.0,0.0,125.0,60.0,1.0,Urban,Y
94,LP001325,Male,No,0,Not Graduate,No,3620.0,0.0,25.0,120.0,1.0,Semiurban,Y
126,LP001448,Male,Yes,3+,Graduate,No,23803.0,0.0,370.0,360.0,1.0,Rural,Y


In [23]:
# Lets set our threshold
threshold = 3

# Lets apply capping to handle our outlier
for i, col in enumerate(df1_copy.select_dtypes(include=[np.number]).columns):
    # Lets select the outliers
    outliers = df1_copy[col][abs(stats.zscore(df1_copy[col])) > threshold]
    # Lets cap anything outside the threshold
    # Use .iloc[] to access the z_score values
    # Using clip to cap outliers within the IQR range
    df1_copy[col] = np.where(np.abs(z_score.iloc[:, i]) > threshold, df1_copy[col].clip(lower=outliers.min(), upper=outliers.max()), df1_copy[col])
    

In [24]:
# Re-check outliers after capping

# numeric columns (refresh to reflect any changes)
num_cols = df1_copy.select_dtypes(include=np.number).columns

# Recompute z-scores on the current df1_copy
z_score_new = pd.DataFrame(stats.zscore(df1_copy[num_cols]), columns=num_cols)

# Z-score outlier mask and counts (|z| > threshold)
z_out_mask_new = z_score_new.abs() > threshold
z_out_counts_new = z_out_mask_new.sum()

print("Outlier counts by column AFTER capping (z-score |z|>3):")
print(z_out_counts_new[z_out_counts_new > 0] if z_out_counts_new.sum() > 0 else "No z-score outliers detected")
print()

# Rows that have any z-score outlier
rows_with_z_outliers_new = df1_copy[z_out_mask_new.any(axis=1)]
print(f"Number of rows with any z-score outlier AFTER capping: {len(rows_with_z_outliers_new)}")
display(rows_with_z_outliers_new.head())

# Compare with previous z-score counts if available
if 'z_out_counts' in globals():
    compare = pd.concat([z_out_counts.rename('before'), z_out_counts_new.rename('after')], axis=1)
    print("\nComparison of z-score outlier counts (before vs after):")
    display(compare)

# # IQR-based check AFTER capping
# iqr_out_counts_new = {}
# for col in num_cols:
#     Q1 = df1_copy[col].quantile(0.25)
#     Q3 = df1_copy[col].quantile(0.75)
#     IQR = Q3 - Q1
#     lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
#     iqr_out_counts_new[col] = ((df1_copy[col] < lower) | (df1_copy[col] > upper)).sum()

# iqr_out_counts_new = pd.Series(iqr_out_counts_new)
# print("\nOutlier counts by column AFTER capping (IQR rule):")
# print(iqr_out_counts_new[iqr_out_counts_new > 0] if iqr_out_counts_new.sum() > 0 else "No IQR outliers detected")

# # Quick boxplot visualization for numeric columns after capping
# plt.figure(figsize=(10, 5))
# df1_copy[num_cols].boxplot(rot=45)
# plt.title("Boxplots of numerical columns AFTER capping")
# plt.tight_layout()
# plt.show()

Outlier counts by column AFTER capping (z-score |z|>3):
ApplicantIncome       8
CoapplicantIncome     6
LoanAmount           15
Loan_Amount_Term     12
dtype: int64

Number of rows with any z-score outlier AFTER capping: 37


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
9,LP001020,Male,Yes,1,Graduate,No,12841.0,10968.0,349.0,360.0,1.0,Semiurban,N
14,LP001030,Male,Yes,2,Graduate,No,1299.0,1086.0,17.0,120.0,1.0,Urban,Y
68,LP001238,Male,Yes,3+,Not Graduate,Yes,7100.0,0.0,125.0,60.0,1.0,Urban,Y
94,LP001325,Male,No,0,Not Graduate,No,3620.0,0.0,25.0,120.0,1.0,Semiurban,Y
126,LP001448,Male,Yes,3+,Graduate,No,23803.0,0.0,370.0,360.0,1.0,Rural,Y



Comparison of z-score outlier counts (before vs after):


Unnamed: 0,before,after
ApplicantIncome,8,8
CoapplicantIncome,6,6
LoanAmount,15,15
Loan_Amount_Term,12,12
Credit_History,0,0


In [25]:
# Lets view
df1_copy.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849.0,0.0,128.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583.0,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000.0,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583.0,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000.0,0.0,141.0,360.0,1.0,Urban,Y


#### Phase 2: Exploratory Data Analysis (EDA)

In [26]:
# Task 2.1: Conduct descriptive statistics to summarize the key characteristics of the data.

df1_copy.describe()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,614.0,614.0,614.0
mean,5403.459283,1621.245798,145.752443,342.410423,0.855049
std,6109.041673,2926.248369,84.107233,64.428629,0.352339
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.25,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,164.75,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0
