## **Project Overview**

This project will conduct an in-depth Exploratory Data Analysis (EDA) on a Home Loan dataset. The objective is to understand the underlying structure, trends, and relationships in the data through data cleaning, visualization, and statistical analysis. This initial investigation is essential for uncovering patterns that may influence loan approvals and risk assessment.

Financial institutions rely on historical loan data to assess creditworthiness and refine their lending practices. The Home Loan dataset contains key information on applicants, such as income, employment status, credit history, and property details, along with the corresponding loan outcomes. By performing a comprehensive EDA, one can reveal critical insights into factors that affect loan approvals, defaults, and overall financial risk, which is instrumental for data-driven decision making in the mortgage industry.

The primary goal of this project is to perform a thorough exploratory analysis of the Home Loan dataset.

### **Data Collection and Preparation**

In [3]:
# Importing the necessary libraries

import pandas as pd
import numpy as np

##### **Loading datasets directly from Github into a Pandas DataFrame**

In [2]:
url = r"https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/home_loan_train.csv"

data = pd.read_csv(url)

# Saving & Loading the dataset locally.
data.to_csv(r"Home_loan", index=False)

In [5]:
df = pd.read_csv(r"Home_loan")


After loading, the next thing to do is get a preview of the dataset.

In [7]:
df.head()       # See the first 5 rows

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [8]:
df.shape        # Number of rows and columns

(614, 13)

In [9]:
df.info()   # Column names, data types, non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


#### **Inspecting for missing values, duplicates, and data type inconsistencies**

In [10]:
# Checking missing values
df.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [12]:
# Checking for duplicates values
df.duplicated().sum()

np.int64(0)

No duplicate values in the data frame.

In [13]:
# Checking for data inconsistencies
df.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

#### **Cleaning the dataset**

In [None]:
# Fill missing values in all non-numerical columns with their respective modes.
mode_gender = data["Gender"].mode()[0]        # This will return the highest occurrences values therein.
data["Gender"].fillna(mode_gender, inplace=True)

mode_married = data["Married"].mode()[0]
data["Married"].fillna(mode_married, inplace=True)

mode_dependents = data["Dependents"].mode()[0]
data["Dependents"].fillna(mode_dependents, inplace=True)

mode_employed = data["Self_Employed"].mode()[0]
data["Self_Employed"].fillna(mode_employed, inplace=True)

mode_loan = data["Loan_Status"].mode()[0]
data["Loan_Status"].fillna(mode_loan, inplace=True)



In [None]:
# Fill missing values in all numerical columns with their respective medians
data.fillna(data.median(numeric_only=True), inplace=True)

In [None]:
data.isna().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

### **Exploratory Data Analysis (EDA)**

**Performing descriptive statistics to summarize the key characteristics of the data.**

In [None]:
# Descriptive Statistics of the data
data.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001015,Male,Yes,0,Graduate,No,5720,0.0,110.0,360.0,1.0,Urban,Y
1,LP001022,Male,Yes,1,Graduate,No,3076,1500.0,126.0,360.0,1.0,Urban,Y
2,LP001031,Male,Yes,2,Graduate,No,5000,1800.0,208.0,360.0,1.0,Urban,Y


In [None]:
# Setting Loan_ID as index
data.set_index("Loan_ID", inplace=True)

In [None]:
data.head(3)

Unnamed: 0_level_0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
Loan_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
LP001015,Male,Yes,0,Graduate,No,5720,0.0,110.0,360.0,1.0,Urban,Y
LP001022,Male,Yes,1,Graduate,No,3076,1500.0,126.0,360.0,1.0,Urban,Y
LP001031,Male,Yes,2,Graduate,No,5000,1800.0,208.0,360.0,1.0,Urban,Y


In [None]:
# Picking out the numerical columns and get the descriptions
numerical_features = data.select_dtypes(include=['float64', 'int64']).columns
data[numerical_features].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ApplicantIncome,981.0,5179.795107,5695.104533,0.0,2875.0,3800.0,5516.0,81000.0
CoapplicantIncome,981.0,1601.91633,2718.772806,0.0,0.0,1110.0,2365.0,41667.0
LoanAmount,981.0,142.057085,76.395592,9.0,101.0,126.0,160.0,700.0
Loan_Amount_Term,981.0,342.56473,64.482011,6.0,360.0,360.0,360.0,480.0
Credit_History,981.0,0.849134,0.358101,0.0,1.0,1.0,1.0,1.0


In [None]:
# Checking the number of loan applicants by gender
pd.DataFrame(data["Gender"].value_counts())

Unnamed: 0_level_0,count
Gender,Unnamed: 1_level_1
Male,799
Female,182


In [None]:
# Checking the number of loan applicants by Education
pd.DataFrame(data["Education"].value_counts())

Unnamed: 0_level_0,count
Education,Unnamed: 1_level_1
Graduate,763
Not Graduate,218


In [None]:
# Checking the number of loan applicants by Property area
pd.DataFrame(data["Property_Area"].value_counts())

Unnamed: 0_level_0,count
Property_Area,Unnamed: 1_level_1
Semiurban,349
Urban,342
Rural,290


**Visualize distributions of numerical features (e.g., applicant income, loan amount) using histograms and box plots.**

**Analyze categorical features (e.g., education, employment status, property area) using bar charts and pie charts.**

**Examine relationships between features and the target variable (loan approval status) using scatter plots, correlation matrices, and cross-tabulations.**

**Identify trends, anomalies, and patterns that could impact loan outcomes.**

### **Reporting and Insights**

**Summarize key findings and insights derived from the EDA.**

**Create comprehensive visualizations and dashboards to communicate your insights effectively.**