<a href="https://colab.research.google.com/github/LaraMohamed127/ProjectSE/blob/main/Classification_Mini_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification - Mini-Project 2

***Edit this cell with your name(s), tutorial number(s) and ID(s)***

---

Name: Zeina Ahmed Taha Ismail

ID: 58-25324

Tutorial: 4

---

Name: Lara Mohamed

ID: 58-7472

Tutorial: T-4

---


## Dataset Description

The following dataset includes information about loan applicants, including factors such as income, credit score, employment status, loan details, and other indicators of financial stability, along with the final decision showing whether each loan was approved or not.

| Column | Description|
|-|-|
|ApplicationNumber|Unique identifier assigned to each loan application|
|Age|Applicant’s age in years|
|AnnualIncome|Applicant’s yearly income|
|CreditScore|A score representing the applicant’s creditworthiness|
|EmploymentStatus|Applicant’s current employment situation (Employed, Unemployed, Self-Employed)|
|EducationLevel|Highest educational qualification attained (Highschool, Bachelor, Master, Doctorate, Diploma)|
|LoanAmount|Total amount of money requested for the loan|
|LoanDuration|Duration of the loan in months|
|MaritalStatus|Applicant’s marital state (Divorced, Married, Single, Widowed)|
|NumberOfDependents|Number of individuals financially dependent on the applicant|
|HomeOwnershipStatus|Applicant’s housing status (Mortagage, Own, Rent, Other)|
|BankruptcyHistory|Indicates whether the applicant has previously declared bankruptcy (0 = No, 1 = Yes)|
|LoanPurpose|The primary reason for taking the loan (Debt Consolidation, Home Improvement, Education, Personal)|
|PreviousLoanDefaults|Indicates if the applicant has defaulted on any previous loans (0 = No, 1 = Yes)|
|MonthlyLoanPayment|Amount the applicant would need to pay monthly to repay the loan|
|MonthlyIncome|Average monthly income of the applicant|
|JobTenure|Number of years the applicant has been in their current job|
|LoanApproved|Indicates loan approval status (No = Not Approved, Yes = Approved)|

## Importing Libraries & Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
#plt.style.use("seaborn")

df = pd.read_csv('https://raw.githubusercontent.com/GUC-DM/W2025/refs/heads/main/data/loan_data.csv')
df.head()

## Data Inspection

In [4]:
df.info() #checks numbers or rows/columns, data types, and non-null counts. And help detect missing values and incorrect data types
df.describe(include='all') # summarizing numerical and categorical columns, showing ranges, possible outliers and unusual data entries
df.isnull().sum() #determine the exact numbet of missing values per column, helps in determining which column or row needs to be imputed or dropped

NameError: name 'df' is not defined

## Data Cleaning

In [3]:
from re import RegexFlag
from numpy._core.numeric import astype
#we need to clean the money columns
money_cols=["AnnualIncome","LoanAmount","MonthlyLoanPayment","MonthlyIncome"]

#then we need to show the columns before handling them
print("Before cleaning Money columns")
display(df[money_cols].head())

#Removing the $ sign, commas and convert to float
for col in money_cols:
    df[col] = (df[col].astype(str).str.replace('$', '',regex=False).str.replace(',','',regex=False).replace('nan',np.nan).astype(float))

#then we need to show them after cleaning
print("\n")
print("After cleaning Money columns")
display(df[money_cols].head())

#after that we need to detect the outliers
#determining the the columns that we will detect the outliers for which are the same money columns
num_cols_to_check = ["AnnualIncome","LoanAmount","MonthlyLoanPayment","MonthlyIncome"]
print("\n")

#we will vsualize it by Boxplot to determine the outliers in them
plt.figure(figsize=(12,8))

for i, col in enumerate(num_cols_to_check,1):
    plt.subplot(2,2,i)
    sns.boxplot(df[col])
    plt.title(f"Boxplot of {col}")

plt.tight_layout()
plt.show()

print("\n")

# We need to start imputation. After determining outliers and missing values from inspection and cleaning, we need to impute the columns with missing values and outliers
# showing the missing values before imputation
print("Before Imputation")
display(df.isnull().sum())

# We deduced the missing values
# the first column that needs imputation is Employment status , since it is categorical column we will impute it using mode
df["EmploymentStatus"] = df["EmploymentStatus"].fillna(df["EmploymentStatus"].mode()[0])

#the second column that needs imputation is MonthlyLoanPayment and since it is numerical column and it has missing values and also has outliers we will impute with median
df["MonthlyLoanPayment"] = df["MonthlyLoanPayment"].fillna(df["MonthlyLoanPayment"].median())

#the third column that needs imputation is MonthlyIncome, which is a numerical column, has missing values, and also has outliers so impute it with median
df["MonthlyIncome"] = df["MonthlyIncome"].fillna(df["MonthlyIncome"].median())

#After imputing all columns with missing values, show them
print("\n")
print("After Imputation")
display(df.isnull().sum())

#after imputation we need to check unique values to determine the encoding type
print("\n")
print("number of unique values")
print("\n")
for col in df.columns:
  print(col, ":", df[col].nunique())
  print("=================================")

#determining the unique values of educational level to give the right ranking for it, since it is an ordinal
print("\n")
df["EducationLevel"].unique()

# first of all we need to drop 2 columns which are Application number because ID-columns needed to be removed and BankruptcyHistory as it contains 1 unique value
print('Before Dropping Columns :',df.shape)
df.drop(columns=['ApplicationNumber','BankruptcyHistory'],inplace=True)
print('After Dropping Columns :', df.shape)

# Label encoding the ordinal category education level as it is ranked




Before cleaning Money columns


NameError: name 'df' is not defined

## Exploratory Data Analysis

**Q1: On average, which type of educational level has the highest approval rate? Show their order on the graph.**

**Visualization**

In [1]:
# Justification: A Bar Plot is best for comparing a continuous variable (Approval Rate) across categorical groups (Education).
# We group by Education Level and take the mean of 'LoanApproved' (since it's 0 or 1, mean = percentage).

approval_rate = df.groupby('EducationLevel')['LoanApproved'].mean().reset_index()

plt.figure(figsize=(8, 5))
sns.barplot(x='EducationLevel', y='LoanApproved', data=approval_rate, palette='viridis')
plt.title('Average Loan Approval Rate by Education Level')
plt.ylabel('Approval Rate (0 to 1)')
plt.xlabel('Education Level (0=HS, 4=Doctorate)')
plt.ylim(0, 1) # Fix y-axis to show percentage clearly
plt.show()

# OBSERVATION:
# (Fill in based on the graph, e.g., "Doctorate holders appear to have a slightly higher approval rate...")

NameError: name 'df' is not defined

**Answer for Q1**: Your answer here

**Q2: How does the annual income vary among approved applicants? Interpret the values of the 3 quartiles.**

**Visualization**

In [None]:
# Justification: A Box Plot is the standard for visualizing the distribution and spread (quartiles)
# of a numerical variable (Income) split by a categorical variable (Approval Status).

plt.figure(figsize=(8, 6))
sns.boxplot(x='LoanApproved', y='AnnualIncome', data=df)
plt.title('Annual Income Distribution: Denied (0) vs Approved (1)')
plt.xlabel('Loan Approved')
plt.ylabel('Annual Income')
plt.show()

# INTERPRETATION:
# The Box represents the Interquartile Range (IQR). The line inside is the Median.
# If the box for '1' is higher, approved applicants generally earn more.

**Answer for Q2**: Your answer here

**Q3: How does the age of an applicant affect their credit score? (Hint: Use the line of best fit.)**

**Visualization**

In [None]:
# Justification: A Scatter Plot with a Regression Line (regplot) is best for checking
# correlations between two continuous numerical variables (Age and Credit Score).

plt.figure(figsize=(10, 6))
# We reduce alpha (transparency) to see density of points
sns.regplot(x='Age', y='CreditScore', data=df, scatter_kws={'alpha':0.3}, line_kws={'color':'red'})
plt.title('Correlation between Age and Credit Score')
plt.show()

print(f"Correlation Coefficient: {df['Age'].corr(df['CreditScore'])}")

# OBSERVATION:
# If the red line is flat and correlation is near 0, there is no linear relationship.

**Answer for Q3**: Your answer here

**Q4: Is the distribution of applicants' income per month normal or skewed?**

**Visualization**

In [None]:
# Justification: A Histogram with a Kernel Density Estimate (KDE) is required to determine
# if a single numerical variable follows a Normal distribution or is Skewed.

plt.figure(figsize=(10, 6))
sns.histplot(df['MonthlyIncome'], kde=True, bins=30, color='blue')
plt.title('Distribution of Monthly Income')
plt.show()

# OBSERVATION:
# If the tail extends to the right, it is Right-Skewed (Positively Skewed).
# If it looks like a bell curve, it is Normal.

**Answer for Q4**: Your answer here

## Data Preparation for Modelling

## Modelling

## Evaluation

## Bonus (Optional)