# Introduction and Problem Statement

Credit risk is the possibility that a borrower may fail to repay a loan. This poses a major challenge for banks and financial institutions, as loan defaults can lead to financial losses.

The objective of this project is to build a **machine learning model** that predicts whether a loan applicant is likely to **repay the loan** or **default**, based on various factors like income, employment status, credit history, and loan amount. The outcome is a **binary classification** — "Yes" (approved) or "No" (rejected).

This prediction model can help financial institutions:
- Reduce credit losses
- Make data-driven decisions
- Improve loan approval processes


# Dataset Understanding and Description

The dataset used in this project is sourced from **Kaggle’s Loan Prediction dataset**.

- **Total Records**: 614  
- **Target Variable**: `Loan_Status` (Y = Approved, N = Not Approved)

# Key Features:

 Feature             | Description                                  

 Gender              | Male / Female                                
 Married             | Applicant marital status                     
 Dependents          | Number of dependents                         
 Education           | Graduate / Not Graduate                      
 Self_Employed       | Employment type                              
 ApplicantIncome     | Monthly income of applicant                  
 CoapplicantIncome   | Monthly income of coapplicant                
 LoanAmount          | Loan amount applied for                      
 Loan_Amount_Term    | Duration of loan (in months)                 
 Credit_History      | Record of previous loan repayment behavior   
 Property_Area       | Urban / Rural / Semiurban                    
 Loan_Status         | Target: Loan approved (Y) or not (N)     

The dataset includes both categorical and numerical features and contains some **missing values** which were handled during the preprocessing phase in Power BI.


In [1]:
# Step 1: Import libraries
import pandas as pd

# Load cleaned dataset
df = pd.read_csv('/kaggle/input/cleaned-loan-prediction-dataset/Cleaned Loan Prediction Data.csv')

df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0,128,360,1,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y


In [2]:
# Drop ID column 
if 'Loan_ID' in df.columns:
    df.drop('Loan_ID', axis=1, inplace=True)

df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0,128,360,1,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
4,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y


In [3]:
# Encode 
if df['Loan_Status'].dtype == 'object':
    df['Loan_Status'] = df['Loan_Status'].map({'Y': 1, 'N': 0})

df.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0,128,360,1,Urban,1
1,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,0
2,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,1
3,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,1
4,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,1


In [4]:
#Encode
df = pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_Male,Married_Yes,Dependents_1,Dependents_2,Dependents_3+,Education_Not Graduate,Self_Employed_Yes,Property_Area_Semiurban,Property_Area_Urban
0,5849,0,128,360,1,1,True,False,False,False,False,False,False,False,True
1,4583,1508,128,360,1,0,True,True,True,False,False,False,False,False,False
2,3000,0,66,360,1,1,True,True,False,False,False,False,True,False,True
3,2583,2358,120,360,1,1,True,True,False,False,False,True,False,False,True
4,6000,0,141,360,1,1,True,False,False,False,False,False,False,False,True


In [5]:
# Split features and target
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
from sklearn.linear_model import LogisticRegression

# Create and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)


In [8]:
# Predict and evaluate
from sklearn.metrics import accuracy_score, confusion_matrix
y_pred = model.predict(X_test)
print("Accuracy:",(accuracy_score(y_test, y_pred))*100)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Accuracy: 78.86178861788618
Confusion Matrix:
 [[18 25]
 [ 1 79]]
