## Table of Contents: Logistic Regression – Loan Approval Prediction

Last Edited: September 23rd, 2024

1. Uploading Dataset (`LoanDataset`)
2. Descriptive Analysis (`head()`, `columns`)
3. Data Cleaning (handle missing values)
4. Encoding Categorical Features
5. Feature Selection and Target Definition
6. Train–Test Split
7. Logistic Regression Model Training
8. Model Prediction on Test Set
9. Model Evaluation: Accuracy Score


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression

from google.colab import drive
drive.mount("/content/drive")

path = "/content/drive/MyDrive/Kellton Tech/Model Code/dataset/LoanDataset - LoansDatasest.csv"
df= pd.read_csv(path)
df.head()

Mounted at /content/drive


Unnamed: 0,customer_id,customer_age,customer_income,home_ownership,employment_duration,loan_intent,loan_grade,loan_amnt,loan_int_rate,term_years,historical_default,cred_hist_length,Current_loan_status
0,1.0,22,59000,RENT,123.0,PERSONAL,C,"£35,000.00",16.02,10,Y,3,DEFAULT
1,2.0,21,9600,OWN,5.0,EDUCATION,A,"£1,000.00",11.14,1,,2,NO DEFAULT
2,3.0,25,9600,MORTGAGE,1.0,MEDICAL,B,"£5,500.00",12.87,5,N,3,DEFAULT
3,4.0,23,65500,RENT,4.0,MEDICAL,B,"£35,000.00",15.23,10,N,2,DEFAULT
4,5.0,24,54400,RENT,8.0,MEDICAL,B,"£35,000.00",14.27,10,Y,4,DEFAULT


In [None]:
df.isnull().sum()
df_data = df.dropna()
df_data.head()

Unnamed: 0,customer_id,customer_age,customer_income,home_ownership,employment_duration,loan_intent,loan_grade,loan_amnt,loan_int_rate,term_years,historical_default,cred_hist_length,Current_loan_status
0,1.0,22,59000,RENT,123.0,PERSONAL,C,"£35,000.00",16.02,10,Y,3,DEFAULT
2,3.0,25,9600,MORTGAGE,1.0,MEDICAL,B,"£5,500.00",12.87,5,N,3,DEFAULT
3,4.0,23,65500,RENT,4.0,MEDICAL,B,"£35,000.00",15.23,10,N,2,DEFAULT
4,5.0,24,54400,RENT,8.0,MEDICAL,B,"£35,000.00",14.27,10,Y,4,DEFAULT
5,6.0,21,9900,OWN,2.0,VENTURE,A,"£2,500.00",7.14,1,N,2,DEFAULT


In [None]:
df_data.columns

Index(['customer_id', 'customer_age', 'customer_income', 'home_ownership',
       'employment_duration', 'loan_intent', 'loan_grade', 'loan_amnt',
       'loan_int_rate', 'term_years', 'historical_default', 'cred_hist_length',
       'Current_loan_status'],
      dtype='object')

In [None]:
df_data['customer_income'] = df_data['customer_income'].str.replace('[\£,]', '', regex=True).astype(float)
df_data['loan_amnt'] = df_data['loan_amnt'].str.replace('[\£,]', '', regex=True).astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_data['customer_income'] = df_data['customer_income'].str.replace('[\£,]', '', regex=True).astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_data['loan_amnt'] = df_data['loan_amnt'].str.replace('[\£,]', '', regex=True).astype(float)


In [None]:
#X = df_data[['customer_age', 'customer_income', 'employment_duration',
              #'loan_amnt', 'loan_int_rate',
            #'term_years', 'cred_hist_length']]
#y = df_data['Current_loan_status']

In [None]:
X = df_data[['customer_income',
              'loan_amnt', 'loan_int_rate',
             'cred_hist_length']]
y = df_data['Current_loan_status']

In [None]:
encoder = LabelEncoder()
y = encoder.fit_transform(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

In [None]:
y_pred = log_reg.predict(X_test_scaled)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
print("R-squared:", log_reg.score(X_train,y_train))



R-squared: 0.42980531944063616


In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy score:", accuracy_score(y_test, y_pred))


Accuracy score: 0.686920370962584
