<style>
    .title { color: #2E86C1; font-size: 2.5em; text-align: center; font-weight: bold; }
    .subtitle { color: #2874A6; font-size: 1.8em; font-weight: bold; }
    .section { color: #1B4F72; font-size: 1.4em; font-weight: bold; }
    .highlight { background-color: #D5F5E3; padding: 4px 8px; border-radius: 4px; }
</style>

<div class="title">🏦 Loan Approval Prediction Using Logistic Regression</div>

<div class="subtitle">📌 Project Overview</div>
<p>
    Loan approval is a critical process for financial institutions. To streamline decision-making, I am building a 
    <span class="highlight">logistic regression model</span> that predicts whether a loan application will be approved based on key features such as:
</p>

<ul>
    <li>✔ <b>Applicant Income</b></li>
    <li>✔ <b>Coapplicant Income</b></li>
    <li>✔ <b>Loan Amount & Term</b></li>
    <li>✔ <b>Credit History</b></li>
</ul>

<div class="subtitle">🎯 Objective</div>
<p>
    The goal is to develop a <b>data-driven model</b> that helps in making accurate loan approval predictions. 
    This project follows a structured workflow:
</p>

<ul>
    <li>🔹 <b>Data Cleaning & Preprocessing</b></li>
    <li>🔹 <b>Exploratory Data Analysis (EDA)</b></li>
    <li>🔹 <b>Feature Engineering</b></li>
    <li>🔹 <b>Model Building & Evaluation</b></li>
</ul>

<p>
    By the end of this project, I aim to achieve a well-tuned logistic regression model that balances <b>accuracy</b> and <b>interpretability</b>.
</p>

<div class="subtitle">📂 Dataset</div>
<p>
    The dataset used in this project can be accessed from 
    <a href="https://www.kaggle.com/datasets/ninzaami/loan-predication" target="_blank"><b>this Kaggle link</b></a>.
</p>


## 📥 Loading the Dataset  

To begin, I load the **Loan Approval Dataset** into a Pandas DataFrame. The dataset contains **applicant details, income information, loan attributes, and approval status**. Here’s a glimpse of the data:

🔹 **Categorical Features**: Gender, Married, Education, Self-Employed, Property Area, Loan Status  
🔹 **Numerical Features**: Applicant Income, Coapplicant Income, Loan Amount, Loan Term, Credit History  

Before proceeding, I'll perform **data cleaning** to handle missing values and inconsistencies. 🚀


In [1]:
import pandas as pd
df = pd.read_csv('loan_approval.csv') 
print(df.head()) 

    Loan_ID Gender Married Dependents     Education Self_Employed  \
0  LP001002   Male      No          0      Graduate            No   
1  LP001003   Male     Yes          1      Graduate            No   
2  LP001005   Male     Yes          0      Graduate           Yes   
3  LP001006   Male     Yes          0  Not Graduate            No   
4  LP001008   Male      No          0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             1.0   

## 🔍 Data Inspection & Summary

Before diving deeper, it's important to examine the dataset's structure and check for any missing or outlier values. I'll first review:

1. **Dataset Info** – To check the data types and non-null counts.
2. **Missing Values** – To see if there are any missing entries in the columns.
3. **Statistical Summary** – To understand the distribution of numerical features.
4. **Unique Values in 'Gender' Column** – To check the categories of gender for potential encoding.

This step will help me identify issues like missing values, incorrect data types, or outliers, which need to be addressed for further analysis. 🚀


In [2]:
print(df.info())
print(df.isnull().sum())
print(df.describe())
print(df['Gender'].unique())  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
None
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education  

## 🧹 Handling Missing Data  

The dataset contains several missing values across different columns. To ensure no missing values impact the model's performance, I'll fill these missing values using the most appropriate method:

- **Numerical columns like 'LoanAmount'** will be filled with the **median** value to avoid skewing the data.
- **Categorical columns like 'Gender', 'Married', 'Dependents', 'Self_Employed', and 'Credit_History'** will be filled with the **mode** (most frequent value) to maintain the category distribution.

After filling in the missing values, I'll verify that no more missing data exists in the dataset. This ensures that the model will not encounter any issues when training.  


In [3]:
print(df.isnull().sum())

df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df['Married'].fillna(df['Married'].mode()[0], inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0], inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0], inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0], inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0], inplace=True)
print(df.isnull().sum())


Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64
Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['LoanAmount'].fillna(df['LoanAmount'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object o

## 🚨 Handling Outliers in 'LoanAmount'

Outliers can significantly impact the performance of machine learning models, especially in regression tasks. Therefore, I will address outliers in the **'LoanAmount'** column using the **Interquartile Range (IQR)** method.

The steps involved:
1. **Calculate the first (Q1) and third (Q3) quartiles** to determine the IQR.
2. **Define the lower and upper bounds** for normal values.
3. **Cap values** that fall outside the bounds to the closest limit.

This method ensures that extreme outliers are treated while keeping the distribution intact.

After applying this, I'll check the distribution of 'LoanAmount' again to confirm the adjustments. 📊


In [4]:
Q1 = df['LoanAmount'].quantile(0.25)
Q3 = df['LoanAmount'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df['LoanAmount'] = df['LoanAmount'].apply(lambda x: upper_bound if x > upper_bound else (lower_bound if x < lower_bound else x))

print(df['LoanAmount'].describe())


count    614.000000
mean     137.365635
std       55.779749
min        9.000000
25%      100.250000
50%      128.000000
75%      164.750000
max      261.500000
Name: LoanAmount, dtype: float64


## 🔢 Encoding Categorical Variables

To ensure the machine learning model can process categorical features, I will **encode** them into numerical format. This is done using **one-hot encoding**, which creates binary columns for each category. For efficiency, I will **drop the first category** in each column to avoid multicollinearity (the "dummy variable trap").

Columns to be encoded:
- **'Gender'**, **'Married'**, **'Education'**, **'Self_Employed'**, and **'Property_Area'**

By applying this transformation, the model will be able to interpret categorical variables as numerical input without any assumptions of ordinal relationships.  

After encoding, I will check the updated dataset to confirm the changes. 📊


In [5]:
df = pd.get_dummies(df, columns=['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area'], drop_first=True)

print(df.head())


    Loan_ID Dependents  ApplicantIncome  CoapplicantIncome  LoanAmount  \
0  LP001002          0             5849                0.0       128.0   
1  LP001003          1             4583             1508.0       128.0   
2  LP001005          0             3000                0.0        66.0   
3  LP001006          0             2583             2358.0       120.0   
4  LP001008          0             6000                0.0       141.0   

   Loan_Amount_Term  Credit_History Loan_Status  Gender_Male  Married_Yes  \
0             360.0             1.0           Y         True        False   
1             360.0             1.0           N         True         True   
2             360.0             1.0           Y         True         True   
3             360.0             1.0           Y         True         True   
4             360.0             1.0           Y         True        False   

   Education_Not Graduate  Self_Employed_Yes  Property_Area_Semiurban  \
0                  

## 💾 Saving the Cleaned Dataset

After handling missing values, outliers, and encoding categorical variables, the dataset is now ready for use in training models. To preserve these changes, I will **save the cleaned dataset** to a new CSV file called **'cleaned_loan_prediction.csv'**. This will allow me to easily load the dataset for future analysis or model training.

The file will not include the index, keeping the structure neat and consistent with the original data.  


In [6]:
df.to_csv('cleaned_loan_prediction.csv', index=False)

## 📊 Exploratory Data Analysis (EDA)

Now that the data is cleaned, it's important to **perform some initial exploratory analysis** to understand the dataset better.

1. **Descriptive Statistics**: I will start by looking at the summary statistics for numerical columns, such as **ApplicantIncome**, **LoanAmount**, and **Credit_History**. This helps in understanding the central tendency, spread, and any potential anomalies.
   
2. **Frequency Distribution**: I will also display the frequency distribution for categorical columns, specifically the **'Loan_Status'** (target variable) and **'Credit_History'**, to see how balanced the target is and the distribution of other key features.

### Key Observations:
- The **Loan_Status** column shows that there are more **'Y'** (approved) loans compared to **'N'** (not approved), which gives an insight into the data's balance.
- The **Credit_History** column mostly has values of 1.0 (indicating a positive credit history), with a small number of 0.0 entries (indicating a negative credit history).

This analysis gives a good foundation for deciding the next steps in model building. 🚀


In [8]:
import pandas as pd

df = pd.read_csv('cleaned_loan_prediction.csv')

print(df.describe())

print(df['Loan_Status'].value_counts())  # Target variable
print(df['Credit_History'].value_counts())  # Example for another categorical column


       ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
count       614.000000         614.000000  614.000000        614.000000   
mean       5403.459283        1621.245798  137.365635        342.410423   
std        6109.041673        2926.248369   55.779749         64.428629   
min         150.000000           0.000000    9.000000         12.000000   
25%        2877.500000           0.000000  100.250000        360.000000   
50%        3812.500000        1188.500000  128.000000        360.000000   
75%        5795.000000        2297.250000  164.750000        360.000000   
max       81000.000000       41667.000000  261.500000        480.000000   

       Credit_History  
count      614.000000  
mean         0.855049  
std          0.352339  
min          0.000000  
25%          1.000000  
50%          1.000000  
75%          1.000000  
max          1.000000  
Loan_Status
Y    422
N    192
Name: count, dtype: int64
Credit_History
1.0    525
0.0     89
Name: count, dt

## 📊 Statistical Testing: Chi-square Test

Next, I will perform a **Chi-square test** to analyze the relationship between **Credit_History** (a categorical feature) and **Loan_Status** (the target variable). The goal is to assess whether there is a significant association between the applicant's credit history and the loan approval status.

### **Chi-square Test Results:**
- **Chi2 Statistic**: 176.11 — This is the test statistic calculated from the observed and expected frequencies.
- **P-value**: 3.42e-40 — A very low p-value suggests strong evidence against the null hypothesis, meaning there is a statistically significant relationship between **Credit_History** and **Loan_Status**.
- **Degrees of Freedom**: 1 — The degrees of freedom for this test, calculated as the number of categories in **Credit_History** minus one, multiplied by the number of categories in **Loan_Status** minus one.
- **Expected Frequencies**: A table showing the expected counts for each combination of categories.

Since the p-value is very small (less than 0.05), I can confidently reject the null hypothesis, concluding that **Credit_History** has a significant influence on **Loan_Status**.

This statistical test helps provide insight into which features are worth considering in the model. 🚀


In [9]:
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['Credit_History'], df['Loan_Status'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-square Test Results:")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies Table:")
print(expected)


Chi-square Test Results:
Chi2 Statistic: 176.1145746235241
P-value: 3.4183499979091188e-40
Degrees of Freedom: 1
Expected Frequencies Table:
[[ 27.83061889  61.16938111]
 [164.16938111 360.83061889]]


## 🧑‍💼 T-test for Independent Samples: ApplicantIncome

Now, I will conduct a **t-test** to compare the **ApplicantIncome** between the approved and rejected loan groups. The t-test helps assess whether there is a significant difference in the mean income between the two groups.

### **T-test Results for ApplicantIncome:**
- **t-statistic**: -0.1165 — The t-statistic is close to zero, indicating a minimal difference between the two groups.
- **p-value**: 0.9073 — Since the p-value is much greater than 0.05, I fail to reject the null hypothesis. This suggests that there is **no significant difference** in the average income between approved and rejected loan applicants.

### **Conclusion:**
The lack of a significant result implies that **ApplicantIncome** alone may not be a strong predictor for loan approval, which might suggest considering other features or transforming this feature for better model performance.


In [10]:
from scipy.stats import ttest_ind

income_approved = df[df['Loan_Status'] == 'Y']['ApplicantIncome']
income_rejected = df[df['Loan_Status'] == 'N']['ApplicantIncome']

t_stat, p_value = ttest_ind(income_approved, income_rejected)

print("t-test Results for ApplicantIncome:")
print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")

t-test Results for ApplicantIncome:
t-statistic: -0.11650844828724542
p-value: 0.907287812130518


### Variance Inflation Factor (VIF) Calculation

In this step, I calculate the Variance Inflation Factor (VIF) for each independent variable. VIF measures the extent to which the variance of an estimated regression coefficient increases due to collinearity with other predictors. A higher VIF indicates high multicollinearity.

Here, I focus on the following variables: 
- `ApplicantIncome`
- `CoapplicantIncome`
- `LoanAmount`
- `Loan_Amount_Term`
- `Credit_History`

Typically, a VIF above 5-10 suggests problematic multicollinearity, which might affect the model’s accuracy. If necessary, I will consider transforming or removing variables with high VIF values.


In [11]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

independent_vars = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']]

vif_data = pd.DataFrame()
vif_data['Variable'] = independent_vars.columns
vif_data['VIF'] = [variance_inflation_factor(independent_vars.values, i) for i in range(independent_vars.shape[1])]

print("VIF Results:")
print(vif_data)


VIF Results:
            Variable       VIF
0    ApplicantIncome  2.321698
1  CoapplicantIncome  1.456320
2         LoanAmount  9.008554
3   Loan_Amount_Term  9.639508
4     Credit_History  5.871926


### Handling Multicollinearity by Removing Highly Correlated Variables

After analyzing the Variance Inflation Factor (VIF) values, it was observed that `LoanAmount` and `Loan_Amount_Term` have high VIF, indicating multicollinearity. To address this, we will remove the `Loan_Amount_Term` variable and recalculate the VIF for the remaining features. By doing this, we reduce multicollinearity and make the model more stable.

#### Code to Remove `Loan_Amount_Term` and Recalculate VIF


In [12]:
df = df.drop(columns=['Loan_Amount_Term'])

independent_vars = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Credit_History']]
vif_data = pd.DataFrame()
vif_data['Variable'] = independent_vars.columns
vif_data['VIF'] = [variance_inflation_factor(independent_vars.values, i) for i in range(independent_vars.shape[1])]

print("Updated VIF Results after removing Loan_Amount_Term:")
print(vif_data)


Updated VIF Results after removing Loan_Amount_Term:
            Variable       VIF
0    ApplicantIncome  2.303778
1  CoapplicantIncome  1.450722
2         LoanAmount  5.975299
3     Credit_History  3.791828


### Adding "Monthly Loan Payment" Feature

We will add a new feature **"Monthly Loan Payment"** based on a standard loan payment formula. The formula for monthly payments is:

\[
M = \frac{P \times r \times (1 + r)^n}{(1 + r)^n - 1}
\]

Where:
- **M** is the monthly payment.
- **P** is the loan amount.
- **r** is the monthly interest rate (annual rate divided by 12).
- **n** is the number of payments (loan term in months).

We will also handle outliers in the newly created feature to ensure its validity.


In [14]:
annual_interest_rate = 0.10
monthly_interest_rate = annual_interest_rate / 12
loan_term = 360

df['Monthly_Loan_Payment'] = df['LoanAmount'] * monthly_interest_rate * (1 + monthly_interest_rate) ** loan_term / ((1 + monthly_interest_rate) ** loan_term - 1)

print(df[['LoanAmount', 'Monthly_Loan_Payment']].head())

Q1 = df['Monthly_Loan_Payment'].quantile(0.25)
Q3 = df['Monthly_Loan_Payment'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df['Monthly_Loan_Payment'] = df['Monthly_Loan_Payment'].apply(lambda x: upper_bound if x > upper_bound else (lower_bound if x < lower_bound else x))

print("Cleaned Monthly Loan Payment Feature:")
print(df['Monthly_Loan_Payment'].head())


   LoanAmount  Monthly_Loan_Payment
0       128.0              1.123292
1       128.0              1.123292
2        66.0              0.579197
3       120.0              1.053086
4       141.0              1.237376
Cleaned Monthly Loan Payment Feature:
0    1.123292
1    1.123292
2    0.579197
3    1.053086
4    1.237376
Name: Monthly_Loan_Payment, dtype: float64


## Adding the "Monthly Loan Payment" feature and Handling Outliers

In this step, we will create a new feature "Monthly Loan Payment" based on the formula for monthly payments of a loan. We'll assume a fixed interest rate and a standard loan term of 30 years (360 months).

Then, we'll handle outliers for the "Monthly Loan Payment" feature by applying the Interquartile Range (IQR) method to cap the values within the upper and lower bounds.

The formula for the monthly payment is:

\[
M = \frac{P \cdot r \cdot (1 + r)^n}{(1 + r)^n - 1}
\]

Where:
- \( P \) is the loan amount
- \( r \) is the monthly interest rate (annual rate / 12)
- \( n \) is the loan term in months (360 months for a 30-year term)


In [15]:
annual_interest_rate = 0.1
r = annual_interest_rate / 12
n = 360

df['Monthly_Loan_Payment'] = (df['LoanAmount'] * r * (1 + r)**n) / ((1 + r)**n - 1)

Q1 = df['Monthly_Loan_Payment'].quantile(0.25)
Q3 = df['Monthly_Loan_Payment'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df['Monthly_Loan_Payment'] = df['Monthly_Loan_Payment'].apply(lambda x: upper_bound if x > upper_bound else (lower_bound if x < lower_bound else x))

print(df[['LoanAmount', 'Monthly_Loan_Payment']].head())


   LoanAmount  Monthly_Loan_Payment
0       128.0              1.123292
1       128.0              1.123292
2        66.0              0.579197
3       120.0              1.053086
4       141.0              1.237376


In [16]:
df['Loan_Status'] = df['Loan_Status'].map({'Y': 1, 'N': 0})

print(df['Loan_Status'].head())


0    1
1    0
2    1
3    1
4    1
Name: Loan_Status, dtype: int64


## Splitting the Data into Training and Testing Sets

Now that we have preprocessed and encoded the data, the next step is to split it into training and testing sets. This will allow us to train our machine learning model on one subset of the data and evaluate its performance on another.

We'll use the `train_test_split` function from `sklearn.model_selection` to split the data, with 80% for training and 20% for testing.

```python
from sklearn.model_selection import train_test_split

X = df.drop(columns=['Loan_Status'])
y = df['Loan_Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")


In [26]:
# Display the first few rows of the dataframe to inspect the data
print("First few rows of the dataframe:")
print(df.head())

# Check for missing values in each column
print("\nMissing values in each column:")
print(df.isnull().sum())

# Check the data types of each column
print("\nData types of each column:")
print(df.dtypes)

# Check for any non-numeric values in the dataset
print("\nUnique values in each column (for non-numeric checks):")
for column in df.select_dtypes(include=['object']).columns:
    print(f"{column}: {df[column].unique()}")


First few rows of the dataframe:
  Dependents  ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Status  \
0          0             5849                0.0       128.0            1   
1          1             4583             1508.0       128.0            0   
2          0             3000                0.0        66.0            1   
3          0             2583             2358.0       120.0            1   
4          0             6000                0.0       141.0            1   

   Gender_Male  Married_Yes  Education_Not Graduate  Self_Employed_Yes  \
0         True        False                   False              False   
1         True         True                   False              False   
2         True         True                   False               True   
3         True         True                    True              False   
4         True        False                   False              False   

   Property_Area_Semiurban  Property_Area_Urban  Monthly_Lo

In [27]:
# Replace '3+' with '3' in the 'Dependents' column
df['Dependents'] = df['Dependents'].replace('3+', '3')

# Convert 'Dependents' column to numeric type
df['Dependents'] = pd.to_numeric(df['Dependents'])

# Check the unique values again to ensure everything is now numeric
print("\nUnique values in 'Dependents' column after replacement:")
print(df['Dependents'].unique())



Unique values in 'Dependents' column after replacement:
[0 1 2 3]


In [28]:
# Check for any remaining categorical variables
print("\nData types after handling 'Dependents' column:")
print(df.dtypes)

# Convert categorical columns (binary variables) to numeric using one-hot encoding
df = pd.get_dummies(df, drop_first=True)

# Verify the transformation
print("\nColumns after one-hot encoding:")
print(df.columns)

# Split the data into features (X) and target variable (y)
X = df.drop(columns=['Loan_Status'])
y = df['Loan_Status']

# Split the data into training and test sets (80-20 split)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the train and test sets to confirm the split
print("\nShapes of train and test sets:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")



Data types after handling 'Dependents' column:
Dependents                   int64
ApplicantIncome              int64
CoapplicantIncome          float64
LoanAmount                 float64
Loan_Status                  int64
Gender_Male                   bool
Married_Yes                   bool
Education_Not Graduate        bool
Self_Employed_Yes             bool
Property_Area_Semiurban       bool
Property_Area_Urban           bool
Monthly_Loan_Payment       float64
Credit_History_1.0            bool
dtype: object

Columns after one-hot encoding:
Index(['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Status', 'Gender_Male', 'Married_Yes', 'Education_Not Graduate',
       'Self_Employed_Yes', 'Property_Area_Semiurban', 'Property_Area_Urban',
       'Monthly_Loan_Payment', 'Credit_History_1.0'],
      dtype='object')

Shapes of train and test sets:
X_train shape: (491, 12)
X_test shape: (123, 12)


In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the logistic regression model
model = LogisticRegression(max_iter=1000)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model performance
print("\nModel Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# If needed, you can also print the coefficients of the logistic regression model
print("\nModel Coefficients:")
print(model.coef_)



Model Performance:
Accuracy: 0.7886

Confusion Matrix:
[[18 25]
 [ 1 79]]

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.42      0.58        43
           1       0.76      0.99      0.86        80

    accuracy                           0.79       123
   macro avg       0.85      0.70      0.72       123
weighted avg       0.83      0.79      0.76       123


Model Coefficients:
[[ 1.30809473e-01 -9.84922412e-06 -4.68698268e-05 -2.65036908e-03
  -6.08030330e-02  5.66063997e-01 -3.48200090e-01  1.17373462e-01
   7.87616787e-01  1.05070529e-01 -2.32588850e-05  3.26024496e+00]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Model Performance Analysis

### Key Observations:

1. **Confusion Matrix**:
   - **True Positives (79)**: The model correctly predicted **79** loan approvals (class `1`).
   - **True Negatives (18)**: The model correctly predicted **18** loan denials (class `0`), but this is quite low.
   - **False Positives (25)**: The model incorrectly predicted **25** loan denials as approvals.
   - **False Negatives (1)**: The model incorrectly predicted only **1** loan approval as a denial, which is very few.

2. **Classification Report**:
   - **Recall for Class 1 (Loan Approval)**: **99%**, indicating that the model is very good at identifying loan approvals.
   - **Recall for Class 0 (Loan Denial)**: **42%**, indicating that the model misses many loan denials.
   - **Precision for Class 0**: **95%**, suggesting that when the model predicts a loan denial, it is often correct. However, this is misleading due to the low number of true positives.

3. **Model Coefficients**:
   - The coefficients give insights into how each feature affects the model's predictions. For example, certain features like `Gender_Male` and `Monthly_Loan_Payment` have a significant impact on the prediction.

4. **Convergence Warning**:
   - A **ConvergenceWarning** was raised, indicating that the solver (lbfgs) did not converge within the specified number of iterations. This might be due to unscaled data or the complexity of the problem. It is recommended to either increase the `max_iter` parameter or scale the data for better convergence.

---

### Recommendations for Improvement:

1. **Scaling the Data**: Standardizing the features using `StandardScaler` could improve the model's performance and help the solver converge more efficiently.

2. **Hyperparameter Tuning**: Tuning the hyperparameters (e.g., `C` for regularization) or using a different solver like `saga` may help improve model convergence and performance.

3. **Addressing Class Imbalance**: 
   - The model is biased towards predicting loan approvals (class `1`). To address this, consider using techniques like **oversampling** the minority class (class `0`) or adjusting **class weights** in the logistic regression model to improve predictions for loan denials.


# **Scaling the Data**

Scaling the data is a crucial step when using models like logistic regression. It helps to improve convergence and model accuracy.

In [30]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# **Model Hyperparameter Tuning**
We can improve the model by fine-tuning its hyperparameters, like the regularization strength (C) or using a different solver.

In [31]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'solver': ['lbfgs', 'saga']
}

# Initialize LogisticRegression
log_reg = LogisticRegression(max_iter=1000)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, verbose=1)

# Fit the model with hyperparameter tuning
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters
print("Best Parameters:", grid_search.best_params_)


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Parameters: {'C': 0.1, 'solver': 'lbfgs'}


# **Model Evaluation After Scaling and Tuning**
Once scaling and hyperparameter tuning are done, we evaluate the model on the test data.


In [32]:
# Use the best estimator from grid search
best_model = grid_search.best_estimator_

# Predict on the test set
y_pred_scaled = best_model.predict(X_test_scaled)

# Calculate accuracy
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy after scaling and tuning: {accuracy_scaled:.4f}")

# Confusion matrix and classification report
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_scaled))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_scaled))


Accuracy after scaling and tuning: 0.7886

Confusion Matrix:
[[18 25]
 [ 1 79]]

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.42      0.58        43
           1       0.76      0.99      0.86        80

    accuracy                           0.79       123
   macro avg       0.85      0.70      0.72       123
weighted avg       0.83      0.79      0.76       123

