# Loan Approval Prediction: Logistic Regression and Decision Tree Classifiers

This notebook demonstrates how to preprocess a loan dataset, build machine learning models (Logistic Regression and Decision Tree) for loan approval prediction, evaluate their performance, and save the trained models for deployment.

## Importing Libraries and Packages

We import the necessary libraries required for data manipulation, preprocessing, model training, evaluation, and saving models. Below is a brief explanation of each library:

1. pandas: For loading, exploring, and manipulating the dataset.
2. sklearn.model_selection.train_test_split: Splits the dataset into training and testing sets for model evaluation.
3. sklearn.preprocessing.LabelEncoder: Encodes categorical variables into numeric format.
4. sklearn.impute.SimpleImputer: Handles missing data by replacing or removing null values.
5. sklearn.preprocessing.StandardScaler: Scales numerical features for improved model performance.
6. sklearn.linear_model.LogisticRegression: Builds a logistic regression model for binary classification.
7. sklearn.tree.DecisionTreeClassifier: Builds a decision tree classifier for predictions.
8. sklearn.metrics.accuracy_score: Evaluates model accuracy.
9. pickle: Saves and loads trained models for deployment.



# CODE SECTION

In [13]:
# Step 1: Importing Necessary Libraries

import pandas as pd 
# For data manipulation and analysis
from sklearn.model_selection import train_test_split  
# For splitting data into train and test sets
from sklearn.preprocessing import LabelEncoder, StandardScaler  
# For encoding categorical variables and scaling features
from sklearn.impute import SimpleImputer  
# For handling missing values
from sklearn.linear_model import LogisticRegression  
# Logistic Regression model
from sklearn.tree import DecisionTreeClassifier  
# Decision Tree model
from sklearn.metrics import accuracy_score  
# For evaluating the models
import pickle  
# For saving and loading the models


STEP 2 : LOAD AND EXPLORE THE DATASET

In [14]:
# Load the dataset
loan_data = pd.read_csv(r'C:\Users\naray\Downloads\LoanApprovalPrediction.csv')

# Display the first few rows to inspect the structure of the dataset
print(loan_data.head())

# Check the distribution of the target variable
print(loan_data['Loan_Status'].value_counts())

# Check for missing values
print(loan_data.isna().sum())


    Loan_ID Gender Married  Dependents     Education Self_Employed  \
0  LP001002   Male      No         0.0      Graduate            No   
1  LP001003   Male     Yes         1.0      Graduate            No   
2  LP001005   Male     Yes         0.0      Graduate           Yes   
3  LP001006   Male     Yes         0.0  Not Graduate            No   
4  LP001008   Male      No         0.0      Graduate            No   

   ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0             5849                0.0         NaN             360.0   
1             4583             1508.0       128.0             360.0   
2             3000                0.0        66.0             360.0   
3             2583             2358.0       120.0             360.0   
4             6000                0.0       141.0             360.0   

   Credit_History Property_Area Loan_Status  
0             1.0         Urban           Y  
1             1.0         Rural           N  
2             

# DATA PREPROCESSING

1. Handling Missing values

Why Handle Missing Values?

Missing values can lead to inaccurate predictions, biases, and errors in machine learning models. Many algorithms cannot process incomplete data, and ignoring this step may result in poor performance.

Why Did We Remove Instead of Filling Missing Values?

In this dataset, missing values were present in critical columns such as LoanAmount, Loan_Amount_Term, Credit_History, etc., which are key predictors of loan approval.

Removing Missing Values: Ensures the data used for training is clean and reliable. Since these columns are crucial, imputing them might introduce noise or incorrect information. For example:
Filling Credit_History with an average or mode could distort its influence on loan decisions.
Loan amount is highly variable, making it unsuitable for mean or median imputation.

Benefit: By removing rows with missing values, we maintain the integrity of our dataset, ensuring only complete and valid data is used for training.



In [15]:
# Remove rows with missing values in critical columns
loan_data.dropna(subset=['LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Gender', 
                         'Married', 'Dependents', 'Self_Employed'], inplace=True)

# Verify if missing values are handled
print(loan_data.isna().sum())


Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


2. Encoding Categorical Variables

Why Encode Categorical Variables?

Machine learning algorithms work with numerical data. Categorical variables like Gender, Married, and Property_Area need to be converted into numerical values for the model to process them.

How Encoding Helps:

Label Encoding: Converts categories into numeric representations (e.g., Male=0, Female=1), which helps algorithms interpret the data.
Ensures that categorical features are integrated into the model, retaining their predictive power.
Benefit: Encoding allows us to include categorical variables in our models, ensuring they contribute to the prediction process.

In [16]:
# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()
loan_data['Gender'] = label_encoder.fit_transform(loan_data['Gender'])
loan_data['Married'] = label_encoder.fit_transform(loan_data['Married'])
loan_data['Education'] = label_encoder.fit_transform(loan_data['Education'])
loan_data['Self_Employed'] = label_encoder.fit_transform(loan_data['Self_Employed'])
loan_data['Property_Area'] = label_encoder.fit_transform(loan_data['Property_Area'])
loan_data['Loan_Status'] = label_encoder.fit_transform(loan_data['Loan_Status'])


3. Defining Features and Target in Machine Learning
In machine learning, defining the features (independent variables) and the target (dependent variable) is an important step for training the model. The features are the input variables that the model uses to make predictions, while the target is the output variable the model tries to predict.

Here’s a breakdown of the process in Python:

Defining Features (X) and Target (y)

Features (X): These are the independent variables that the model uses to make predictions. In this case, we remove Loan_ID (an identifier) and Loan_Status (the target) from the features, leaving only the relevant data for prediction.

Target (y): The dependent variable, which is Loan_Status in this case, represents the output the model needs to predict (loan approval status).

Why is this Important?
Correctly defining X and y ensures the model has the right input data to learn from and knows which variable to predict, allowing for accurate training and evaluation.

Benefits
This approach helps streamline the data for model training, focusing on relevant features and correctly identifying the target variable for prediction.

In [17]:
# Define features (X) and target (y)
X = loan_data.drop(columns=['Loan_ID', 'Loan_Status'])
y = loan_data['Loan_Status']

# Print the shapes of X and y
print(X.shape, y.shape)


(505, 11) (505,)


4. Splitting Data into Training and Testing Sets
Purpose: Splitting the dataset into two sets:

Training set (70%): Used to train the model.
Testing set (30%): Used to evaluate the model’s performance on unseen data.

Why Split the Data?: This ensures the model is trained on a portion of the data and evaluated on data it hasn't seen before, preventing overfitting and providing a realistic performance estimate.

Benefit:

Training Set: Used to train the model.
Testing Set: Used to evaluate the model’s accuracy and ensure it generalizes well.


In [18]:
# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the splits
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(353, 11) (152, 11) (353,) (152,)


5. Feature Scaling
Why Scale Features?
Different features in the dataset may have varying ranges (e.g., Income could range from 1,000 to 100,000, while LoanAmount could range from 100 to 5,000). This disparity can cause models like Logistic Regression to assign undue importance to larger-scaled features.

StandardScaler: Scales features to have a mean of 0 and a standard deviation of 1, making all features comparable.

Benefit:

Improves model convergence during training.
Ensures algorithms that are sensitive to feature magnitudes (e.g., Logistic Regression) perform optimally.


In [19]:
# Scale the features for better model performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## Model Selection
Why Logistic Regression and Decision Tree?

Logistic Regression: A baseline algorithm for binary classification tasks, known for its simplicity and interpretability.
Decision Tree: A more flexible algorithm that can model complex decision boundaries and handle categorical and numerical data.
Benefit: Using two models allows us to compare their performance and select the one that works best for this problem.

### Logistic Regression
Purpose:

* Train Logistic Regression model: The LogisticRegression class from sklearn.linear_model is used to create a logistic regression model. The model is trained using the training data (X_train_scaled and y_train).
* Evaluate Model: After training the model, predictions are made using the test set (X_test_scaled), and the accuracy of these predictions is calculated by comparing them to the actual outcomes (y_test). This accuracy gives a measure of how well the model performs.

Why Logistic Regression?:
* Binary Classification: Logistic Regression is particularly well-suited for binary classification tasks, where the goal is to classify data into two categories (in this case, whether a loan is approved or denied).
* Simplicity and Interpretability: Logistic Regression is a simple and interpretable algorithm. It outputs probabilities, which can be helpful for understanding the likelihood of loan approval and making informed decisions.
* Effectiveness with Linearly Separable Data: While it may not be the best model for highly complex or non-linear data, Logistic Regression works well when the decision boundary between classes is approximately linear. In financial applications, such as loan approval, the relationship between features and the target is often linear or can be approximated by linear boundaries.

Why Use the max_iter=500 Parameter?
The max_iter parameter controls the number of iterations for the optimization algorithm. Setting it to 500 ensures that the algorithm has enough iterations to converge and find the optimal parameters. This is useful for avoiding issues where the algorithm may not converge (especially when the data is more complex).

Model Evaluation:

* Accuracy Score: The accuracy_score function from sklearn.metrics is used to calculate the percentage of correct predictions made by the model. It compares the predicted loan approval status (y_pred_log_reg) against the actual values (y_test) and outputs a percentage of accurate predictions.
* Result: The result gives an idea of how well the Logistic Regression model performs in predicting loan approval.

In Summary:
Logistic Regression is a simple, interpretable model ideal for binary classification tasks like loan approval prediction.
The model's performance is evaluated based on accuracy, which provides a straightforward measure of how well the model is performing on unseen data.

In [20]:
# Train Logistic Regression model
log_reg = LogisticRegression(max_iter=500)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred_log_reg = log_reg.predict(X_test_scaled)

# Evaluate the model
log_reg_accuracy = accuracy_score(y_test, y_pred_log_reg)
print(f"Logistic Regression Accuracy: {log_reg_accuracy * 100:.2f}%")


Logistic Regression Accuracy: 84.87%


### Decision tree
Purpose:
* Train Decision Tree model: The DecisionTreeClassifier from sklearn.tree is used to train a Decision Tree model. The model is fit using the training data (X_train and y_train), learning patterns from the features to predict the target variable (loan approval status).
* Evaluate Model: After training, predictions are made on the test data (X_test), and the accuracy of the model is assessed by comparing the predicted loan approvals (y_pred_tree) with the actual outcomes (y_test).

Why Decision Tree?:
* Interpretability: One of the main advantages of Decision Trees is that they are highly interpretable. The decision-making process is transparent and easy to understand. This is particularly useful in finance, where stakeholders need to understand why a decision was made (e.g., why a loan was approved or denied).
* Non-linear Relationships: Unlike linear models like Logistic Regression, Decision Trees can handle non-linear relationships between features and the target variable. For example, certain interactions between income, credit score, and loan amount might be better captured with a tree structure rather than a simple linear model.
* Feature Importance: Decision Trees can also help identify the most important features for prediction. By examining the structure of the tree, we can see which features (e.g., credit score, income) had the most significant impact on loan approval decisions.
* Versatility: Decision Trees can handle both numerical and categorical data without the need for extensive preprocessing like scaling or encoding (though they still benefit from it). This makes them versatile for different types of datasets.

Model Evaluation:
* Accuracy Score: As with Logistic Regression, we use accuracy_score to measure how well the Decision Tree model performs. The accuracy score compares the predicted results (y_pred_tree) with the actual values (y_test), providing a percentage of correct predictions.
* Result: The accuracy score gives an indication of how well the Decision Tree generalizes to unseen data. A high accuracy score would suggest that the model is performing well, while a lower score might indicate that the model is overfitting or underfitting the data.

In Summary:
* Decision Trees are a powerful, interpretable model for both linear and non-linear relationships. They excel in handling complex decision-making scenarios like loan approval prediction, where factors may interact in non-linear ways.
* The model's performance is evaluated based on accuracy, providing insight into how well the model predicts loan approval or denial.

In [21]:
# Train Decision Tree model
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)

# Make predictions
y_pred_tree = decision_tree.predict(X_test)

# Evaluate the model
tree_accuracy = accuracy_score(y_test, y_pred_tree)
print(f"Decision Tree Accuracy: {tree_accuracy * 100:.2f}%")


Decision Tree Accuracy: 73.68%


### Evaluating Models
Why Use Accuracy?
Accuracy is a basic metric that tells us the percentage of correct predictions. While other metrics like precision and recall are also important, accuracy gives an overall sense of how well the model is performing.

Benefit: Evaluation ensures the chosen model meets the desired performance level before deployment.



Purpose of Saving Models and Scaler

Saving the Scaler:

What: The scaler.pkl file stores the StandardScaler object.
Why: It ensures that the same scaling transformation is applied to new input data during prediction, maintaining consistency in the model's behavior when processing new data.

Saving the Logistic Regression Model:

What: The logistic_regression_model.pkl file stores the trained Logistic Regression model.
Why: Saves time and resources by preventing the need to retrain the model each time you want to use it. You can directly load the model for making predictions on new data.

Saving the Decision Tree Model:

What: The decision_tree_model.pkl file stores the trained Decision Tree model.
Why: Similar to the Logistic Regression model, it allows for reuse and avoids retraining, making future predictions faster and more efficient.

Why Save the Models?
Efficiency: Training models can be time-consuming and resource-intensive. Saving them prevents the need for retraining, which is especially important when using large datasets.
Reusability: Saved models can be reused in different environments (e.g., production systems, applications), facilitating deployment and integration.
Model Deployment: Models need to be deployed in real-time systems for making predictions on live data. Saving models ensures they are ready for deployment without retraining.

Benefits
Time-Saving: You can quickly load saved models and make predictions without going through the expensive training process.
Consistency: Saved models provide consistent predictions across different environments or times.
Easy Integration: The models can be integrated into applications (e.g., web or mobile) to provide real-time predictions.

In Summary:
Saving the scaler and models allows for efficient, consistent, and reusable predictions without retraining, which is crucial for deploying machine learning systems in production.

In [22]:
# Save the scaler
with open('scaler.pkl', 'wb') as file:
    pickle.dump(scaler, file)

# Save the Logistic Regression model
with open('logistic_regression_model.pkl', 'wb') as file:
    pickle.dump(log_reg, file)

# Save the Decision Tree model
with open('decision_tree_model.pkl', 'wb') as file:
    pickle.dump(decision_tree, file)


 Discussion 
The Logistic regression model outperforms the Decision Tree model in terms of accuracy. 
With a significantly higher accuracy rate (84.87% compared to 73.68%), Logistic 
Regression is a more reliable model for predicting loan approval in this particular case. 
Therefore, it would be more appropriate to use logistic regression for making future 
predictions of Loan Approval. 
