# Loan Eligibility Automation Project Plan

## Problem statement:

The problem is automating the loan eligibility process for a finance company based on customer data provided in an online application form. The company's goal is to create a real-time system that efficiently evaluates whether a customer is eligible for a mortgage loan. Key customer information includes gender, marital status, education, number of dependents, income, loan amount, credit history, and other relevant factors.
By automating this process, the company aims to streamline operations, reduce manual effort, and improve the customer experience through faster response times to loan applications. Ultimately, it is important to identify specific customer segments  for different loan amounts, allowing for targeted marketing efforts and efficient resource allocation.


## Data needed:

•	Gender: The gender of the applicant

•	Marital Status: The marital status of the applicant 

•	Education: The education level of the applicant 

•	Number of Dependents: The number of dependents the applicant has.

•	Income: The income of the applicant, which is a crucial factor in determining loan eligibility.

•	Loan Amount: The amount of loan the applicant is requesting.

•	Credit History: The credit history of the applicant, indicating past repayment behavior.

•	Other relevant factors: Additional factors such as employment status, property ownership, existing loans, etc., may also be considered depending on the company's criteria.


## Hypothesis:
1.	Education level and monthly income positively impact the loan eligibility.
2.	A higher number of dependents may negatively influence loan eligibility due to increased financial responsibilities.
3.	Applicants with a better credit history are more likely to be eligible for higher loan amounts.
4.	There is a correlation between loan amount and eligibility, with higher loan amounts being more difficult to obtain.
5.	Gender and marital status do not have a significant impact on loan eligibility (to ensure fairness and avoid bias).
These hypotheses will be tested during the exploratory data analysis and model building phases of the project.



## Decision Tree Model:

### Import Libraries:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
import plotly.express as px
import matplotlib.pyplot as plt

This code imports the libraries required for data processing, model assessment, and visualisation. It uses pandas to handle data, scikit-learn's train_test_split to split datasets, DecisionTreeClassifier for decision tree modelling, and plot_tree to visualise decision trees. It also imports accuracy_score for model assessment. Furthermore, it uses Plotly Express for interactive data visualisation and matplotlib.pyplot for static charting. These libraries work together to assist many stages of the machine learning workflow, including data preparation, model training, assessment, and visualisation.


### Next we prepared the data for the model:

In [None]:
df = pd.read_csv('Loans updated2.csv')
df.drop(columns=['Loan_ID'], inplace=True)
ohe = pd.get_dummies(df, columns=['Dependents', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term',
                                  'Credit_History', 'Married', 'Loan_Status', 'Education',
                                  'Self_Employed', 'Gender', 'Property_Area'])
X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


This code segment reads a CSV file called 'Loans updated2.csv' into a pandas DataFrame 'df' and then deletes the 'Loan_ID' column. It then conducts one-hot encoding on the categorical variables in 'df', resulting in a new DataFrame 'ohe'. It divides the features into 'X' and the target variable into 'Y', with 'X' containing all columns except 'Loan_Status' and 'Y' containing only the 'Loan_Status' column. Finally, it divides the dataset into two sets: 80% for training ('X_train', 'y_train') and 20% for validation ('X_val', 'y_val') using the 'train_test_split' function with a random state of 42.


### Calculating Baseline Accuracy

In [None]:
acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 2))

This code calculates baseline accuracy by finding the fraction of the majority class in the training dataset's target variable and displays it as "Baseline Accuracy" rounded to two decimal places.

### Training and fine tuning the Decision Tree

In [None]:
depth_hyperparams = range(1, 16)
training_acc = []
validation_acc = []

for d in depth_hyperparams:
    model_dt = DecisionTreeClassifier(max_depth=d, random_state=42)
    model_dt.fit(X_train, y_train)
    training_acc.append(model_dt.score(X_train, y_train))
    validation_acc.append(model_dt.score(X_val, y_val))

The term 'depth_hyperparams' refers to a set of values ranging from 1 to 15. Two empty lists, 'training_acc' and 'validation_acc', are created. Then, a loop iterates over each value in 'depth_hyperparams', where a decision tree classifier model is created with the current depth ('d') and trained on the training data ('X_train' and 'y_train'). The model's training and validation accuracy scores are calculated using the'score' method and appended to the lists 'training_acc' and 'validation_acc', respectively.


### Visualization of Training and Validation Curves

In [None]:
tune_data = pd.DataFrame({'Training': training_acc, 'Validation': validation_acc}, index=depth_hyperparams)
fig = px.line(data_frame=tune_data, x=depth_hyperparams, y=['Training', 'Validation'], title="Training & Validation Curves (Decision Tree Model)")
fig.update_layout(xaxis_title="Maximum Depth", yaxis_title="Accuracy Score")
fig.show()

This code section generates a pandas DataFrame called tune_data to hold training and validation accuracy ratings for different levels of a decision tree model. Using Plotly Express, it creates a line plot demonstrating how these accuracy scores vary as the decision tree's maximum depth changes. The x-axis depicts the greatest depth values, while the y-axis shows the training and validation accuracy scores. The plot's title is set to "Training & Validation Curves (Decision Tree Model)", and the axis labels are correctly titled. This visualisation helps to determine the best maximum depth for the decision tree model by showing how training and validation accuracies change with depth.


### Final Decision Tree visualization

In [None]:
feature_names = X.columns
plt.figure(figsize=(18, 12))
plot_tree(decision_tree=final_model_dt, filled=True, max_depth=2, feature_names=feature_names, class_names=True)
plt.axis('off')
plt.show()

This code segment plots the final decision tree model's structure. First, it retrieves the feature names from the input data X and stores them in the variable feature_names. Then it creates a plot with a given figure size. The plot_tree function is used to construct the decision tree visualisation, using the trained final_model_dt and defining settings such as filling nodes with colour, restricting the tree's maximum depth to 2, and showing feature and class names. After creating the plot, the axis is switched off, and the plot is presented with plt.show().