# **Predicting Auto Loan Approval**    -



##### **Project Type**    - Classification

# **Project Summary -**

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Develop a machine learning algorithm to predict whether a loan application will be approved or rejected based on the provided applicant information.**

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

## Python Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


# Display all the columns in the dataframe
pd.pandas.set_option("display.max_columns", None)

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset

df = pd.read_excel(r"/content/drive/MyDrive/Loan_approval.xlsx")

### Dataset First View

In [None]:
# Dataset First Look
df.head(20)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isnull().sum()

In [None]:
missing_percnt=df.isnull().sum()/len(df)*100
missing_percnt

In [None]:
# Visualizing the missing values
plt.figure(figsize=(15, 5))
sns.barplot(x=missing_percnt.index, y=missing_percnt)
plt.xticks(rotation=60)
plt.xlabel("Variables")
plt.ylabel("Missing Percentage")
plt.title("Missing Values")

### What did you know about your dataset?

- In our dataset, we have a total of 29,394 rows and 21 columns.

- We have missing values in the "Income", "No. of Tradelines", "Vehicle Year", "Vehicle Make", "Vehicle Age", and "Vehicle Miles"    column.
- The "Income" column has 557 missing values, "No. of Tradelines" has 709 missing value
- Most of the columns contain numeric data tyts), representing numerical attributes such as income, loan amount, and credit scores.
- Some columns, such as "CreditType" and "Vehicle Make", are categorical, containing object data ty
- There are no duplicate rows in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

df.columns

In [None]:
# Dataset Describe

df.describe().T.round(2)

### Variables Description

1.	Loan Number: Unique identifier for each loan application.
2.	CreditType: Type of credit (e.g., prime, subprime).
3.	Job Hours: Number of hours worked per week by the applicant.
4.	Income: Monthly income of the applicant.
5.	LTV (Loan-to-Value): Ratio of the loan amount to the appraised value of the vehicle.
6.	Term: Duration of the loan in months.
7.	Price: Price of the vehicle being financed.
8.	Downpayment: Amount of downpayment made by the applicant.
9.	BookValue: Book value of the vehicle.
10.	Amount Financed: The amount of the loan being applied for.
11.	APR (Annual Percentage Rate): Annualized interest rate on the loan.
12.	Monthly Payment: Monthly payment amount.
13.	Monthly Debt: Total monthly debt obligations of the applicant.
14.	No. of Tradelines: Number of credit tradelines (credit accounts) reported for the applicant.
15.	FICO Score: FICO credit score of the applicant.
16.	ClearFraud Score: Fraud risk score assigned to the applicant.
17.	Vehicle Year: Year of the vehicle being financed.
18.	Vehicle Make: Make of the vehicle.
19.	Vehicle Age: Age of the vehicle.
20.	Vehicle Miles: Mileage of the vehicle.
21.	AutoApproved (Target): Binary variable indicating whether the loan was approved (1) or rejected (0).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

df.nunique()

In [None]:
df["CreditType"].value_counts()

In [None]:
df["AutoApproved"].value_counts()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

num_col=df.select_dtypes(include=["int", "float"])
cat_col=df.select_dtypes(include=["object"])

In [None]:
df.drop("Vehicle Make", axis=1, inplace= True)

### Summary Statistics

In [None]:
stats = num_col.describe().T.round(2)
stats['skew'] = num_col.skew().round(2)
stats['kurtosis'] = num_col.kurtosis().round(2)

stats

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Pre-processing***

In [None]:
### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

In [None]:
#### What all missing value imputation techniques have you used and why did you use those techniques?

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

## Univariate Analysis

#### Chart - 1

In [None]:
# Chart - 1 visualization code
sns.boxplot(x="AutoApproved", y="Income", showmeans=True, data=df)
plt.xlabel("Loan Approved")
plt.ylabel("Income")
plt.title("Income Distribution by Loan Approval")
plt.show()

In [None]:
df["Income"].describe()

In [None]:
df.loc[df["Income"] >= 250000]

##### 1. Why did you pick the specific chart?

 Using boxplot to analyze the income distribution by loan approval

##### 2. What is/are the insight(s) found from the chart?

Income Distribution: The boxplot shows that the income distribution for approved loans is significantly higher than for rejected loans. The median income for approved loans is around 50,000, while for rejected loans, it's closer to $30,000. This suggests that income is a crucial factor in loan approval for ABC Auto Finance.

Outliers: There appear to be outliers on both sides of the distribution, indicating that some high-income applicants were rejected and some low-income applicants were approved.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Improved Loan Approval Process: By understanding the relationship between income and loan approval, ABC Auto Finance can refine their credit risk assessment process. This can lead to approving more loans to qualified borrowers while reducing defaults.

Targeted Marketing: The income distribution can help ABC Auto Finance target their marketing campaigns towards demographics with a higher likelihood of loan approval. This can improve the return on investment for their marketing efforts.

Negative Impacts (if any):

Fairness and Bias: If the income distribution is skewed towards higher income groups due to biases in the loan approval process, it can limit access to loans for qualified low-income borrowers. This can have negative social and ethical implications.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

plt.figure(figsize=(10, 6))
sns.scatterplot(x='LTV', y='AutoApproved', hue='AutoApproved', data=df, palette='Set1', alpha=0.7)

# Adding a trendline or regression line
sns.regplot(x='LTV', y='AutoApproved', data=df, scatter=False, color='black')

plt.title('Scatterplot of Loan-to-Value (LTV) vs. AutoApproval')
plt.xlabel('Loan-to-Value (LTV)')
plt.ylabel('AutoApproval')
plt.legend(title='AutoApproved', loc='upper right', labels=['Rejected', 'Approved'])

plt.show()

In [None]:
df.loc[(df["LTV"] >= 500) & (df["AutoApproved"] == 1)]

##### 1. Why did you pick the specific chart?

##### 2. What is/are the insight(s) found from the chart?

Negative Correlation: The plot confirms a negative correlation between LTV ratio and loan approval. Data points representing approved loans (green) are concentrated in the lower LTV ratio area (left side), while rejected loans (red) are scattered towards the higher LTV ratio area (right side). This indicates that borrowers with a higher LTV (larger loan amount relative to vehicle value) are less likely to get their loans approved

Approval Threshold: The data suggests a possible LTV threshold around 400-500, where loan approvals become less frequent. This can be a starting point for ABC Auto Finance to define their LTV risk management strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

hey can focus on approving loans with lower LTVs to mitigate defaults and improve loan portfolio health.
 By segmenting loan applications by LTV and approval status, ABC Auto Finance can target marketing campaigns more effectively. They can offer loan products with higher LTV limits to creditworthy borrowers who were previously rejected due to stricter thresholds

#### Chart - 3

In [None]:
# Chart - 3 visualization code

sns.boxplot(y="FICO", x="AutoApproved", data=df)
plt.xlabel("FICO Score")
plt.ylabel("Loan Approved")
plt.title("FICO Score vs. Loan Approval")
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is an excellent choice for visualizing the relationship between two continuous variables

##### 2. What is/are the insight(s) found from the chart?

positive correlation between FICO score and loan approval.
The scatter plot reveals a positive correlation between FICO score and loan approval. This means that there's a general tendency for applicants with higher FICO scores to have their loans approved. We can observe Applicants with higher FICO scores are generally more likely to get their loans approved in this data set.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Improved Loan Approval Efficiency: By understanding the relationship between FICO score and approval rates, you can potentially streamline the loan approval process. For instance, applications with very high FICO scores (e.g., above 750) might qualify for faster approvals or require less manual review.
Reduced Risk: Focusing on applicants with strong creditworthiness (high FICO scores) can help reduce the risk of defaults, leading to lower delinquency rates and improved portfolio health for your auto finance business.
Are there negative impacts?

While a FICO score is a valuable indicator, relying solely on it could lead to potential drawbacks:

Fair Lending Concerns: Using FICO scores as the primary criterion might raise fair lending concerns if it disproportionately excludes applicants from certain demographics with lower average credit scores but who might be creditworthy nonetheless. Regulatory compliance and fair lending practices are crucial considerations.
Missing Out on Good Borrowers: Some borrowers with lower FICO scores might still be good candidates for loans if they have other positive attributes, such as steady employment, stable income, or a history of on-time payments for other debts. A more holistic approach to evaluating loan applications can help capture these nuances.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(15, 6))
sns.scatterplot(x='Amount Financed', y='Price', data=df)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Loan Amount (Log Scale)')
plt.ylabel('Vehicle Price (Log Scale)')
plt.title('Loan Amount vs. Vehicle Price (Logarithmic Scale)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

There's a clear positive correlation between loan amount and vehicle price. As vehicle prices increase (on a logarithmic scale), loan amounts also tend to increase. This makes sense - more expensive vehicles typically require larger loans

The data points are not clustered around a straight line, but rather follow a curved pattern. This indicates a non-linear relationship. The increase in loan amount might not be directly proportional to the increase in vehicle price. For example, the jump in loan amount for a luxury car compared to a mid-range car might be smaller than the price difference between them.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Loan Product Design: This scatter plot can inform the development of targeted loan products. By understanding the loan amount range needed for different vehicle price segments (e.g., budget-friendly cars vs. luxury vehicles), we can tailor loan offerings (like maximum loan amount) to specific customer segments. This can attract a wider range of customers and potentially increase loan applications.
Credit Risk Assessment: Loan amount can be a factor influencing creditworthiness. Analyzing this plot alongside loan approval data can help assess if there's a relationship between the loan amount financed and the likelihood of loan approval. This knowledge can be valuable for developing credit risk assessment models.

#### Chart - 5

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

plt.figure(figsize=(15, 5))
sns.kdeplot(x=df['ClearFraud Score'], y=df['AutoApproved'])
plt.xlabel('ClearFraud Score')
plt.ylabel('Frequency')
plt.title('Distribution of ClearFraud Score')
plt.show()

In [None]:
# Check for infinity values
inf_mask = df.isin([np.inf, -np.inf])

# Get rows with infinity values
rows_with_inf = df[inf_mask.any(axis=1)]

print("Rows with infinity values:")
print(rows_with_inf)

In [None]:
# Check for infinity values in a specific column
inf_values_column_A = df[df['ClearFraud Score'] == np.inf]
print("Rows with infinity values in column ClearFraud Score:")
print(inf_values_column_A)

##### 1. Why did you pick the specific chart?

KDE plots are well-suited for representing the distribution of continuous data.

##### 2. What is/are the insight(s) found from the chart?

ClearFraud score has a limited impact on loan approval because approvals happen across the entire score range above 250, and applicants with 0 scores are automatically rejected.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

plt.figure(figsize=(10, 6))
plt.scatter(x="CreditType", y="AutoApproved", data=df)
plt.xlabel("Credit Type")
plt.ylabel("Count")
plt.title("Count of Auto Approval by Credit Type")

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(8, 6))
sns.kdeplot(x=df['APR'], y=df['AutoApproved'])
plt.xlabel('APR (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Annual Percentage Rate (APR)')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

plt.figure(figsize=(8, 6))
sns.scatterplot(
    x="AutoApproved",
    y="No. of Tradelines",
    data=df,
)
plt.xlabel('Loan Approval')
plt.ylabel('No. of Tradelines')
plt.title('No. of Tradelines by Loan Approval')
plt.show()

##### 1. Why did you pick the specific chart?

 scatter plots to see if there's a direct relationship between the number of tradelines and auto approval rate.

##### 2. What is/are the insight(s) found from the chart?

The plot might visually show that the points for "No. of Tradelines" is higher for non-approved loans compared to approved loans. This suggests a correlation - there's a relationship between the two variables. However, it doesn't necessarily mean that having more accounts causes loan rejections.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(12, 4))
sns.scatterplot(x='Downpayment', y='LTV', data=df)
plt.xlabel('Down Payment')
plt.ylabel('Loan-to-Value (LTV) Ratio')
plt.title('Down Payment vs. LTV Ratio')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

the data points in the scatter plot exhibit a positive correlation. This means there's a general tendency for loan-to-value ratios to increase as down payments increase.
a larger down payment reduces the lender's risk, so they might be more willing to approve a loan with a higher LTV
The scatter plot likely suggests a non-linear relationship between down payment and LTV ratio. The data points are not clustered around a straight line, but rather follow a curved pattern. This indicates that the increase in LTV ratio might not be directly proportional to the increase in down payment.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Monthly Payment', y='Income', data=df)
plt.xlabel('Monthly Loan Payment')
plt.ylabel('Income')
plt.title('Monthly Payment vs. Income')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows a weak positive correlation between monthly loan payment and income. There's a slight upward trend, indicating that borrowers with higher incomes tend to take out loans with higher monthly payments. However, the data points are quite spread out, suggesting a weak association.
The plot reveals that borrowers with similar loan payments can have significantly different incomes.
For some borrowers, especially those on the lower end of the income spectrum, the monthly payment might represent a significant portion of their income. This could be a risk factor, potentially leading to defaults if unexpected financial burdens arise

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

plt.figure(figsize=(8, 6))
plt.scatter(df['Monthly Dedt'], df['AutoApproved'], c=df['AutoApproved'], cmap='coolwarm')
plt.xlabel('Monthly Debt')
plt.ylabel('Loan Approval (0-Rejected, 1-Approved)')
plt.title('Monthly Debt vs. Loan Approval')
plt.colorbar(label='Loan Approval')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

the color gradient (coolwarm) indicates a positive correlation between monthly debt and loan approval. This means there's a general tendency for applicants with higher monthly debt to be more likely to get their loans approved (represented by warmer colors). However, the data points are spread out, suggesting this relationship is not very strong.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Assuming you have chosen two features for interaction (replace with your choices)
feature1 = 'LTV'
feature2 = 'Income'

# Scatter plot with color representing loan approval probability
plt.figure(figsize=(8, 6))
plt.scatter(df[feature1], df[feature2], c=df['AutoApproved'], cmap='coolwarm')
plt.xlabel(feature1)
plt.ylabel(feature2)
plt.title('Interaction Plot: ' + feature1 + ' vs. ' + feature2)
plt.colorbar(label='Loan Approval Probability')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

LTV vs. Income and Loan Approval

Trend by Color: The color gradient (coolwarm) indicates a possible trend between LTV, income, and loan approval probability. Here's a general interpretation based on color:

Red: Lower loan approval probability (applicants with this combination of LTV and income are less likely to get approved).
Yellow: Intermediate loan approval probability.
Blue: Higher loan approval probability (applicants with this combination of LTV and income are more likely to get approved).
Spread of Data Points: The data points are scattered, meaning the relationship between LTV, income, and approval probability is not perfectly linear. There can be applicants with similar LTV and income who have varying loan approval outcomes (red, yellow, or blue).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

plt.figure(figsize=(15, 6))
correlation = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation, annot=True, cmap="twilight_shifted_r")
plt.title('Correlation Matrix of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15

In [None]:
# visualization code
plt.figure(figsize=(8, 6))
plt.scatter(df['Price'], df['BookValue'])
plt.xlabel('Vehicle Price')
plt.ylabel('Book Value')
plt.title('Book Value vs. Vehicle Price')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The data points in the scatter plot exhibit a positive correlation. This means there's a general tendency for book value to increase as vehicle price increases. In simpler terms, more expensive cars tend to have a higher book value, which reflects their market worth.

The data points are scattered around the positive trend line, indicating that the relationship between book value and vehicle price is not perfectly linear. Here are some possible reasons for this spread:

Vehicle Make and Model: Even for similar prices, different car makes and models might have different book values due to factors like brand reputation, features, and demand.
Vehicle Condition: The condition of a car (mileage, wear and tear) can significantly impact its book value, even if the price is the same. A car in good condition will likely have a higher book value than a car with similar features but poorer condition.
Geographic Location: Geographic location can also influence book value. The same car model might have a different book value in different parts of the country.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

 Borrowers with a higher number of credit tradelines (credit accounts) are more likely to be rejected for loans.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant association between the number of job hours worked per week ('Job Hours') and loan approval rates ('AutoApproved').

Alternative Hypothesis (Ha): Applicants with a higher number of job hours (potentially indicating full-time employment) are more likely to get their loans approved compared to applicants with fewer job hours (potentially part-time or unemployed).

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import mannwhitneyu

# Assuming 'AutoApproved' is binary (approved/rejected)
u_statistic, p_value = mannwhitneyu(df[df['AutoApproved'] == 0]['Job Hours'], df[df['AutoApproved'] == 1]['Job Hours'])

# Print the results
print("mannwhitneyu statistic:", u_statistic)
print("p-value:", p_value)

# Interpretation
if p_value < 0.05:  # Adjust significance level as needed
    print("We can reject the null hypothesis. There is a significant association between job hours and loan approval rates.")
    print("Applicants with higher job hours are more likely to get their loans approved.")
else:
    print("We fail to reject the null hypothesis. There is not enough evidence to conclude a significant difference in approval rates based on job hours.")


##### Which statistical test have you done to obtain P-Value?

Mann-Whitney U Test

##### Why did you choose the specific statistical test?

This is a non-parametric test suitable for comparing the medians of two independent groups (approved vs. rejected loans in this case).
We don't necessarily know if the distribution of the number of credit tradelines is normal (which is an assumption for a t-test).
The Mann-Whitney U Test is a robust option that makes fewer assumptions about the data distribution.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant association between income level and loan approval rates.
Alternative Hypothesis (Ha): Loan approval rates are positively correlated with income level

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

# Assuming your data is in a pandas dataframe named 'df'
# Create a contingency table for income level and loan approval
income_groups = pd.cut(df['Income'], bins=3, labels=['Low', 'Medium', 'High'])
contingency_table = pd.crosstab(income_groups, df['AutoApproved'])

# Perform Chi-Square test with Yates' Correction
chi2_statistic, p_value, expected_table, obs_table = chi2_contingency(contingency_table, correction=True)

# Print the results
print("Chi-Square Statistic:", chi2_statistic)
print("p-value:", p_value)

# Interpretation
if p_value < 0.05:  # Adjust significance level as needed
    print("We can reject the null hypothesis. There is a significant association between income level and loan approval rates.")
else:
    print("We fail to reject the null hypothesis. There is not enough evidence to conclude a significant association between income level and loan approval rates.")


##### Which statistical test have you done to obtain P-Value?

Chi-Square Statistic

##### Why did you choose the specific statistical test?

 Chi-Square test with Yates' Correction is the most suitable option because it considers the categorical nature of the loan approval data and provides a robust p-value even for potential limitations in sample size or data sparsity.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant difference in loan approval rates between applicants with prime and subprime credit type
Alternative Hypothesis (Ha): Applicants with prime credit (e.g., higher credit scores, lower debt-to-income ratio) are more likely to get their loans approved compared to applicants with subprime credit.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import chi2_contingency

contingency_table = pd.crosstab(df['CreditType'], df['AutoApproved'])
chi2_statistic, p_value, expected_table, obs_table = chi2_contingency(contingency_table)

# Print the results
print("Chi-Square Statistic:", chi2_statistic)
print("p-value:", p_value)
# Interpretation
if p_value < 0.05:  # Adjust significance level as needed
    print("We can reject the null hypothesis. There is a significant association between credit type and loan approval rates.")
else:
    print("We fail to reject the null hypothesis. There is not enough evidence to conclude a significant difference in approval rates between prime and subprime borrowers.")


##### Which statistical test have you done to obtain P-Value?

Chi-Square Test

##### Why did you choose the specific statistical test?

- 'CreditType' is a categorical variable (prime, subprime).
-
'AutoApproved' is a binary variable (approved/rejected)
-
We want to assess the association between these two categorical variables.

## ***6. Feature Engineering & Data Pre-processing***

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

def calculate_outlier(df, column):
    Q3 = df[column].quantile(0.75)
    Q1 = df[column].quantile(0.25)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[column] > upper) | (df[column] < lower)]
    percent_outliers = round((outliers.shape[0] / df.shape[0]) * 100, 2)
    return lower, upper, percent_outliers

In [None]:
num_col.head(1)

In [None]:
lower_income, upper_income, percentage_income_outliers=calculate_outlier(df, "Income")
print("lower band",(lower_income))
print("upper band",(upper_income))
print("outlier percent",(percentage_income_outliers))

In [None]:
df["Income"].describe()

In [None]:
sns.boxplot(x="AutoApproved", y="Income", showmeans=True, data=df)
plt.xlabel("Loan Approved")
plt.ylabel("Income")
plt.title("Income Distribution by Loan Approval")
plt.show()

In [None]:
df.loc[df["Income"]<= 0, "Sales" ]=0
df.loc[df["Income"]>= 250000, "Sales" ]=250000

In [None]:
# For LTV

In [None]:
sns.boxplot(x="AutoApproved", y="LTV", showmeans=True, data=df)
plt.xlabel("AutoApproved")
plt.ylabel("Income")
plt.title("LTV Distribution by Loan Approval")
plt.show()

In [None]:
lower_LTV, upper_LTV, percentage_LTV_outliers=calculate_outlier(df, "LTV")
print("lower band",(lower_LTV))
print("upper band",(upper_LTV))
print("outlier percent",(percentage_LTV_outliers))

In [None]:
df.loc[df["LTV"]<= 0, "LTV" ]=0
df.loc[df["LTV"]>= 2000, "LTV" ]=2000

In [None]:
df[df["LTV"]> 2000]

In [None]:
# For Term

In [None]:
sns.boxplot(x="AutoApproved", y="Term", showmeans=True, data=df)
plt.xlabel("Term")
plt.ylabel("Income")
plt.title("Term Distribution by Loan Approval")
plt.show()

In [None]:
lower_Term, upper_Term, percentage_Term_outliers=calculate_outlier(df, "Term")
print("lower band",(lower_Term))
print("upper band",(upper_Term))
print("outlier percent",(percentage_Term_outliers))

In [None]:
df.loc[df["Term"]>= 80, "LTV" ]=80

In [None]:
# For Price

In [None]:
plt.figure(figsize=(15, 4))
sns.scatterplot(x='AutoApproved', y='Price', data=df)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('Loan Amount (Log Scale)')
plt.ylabel('Vehicle Price (Log Scale)')
plt.title('Loan Amount vs. Vehicle Price (Logarithmic Scale)')
plt.show()

In [None]:
lower_Price, upper_Price, percentage_Price_outliers=calculate_outlier(df, "Price")
print("lower band",(lower_Price))
print("upper band",(upper_Price))
print("outlier percent",(percentage_Price_outliers))

In [None]:
# For Downpayment

In [None]:
sns.boxplot(x="AutoApproved", y="Downpayment", showmeans=True, data=df)
plt.xlabel("AutoApproved")
plt.ylabel("Downpayment")
plt.title("Downpayment Distribution by Loan Approval")
plt.show()

In [None]:
lower_Downpayment, upper_Downpayment, percentage_Downpayment_outliers=calculate_outlier(df, "Downpayment")
print("lower band",(lower_Downpayment))
print("upper band",(upper_Downpayment))
print("outlier percent",(percentage_Downpayment_outliers))

In [None]:
df.loc[df["Downpayment"] < 0, "Downpayment" ]=0
df.loc[df["Downpayment"]>= 200000, "Downpayment" ]=200000

In [None]:
# For BookValue

In [None]:
sns.boxplot(x="AutoApproved", y="BookValue", showmeans=True, data=df)
plt.xlabel("AutoApproved")
plt.ylabel("BookValue")
plt.title("BookValue Distribution by Loan Approval")
plt.show()

In [None]:
lower_BookValue, upper_BookValue, percentage_BookValue_outliers=calculate_outlier(df, "BookValue")
print("lower band",(lower_BookValue))
print("upper band",(upper_BookValue))
print("outlier percent",(percentage_BookValue_outliers))

In [None]:
df.loc[df["BookValue"] < 0, "BookValue" ]=0
df.loc[df["BookValue"]>= 300000, "BookValue" ]=300000

In [None]:
# For Amount Financed

In [None]:
plt.figure(figsize=(15, 4))
sns.scatterplot(x='AutoApproved', y='Amount Financed', data=df)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('AutoApproved (Log Scale)')
plt.ylabel('Amount Financed (Log Scale)')
plt.title('AutoApproved vs. VAmount Financed (Logarithmic Scale)')
plt.show()

In [None]:
df.loc[df["Amount Financed"] < 0, "Amount Financed" ]=0

In [None]:
# For APR

In [None]:
sns.boxplot(x="AutoApproved", y="APR", showmeans=True, data=df)
plt.xlabel("AutoApproved")
plt.ylabel("APR")
plt.title("APR Distribution by Loan Approval")
plt.show()

In [None]:
lower_APR, upper_APR, percentage_APR_outliers=calculate_outlier(df, "APR")
print("lower band",(lower_APR))
print("upper band",(upper_APR))
print("outlier percent",(percentage_APR_outliers))

In [None]:
df.loc[df["APR"]>= 34, "APR" ]=34

In [None]:
# For Monthly Payment

In [None]:
sns.boxplot(x="AutoApproved", y="Monthly Payment", showmeans=True, data=df)
plt.xlabel("AutoApproved")
plt.ylabel("Monthly Payment")
plt.title("Monthly Payment Distribution by Loan Approval")
plt.show()

In [None]:
lower_Monthly_Payment, upper_Monthly_Payment, percentage_Monthly_Payment_outliers=calculate_outlier(df, "Monthly Payment")
print("lower band",(lower_Monthly_Payment))
print("upper band",(upper_Monthly_Payment))
print("outlier percent",(percentage_Monthly_Payment_outliers))

In [None]:
# For Monthly_Debt

sns.boxplot(x="AutoApproved", y="Monthly Dedt", showmeans=True, data=df)
plt.xlabel("AutoApproved")
plt.ylabel("Monthly Debt")
plt.title("Monthly Debt Distribution by Loan Approval")
plt.show()

In [None]:
lower_Monthly_Debt, upper_Monthly_Debt, percentage_Monthly_Debt_outliers=calculate_outlier(df, "Monthly Dedt")
print("lower band",(lower_Monthly_Debt))
print("upper band",(upper_Monthly_Debt))
print("outlier percent",(percentage_Monthly_Debt_outliers))

In [None]:
df.loc[df["Monthly Dedt"]>= 120000, "Monthly Dedt" ]=120000

In [None]:
columns_to_check = ["Income", "LTV", "Term", "Price", "Downpayment", "BookValue", "Amount Financed",
                    "APR", "Monthly Payment", "Monthly Dedt", "No. of Tradelines", "FICO",
                    "ClearFraud Score", "Vehicle Year", "Vehicle Age", "Vehicle Miles"]

for column in columns_to_check:
    lower, upper, percent_outliers = calculate_outlier(df, column)
    print(f"For {column}:")
    print("Lower band:", lower)
    print("Upper band:", upper)
    print("Outlier percent:", percent_outliers)
    print()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

In [None]:
### 2. Handling Missing Values

In [None]:
df.isnull().sum()

In [None]:
df[df[["Vehicle Year", "Vehicle Age", "Vehicle Miles"]].isnull().any(axis=1)]

In [None]:
df["Income"]=df["Income"].transform(lambda x:x.fillna(x.mean()))

In [None]:
df["No. of Tradelines"]=df["No. of Tradelines"].transform(lambda x:x.fillna(x.mean()))
df["Vehicle Year"] = df["Vehicle Year"].fillna(df["Vehicle Year"].mode()[0])
df["Vehicle Age"] = df["Vehicle Age"].fillna(df["Vehicle Age"].mode()[0])
df["Vehicle Miles"] = df["Vehicle Miles"].fillna(df["Vehicle Miles"].mode()[0])


In [None]:
df.isnull().sum()

In [None]:
df.drop(columns=['Sales', 'ApplNbr'], inplace=True)

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
df["CreditType"].value_counts()

In [None]:
df["CreditType"] = df["CreditType"].map({"Individual":0, "Joint":1})

#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
correlation_matrix = num_col.corr()

# Print correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)

#### 2. Feature Selection

##### Select your features wisely to avoid overfitting
 Debt-to-Income Ratio (DTI):

Calculation: Divide the total monthly debt (Monthly Debt) by the monthly income (Income).
Interpretation: This ratio indicates the percentage of income used to cover existing debts. A higher DTI could suggest a higher risk of delinquency on a new loan.

In [None]:
df['DTI'] = df['Monthly Dedt'] / df['Income']

 Loan-to-Value Ratio with Downpayment (LTV_DP):

Calculation: Subtract the downpayment (Downpayment) from the loan amount (Amount Financed) and divide by the vehicle price (Price).
Interpretation: This modified LTV considers the downpayment, potentially providing a more nuanced picture of the loan-to-vehicle value ratio. A lower LTV_DP might indicate a lower risk of default.

In [None]:
df['LTV_DP'] = (df['Amount Financed'] - df['Downpayment']) / df['Price']

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
plt.figure(figsize=(18, 8))
sns.heatmap(df.corr(), annot=True, cmap="Dark2")

In [None]:
correlation_matrix = df.corr()

correlation_matrix.head(39)

In [None]:
# Check for infinity values
inf_mask = df.isin([np.inf, -np.inf])

# Get rows with infinity values
rows_with_inf = df[inf_mask.any(axis=1)]

print("Rows with infinity values:")
rows_with_inf

In [None]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df["DTI"]=df["DTI"].transform(lambda x:x.fillna(x.mean()))

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

In [None]:
df.loc[df["LTV_DP"] < 0, "LTV_DP" ]=0

In [None]:
x=df.drop("AutoApproved",axis=1)
y=df["AutoApproved"]

In [None]:
stats = x.describe().T.round(2)
stats['skew'] = x.skew().round(2)
stats['kurtosis'] = x.kurtosis().round(2)

stats

### 7. Handling Imbalanced Dataset

In [None]:
df["AutoApproved"].value_counts()

In [None]:
df.groupby("AutoApproved").size().plot(kind="pie", autopct="% .2f")

In [None]:
x=df.drop("AutoApproved",axis=1)
y=df["AutoApproved"]

In [None]:
from imblearn.over_sampling import BorderlineSMOTE
smote = BorderlineSMOTE(sampling_strategy='auto', k_neighbors=5)

x, y = smote.fit_resample(x, y)

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test, y_train,y_test=train_test_split(x,y, test_size=0.3, random_state=42)

Answer Here.

In [None]:
### 7. Data Scaling

In [None]:
from sklearn.preprocessing import RobustScaler
rob_scaler = RobustScaler()
x_train = rob_scaler.fit_transform(x_train)
x_test = rob_scaler.transform(x_test)

In [None]:
import joblib

joblib.dump(rob_scaler, "scaling_chd.pkl")
scaling_loan = joblib.load('scaling_chd.pkl')

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, RocCurveDisplay

from sklearn.linear_model import LogisticRegression
# Logistic Regression
logistic_regression=LogisticRegression(max_iter=1000)
# Fit the Algorithm
logistic_regression.fit(x_train, y_train)
# Predict on the model
logistic_pred_y=logistic_regression.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
print(confusion_matrix(y_test,logistic_pred_y))
print(accuracy_score(y_test, logistic_pred_y))
print(classification_report(y_test,logistic_pred_y))

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, logistic_pred_y)
roc_auc = auc(fpr, tpr)

# Plotting the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
param_grid={'C': [0.1, 1, 10], "penalty": ["l2"]}
cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
logistic_regression=LogisticRegression(max_iter=1000)
Logclf=GridSearchCV(logistic_regression, param_grid, cv=cv, n_jobs=-1, scoring="f1")
# Fit the Algorithm
Logclf.fit(x_train, y_train)
# Predict on the model
best_params = Logclf.best_params_
best_model = Logclf.best_estimator_
y_pred=best_model.predict(x_test)

In [None]:
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test,y_pred))

### ML Model - 2

In [None]:
# ML Model - 2 Implementation
from sklearn.ensemble import RandomForestClassifier
random_forest= RandomForestClassifier()
# Fit the Algorithm
random_forest.fit(x_train,y_train)
# Predict on the model
randomforest_y_pred=random_forest.predict(x_test)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Evaluation Metric Score chart
print(confusion_matrix(y_test,randomforest_y_pred))
print(accuracy_score(y_test, randomforest_y_pred))
print(classification_report(y_test,randomforest_y_pred))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, randomforest_y_pred)
roc_auc = auc(fpr, tpr)

# Plotting the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc="lower right")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}
# Perform grid search with cross-validation
cv = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)
rf_clf = GridSearchCV(random_forest, param_grid, cv=cv, n_jobs=-1, scoring='f1')

# Fit the Algorithm
rf_clf.fit(x_train, y_train)

# Predict on the model
best_params = rf_clf.best_params_
best_model = rf_clf.best_estimator_
rf_y_pred=best_model.predict(x_test)

In [None]:
print(confusion_matrix(y_test,rf_y_pred))
print(accuracy_score(y_test, rf_y_pred))
print(classification_report(y_test,rf_y_pred))

### ML Model - 3

In [None]:
pip install xgboost

In [None]:
import xgboost as xgb

xgb_classifier = xgb.XGBClassifier()

# Fit the model
xgb_classifier.fit(x_train, y_train)

# Predict on the test data
y_pred = xgb_classifier.predict(x_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
### ML Model - 4

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gradient_boosting = GradientBoostingClassifier()

# Fit the model
gradient_boosting.fit(x_train, y_train)

# Predict on the test data
y_pred = gradient_boosting.predict(x_test)

# Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
### ML Model - 5

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)  # Specify the number of neighbors to consider
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
# ML Model - 6 Implementation
from sklearn.naive_bayes import GaussianNB
naive_bayes = GaussianNB()

# Fit the Algorithm
naive_bayes.fit(x_train, y_train)

# Predict on the model
nb_y_pred = naive_bayes.predict(x_test)
print(confusion_matrix(y_test, nb_y_pred))
print(accuracy_score(y_test, nb_y_pred))
print(classification_report(y_test, nb_y_pred))

### 1. Which Evaluation metrics did you consider for a positive business impact and why?



1.   Accuracy: This measures the overall percentage of correct predictions. However, in our case with an imbalanced dataset, accuracy alone might not be sufficient.

2.   Precision and Recall: These metrics provide a more detailed picture of class-specific performance. Precision tells us how often a positive prediction is truly correct, while recall indicates how well we identify all actual positive cases.


3.   F1-score: This combines precision and recall into a single metric, offering a balanced view of model performance.





### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Random Forest as the most suitable model because:

* High Accuracy: Random Forest achieved the highest overall accuracy (77.27%),
indicating strong predictive power.

* Balanced Class Performance: It demonstrates a good balance between precision and recall for both classes, ensuring reliable predictions for both majority and minority classes.

* Outlier Resilience: Compared to models like Logistic Regression, Random Forest is less susceptible to outliers in the data, leading to more robust predictions.

In [None]:
from sklearn.inspection import permutation_importance

# Calculate permutation importances
perm_importance = permutation_importance(random_forest, x_test, y_test, n_repeats=10, random_state=42)

# Get the feature importance scores
feature_importance = perm_importance.importances_mean

# Sort the features based on importance scores
sorted_indices = np.argsort(feature_importance)[::-1]
sorted_features = x.columns[sorted_indices]

# Create the bar chart using seaborn
sns.barplot(x=feature_importance[sorted_indices], y=sorted_features)

# Set the labels and title
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.title('Feature Importances')

# Show the bar chart
plt.show()

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

import joblib

joblib.dump(random_forest, "random_forest_loan.pkl")

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.
random_forest_loan = joblib.load('random_forest_loan.pkl')

In [None]:
# Make predictions on the test data
predictions = random_forest_loan.predict(x_test)

In [None]:
predictions=pd.DataFrame(predictions)

In [None]:
# Naming Column
predictions = predictions.rename(columns={0: "AutoApproved"})

In [None]:
predictions.head()

# **Conclusion**

Our evaluation identified Random Forest as the best model for Loan Approval due to its high accuracy, balanced performance on both classes in the imbalanced dataset, and resilience to outliers. Further analysis using permutation_importance can provide deeper understanding of the model's decision-making and potentially improve its performance.