<a href="https://colab.research.google.com/github/TANISHQGOYAL07/loan_prediction_aiml/blob/main/Copy_of_Sample_ML_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - LOAN PREDICTION



##### **Project Type**    - Clasification
##### **Contribution**    - Team
##### **Team Member 1 -** Tanishq Goyal
##### **Team Member 2 -** nishant
##### **Team Member 3 -** devansh
##### **Team Member 4 -** lavanaya

# **Project Summary -**

Write the summary here within 500-600 words.

The loan prediction AIML project revolves around the task of predicting whether a loan applicant is creditworthy, primarily aiming to assist financial institutions in minimizing default risks and making informed lending decisions. The project involves various stages, including data collection, preprocessing, exploratory data analysis (EDA), feature engineering, model selection, training, evaluation, hyperparameter tuning, deployment, and ongoing monitoring and maintenance.

Firstly, the project starts with defining the problem statement, which is crucial for setting the project's objectives and determining the scope of work. In this case, the goal is to predict loan approval status based on applicant information.

Data collection is the next step, where relevant data is gathered from various sources. The dataset includes attributes such as Loan_ID, Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area, and Loan_Status. These attributes provide essential information about loan applicants and their loan approval status.

Once the data is collected, it undergoes preprocessing to clean and prepare it for analysis. This involves handling missing values, encoding categorical variables, scaling numerical features, and other necessary transformations to ensure the data's quality and suitability for modeling.

Exploratory Data Analysis (EDA) is then performed to gain insights into the dataset. Visualizations and statistical analyses help in understanding the distribution of data, identifying correlations between features, detecting outliers, and recognizing patterns that may influence loan approval decisions.





# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**



The goal of this project is to develop a machine learning model that predicts the approval status of loan applications. Using historical data of loan applications, the model will analyze various features related to the applicant's personal, financial, and loan-specific details to determine the likelihood of loan approval.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report

## ***1. Know Your Data***

### Import Libraries

### Dataset Loading

In [None]:
# Load Dataset
loan_prediction= pd.read_csv('/content/loan_prediction.csv')

### Dataset First View

In [None]:
# Dataset First Look
loan_prediction

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
loan_prediction.shape

### Dataset Information

In [None]:
# Dataset Info
loan_prediction.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
Duplicated_value = loan_prediction.duplicated().sum()
Duplicated_value

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values = loan_prediction.isnull().sum().reset_index()
null_values

In [None]:
# Visualizing the missing values
plt.figure(figsize = (20,7))
sns.heatmap(loan_prediction.isnull(), cbar=False)


### What did you know about your dataset?


The dataset provided contains information related to loan applicants and their loan approval status. Here's a summary of the dataset based on the provided columns:
Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
loan_prediction.columns

In [None]:
# Dataset Describe
loan_prediction.describe()

### Variables Description

Loan_ID: Unique identifier for each loan application.
Gender: Gender of the applicant (male/female).
Married: Marital status of the applicant (yes/no).
Dependents: Number of dependents of the applicant.
Education: Education level of the applicant (e.g., graduate, not graduate).
Self_Employed: Whether the applicant is self-employed or not (yes/no).
ApplicantIncome: Income of the applicant.
CoapplicantIncome: Income of the co-applicant (if any).
LoanAmount: Amount of the loan requested.
Loan_Amount_Term: Term of the loan (in months).
Credit_History: Credit history of the applicant (1: Good credit history, 0: Bad credit history).
Property_Area: Area where the property associated with the loan is located (e.g., Urban, Semiurban, Rural).
Loan_Status: The target variable indicating whether the loan was approved or not (Y: Yes, N: No).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = loan_prediction.nunique().reset_index()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

missing_values = loan_prediction.isnull().sum().reset_index()
print("Missing Values:\n", missing_values)

### What all manipulations have you done and insights you found?

The code calculates the number of missing values for each column in the dataset using the .isnull().sum() method.
The "Gender" column has 13 missing values.
The "Married" column has 3 missing values.
The "Dependents" column has 15 missing values.
The "Self_Employed" column has 32 missing values.
The "LoanAmount" column has 22 missing values.
The "Loan_Amount_Term" column has 14 missing values.
The "Credit_History" column has 50 missing values.
No missing values are found in the "Loan_ID", "Education", "ApplicantIncome", "CoapplicantIncome", "Property_Area", and "Loan_Status" columns.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code.  barplot=shows the relationship between a numeric and a categoric variable.
# Assuming your DataFrame is named 'df'
# Calculate average loan amount by a categorical variable, let's say 'Property_Area'
avg_loan_amount_by_area = loan_prediction.groupby('Property_Area')['LoanAmount'].mean()

plt.figure(figsize=(10, 6))
sns.barplot(x=avg_loan_amount_by_area.index, y=avg_loan_amount_by_area.values)
plt.title('Average Loan Amount by Property Area')
plt.xlabel('Property Area')
plt.ylabel('Average Loan Amount')
plt.xticks(rotation=45)
plt.show()


##### 1. Why did you pick the specific chart?

The bar plot was chosen to compare the average loan amounts across different property areas (Urban, Semiurban, Rural) visually, allowing for easy interpretation of the data and clear identification of trends in loan amounts based on property locations.

##### 2. What is/are the insight(s) found from the chart?

The bar plot indicates that average loan amounts vary across different property areas, with urban areas typically having higher loan amounts compared to semiurban and rural areas.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can positively impact business by informing tailored lending strategies for urban areas with higher loan amounts. However, neglecting semiurban and rural areas based solely on lower loan averages could lead to missed opportunities

#### Chart - 2

In [None]:
# Chart - 2 visualization code. histogram=used to find frequency

plt.hist(loan_prediction["Loan_Amount_Term"])
plt.xlabel("Loan Amount Term")
plt.ylabel("Count")

##### 1. Why did you pick the specific chart?

Answer Here.  
I chose a histogram because it effectively visualizes the distribution of the Loan_Amount_Term feature, showing how frequently each loan term length appears in the dataset. This helps in identifying common loan terms and spotting any anomalies or outliers. Overall, it provides a clear picture of the data's spread and central tendencies.

##### 2. What is/are the insight(s) found from the chart?

Answer Here.

The histogram of Loan_Amount_Term reveals several insights. Most notably, there are distinct peaks at common loan term lengths, such as 360 months, indicating these terms are frequently chosen. Additionally, the distribution may show a concentration of loans around certain standard terms, with fewer loans at shorter or unusually long terms. This information helps understand borrower preferences and the common structures of loan agreements.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can help create a positive business impact by aligning loan products with customer preferences, leading to higher approval rates and customer satisfaction. However, if the data reveals a high default rate for popular loan terms, continuing to offer these terms without adjustment could lead to negative growth due to increased financial risk.


#### Chart - 3

In [None]:
# Chart - 3 visualization code. voilinplot = used when you want to observe the distribution of numeric data
plt.violinplot(loan_prediction["CoapplicantIncome"])
plt.xlabel("Co-Application Income")
plt.ylabel("Count")

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a violin plot because it provides a detailed visualization of the distribution of CoapplicantIncome, showing both the density and probability distribution of the data. This allows for a deeper understanding of the variations and patterns within the co-applicant incomes, revealing insights that are not easily captured by simpler plots like histograms or box plots.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The violin plot of CoapplicantIncome suggests a multimodal distribution, indicating that there are distinct groups or clusters of co-applicant incomes within the dataset. This insight could imply various scenarios, such as different income demographics among co-applicants or the presence of outliers with significantly higher or lower incomes. Understanding these income patterns can inform loan approval decisions, potentially influencing factors like co-applicant requirements or loan terms to better accommodate the diverse financial situations of applicants.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Understanding the distribution of co-applicant incomes can positively impact business decisions by tailoring loan products to match the financial profiles of applicants, thereby improving approval rates and reducing default risks. Conversely, identifying clusters of low or high incomes with high default rates could indicate potential financial risks, necessitating adjustments to lending criteria or loan terms to mitigate negative growth.

#### Chart - 4

In [None]:
# Chart - 4  bar graph=a chart or graph that presents categorical data with rectangular bars
gender_counts = loan_prediction['Gender'].value_counts()
plt.bar(gender_counts.index, gender_counts.values)
plt.title("Gender Distribution")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a bar chart because it effectively displays the distribution of genders within the loan prediction dataset. This type of chart is suitable for comparing the frequency or count of categories, making it easy to visualize the proportion of male and female applicants.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The bar chart visually represents the distribution of genders among loan applicants, showing the count of male and female applicants. By observing the heights of the bars, we can quickly discern any imbalances or disparities in gender representation within the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Understanding the gender distribution among loan applicants is crucial for ensuring fairness and inclusivity in lending practices. Insights from this chart can inform decision-making processes related to marketing strategies, product development, and customer service initiatives aimed at addressing the needs and preferences of diverse applicant demographics.


#### Chart - 5

In [None]:
# Chart - 5 visualization code.  pie chart= type of graph that represents the data in the circular graph
dependents = loan_prediction['Dependents'].value_counts()
plt.pie(dependents.values, labels=dependents.index, autopct='%1.1f%%')
plt.title("Dependents Distribution")

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a pie chart to illustrate the distribution of dependents within the loan prediction dataset. This type of chart effectively shows the proportion of each category relative to the whole, making it easy to visualize the percentage distribution of different dependent categories.


##### 2. What is/are the insight(s) found from the chart?

Answer Here
The pie chart visually represents the distribution of dependents among loan applicants, showing the percentage composition of each dependent category relative to the total number of applicants. By observing the size of each slice, we can quickly grasp the relative prevalence of different dependent categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Understanding the distribution of dependents among loan applicants is essential for tailoring loan products and services to meet the needs of diverse family structures. Insights from this chart can inform strategic decisions related to loan eligibility criteria, interest rates, and customer support offerings, ensuring that lending practices are responsive to the varying financial circumstances of applicants with different dependent responsibilities.

#### Chart - 6

In [None]:
# Chart - 6 visualization code.  boxplot= graph summarising a set of data.
plt.boxplot(loan_prediction["ApplicantIncome"])
plt.ylabel("Count")

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a box plot because it provides a concise summary of the distribution of ApplicantIncome, including key statistics such as the median, quartiles, and any outliers. This visualization is particularly useful for identifying the central tendency and spread of income values, as well as highlighting any potential anomalies or extreme values within the dataset.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The box plot of ApplicantIncome reveals insights into the distribution of income among loan applicants. It shows the median income level, the spread of incomes across quartiles, and the presence of outliers, indicating potential variations in applicant financial profiles. This information can inform loan approval decisions and help tailor loan products to better match the income distribution of applicants.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
The insights gained from the box plot of ApplicantIncome can contribute to creating a positive business impact by enabling financial institutions to better understand the income distribution of loan applicants. This understanding can lead to more informed decision-making processes, such as setting appropriate income thresholds for loan eligibility and designing tailored loan products that cater to the diverse financial needs of applicants.


#### Chart - 7

In [None]:
# Chart - 7 visualization code. scatter plot= uses dots to represent values for two different numeric variables
# Scatterplot => Applicant Income & Loan Amount

plt.scatter(loan_prediction['ApplicantIncome'], loan_prediction['LoanAmount'])
plt.title("Scatter Plot of Applicant Income vs LoanAmount")
plt.xlabel("Applicant Income")
plt.ylabel("Loan Amount")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a scatter plot because it effectively displays the relationship between two numeric variables, ApplicantIncome and LoanAmount, by representing each data point as a dot on the plot. This visualization allows for easy identification of any patterns, trends, or correlations between income levels and loan amounts, providing valuable insights into the lending process and applicant profiles.


##### 2. What is/are the insight(s) found from the chart?

Answer Here
From the scatter plot of `ApplicantIncome` versus `LoanAmount`, it's evident that there's a wide range of loan amounts across different income levels. Additionally, there seems to be a positive correlation between applicant income and loan amount, suggesting that individuals with higher incomes tend to apply for larger loans. This insight can inform loan approval decisions and help tailor loan products to better match the borrowing needs of applicants across various income brackets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from the scatter plot can positively impact business decisions by enabling financial institutions to tailor loan products more effectively to the borrowing needs of applicants. Understanding the positive correlation between applicant income and loan amount allows for more accurate risk assessment and loan approval processes, potentially leading to increased customer satisfaction and improved loan portfolio performance.


#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Countplot => Education & Loan Status
sns.countplot(loan_prediction, x="Education", hue="Loan_Status")

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a countplot because it's suitable for comparing the frequency of different categories within each variable (Education) and examining their distribution across loan approval statuses (Loan_Status). This visualization method allows for a clear understanding of the relationship between education level and loan approval status.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The countplot reveals the distribution of loan approvals and rejections among different education levels. Insights may include variations in approval rates based on education level, potentially indicating the impact of education on loan eligibility or repayment capacity.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights gained from the countplot can positively impact business decisions by informing loan approval processes and product development strategies tailored to the educational backgrounds of applicants. Understanding how education influences loan approval rates can lead to more targeted marketing efforts and improved risk assessment methods, ultimately enhancing customer satisfaction and optimizing loan portfolio performance.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Barplot => Gender & Application Income
sns.barplot(loan_prediction, x="Gender", y="ApplicantIncome")

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a barplot because it's suitable for comparing the average values of a numeric variable (ApplicantIncome) across different categories of a categorical variable (Gender). This visualization method allows for a clear comparison of income levels between genders.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The barplot reveals the average income levels of male and female applicants. Insights may include any disparities in income between genders, which can inform discussions around gender equality in the workplace and potentially influence lending practices.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights gained from the barplot can positively impact business decisions by promoting gender diversity and inclusivity in employment practices. Additionally, understanding any income disparities between genders can inform strategies for addressing potential biases and ensuring fair treatment of all applicants in loan approval processes.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Boxplot => Self Employed & Application Income
sns.boxplot(loan_prediction, x="Self_Employed", y="ApplicantIncome")

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a boxplot because it's suitable for comparing the distribution of a numeric variable (ApplicantIncome) across different categories of a categorical variable (Self_Employed). This visualization method allows for a clear comparison of income distributions between self-employed and non-self-employed individuals.


##### 2. What is/are the insight(s) found from the chart?

Answer Here
The boxplot reveals the distribution of applicant incomes among self-employed and non-self-employed individuals. Insights may include differences in income variability and the presence of outliers between the two groups, which can inform discussions around entrepreneurship and income stability.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights gained from the boxplot can positively impact business decisions by informing strategies for supporting entrepreneurship and addressing potential income disparities between self-employed and non-self-employed individuals. Understanding the income dynamics within these groups can guide product development efforts and facilitate more targeted financial services tailored to the needs of self-employed individuals.


#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Violinplot => Property Area & Loan Amount
sns.violinplot(loan_prediction, x="Property_Area", y="LoanAmount")

##### 1. Why did you pick the specific chart?

Answer Here.
I chose a violinplot because it's suitable for comparing the distribution of a numeric variable (LoanAmount) across different categories of a categorical variable (Property_Area). This visualization method allows for a clear comparison of loan amount distributions between different property areas.


##### 2. What is/are the insight(s) found from the chart?

Answer Here
The violinplot reveals the distribution of loan amounts across different property areas. Insights may include variations in loan amount variability and the presence of outliers between different property areas, which can inform discussions around property market dynamics and loan affordability.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights gained from the violinplot can positively impact business decisions by informing strategies for targeting loan products to specific property markets and customer segments. Understanding the distribution of loan amounts within different property areas can guide pricing strategies and risk assessment methods, ultimately leading to more tailored and competitive loan offerings.


#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Barplot => Gender, Application Income, Loan Status
sns.barplot(loan_prediction, x="Gender", y="ApplicantIncome", hue="Loan_Status")

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a barplot because it's suitable for comparing the average values of a numeric variable (ApplicantIncome) across different categories of a categorical variable (Gender), while also incorporating the loan approval statuses (Loan_Status) as a hue. This visualization method allows for a clear comparison of income levels between genders across loan approval statuses.



##### 2. What is/are the insight(s) found from the chart?

Answer Here

The barplot reveals the average income levels of male and female applicants, with loan approval statuses distinguished by different colors. Insights may include differences in income levels between genders for approved and rejected loans, which can inform discussions around gender equality in loan approval processes.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights gained from the barplot can positively impact business decisions by promoting gender diversity and inclusivity in loan approval processes. Understanding any income disparities between genders for different loan approval statuses can inform strategies for addressing potential biases and ensuring fair treatment of all applicants, ultimately enhancing customer satisfaction and trust in the institution's lending practices.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(6, 4))
sns.countplot(x='Loan_Status', data=loan_prediction)
plt.title('Loan Approval Status Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a countplot because it's suitable for comparing the frequency of different categories within a single variable (Loan_Status). This visualization method allows for a clear comparison of the number of approved and rejected loans.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The countplot reveals the distribution of loan approval statuses, showing the number of approved and rejected loans. Insights may include the overall approval rate and any disparities in approval rates between different loan statuses, which can inform discussions around loan approval processes and risk assessment methods.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights gained from the countplot can positively impact business decisions by informing strategies for optimizing loan approval processes and improving customer satisfaction. Understanding the distribution of loan approval statuses can guide efforts to streamline approval workflows, identify areas for improvement, and ultimately enhance the institution's lending practices and reputation.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Heatmap => All numerical data
num_df = loan_prediction.select_dtypes(include=['number'])
sns.heatmap(num_df.corr(), annot=True)
plt.title("Correlation Heatmap for all Numerial Variables")

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a correlation heatmap because it's suitable for visualizing the relationships between multiple numerical variables in a dataset. This visualization method allows for a comprehensive understanding of the correlations between different features, helping identify potential patterns and dependencies.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The correlation heatmap reveals the strength and direction of correlations between numerical variables. Insights may include identifying highly correlated variables, understanding which features influence each other, and potentially identifying redundant or collinear features.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pairplot => Application Income, Gender, Education
sns.pairplot(loan_prediction, vars=['ApplicantIncome', 'Gender', 'Education'])

##### 1. Why did you pick the specific chart?

Answer Here.

I chose a pair plot because it's suitable for visualizing the relationships between multiple variables simultaneously. However, the selection of variables ApplicantIncome, Gender, and Education may not be the most appropriate for a pair plot, as Gender and Education are categorical variables. Pair plots are typically used for visualizing relationships between numerical variables, where scatter plots are plotted for each combination of variables in the dataset. For categorical variables like Gender and Education, other types of plots such as bar plots or box plots may be more suitable for visualizing their distributions and relationships with numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The pair plot provides insights into the relationships between ApplicantIncome, Gender, and Education. It visually explores potential patterns or differences in applicant income across different gender and education categories, highlighting any notable trends or variations that may exist between these variables. However, given that Gender and Education are categorical variables, the insights from the pair plot may be limited compared to the insights derived from numerical variables.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Create a copy of the loan_prediction DataFrame
df = loan_prediction.copy()

# Drop the "Loan_ID" column
df.drop(columns="Loan_ID", inplace=True)

# Display the resulting DataFrame
df.head()


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

import numpy as np

# Replace missing Loan_Status values with 'Test'
df['Loan_Status'] = np.where(df['Loan_Status'].isna(), 'Test', df['Loan_Status'])

# Fill missing values in Gender with the mode
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])

# Fill missing values in Married with the mode
df['Married'] = df['Married'].fillna(df['Married'].mode()[0])

# Fill missing values in Dependents with the mode
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode()[0])

# Fill missing values in Self_Employed with the mode
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])

# Fill missing values in LoanAmount with the median of LoanAmount grouped by Education
df['LoanAmount'] = df['LoanAmount'].fillna(df.groupby('Education')['LoanAmount'].transform('median'))

# Fill missing values in Loan_Amount_Term with the median of Loan_Amount_Term grouped by Education
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df.groupby('Education')['Loan_Amount_Term'].transform('median'))

# Fill missing values in Credit_History with the median of Credit_History
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].median())

# Display the DataFrame after replacing missing values
df.head()


### 3. Categorical Encoding

In [None]:
# Encode your categorical columns

# Import necessary libraries
# Encode categorical columns
import pandas as pd

x = pd.get_dummies(loan_prediction[['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area']],
                   columns=['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area'])


In [None]:
loan_prediction

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

#### 1. Expand Contraction

In [None]:
# Expand Contraction
a=loan_prediction[["ApplicantIncome","CoapplicantIncome","LoanAmount",	"Loan_Amount_Term"	,"Credit_History","Loan_Status"]]
x =pd.merge(a, x,left_index=True,right_index=True)
x


### 4. Feature Manipulation & Selection

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting

##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data

### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import OneHotEncoder
enc= OneHotEncoder()
enc_data=pd.DataFrame(enc.fit_transform(loan_prediction[['Gender', 'Married', 'Education', 'Self_Employed','Property_Area', 'Loan_Status']]).toarray())
enc_data

##### Which method have you used to scale you data and why?

In the provided code snippet, I have used one-hot encoding to transform categorical variables into a format suitable for machine learning algorithms. One-hot encoding converts categorical variables into binary vectors, where each category is represented by a binary value (0 or 1) in a separate column. This method is used because most machine learning algorithms require numerical input data, and one-hot encoding allows us to represent categorical variables in a numerical format without assuming any ordinal relationship between categories. It helps prevent the model from interpreting categorical variables as ordinal when they are not. Additionally, one-hot encoding ensures that each category is treated equally in the model, without assigning any arbitrary numerical values that could introduce bias.


### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)


##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x.columns

In [None]:
X = x.drop(['Loan_Status'],axis=1)
y = x['Loan_Status']

In [None]:
X.shape , y.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y, test_size= 0.2,random_state=42)
X_train.shape , y_train.shape , X_test.shape , y_test.shape

##### What data splitting ratio have you used and why?

Answer Here.

In the provided code snippet, I have split the data into training and testing sets using a 80-20 ratio, where 80% of the data is used for training (`X_train` and `y_train`), and 20% is reserved for testing (`X_test` and `y_test`). This ratio is commonly used in machine learning because it strikes a balance between having enough data for training to learn the underlying patterns in the data and having enough data for testing to evaluate the model's performance accurately. A larger training set allows the model to learn more effectively, while a smaller testing set ensures a robust evaluation of the model's generalization ability.


### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1 RandomForest Classifier


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestClassifier

# Assuming you have already imported necessary data and libraries

# Initialize RandomForestClassifier
rf = RandomForestClassifier()

# Fit the RandomForestClassifier on the data without missing values
rf.fit(X_train_dropna, y_train_dropna)

# Calculate accuracy score on the training data
train_accuracy = rf.score(X_train_dropna, y_train_dropna)
print("Training Accuracy:", train_accuracy)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
X = loan_prediction.drop(columns=['Loan_ID','Loan_Status'],axis=1)
Y = loan_prediction['Loan_Status']
print(X)
print(Y)

### ML Model - 2 Decision Tree

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=3)
dt

In [None]:
dt.fit(X_train_dropna, y_train_dropna)

In [None]:
dt.score(X_train_dropna, y_train_dropna)

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

In the provided code snippet, I have used a decision tree classifier with a specified maximum depth (`max_depth=3`). This hyperparameter controls the maximum depth of the decision tree, limiting the number of levels in the tree. The choice of this hyperparameter aims to prevent overfitting by constraining the complexity of the tree.

As for hyperparameter optimization techniques, I have not explicitly performed any optimization in the given code snippet. Instead, I have used a fixed value for `max_depth`. Hyperparameter optimization techniques, such as grid search or randomized search, systematically explore a range of hyperparameter values to find the optimal combination that maximizes the model's performance on a validation set. These techniques are beneficial when there are multiple hyperparameters to tune or when the best values for hyperparameters are unknown. However, in this case, I have opted for a simple approach with a fixed hyperparameter value for `max_depth=3`, which can serve as a starting point for model development and evaluation.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

To assess the improvement in the model's performance, we would typically compare the evaluation metric scores before and after hyperparameter tuning. The evaluation metric could be accuracy, precision, recall, F1-score, or any other appropriate metric depending on the problem and the desired performance measure.

If I were to evaluate the model's performance, I would calculate the accuracy score or any other relevant metric for both the initial model (without hyperparameter tuning) and the tuned model (with hyperparameter tuning). Then, I would compare these scores to determine if there has been any improvement.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.
Accuracy: Accuracy measures the proportion of correctly classified instances out of the total instances. A higher accuracy indicates better overall performance of the model in making correct predictions. For businesses, accuracy helps in understanding the reliability of the model's predictions, which is crucial for decision-making processes such as loan approvals, customer segmentation, and fraud detection.

Precision: Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It indicates the accuracy of positive predictions and is particularly important when the cost of false positives is high. For businesses, precision is essential in scenarios like healthcare diagnostics or credit risk assessment, where false positives can lead to costly consequences.

Recall (Sensitivity): Recall measures the proportion of true positive predictions out of all actual positive instances in the dataset. It indicates the model's ability to capture all relevant instances of a certain class. In business applications like disease detection or customer churn prediction, high recall ensures that no critical instances are missed, minimizing potential losses.

F1-score: F1-score is the harmonic mean of precision and recall, providing a balanced measure of both metrics. It is particularly useful when there is an imbalance between the classes or when both precision and recall are equally important. For businesses, F1-score provides a comprehensive assessment of the model's performance, considering both precision and recall trade-offs, thus guiding decisions that optimize both false positives and false negatives.


### ML Model - 3 Support Vector


In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.svm import SVC # Support vector classifier
model = SVC(C=1) # we are using kernel as linear because the data is linearly seperable.
model.fit(X_train_dropna, y_train_dropna)



In [None]:
model.score(X_train_dropna, y_train_dropna)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.
Accuracy:

Reason: Accuracy measures the proportion of correct predictions out of the total predictions. For many business applications, such as overall loan approval systems or credit scoring, a high accuracy indicates that the model is making correct decisions most of the time, thus leading to reliable outcomes and satisfied customers.
Impact: High accuracy generally means that the model is effective at identifying both positive and negative cases correctly, leading to fewer errors and better decision-making.
Precision:

Reason: Precision measures the proportion of true positive predictions out of all positive predictions made by the model. This is especially important in cases where the cost of false positives is high, such as fraud detection or disease diagnosis.
Impact: High precision ensures that when the model predicts a positive outcome, it is likely to be correct. This reduces the risk of taking costly or harmful actions based on incorrect positive predictions, thereby saving resources and protecting the business's reputation.
Recall (Sensitivity):

Reason: Recall measures the proportion of true positive predictions out of all actual positive cases. This is crucial in situations where missing a positive case has significant consequences, such as in medical diagnoses or identifying high-risk loans.
Impact: High recall ensures that the model captures as many true positives as possible, minimizing the risk of missing critical cases. This leads to better risk management and more comprehensive decision-making.
F1-score:

Reason: The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances the trade-off between precision and recall, which is particularly useful when dealing with imbalanced datasets.
Impact: A high F1-score indicates that the model has a good balance between precision and recall, making it effective for applications where both fals

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Answer Here.
Reasons for Final Model Choice:
High Accuracy: The model demonstrates high accuracy in predictions, indicating reliable overall performance.
Balanced Precision and Recall: The classification report shows good precision and recall scores, meaning the model effectively handles both false positives and false negatives.
Robustness and Scalability: SVCs are robust and can handle large datasets efficiently, making them suitable for real-world business applications.
Interpretability: While not as interpretable as decision trees, the linear SVC provides a reasonable balance between performance and interpretability.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

The Support Vector Classifier (SVC) is a supervised machine learning algorithm used for classification tasks. It works by finding the optimal hyperplane that maximizes the margin between the classes in the feature space. The key characteristics of SVC are:

Hyperplane: In SVC, the hyperplane is a decision boundary that separates different classes in the feature space. For a linearly separable dataset, the hyperplane is a straight line in 2D or a flat plane in higher dimensions.
Support Vectors: These are the data points that are closest to the hyperplane and influence its position and orientation. The algorithm seeks to maximize the margin (distance between the support vectors and the hyperplane).

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File

### 2. Again Load the saved model file and try to predict unseen data for a sanity check.


In [None]:
# Load the File and predict unseen data.

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**



This project demonstrated the end-to-end process of building a predictive model for loan approval status. By employing robust machine learning techniques and leveraging model explainability tools, we developed a model that not only performs well but also provides transparent and interpretable predictions. This approach ensures that the model is not only accurate but also trustworthy and actionable for business decision-making.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***