# Loan Repayment Assessment in Banking
### Problem Statement:
Welcome to KnowledgeHut AI hackathon – Loan Repayment Assessment in Banking. You are required
to build and train a model that identifies a customer will repay or default from the loan dataset. This
dataset is included in loan data, and provides a challenging classifier that will test what you have learnt
in this course.
Task:
Your task is to build this model based on the details in this document and submit it. Please read the
details carefully before attempting this hackathon.
You will need to decide the following:
1. Use the specific source or dataset for assess loan repayment shared with you
2. What is your intended data split ratio for training, validation, and test sets for the loan dataset? How
do you plan to ensure randomness in this split?
3. Do you plan to explore the importance of these components further?
4. Do you anticipate class imbalance in the 'loan_status' feature, where
Paid: Applicant has fully paid the loan (the principal and the interest rate)
Defaulted: Applicant has not paid the installments in due time for a long period of time, i.e. Client
has defaulted on the loan
If so, how will you address this imbalance?
5. Will you normalize the features? If yes, what normalization techniques do you have in mind?
6. Do you intend to perform data preprocessing tasks such as outlier detection, missing value handling,
or feature selection before training your model.

: 


* ✓Statistics descriptive analysis
* ✓EDA
* ✓Data preprocessing
* ✓Feature scaling
* ✓Feature engineering
* ✓Feature selection
* ✓Build model
* ✓Ensemble techniques - Bagging, Boosting
* ✓Cross validation
* ✓Grid search, Tuning Hyper parameters
*

```
# This is formatted as code
```

✓Evaluation metric: F1 - Score

## Data Description of Features

### Column Descriptions

- **earliest_cr_line**: The month the borrower's earliest reported credit line was opened.
- **emp_title**: The job title supplied by the Borrower when applying for the loan.
- **fico_range_high**: The upper boundary range the borrower’s FICO at loan origination belongs to.
- **fico_range_low**: The lower boundary range the borrower’s FICO at loan origination belongs to.
- **Grade**: LC assigned loan grade.
- **application_type**: Indicates whether the loan is an individual application or a joint application with two co-borrowers.
- **initial_list_status**: The initial listing status of the loan. Possible values are – W, F.
- **num_actv_bc_tl**: Number of currently active bankcard accounts.
- **mort_acc**: Number of mortgage accounts.
- **tot_cur_bal**: Total current balance of all accounts.
- **open_acc**: The number of open credit lines in the borrower's credit file.
- **pub_rec**: Number of derogatory public records.
- **pub_rec_bankruptcies**: Number of public record bankruptcies.
- **Purpose**: A category provided by the borrower for the loan request.
- **revol_bal**: Total credit revolving balance.
- **Title**: The loan title provided by the borrower.
- **total_acc**: The total number of credit lines currently in the borrower's credit file.
- **verification_status**: Indicates if income was verified by LC, not verified, or if the income source was verified.
- **addr_state**: The state provided by the borrower in the loan application.
- **annual_inc**: The self-reported annual income provided by the borrower during registration.
- **emp_length**: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
- **home_ownership**: The home ownership status provided by the borrower during registration. Our values are: RENT, OWN, MORTGAGE, OTHER.
- **int_rate**: Interest Rate on the loan.
- **loan_amnt**: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
- **sub_grade**: LC assigned loan subgrade.
- **Term**: The number of payments on the loan. Values are in months and can be either 36 or 60.
- **revol_util**: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

### Target

- **loan_status**: Status of the loan.

### Test Data

- **test_loan_data.csv**: This dataset can be used to test model.


# import the import libraries for machine learning model trainning

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns


: 

In [None]:
df = pd.read_csv('train_loan_data (1).csv')
df.head()

: 

# Statistics descriptive analysis


In [None]:
df.shape

: 

here we find the shape of the trainig data this data set have total number of rows is equale to 80000 and number of columns 28

In [None]:
# know here we using the describe function to find the statistical summary of the entire numerical data from the total data
df.describe().T

: 

In [None]:
# here we usinhg the shape function for getting the more information about the data
df.info()

: 

in this data set we can easily see the number of values is missing into the given data set missing values three columns have objects data type ,and  five columns have float dta types so the total number of columns which have missing values is eight columns

In [None]:
df['loan_status'].value_counts()

: 

### convert the loan status into the 0 and 1

## EDA (Exploratory Data Analysis)

EDA is primarily about understanding the dataset. It involves:

### Understanding the Data Structure
- Getting to know the types of data, such as numerical, categorical, and the distribution of data.

### Identifying Patterns and Relationships
- Using visualizations and statistical methods to uncover patterns, trends, and relationships between variables.

### Detecting  Outliers
- Finding unusual data points that might need special attention.


### *Understanding the Data Structure*


## finding the null values into the dataset

In [None]:
# know here we check the total number of missing values into the dataset into the percentage
df.isnull().sum()/df.shape[0]*100

: 

We can easily see the percentage of missing values ​​in the total values.

## Here we further differentiate between numerical values ​​and categorical values


In [None]:
# Here we further differentiate between numerical values ​​and categorical values## calc#ulation
numerical_columns = df.select_dtypes(include=['float64','int64']).columns
categorical_columns = df.select_dtypes(include=['object']).columns

: 

here into the plot we can easily see the most off the numerical features affacted by the outlire so we use the meadian not mean

### cheching the duplicate values into the dataset

In [None]:
# here check the duplicated values
df.duplicated().sum()

: 

This dataset did not have any duplicate values

## Identifying Patterns and Relationships


# Using the sweetviz
Here we import sweetviz to analysis the the data

In [None]:
! pip install sweetviz


: 

In [None]:
import sweetviz as sv
report = sv.analyze(df)
report.show_notebook()

: 

# Ananlysis Numerical columns

In [None]:
#correlation
df1=df.copy()
# Convert 'loan_status' to numerical values
df1['loan_status'] = df1['loan_status'].apply(lambda x: 1 if x == 'Paid' else 0)

# Calculate the correlation matrix for numerical columns
numerical_columns = df1.select_dtypes(include=['float64', 'int64']).columns
correlations = df1[numerical_columns].corr()

# Plot the correlation matrix
plt.figure(figsize=(15, 10))
sns.heatmap(correlations, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


: 

### Analysing the data by the help of **pviot table** so we creat a function table which is help to build the tabble to find the some relationship 

In [None]:
  # Creating the pivot table with count by the help of the function
def table(column):
    # Creating the pivot table with count
    pivot_table = pd.pivot_table(df, index=column, columns='loan_status', aggfunc='size', fill_value=0)
    pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
    pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)
    # Sorting the pivot table by the default percentage
    pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

    return pivot_table

: 

## Define a function for creating the **box** plot


In [None]:
# creating a function for creating the box plot
def plot_boxplot(column):
    plt.figure(figsize=(10, 6))
    sns.boxplot(data=df, x=column)
    plt.title(f'Boxplot of {column} by loan_status')

: 

## Befine a function here for ploting the **histogram**

In [None]:
def plot_histogram(column):
    plt.figure(figsize=(10, 6))
    sns.histplot(data=df, x=column, hue='loan_status')
    plt.title(f'Histogram of {column} by loan_status')

: 

## Define the function  for **Count plot** for creating the plot

In [None]:
def plot_countplot(column):
    plt.figure(figsize=(10, 6))
    sns.countplot(data=df, x=column, hue='loan_status')
    plt.title(f'Countplot of {column} by loan_status')

: 

## Creating the hist plot

In [None]:
# Plot histograms for each numerical column with hue='loan_status'
for column in numerical_columns:
    plot_histogram(column)

: 

In [None]:
plot_histogram('annual_inc')

: 

In [None]:
plot_histogram('fico_range_high')

: 

In [None]:
plot_histogram('fico_range_low')

: 

In [None]:
plot_histogram('loan_amnt')

: 

In [None]:
plot_histogram('open_acc')

: 

In [None]:
plot_histogram('pub_rec')

: 

In [None]:
plot_histogram('pub_rec_bankruptcies')

: 

In [None]:

plot_histogram('revol_bal')

: 

In [None]:
plot_histogram('total_acc')

: 

In [None]:
plot_histogram('mort_acc')

: 

In [None]:
plot_histogram('revol_util')

: 

: 

: 

In [None]:
numerical_columns

: 

In [None]:
#correlation
df1=df.copy()
# Convert 'loan_status' to numerical values
df1['loan_status'] = df1['loan_status'].apply(lambda x: 1 if x == 'Paid' else 0)

# Calculate the correlation matrix for numerical columns
numerical_columns = df1.select_dtypes(include=['float64', 'int64']).columns
correlations = df1[numerical_columns].corr()

# Plot the correlation matrix
plt.figure(figsize=(15, 10))
sns.heatmap(correlations, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


: 

import the some sklearn function to do some opretion

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(x='',hue='loan_status', data=df)
plt.title('Interest Rate vs. Loan Status')
plt.show()


: 

In this plot we can easily see that if the interest rate of loan is so high that the loan default increases

In [None]:
sns.countplot(x='term', hue='loan_status', data=df)
plt.title('Term vs. Loan Status')
plt.show()

: 

In this plot the loan tenure is increased, loan defaults will increase

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(x='emp_length', hue='loan_status', data=df)
plt.title('Employment Length vs. Loan Status')
plt.show()


: 

In this graph we can see if you have long working experience then you should take loan than other which increases the number of defaulter but it does not provide good information about defaulter and paid customer.

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='home_ownership', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(x='home_ownership', hue='loan_status', data=df)
plt.title('Home Ownership vs. Loan Status')
plt.show()


: 

In this graph, we can see the increase in number of persons having mortgage, rent and own house but the number of defaulters among rented persons is more than others.

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='purpose', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(y='purpose', hue='loan_status', data=df)
plt.title('Loan Purpose vs. Loan Status')
plt.show()


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='addr_state', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='grade', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='application_type', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='initial_list_status', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='sub_grade', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='term', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='title', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
# Creating the pivot table with count
pivot_table = pd.pivot_table(df, index='verification_status', columns='loan_status', aggfunc='size', fill_value=0)

pivot_table['Total'] = pivot_table['Defaulted'] + pivot_table['Paid']
pivot_table['Default Percentage'] = round((pivot_table['Defaulted'] / pivot_table['Total']) * 100,2)

# Sorting the pivot table by the default percentage
pivot_table = pivot_table.sort_values('Default Percentage', ascending=False)

# Display the pivot table
pivot_table


: 

In [None]:
categorical_columns

: 

: 

In [None]:
plt.figure(figsize=(15, 10))
sns.countplot(y='addr_state', hue='loan_status', data=df)
plt.title('Address State vs. Loan Status')
plt.show()


: 

: 

: 

: 

: 

: 

: 

## finding the outlire with the help of box\ plot

In [None]:
# Befor filling the missing values into the data set we firstly check the outlire   Plot boxplots for numerical variables to find the outlire
for column in numerical_columns:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x=df[column])
    plt.title(f'Boxplot of {column}')
    plt.show()


: 

## Preprocessing

Preprocessing involves preparing the data for analysis or modeling. It includes:

### Handling Missing Values
- Filling, imputing, or dropping missing data points.

### Encoding Categorical Variables
- Converting categorical variables into numerical formats that machine learning models can understand.

### Scaling and Normalization
- Adjusting the scale of numerical features to ensure they are on a similar scale.

### Feature Engineering
- Creating new features from existing data to improve model performance.

### Data Cleaning
- Removing or correcting erroneous data points.


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

: 

#### converting the target variable into the some numerical values

In [None]:
for i in df['loan_status'].unique():
  if i == 'Paid':
    df['loan_status'] = df['loan_status'].replace(i,1)
  else:
    df['loan_status'] = df['loan_status'].replace(i,0)

: 

### Treating the missing values by the help of the simple imputor and column transform

In [None]:
# List of numerical and categorical features to fill
num_features = ['mort_acc', 'pub_rec_bankruptcies', 'revol_util', 'num_actv_bc_tl', 'tot_cur_bal']
cat_features = ['emp_title', 'emp_length', 'title']

# Define the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), num_features),
        ('cat', SimpleImputer(strategy='most_frequent'), cat_features)
    ])

# Apply the transformations
df[num_features + cat_features] = preprocessor.fit_transform(df)

# Check the result
print(df.isnull().sum())


: 

In [None]:
# Handle outliers in numerical columns using IQR method
for column in numerical_columns:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = np.where((df[column] < lower_bound) | (df[column] > upper_bound),
                          df[column].median(), df[column])

# Verify outlier treatment
for column in numerical_columns:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x=df[column])
    plt.title(f'Boxplot of {column} After Outlier Handling')
    plt.show()


: 

In [None]:
# Befor filling the missing values into the data set we firstly check the outlire   Plot boxplots for numerical variables to find the outlire
for column in numerical_columns:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x=df[column])
    plt.title(f'Boxplot of {column}')
    plt.show()


: 

## Data Preprocessing

Preprocessing involves preparing the data for analysis or modeling. It includes:

### Handling Missing Values
- Filling, imputing, or dropping missing data points.

### Encoding Categorical Variables
- Converting categorical variables into numerical formats that machine learning models can understand.

### Scaling and Normalization
- Adjusting the scale of numerical features to ensure they are on a similar scale.

### Feature Engineering
- Creating new features from existing data to improve model performance.

### Data Cleaning
- Removing or correcting erroneous data points.


## Importing sklarn librare and theier some function for performing the data preprocessing

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

: 

## converting the target variable into numerical 

In [None]:
for i in df['loan_status'].unique():
  if i == 'Paid':
    df['loan_status'] = df['loan_status'].replace(i,1)
  else:
    df['loan_status'] = df['loan_status'].replace(i,0)

: 

## treating the missing values By using the Column transform

In [None]:
# List of numerical and categorical features to fill
num_features = ['mort_acc', 'pub_rec_bankruptcies', 'revol_util', 'num_actv_bc_tl', 'tot_cur_bal']
cat_features = ['emp_title', 'emp_length', 'title']

# Define the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='median'), num_features),
        ('cat', SimpleImputer(strategy='most_frequent'), cat_features)
    ])

# Apply the transformations
df[num_features + cat_features] = preprocessor.fit_transform(df)

# Check the result
print(df.isnull().sum())


: 

# handling the outliers

In [None]:
# Handle outliers in numerical columns using IQR method
for column in numerical_columns:
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[column] = np.where((df[column] < lower_bound) | (df[column] > upper_bound),
                          df[column].median(), df[column])

# Verify outlier treatment
for column in numerical_columns:
    plt.figure(figsize=(10, 5))
    sns.boxplot(x=df[column])
    plt.title(f'Boxplot of {column} After Outlier Handling')
    plt.show()


: 