# EDA

## Date: Nov 7, 2023

---------------

## Table of Contents

## Introduction

In this notebook, we will explore the relationship between the borrowers financial characteristics and their loan outcome. We will look for any patterns between features, relying on visuals as an aid. This will also allow us to better understand our data and possible feature engineering steps to take when modeling.   

## Import Librarys

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path

## Data Dictionary

In [None]:
#Ensure the output is not truncated
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [None]:
#pathlib is used to ensure compatibility across operating systems
try:
    data_destination = Path('../Data/Lending_club/Lending Club Data Dictionary Approved.csv')
    dict_df = pd.read_csv(data_destination, encoding='ISO-8859-1')
    display(dict_df.iloc[:,0:2])
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

## Load the Data

In [None]:
# Define the relative path to the file
parquet_file_path = Path('../Data/Lending_club/Cleaned')

try:
    # Read the parquet file
    loans_df = pd.read_parquet(parquet_file_path)
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

In [None]:
loans_df.head(5)

## Exploratory Data Analysis

In [None]:
# Separate the data between fully paid and charged off / defaulted loans
paid_loans = loans_df[loans_df['loan_status'] == "Fully Paid"]
defaulted_loans = loans_df[loans_df['loan_status'] == "Charged Off/Default"]

***Loan Status Imbalance***

We will first explore the inbalance in our target variable ie failed and successful loans. This will become crucial when we start training the models. 

In [None]:
# Get the proportion of failed vs successful loans 
loan_status_counts = loans_df['loan_status'].value_counts(normalize=True)

# Place a background grid
sns.set_style("whitegrid")

# Plot the Proportions
loan_status_counts.plot(kind='bar', color='skyblue')
plt.title('Proportion of loans by Status')
plt.xticks(rotation=45) 
plt.xlabel('Loan Status')
plt.ylabel('Proportion')

# Show the plot
plt.tight_layout()
plt.show()

We can see that we have large difference between our categories. This will need to be taken into consideration when we start creating the models.

***Loan Amount***

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(loans_df['loan_amnt'], bins=20, kde=True)
plt.title('Loan Amount Distribution')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.show()

We can see that majority of loans center around  `$10,000`, with a right tail to `$40,000`, with the maximum coming in at `$40,000`.   
This is due to LC limiting the amount to just `$40,000`. This gives us a good idea of the range for Loan Amount, as well as how much investors typically risk on a loan. 
More Information can be found here:  
https://www.lendingclub.com/help/personal-loan-faq/how-much-can-i-borrow

***Debt to income vs Loan Status***

In [None]:
# DTI vs Loan Status
plt.figure(figsize=(10, 6))
sns.boxplot(x='loan_status', y='dti', data=loans_df)
plt.xticks(rotation=45)
plt.xlabel('Loan Status')
plt.ylabel('Debt to Income Ratio')
plt.title('Debt to Income for Failed and Successful Loans')

plt.show()

Looking at the boxplot, we can see that the median Dti is lower for the successful loans, with a lower IQR. Borrowers that have a lower Dti ratio are more likely to repay their loans.

***Number of loans and interest rate overtime***

In [None]:
#Link used: https://stackoverflow.com/questions/22276066/how-to-plot-multiple-functions-on-the-same-figure

# Convert 'issue_d' to datetime
loans_df['issue_d'] = pd.to_datetime(loans_df['issue_d'], format='%b-%Y')

#group by issue date and count the number of loans
loans_count = loans_df.groupby(loans_df['issue_d']).size()

# calculate the average interest rate over the same period
average_interest_rate = loans_df.groupby(loans_df['issue_d'])['int_rate'].mean()

fig, ax1 = plt.subplots(figsize=(10, 5))

# number of loans on the left y-axis
ax1.set_xlabel('Issue Date')
ax1.set_ylabel('Number of Loans', color='blue')
ax1.plot(loans_count.index, loans_count, color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

ax2 = ax1.twinx()

#average interest rate on the right y-axis
ax2.set_ylabel('Average Interest Rate', color='red')
ax2.plot(average_interest_rate.index, average_interest_rate, color='red')
ax2.tick_params(axis='y', labelcolor='red')


# Title and show
plt.title('Number of Loans and Average Interest Rate Over Time')
fig.tight_layout()
plt.show()

We can see that there is an inverse correlation between interest rate and number of loans. This shows that P2P loans are as sensitive to external economic factors as other loans, and that these factors must be considered along side any conclusions found in this project.

***Loan Amount and Loan Status Correlation***

In [None]:
# A hexbin is more appropriate due to the number of datapoints being plotted. The count of each hex is plotted on the right
plt.hexbin(paid_loans['funded_amnt'], paid_loans['int_rate'], gridsize=20, label='Fully Paid')
plt.colorbar()
plt.xlabel('Loan Amount')
plt.xticks(rotation=45) 
plt.ylabel('Interest Rate')
plt.title('Hexbin plot of Interest Rate vs Loan Amount')
plt.show()

***Interest rate by loan Status***

In [None]:
sns.boxplot(data=loans_df, x='loan_status', y='int_rate')
plt.xticks(rotation=45) 
plt.title('Boxplot of Loan Amount by Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Interest Rate')
plt.show()

There is a difference between fully payed and defaulted / charged off loans. Charged off / defaulted loans have the a higher median interest rate, with fully paid loans having one of the lowest. When considered with the hexplot, the majority of loans fall between `$5,000` and `$10,000`, with an interest rate of approximately 12%, with the defaulted / charged off loans have a much higher interest rate, being further from the central grouping of data on the hex plot. 

In [None]:
# Plot the data
loans_df['purpose'].value_counts().plot(kind='barh')

# Set the title and labels
plt.title('Purpose of Loans')
plt.xlabel('Frequency')
plt.ylabel('Purpose')

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Retrieve the current figure and axes
fig = plt.gcf()
ax = plt.gca()

# Set the background color for the figure and the axes
fig.set_facecolor('lightblue')
ax.set_facecolor('lightblue') 

# Show the plot
plt.show()

We can see that the purpose for the majority of loans is debt and credit card consolidation. This makes sense since the interest rates for credit cards is usually over `20%` where as the interest rate for LC loans average `12%`.

In [None]:
# You can still apply the filtering as mentioned before to remove outliers if necessary
percentile_95 = loans_df['annual_inc'].quantile(0.95)
filtered_loans_df = loans_df[loans_df['annual_inc'] <= percentile_95]

# Create the hexbin plot
plt.figure(figsize=(10, 6))

# Retrieve the current figure and axes
fig = plt.gcf()
ax = plt.gca()

# Set the background color for the figure and the axes
fig.set_facecolor('lightgrey')
ax.set_facecolor('lightgrey')

plt.hexbin(filtered_loans_df['annual_inc'], filtered_loans_df['loan_amnt'], gridsize=45, cmap='plasma')
plt.colorbar(label='Count in bin')
plt.title('Hexbin of Annual Income vs. Loan Amount')
plt.xlabel('Annual Income')
plt.ylabel('Loan Amount')
plt.show()

We can see an odd relationship / line in this graph. We will research this further. 