# Effects of Borrower Characteristics on Loan Repayment

## Investigation Overview

In this investigation, I wanted to look at the characteristics of Loan Borrowers that could be used to predict their loan repayment behaviour. The main focus was on:
> - Home ownership
> - Monthly Income
> - Credit Score
> - Borrower State
> - Employment Status

## Dataset Overview

The cleaned data consisted of Loan borrower information of 83,507 loan borrowers, with each entry having 15 attributes. The attributes include the above listed borrower traits of interest among others

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

In [2]:
# load in the dataset into a pandas dataframe
clean_loan_df = pd.read_csv('Datasets/clean_loan_data.csv')

In [None]:
# data wrangling, removing records with outliers in the stated monthly income column

# Calculate the lower quartile and the upper quartile values.
for salary in clean_loan_df.StatedMonthlyIncome:
    q75, q25 = np.percentile(clean_loan_df.loc[:,'StatedMonthlyIncome'],[75,25])
    
    # Calculate the Interquartile Range
    intr_qr = q75-q25
    
    # Calculate the Minimum and maximum possible values for the stated monthly income entries.
    maxim = q75+(1.5*intr_qr)
    minim = q75-(1.5*intr_qr)
    
    # Replace the outliers with np.nan
    clean_loan_df.loc[clean_loan_df['StatedMonthlyIncome'] < minim,'StatedMonthlyIncome'] = np.nan
    clean_loan_df.loc[clean_loan_df['StatedMonthlyIncome'] > maxim,'StatedMonthlyIncome'] = np.nan

# Drop the records with null entries
clean_loan_df.dropna(inplace=True)

## Distribution of Stated Monthly Income
`Stated Monthly Income` refers to the amount that a borrower indicated as the amount they receive as monthly payment from their occupation or employment opportunity.

In [None]:
# A Histogram of Stated Monthly Income
sns.set_theme(style="whitegrid")

bins = np.arange(0, clean_loan_df.StatedMonthlyIncome.max()+500, 500)
plt.hist(data = clean_loan_df, x = 'StatedMonthlyIncome', bins=bins);
plt.title('A Histogram of the Stated Monthly Income', size = 18)
plt.xlabel('Stated Monthly Income')
plt.ylabel('Frequency');

## Distribution of Borrowers in every state

In [None]:
# Using the value_counts() function to view the number of borrowers per state.
borrowers_by_state = clean_loan_df.BorrowerState.value_counts()

# First convert the Series to a pandas dataframe.
borrowers_by_state = borrowers_by_state.to_frame(name='borrower_population').reset_index()

# Rename the column with the name index to "State"
borrowers_by_state.rename(columns = {'index':'State'}, inplace=True)

# Using the newly created dataframe, generate a heatmap indicating the population of borrowers in every state.
fig = px.choropleth(borrowers_by_state,
                   locations='State',
                    locationmode='USA-states',
                   scope = 'usa',
                   color='borrower_population',
                   color_continuous_scale=px.colors.sequential.Inferno_r)
fig.update_layout(title_text = 'Borrower population by State',
                  title_font_size = 22,
                  title_font_color = 'black',
                  title_x = 0.5)
fig.show();

## Number of Borrowers grouped by Occupation

In [None]:
# Set the theme of the visualization
sns.set_theme(style="whitegrid")

# Set the size of the visualization
f, ax = plt.subplots(figsize=(9,20))

# Set the color of the visualization
base_color = sns.color_palette()[0]

# Define the order in which the bars will appear in the visualization
occupations_order = clean_loan_df.Occupation.value_counts().index

# Visualize the bar graph.
sns.countplot(data=clean_loan_df, y = 'Occupation', color = base_color, order=occupations_order);

# Set the labels and plot title
plt.title('A visualization of The Number of Borrowers in Every Listed Occupation')
plt.xlabel('Population')
plt.ylabel('Occupation');

## Loan Borrowers grouped by Home Ownership

In [None]:
# Set the theme of the visualization
sns.set_theme(style="whitegrid")

# Set the size of the visualization
f, ax = plt.subplots(figsize=(10,6))

# Set the color of the visualization
base_color = sns.color_palette()[0]

# Define the order in which the bars will appear in the visualization
arr_order = clean_loan_df.IsBorrowerHomeowner.value_counts().index

# Visualize the bar graph.
sns.countplot(data=clean_loan_df, x = 'IsBorrowerHomeowner', color = base_color, order=arr_order);

ax.set_xticklabels(['Owned House', 'No House'], size=14)

# Set the labels and plot title
plt.title('Home Ownership status of Loan Borrowers', size = 22)
plt.xlabel('Home Ownership Status', size = 18)
plt.ylabel('Population', size = 18);

## `Stated Monthly Salary` Versus `Montly Loan Payment`

In [None]:
sns.regplot(data=clean_loan_df, x='StatedMonthlyIncome', y='MonthlyLoanPayment', truncate=False, x_jitter=0.7, scatter_kws={'alpha':1/20});
plt.title('A scatter plot of stated monthly salary against monthly loan payment');

## `Credit Score Mid_range` versus `Monthly Loan Payment`

In [None]:
sns.regplot(data=clean_loan_df, x='MonthlyLoanPayment', y='CreditScoreMid_range', truncate=False, y_jitter=0.7, scatter_kws={'alpha':1/20});
plt.title('A scatter plot of Monthly Loan Payment against Credit Score Mid_range');

## Monthly Loan Payment by Employment Status

In [None]:
color = sns.color_palette()[0]
sns.violinplot(data = clean_loan_df, y = 'EmploymentStatus', x = 'MonthlyLoanPayment', color=color)
plt.title('Monthly Loan Payment by Employment Status');

In [None]:
!jupyter nbconvert Loan_Data_Exploration_Part2.ipynb --to slides --post serve --no-input --no-prompt