# Part I - (Prosper Loan Data Analysis)
## by (Selasi Ayittah Randy)



## Preliminary Wrangling

> The dataset containing 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.


In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [None]:
#Load the dataset
loan = pd.read_csv("prosperLoanData.csv")

In [None]:
loan_data = loan.copy()

In [None]:
#Shape of the dataset
loan_data.shape

In [None]:
loan_data.dtypes

In [None]:
#Checking for columns with null values
loan_data.isnull().sum()

In [None]:
loan_data.describe()

In [None]:
selected_columns = ['LoanOriginalAmount', 'BorrowerAPR',"ProsperScore", 'StatedMonthlyIncome', 'Term', 'ProsperRating (Alpha)', 
        'EmploymentStatus','LoanStatus']

In [None]:
#Selected columns of interest
loan_data_cols =loan_data[selected_columns]
loan_data_cols

In [None]:
loan_data_cols.info()

In [None]:
#Drop rows with misssing APR
loan_data_cols=loan_data_cols[~loan_data_cols["BorrowerAPR"].isnull()]

In [None]:
loan_data_cols.info()

### What is the structure of your dataset?

> The dataset comprises of 113937 rows  and 81 columns

### What is/are the main feature(s) of interest in your dataset?

> I am interested in finding out whuch features are best for predicting the Borrower APR for a loan

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> I expect that the larger the loan the lower the APR and bprrowers with higher stated monthly income will have higher loan amount

## Univariate Exploration


In [None]:
#Distribution of the Loan Original Amount
binsize = 2500
bins = np.arange(0, loan_data_cols['LoanOriginalAmount'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = loan_data_cols, x = 'LoanOriginalAmount', bins = bins)
plt.xlabel('LoanOriginalAmount')
plt.title('Distribution of the LoanOriginalAmount')
plt.show()

The distribution of the Loan Original Amount is right skewed with most borrowers given amount less than 20k

In [None]:
# start with a standard-scaled plot
binsize = 0.01
bins = np.arange(0, loan_data_cols['BorrowerAPR'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = loan_data_cols, x = 'BorrowerAPR', bins = bins)
plt.xlabel('BorrowerAPR')
plt.show()

- Most loans have a APR less than 0.43 and very few loans have APR greater than 0.43

In [None]:
# Check loans with an APR greater than 0.43
loan_data_cols[loan_data_cols.BorrowerAPR>0.43]

- Loans with APR greater than 0.43 have no ProsperRating and ProsperScore 

In [None]:
loan_data_cols.info()

In [None]:
#Convert the PropserRating columnt to an ordered category type
rate_order = ['HR','E','D','C','B','A','AA']
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                    categories = rate_order)
loan_data_cols['ProsperRating (Alpha)'] = loan_data_cols['ProsperRating (Alpha)'].astype(ordered_var)


In [None]:
#Studying the Employment Status 
loan_data_cols["EmploymentStatus"].unique()

In [None]:
loan_data_cols["EmploymentStatus"].value_counts()

In [None]:
fig, ax = plt.subplots(nrows=3, figsize = [10,10])

default_color = sb.color_palette()[0]
sb.countplot(data = loan_data_cols, x = 'EmploymentStatus', color = default_color, ax = ax[0])
sb.countplot(data = loan_data_cols, x = 'Term', color = default_color, ax = ax[1])
sb.countplot(data = loan_data_cols, x = 'ProsperRating (Alpha)', color = default_color, ax = ax[2])
plt.xticks(rotation=45);
plt.show()

- Most of the borrowers are employed and working full time
- Most of the loans are in terms of 36 months or 3 years
- The most ProsperRating is C followed by B

In [None]:
loan_data_cols.dtypes

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

In [None]:
numeric_vars = ['LoanOriginalAmount', 'BorrowerAPR',  'StatedMonthlyIncome']
categoric_vars = ['EmploymentStatus', 'ProsperRating (Alpha)','Term']

In [None]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(loan_data_cols[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

In [None]:
# plot matrix: sample 500 loans so that plots are clearer and they render faster
loan_data_cols_samp = loan_data_cols.sample(n=500, replace = False)
print("diamonds_samp.shape=",loan_data_cols_samp.shape)

g = sb.PairGrid(data = loan_data_cols_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)

- The correlation  coefficient between the borrower APR and loan amount is -0.323. The scatter plots shows that the variables are negatively correlated meaning the higher the loan amount the lower the borrowers APR.
>
- There is a positive correlation between the borrowers stated monthly income and the loan original amount

In [None]:
## plot matrix of numeric features against categorical features.

def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x=x, y=y, color=default_color)

plt.figure(figsize = [15, 35])
g = sb.PairGrid(data = loan_data_cols, y_vars = ['BorrowerAPR', 'StatedMonthlyIncome', 'LoanOriginalAmount'], x_vars = categoric_vars,
                height = 5, aspect = 1.5)
g.map(boxgrid)
plt.xticks(rotation=45)
plt.show();

- The borrower APR decreases with better Prosper Rating(Alpha).
- Borrowers who are employed receives higher Loan original amount.
- The borrower APR decreases with increasing number of term

In [None]:
# since there's only three subplots to create, using the full data should be fine.
plt.figure(figsize = [10, 10])
# subplot 1:  Prosper rating vs. employment status
plt.subplot(3, 1, 1)
sb.countplot(data = loan_data_cols, x ='EmploymentStatus' , hue ='ProsperRating (Alpha)' , palette = 'Blues')

# subplot 2:Prosper rating vs term
ax = plt.subplot(3, 1, 2)
sb.countplot(data = loan_data_cols, x = 'ProsperRating (Alpha)', hue = 'Term', palette = 'Blues')
ax.legend(ncol = 2) # re-arrange legend to reduce overlapping

# subplot 3: employment status vs. term 
ax = plt.subplot(3, 1, 3)
sb.countplot(data = loan_data_cols, x = 'EmploymentStatus', hue = 'Term', palette = 'Greens')
ax.legend(loc = 1, ncol = 2) # re-arrange legend to remove overlapping
plt.xticks(rotation=10);
plt.show()

 - There is an interaction between term and ProsperRating. There is only 36 months loans for HR rating borrowers
 - There is more 60 amd 36 month loans on B and C ratings

### Relationship between LoanOriginalAmount and BorrowerAPR

In [None]:
plt.figure(figsize = [8, 6])
sb.regplot(data = loan_data_cols, x = 'LoanOriginalAmount', y = 'BorrowerAPR', scatter_kws={'alpha':0.02});


- The borrower APR is negatively correlated with loan original amount that is the higher the loan amount the lower the APR

## Multivariate Exploration


In [None]:
# Term effect on relationship of APR and loan amount
g=sb.FacetGrid(data=loan_data_cols, aspect=1.2, height=6, col='Term', col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.2});
g.add_legend();

Term does not have an effect on  relationship of APR and loan amount

In [None]:
# Prosper Rating effect on relationship of APR and loan amount
g=sb.FacetGrid(data=loan_data_cols, aspect=1.2, height=5, col='ProsperRating (Alpha)', col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.2});
g.add_legend();

Prosper Rating  have effect on  relationship of LoanOriginalAmount and BorrowerAPR
For borrowers with ratting AA, loan original amount increase as borrowerAPR increases

In [None]:
# Term effect on relationship of APR and loan amount
g=sb.FacetGrid(data=loan_data_cols, aspect=1.2, height=5, col='Term', col_wrap=4)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter=0.04, scatter_kws={'alpha':0.1});
g.add_legend();

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

>  The multivariate exploration showed that the relationship between borrower APR and loan amount turns from negative to slightly positive when the Prosper ratings increased from HR to AA. 

>For the rating and term effects on loans,it shows that with better Prosper rating, the loan amount of all three terms increases, the increase amplitude of loan amount between terms also becomes larger.

### Were there any interesting or surprising interactions between features?

> The borrower APR and loan amount is negatively correlated when the Prosper ratings are from HR to B, but the correlation is turned to be positive when the ratings are A and AA. Another interesting thing is that the borrower APR decrease with the increase of borrow term for people with HR-C raings. But for people with B-AA ratings, the APR increase with the borrow term.