# LOAN REPAYMENT

## The Data

We will be using a subset of the LendingClub DataSet obtained from Kaggle: https://www.kaggle.com/wordsforthewise/lending-club

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.[3] It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

### Our Goal

Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), can we build a model thatcan predict wether or nor a borrower will pay back their loan? This way in the future when we get a new potential customer we can assess whether or not they are likely to pay back the loan. Keep in mind classification metrics when evaluating the performance of your model!

The "loan_status" column contains our label.

### Data Overview

----
-----
There are many LendingClub data sets on Kaggle. Here is the information on this particular data set:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

---
----

## Starter Code

In [None]:
import pandas as pd

In [None]:
data_info = pd.read_csv("F:\\04- UDEMY COURSES\\02- Data Science\\Py_DS_ML_Bootcamp-master\\TensorFlow_FILES\\DATA\\lending_club_info.csv", index_col='LoanStatNew')

In [None]:
print(data_info.loc['revol_util']['Description'])

In [None]:
def feat_info(col_name):
    print(data_info.loc[col_name]['Description'])

In [None]:
feat_info('mort_acc')

## Loading the data and other imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

In [None]:
df = pd.read_csv("F:\\04- UDEMY COURSES\\02- Data Science\\Py_DS_ML_Bootcamp-master\\TensorFlow_FILES\\DATA\\lending_club_loan_two.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe().transpose()



# Section 1: Exploratory Data Analysis

In [None]:
# Count PLot for the label - 'loan_status'
df.loan_status

In [None]:
sns.countplot(data=df, x=df.loan_status)

**TASK: Create a histogram of the loan_amnt column.**

In [None]:
plt.figure(figsize=(12,8))
# df.loan_amnt.plot(kind = 'hist', bins=40)
sns.histplot(data=df, x=df.loan_amnt, bins=40)
plt.grid()

**Let's explore correlation between the continuous feature variables. Calculating the correlation between all continuous numeric variables using .corr() method.**

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), annot=True)

In [None]:
# Most corealated features with "loan_amnt"
df.corr()['loan_amnt'].sort_values(ascending=False)

In [None]:
# loan_amnt
feat_info('loan_amnt')

In [None]:
df.loan_amnt.describe()

In [None]:
# installment
feat_info('installment')

In [None]:
df.installment.describe()

In [None]:
# Scatter plot for 'loan_amnt' v/s 'installment'
plt.figure(figsize=(12,8))
sns.scatterplot(data=df, x=df.installment, y=df.loan_amnt)

In [None]:
# px.scatter(data_frame=df, x=df.loan_amnt, y=df.installment)

**Creating a boxplot showing the relationship between the loan_status and the Loan Amount.**

In [None]:
sns.boxplot(data=df, x=df.loan_status, y=df.loan_amnt)

In [None]:
px.box(data_frame=df, x=df.loan_status, y=df.loan_amnt, color=df.loan_status)

**Calculating the summary statistics for the loan amount, grouped by the loan_status.**

In [None]:
df.groupby('loan_status')['loan_amnt'].describe().transpose()

**Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans. What are the unique possible grades and subgrades?**

In [None]:
# grade
feat_info('grade')

In [None]:
print(df.grade.sort_values().unique())
print()
# or

print(sorted(df.grade.unique()))

In [None]:
# sub_grade
feat_info('sub_grade')

In [None]:
print(df.sub_grade.sort_values().unique())
print()

# or
print(sorted(df.sub_grade.unique()))

**Creating a countplot per grade.**

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(data=df, x=df.grade, hue=df.loan_status)
plt.grid()

In [None]:
plt.figure(figsize=(24, 8))
sub_grade_order = sorted(df.sub_grade.unique())
sns.countplot(data=df, x=df.sub_grade, order=sub_grade_order, palette='coolwarm')
plt.grid()

In [None]:
plt.figure(figsize=(24, 8))
sub_grade_order = sorted(df.sub_grade.unique())
sns.countplot(data=df, x=df.sub_grade, order=sub_grade_order, palette='coolwarm', hue=df.loan_status)
plt.grid()

In [None]:
# Extracting the rows having "F" or "G" as a grade
f_and_g = df[(df['grade'] == 'F') | (df['grade'] == 'G')]

In [None]:
plt.figure(figsize=(12, 8))
subgrade_order = sorted(f_and_g.sub_grade.unique())
sns.countplot(data=f_and_g, x=f_and_g.sub_grade, order=subgrade_order, hue=f_and_g.loan_status,)
plt.grid()

**Creating a new column called 'load_repaid' which will contain a 1 if the loan status was "Fully Paid" and a 0 if it was "Charged Off".**

In [None]:
df.loan_status.unique()

In [None]:
dict1 = {'Fully Paid':1, 'Charged Off':0}

In [None]:
df['loan_repaid'] = df.loan_status.map(dict1)

In [None]:
df.head()

** Creating a bar plot showing the correlation of the numeric features to the new loan_repaid column. [Helpful Link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html)**

In [None]:
df.corr()['loan_repaid'].sort_values(ascending=False)

In [None]:
df.corr()['loan_repaid'][:-1].sort_values(ascending=False).plot(kind='bar')
plt.grid()

# or

In [None]:
df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind='bar')
plt.grid()

---
---
# Section 2: Data PreProcessing

**Section Goals: Remove or fill any missing data. Remove unnecessary or repetitive features. Convert categorical string features to dummy variables.**



In [None]:
df.head()

# Missing Data

**Let's explore this missing data columns. We use a variety of factors to decide whether or not they would be useful, to see if we should keep, discard, or fill in the missing data.**

In [None]:
len(df)

**Creating a Series that displays the total count of missing values per column.**

In [None]:
df.isnull().sum()

In [None]:
# Percentage of missing data
df.isnull().sum() / len(df) * 100

In [None]:
# emp_title
feat_info('emp_title')

In [None]:
df.isnull().sum()['emp_title']/396030 * 100

In [None]:
# emp_length
feat_info('emp_length')

In [None]:
df.isnull().sum()['emp_length']/396030 * 100

**How many unique employment job titles are there?**

In [None]:
len(df.emp_title.unique())

In [None]:
df.emp_title.value_counts()

**Realistically there are too many unique job titles to try to convert this to a dummy variable feature. Let's remove that emp_title column.**

In [None]:
df = df.drop('emp_title', axis=1)

In [None]:
# df.columns

**Creating a count plot of the emp_length feature column. Challenge: Sort the order of the values.**

In [None]:
sorted_order = sorted(df.emp_length.dropna().unique())

sorted_order

In [None]:
emp_lenght_order = ['< 1 year',
                    '1 year',
                    '2 years',
                    '3 years',
                    '4 years',
                    '5 years',
                    '6 years',
                    '7 years',
                    '8 years',
                    '9 years',
                    '10+ years']

In [None]:
plt.figure(figsize=(18,8))
sns.countplot(data=df, x=df.emp_length, order=emp_lenght_order)

**Ploting out the countplot with a hue separating Fully Paid vs Charged Off**

In [None]:
plt.figure(figsize=(18,8))
sns.countplot(data=df, x=df.emp_length, order=emp_lenght_order, hue='loan_status')

In [None]:
# Charged Off Details
emp_co = df[df['loan_status'] == 'Charged Off'].groupby('emp_length').count()['loan_status']

In [None]:
emp_co

In [None]:
# Fully Paid Details
emp_fp = df[df['loan_status'] == 'Fully Paid'].groupby('emp_length').count()['loan_status']

In [None]:
emp_fp

In [None]:
emp_len = emp_co / emp_fp

In [None]:
# percentage
emp_len

In [None]:
# ratio of emp_co
emp_ratio = emp_co / (emp_co + emp_fp)

In [None]:
# for percentage
emp_len.plot(kind='bar')

In [None]:
# for emp_co ratio
emp_ratio.plot(kind='bar')

**Charge off rates are extremely similar across all employment lengths. Lt's drop the emp_length column.**

In [None]:
df = df.drop('emp_length', axis=1)

In [None]:
# df.columns

In [None]:
df.isnull().sum()

In [None]:
feat_info('purpose')

In [None]:
feat_info('title')

In [None]:
df.title.head(10)

In [None]:
df.purpose.head(10)

In [None]:
# As both the columns 'title' and 'purpose' have same information, so we can drop one of the column
df = df.drop('title', axis=1)

In [None]:
feat_info('mort_acc')

In [None]:
df.mort_acc.value_counts()

In [None]:
print("Correlation with the mort_acc column:")
df.corr()['mort_acc'].sort_values(ascending=False)

In [None]:
feat_info('total_acc')

In [None]:
print("Mean of mort_acc column per total_acc:")
df.groupby('total_acc').mean()['mort_acc']

In [None]:
total_acc_avg = df.groupby('total_acc').mean()['mort_acc']

In [None]:
total_acc_avg

In [None]:
total_acc_avg[100.0]

In [None]:
def fill_mort_acc(total_acc, mort_acc):
    '''
    Accepts the total_acc and mort_acc values for the row.
    Checks if the mort_acc is NaN , if so, it returns the avg mort_acc value
    for the corresponding total_acc value for that row.
    
    total_acc_avg here should be a Series or dictionary containing the mapping of the
    groupby averages of mort_Acc per total_Acc values.
    '''
    
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc]
    else:
        return mort_acc

In [None]:
df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis=1)

In [None]:
df.isnull().sum()

**revol_util and the pub_rec_bankruptcies have missing data points, but they account for less than 0.5% of the total data. Let's remove the rows that are missing those values in those columns with dropna().**

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

## Categorical Variables and Dummy Variables


**List all the columns that are currently non-numeric. [Helpful Link](https://stackoverflow.com/questions/22470690/get-list-of-pandas-dataframe-columns-based-on-data-type)**

[Another very useful method call](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html)

### .select_dtypes():
- Return a subset of the DataFrame's columns based on the column dtypes.


- **Parameters:**
    - **include, exclude** : scalar or list-like
        - A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
        
        
        - To select all numeric types, use np.number or 'number'
        - To select strings you must use the object dtype, but note that this will return all object dtype columns
        - See the numpy dtype hierarchy
        - To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
        - To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
        - To select Pandas categorical dtypes, use 'category'
        - To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'
        
        
 #### Syntax:
 - **df.select_dtypes(include=None, exclude=None)**

In [None]:
df.select_dtypes(exclude='number').columns

---
**Let's now go through all the string features to see what we should do with them.**

---


### term feature

**Converting the term feature into either a 36 or 60 integer numeric data type using .apply() or .map().**

In [None]:
df.term.value_counts()

In [None]:
# Or just use .map()
df['term'] = df.term.apply(lambda term: int(term[:3]))

In [None]:
df.term.value_counts()

In [None]:
# df.head()

### grade feature

**We already know grade is part of sub_grade, so just drop the grade feature.**

In [None]:
df.grade.unique()

In [None]:
df = df.drop('grade', axis=1)

**Converting the subgrade into dummy variables. Then concatenate these new columns to the original dataframe. Remember to drop the original subgrade column and to add drop_first=True to your get_dummies call.**

In [None]:
df.sub_grade.unique()

In [None]:
subgrade_dummies = pd.get_dummies(df.sub_grade,drop_first=True)

In [None]:
# Adding 'subgrade_dummies' into data set
df = pd.concat([df, subgrade_dummies], axis=1)

In [None]:
df = df.drop('sub_grade', axis=1)
df.head()

In [None]:
df.select_dtypes(include='object').columns

### verification_status, application_type,initial_list_status,purpose 
**Converting these columns: ['verification_status', 'application_type','initial_list_status','purpose'] into dummy variables and concatenate them with the original dataframe. Remember to set drop_first=True and to drop the original columns.**

In [None]:
# verification_status
df.verification_status.unique()

In [None]:
# application_type
df.application_type.unique()

In [None]:
# initial_list_status
df.initial_list_status.unique()

In [None]:
# purpose
df.purpose.unique()

In [None]:
# Converting ['verification_status', 'application_type','initial_list_status','purpose'] into dummy variables together
dummies = pd.get_dummies(df[['verification_status', 'application_type', 'initial_list_status', 'purpose']], drop_first=True)

In [None]:
dummies.head()

In [None]:
df = df.drop(['verification_status', 'application_type', 'initial_list_status', 'purpose'], axis=1)

In [None]:
df = pd.concat([df, dummies], axis=1)

In [None]:
df.columns

### home_ownership
**Review the value_counts for the home_ownership column.**

In [None]:
df.home_ownership.value_counts()

In [None]:
df['home_ownership'] = df.home_ownership.replace(to_replace=['NONE', 'ANY'], value='OTHER')

In [None]:
df.home_ownership.value_counts()

In [None]:
dummies = pd.get_dummies(df['home_ownership'], drop_first=True)
df = df.drop('home_ownership', axis=1)
df = pd.concat([df, dummies], axis=1)

In [None]:
# dummies

In [None]:
df.columns

### address
**Let's feature engineer a zip code column from the address in the data set. Creating a column called 'zip_code' that extracts the zip code from the address column.**

In [None]:
df.address.head(10)

In [None]:
df['zip_code'] = df.address.apply(lambda address: address[-5:])

In [None]:
df.zip_code.unique()

In [None]:
dummies = pd.get_dummies(df['zip_code'], drop_first=True)
df = df.drop(['zip_code', 'address'], axis=1)
df = pd.concat([df, dummies], axis=1)

In [None]:
dummies.head()

### issue_d 

**This would be data leakage, we wouldn't know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn't have an issue_date, drop this feature.**

In [None]:
df.issue_d

In [None]:
df = df.drop('issue_d', axis=1)

In [None]:
df.head()

### earliest_cr_line
**This appears to be a historical time stamp feature. Extract the year from this feature using a .apply function, then convert it to a numeric feature. Set this new data to a feature column called 'earliest_cr_year'.Then drop the earliest_cr_line feature.**

In [None]:
feat_info('earliest_cr_line')

In [None]:
df.earliest_cr_line.head()

In [None]:
# Extracting earliest_cr_line's 'year'
df['earliest_cr_year'] = df.earliest_cr_line.apply(lambda year: int(year[-4:]))

In [None]:
df.earliest_cr_year.head()

In [None]:
df = df.drop('earliest_cr_line', axis=1)

In [None]:
df.select_dtypes(include='object').columns

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df = df.drop('loan_status', axis=1)

In [None]:
X = df.drop('loan_repaid', axis=1).values
y = df.loan_repaid.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

## Normalizing the Data


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

# Creating the Model


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.constraints import max_norm

**Building a sequential model to will be trained on the data. You have unlimited options here, but here is what the solution uses: a model that goes 78 --> 39 --> 19--> 1 output neuron.**

In [None]:
model = Sequential()

# layer 1 / Input Layer
model.add(Dense(units=78, activation='relu'))
model.add(Dropout(rate=0.2))

# Layer 2 / hidden Layer 1
model.add(Dense(units=39, activation='relu'))
model.add(Dropout(rate=0.2))

# Layer 3 / hidden layer 2
model.add(Dense(units=19, activation='relu'))
model.add(Dropout(rate=0.2))

# Layer 4 / output layer
model.add(Dense(units=1, activation='sigmoid'))

# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy')

In [None]:
model.fit(x=X_train,
          y=y_train,
          batch_size=256,
          validation_data=(X_test, y_test),
          epochs=25)

# Section 3: Evaluating Model Performance.


In [None]:
model.history.history

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.head()

In [None]:
losses.plot()

In [None]:
predictions = (model.predict(X_test) > (0.5)).astype("int32")

In [None]:
predictions.dtype

In [None]:
y_test.dtype

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

In [None]:
# Classification Report
print(classification_report(y_test, predictions))

In [None]:
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, predictions)
display = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Charged Off', 'Repaid'])
display.plot()

**Save your model.**

In [None]:
from tensorflow.keras.models import load_model

In [None]:
model.save("full_data_project_model.h5")

**TASK: Given the customer below, would you offer this person a loan?**

In [None]:
import random
random.seed(101)

#### random.randint(a, b):

- Return random integer in range [a, b], including both end points


- **Syntax:** 
    - **random.randint(a, b)**

In [None]:
random_ind = random.randint(0, len(df))

In [None]:
new_customer = df.drop('loan_repaid', axis=1).iloc[random_ind]
new_customer

In [None]:
new_customer = scaler.transform(new_customer.values.reshape(1,78))

In [None]:
# Predicting On A New DataSet
(model.predict(new_customer) > 0.5).astype("int32")

**Now check, did this person actually end up paying back their loan?**

In [None]:
df.iloc[random_ind]['loan_repaid']