# Predicting Loan Defaults

## Goal:

### - Discover features driving borrowers to default.
### - Build a classififcation model to predict defaults.

___

# Imports

In [1]:
import wrangle2 as w
import explore as e
import preprocess as p
import numpy as np
import modeling as m

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

---

# Dictionary

| Column       | Column_type | Data_type| Description                                                                |
|--------------|-------------|----------|----------------------------------------------------------------------------|
|LoanID	       |Identifier   |string    |A unique identifier for each loan.                                          |
|Age	       |Feature      |integer	|Age of the borrower.                                                        |
|Income        |Feature      |integer   |Annual income of the borrower.                                              |
|LoanAmount    |Feature      |integer	|Amount of money being borrowed.                                             |
|CreditScore   |Feature      |integer	|Credit score of the borrower, indicating their creditworthiness.            |
|MonthsEmployed|Feature      |integer	|Number of months the borrower has been employed.                            |
|NumCreditLines|Feature      |integer	|Number of credit lines the borrower has open.                               |
|InterestRate  |Feature      |float	    |Interest rate for the loan.                                                 |
|LoanTerm      |Feature      |integer	|Term length of the loan in months.                                          |
|DTIRatio      |Feature      |float	    |Debt-to-Income ratio, borrower's debt compared to their income.             |
|Education     |Feature      |string	|Highest level of education attained by the borrower.                        |
|EmploymentType|Feature      |string	|Type of employment status of the borrower.                                  |
|MaritalStatus |Feature      |string	|Marital status of the borrower (Single, Married, Divorced).                 |
|HasMortgage   |Feature      |string	|Whether the borrower has a mortgage (Yes or No).                            |
|HasDependents |Feature      |string	|Whether the borrower has dependents (Yes or No).                            |
|LoanPurpose   |Feature      |string	|Purpose of the loan (Home, Auto, Education, Business, Other).               |
|HasCoSigner   |Feature      |string	|Whether the loan has a co-signer (Yes or No).                               |
|Default	   |Target       |integer	|Binary target variable indicating whether the loan defaulted (1) or not (0).|

---

# Acquire & Wrangle



### - Data acquired from Coursera into a csv file

### - Renamed columns & lowercased column names

### - No missing values

### - Dropped LoanID column

### - Split data 70%,15%,15%

In [2]:
train = w.train_data()

Found Data


In [3]:
train.shape

(255347, 18)

In [4]:
train.head()

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   LoanID          255347 non-null  object 
 1   Age             255347 non-null  int64  
 2   Income          255347 non-null  int64  
 3   LoanAmount      255347 non-null  int64  
 4   CreditScore     255347 non-null  int64  
 5   MonthsEmployed  255347 non-null  int64  
 6   NumCreditLines  255347 non-null  int64  
 7   InterestRate    255347 non-null  float64
 8   LoanTerm        255347 non-null  int64  
 9   DTIRatio        255347 non-null  float64
 10  Education       255347 non-null  object 
 11  EmploymentType  255347 non-null  object 
 12  MaritalStatus   255347 non-null  object 
 13  HasMortgage     255347 non-null  object 
 14  HasDependents   255347 non-null  object 
 15  LoanPurpose     255347 non-null  object 
 16  HasCoSigner     255347 non-null  object 
 17  Default   

In [6]:
train.columns

Index(['LoanID', 'Age', 'Income', 'LoanAmount', 'CreditScore',
       'MonthsEmployed', 'NumCreditLines', 'InterestRate', 'LoanTerm',
       'DTIRatio', 'Education', 'EmploymentType', 'MaritalStatus',
       'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner',
       'Default'],
      dtype='object')

In [7]:
train.isnull().sum()

LoanID            0
Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
Default           0
dtype: int64

In [8]:
train.loan_amount_bin[train.loan_amount_bin.isnull() == 'True']

AttributeError: 'DataFrame' object has no attribute 'loan_amount_bin'

In [None]:
test = w.test_data()

In [None]:
val, test = w.val_test(test)

In [None]:
val.shape, test.shape

In [None]:
val.head()

In [None]:
w.summarize(train)

In [None]:
w.summarize(val)

In [None]:
train, val, test = w.wrangle_data()

In [None]:
test.head()

In [None]:
train.shape, val.shape, test.shape

In [None]:
w.summarize(train)

In [None]:
w.summarize(test)

---

# Exploration

### - Binned data for better visuals

In [None]:
sns.barplot(data = train, x = 'cosigned', y = 'default')

In [None]:
# Plot distribution of default
e.plt_dist(train, 'default')

There is an uneven distribution of default values. I need to balance this before modeling

In [None]:
train.head()

In [None]:
# Make copy of train and add new binned features.
train.copy = e.bin_data(train)
val.copy = e.bin_data(val)

## 1. Is there a difference in interest rates for borrowers that default and those that did not ?

### H0: Mean of interest rates of defaults = Mean of interest rates of all borrowers

### Ha: Mean of interest rates of defaults != Mean of interest rates of all borrowers

In [None]:
# Call function to visualize barplot.
e.plt_1(train.copy)

In [None]:
# 1 sample, 2 tailed t-test
e.t_test(train.copy, 'interest_rate')

### Conclusion: 
### - Mean of interest rates of defaults != Mean of interest rates of all borrowers
### - As interest rates increase, the mean of default increases

---

## 2. Is there a difference in loan amounts for borrowers that default and those that did not ?

### H0: Mean of loan amounts of defaults = Mean of loan amount of all borrowers

### Ha: Mean of loan amounts of defaults != Mean of loan amount of all borrowers

In [None]:
# Call function to visualize barplot.
e.plt_3(train.copy)

In [None]:
# 1 sample, 2 tailed t-test
e.t_test(train.copy, 'loan_amount')

### Conclusion: 

### - Mean of loan amounts of defaults != Mean of loan amount of all borrowers
### - As loan amount increase, the mean of default increases

---

## 3. Is there a difference in age for borrowers that default and those that did not ?

### H0:  Mean of age of defaults = Mean of age of all borrowers

### Ha: Mean of age of defaults != Mean of age of all borrowers

In [None]:
# Call function to visualize barplot.
e.plt_2(train.copy)

In [None]:
# 1 sample, 2 tailed t-test
e.t_test(train.copy, 'age')

### Conclusion: 
### - Mean of age of defaults != Mean of age of all borrowers
### - As age increases, mean of default decreases

## EDA Summary:
### - Distribution of defaults significantly concentrated on non defaults (0)
### - Interest rates, loan amount, and age seem to drive borrrowers to default on loans

---

# Pre-Process

function is not working. Val subset does not have a 'default' column. How can I measure accuracy?

In [None]:
# Split data, scale data, get dummies
X_train, y_train = p.xy_split(train.copy)
X_val, y_val = p.xy_split2(val.copy)

X_train.head()

---

# Modeling

### - Baseline = .88
### - Evaluation metric is accuracy
### - I have moved forward with all features including new binned features

## Random Forest

In [None]:
m.r_forest(X_train, y_train, X_val, y_val)

## Decision Tree

In [None]:
m.d_tree(X_train, y_train, X_val, y_val)

## KNearest Neighbor

In [None]:
m.knn_m(X_train, y_train, X_val, y_val)

---

In [None]:
# Plot distribution of default
e.plt_dist(train, 'default')

# Conclusion

### - Baseline = .88
### - Decision tree and random forest models with balanced weight parameters perform worse than the baseline
### - Distribution of default binary values heavily concentrated on one value
### - Knearest tree is weighing one outcome significantly more than the other
### - Decided not to test until I improve my train and validate accuracy

---

# Next Steps:

### - Adjust model hyperparameters 
### - Do some more feature engineering
### - Investigate feature importance and minimize dimension by dropping features
### - Run test
finish


---

# Recommendations:

### - Target loan amounts lower than 150k
### - Require higher qualifications for younger population
### - Target borrowers that qualify with low interest rates