## Give Me Some Credit: Predicting loan repayment delinquencies

### Abstract
Can we look at a candidate application and tell if they are going to pay back their loans on time with reasonable accuracy?

Yes.

### Introduction
This dataset comes from a [2011 Kaggle competiton](https://www.kaggle.com/c/GiveMeSomeCredit).

Here is the competition overview with one corrected typo:

"Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

The goal of this competition is to build a model that (lenders) can use to help make the best financial decisions."

Top scores on the leaderboard were roughly .86.

# Data Exploration Summary
A link to my data exploration and explanations of my decisions can be found [here.](https://github.com/Lukede9/Thinkful/blob/master/Bootcamp/Unit%203/Capstone/Capstone%20EDA.ipynb) Summary:

1. Are there any entries in the data that do not make sense (e.g. Monthly Income of 1 billion)?
2. How does a change in age impact the likelihood to default on the loans? What about the other variables?
3. Is any of the information we collected redundant?

#### Missing Data
The was missing data in the monthly income and number of dependents columns.
- Roughly 20% of borrowers did not report monthly income.
- Roughly 2.5% did not report their # of dependents.

Many variables had strange outliers. These outliers had erratic relationships to the outcome variable. These include:
- age < 21
- NumberOfTime30-59DaysPastDueNotWorse > 13
- NumberOfTime60-89DaysPastDueNotWorse > 11
- NumberOfTimes90DaysLate > 17
- RevolvingUtilizationOfUnsecuredLines > 2
- NumberRealEstateLoansOrLines > 30
- DebtRatio > 2
- MonthlyIncome > 10000
- NumberOfDependents > 6

#### Multi-collinearity
- The three columns about lateness are highly redundant.
- The number of lines and loans has a correlation with number of real estate lines and loans

#### Potential Data Transformations
Taking the square root of certain variables leads to more normal distributions. This includes:
- 'NumberOfOpenCreditLinesAndLoans'
- 'NumberRealEstateLoansOrLines'
- 'combined_lines'

# Feature Engineering Summary

- Each variable had a point where the data became too sparse to tell us anything about the correlations. I wrote functions to reduce the scale of the outliers.
- The result is that the entries are still in the same order, which is helpful for random forest. At the same time, some of the outlier values were thousands of times larger than the average entries in a given column. Now they are more similar, which should help the logistic regression classifier.
- I imputed the missing Monthly Incomes based on the age of the applicant.
- I imputed the missing number of dependents to 0, the mode.
- After this, I did what I could to make the features more normally distributed using simple mathematical transformations. The goal was to feed more normal data to the logistic regression model.

# Feature Selection Summary

- I selected features with the goal of reducing multicollinearity within my variables. In doing so, many of the newly engineered features were rejected.

- The helpful new feature that survived this filtering was called 'bi_combined_lates'. It marks whether the applicant has been late even once on a payment in the past two years.

# Model Selection and Tuning
The models that I will be testing out are Logistic Regression, K-Nearest Neighbors, Random Forest, and Gradient Boosting. I chose these because they will still perform with non-normal data.

The measure of performance is the area under the ROC curve.

In [2]:
import pandas as pd
results = pd.read_csv("capstone_model_results.csv")
results

Unnamed: 0.1,Unnamed: 0,Default Params,Tuned Params
0,LogR,0.809843,0.8192658913530605
1,knn,0.716889,Takes too long
2,rfc,0.775778,0.8462978738187573
3,gbc,0.868585,0.7670713108281271


The Logistic Regression was promising but did not improve much from parameter tuning.

The K-Nearest Neighbors model was never promising, never improved, and would spend your whole life running if you let it.

Random Forest does better and better the more estimators you feed it, but it takes longer to run and caps off a bit shy of what the gradient booster could do.

The Gradient Boosting Classifier does take a fair amount of time to run, but it practically matches the best kaggle scores while on default parameters. Tuning the parameters does not seem to help.

# Conclusion

Loaning institutions can always use a good credit-scoring algorithm in order to assess risks, choose whether or not to approve a credit line/loan, and to determine the rates they will offer to applicants.

The models I have developed do just that. They determine the likelihood that the applicant will default on the loan in the next two years.

The most successful model scored .868 under the ROC curve.