## Give Me Some Credit: Predicting loan repayment delinquencies

### Abstract
Can we look at a candidate application and tell if they are going to pay back their loans on time with reasonable accuracy?

Yes.

### Introduction
This dataset comes from a [2011 Kaggle competiton](https://www.kaggle.com/c/GiveMeSomeCredit).

Here is the competition overview with one corrected typo:

"Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

The goal of this competition is to build a model that (lenders) can use to help make the best financial decisions."

Top scores on the leaderboard were roughly .86.

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import fancyimpute

from fancyimpute import KNN
from sklearn.model_selection import train_test_split

# Models
from sklearn import ensemble
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import tree
from sklearn.linear_model import LogisticRegression

# The PipeLine
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

import scipy.stats as stats

%matplotlib inline

Using TensorFlow backend.


In [2]:
# Competiton training dataset and outcome variable
credit = pd.read_csv('cs-training.csv', index_col=0)
delinq = credit.SeriousDlqin2yrs
test_data = pd.read_csv('cs-test.csv', index_col=0)

In [3]:
# Putting data into stratified train/test groups
X_train, X_test_dropna, y_train, y_test = train_test_split(credit,
                                                    delinq, 
                                                    test_size=.2, 
                                                    stratify=delinq, 
                                                    random_state=24
)



# Data Exploration Summary
A link to my data exploration and explanations of my decisions can be found [here.](https://github.com/Lukede9/Thinkful/blob/master/Bootcamp/Unit%203/Capstone/Capstone%20EDA.ipynb) Summary:

#### Missing Data
The was missing data in the monthly income and number of dependents columns.
- Roughly 20% of borrowers did not report monthly income.
- Roughly 2.5% did not report their # of dependents.

Many variables had strange outliers that I will treat as missing and impute. These include:
- age < 21
- NumberOfTime30-59DaysPastDueNotWorse > 13
- NumberOfTime60-89DaysPastDueNotWorse > 11
- NumberOfTimes90DaysLate > 17
- RevolvingUtilizationOfUnsecuredLines > 2
- NumberRealEstateLoansOrLines > 30
- DebtRatio > 2
- MonthlyIncome > 10000
- NumberOfDependents > 6

#### Multi-collinearity
- The three columns about lateness are highly redundant.
- The number of lines and loans has a correlation with number of real estate lines and loans

# Feature Engineering

### Dealing with outliers and missing values:

In [4]:
# Remove outcome variable from DataFrame
def df_remove_outcome(df):
    df = df.drop(columns='SeriousDlqin2yrs')
    return df
# Handling outliers in age
def age_nan(age):
    if age <21:
        return np.NaN
    else:
        return age
    
def df_age_nan(df):
    df['age'] = df['age'].apply(lambda x: age_nan(x))
    return df

# Handling outliers in RevolvingUtilizationOfUnsecuredLines
def rev_nan(rev):
    if rev > 2:
        return np.NaN
    else:
        return rev
    
def df_rev_nan(df):
    df['RevolvingUtilizationOfUnsecuredLines'] = df['RevolvingUtilizationOfUnsecuredLines'].apply(lambda x: rev_nan(x))
    return df

# Handling outliers in NumberRealEstateLoansOrLines
def real_estate_nan(real_estate):
    if real_estate > 30:
        return np.NaN
    else:
        return real_estate

def df_real_estate_nan(df):
    df['NumberRealEstateLoansOrLines'] = df['NumberRealEstateLoansOrLines'].apply(lambda x: real_estate_nan(x))
    return df

# Handling outliers in DebtRatio
def debt_nan(debt):
    if debt > 2:
        return np.NaN
    else:
        return debt
    
def df_debt_nan(df):
    df['DebtRatio'] = df['DebtRatio'].apply(lambda x: debt_nan(x))
    return df

# Handling outliers in MonthlyIncome
def income_nan(income):
    if income > 10000:
        return np.NaN
    else:
        return income

def df_income_nan(df):
    df['MonthlyIncome'] = df['MonthlyIncome'].apply(lambda x: (income_nan(x)))
    return df

# Handling outliers in NumberOfTime30-59DaysPastDueNotWorse
def late_30(count):
    if count > 13:
        return np.NaN
    else:
        return count
    
def df_late_30(df):
    df['NumberOfTime30-59DaysPastDueNotWorse'] = df['NumberOfTime30-59DaysPastDueNotWorse'].apply(lambda x: late_30(x))
    return df

# Handling outliers in NumberOfTime60-89DaysPastDueNotWorse
def late_60(count):
    if count > 11:
        return np.NaN
    else:
        return count
    
def df_late_60(df):
    df['NumberOfTime60-89DaysPastDueNotWorse'] = df['NumberOfTime60-89DaysPastDueNotWorse'].apply(lambda x: late_60(x))
    return df

# Handling outliers in NumberOfTimes90DaysLate
def late_90(count):
    if count > 17:
        return np.NaN
    else:
        return count
    
def df_late_90(df):
    df['NumberOfTimes90DaysLate'] = df['NumberOfTimes90DaysLate'].apply(lambda x: late_90(x))
    return df

# Handling outliers in NumberOfDependents
def depend(count):
    if depend > 6:
        return np.NaN
    else:
        return depend
    
def df_depend(df):
    df['NumberOfDependents'] = df['NumberOfDependents'].apply(lambda x: depend(x))
    return df

In [5]:
# Function to replace all outliers with NaN at once
def nan(df):
    df = df_remove_outcome(df)
    df = df_age_nan(df)
    df = df_rev_nan(df)
    df = df_real_estate_nan(df)
    df = df_debt_nan(df)
    df = df_income_nan(df)
    df = df_late_30(df)
    df = df_late_60(df)
    df = df_late_90(df)
    df = df_depend(df)
    return df

#### Imputing missing values with FancyImpute

In [6]:
# Create boolean mask of nulls
def nan_tf(df):
    for col in df.columns:
        df[col] = np.isnan(df[col])
    return df

# Returns DataFrame, nulls filled using KNN, k=3
def df_fancyimpute(df):
    df_mask = nan_tf(df.copy())
    filled = KNN(k=3).fill(df, df_mask)
    filled = pd.DataFrame(filled)
    filled.columns = df.columns
    return filled

### Creating new features:

The most obvious new features to create are just combining the correlated features we already have.

In [7]:
# Take in DF and return it with new column of combined defaults feature
def df_combined_lates(df):
    df['combined_lates'] = df['NumberOfTime30-59DaysPastDueNotWorse'] + \
    df['NumberOfTime60-89DaysPastDueNotWorse'] + \
    df['NumberOfTimes90DaysLate']
    return df

# Take in DF and return it with new column of combined lines feature
def df_combined_lines(df):
    df['combined_lines'] = df['NumberRealEstateLoansOrLines'] + \
    df['NumberOfOpenCreditLinesAndLoans']
    return df

In [8]:
# Returns DataFrame with new features for lateness and loans
def df_new_features(df):
    df = df_combined_lates(df)
    df = df_combined_lines(df)
    return df

# Feature Selection

In [9]:
# Pick best features

# Model Selection and Tuning
The models that I will be testing out are Logistic Regression, K-Nearest Neighbors, Random Forest, and Gradient Boosting. I chose these because they will still perform with non-normal data.

The measure of performance is the area under the ROC curve.

In [10]:
# Pick best model and fine tune it with undersampling and GridSearchCV

# Model Results

In [11]:
# display results using roc seaborn visualizations

# Conclusion

In [12]:
# Summarize it all