# Case Study 2: Predicting bank loan defaults and credit scores

You are a data scientist working for a consumer bank that wants to use their data on customer savings, loans, and spending habits to predict whether they will default on a loan. They also want to know if they can predict what credit score a credit rating agency will give them.

They give you a dataset 'credit_score.csv' of 1000 customers who took out loans recently and ask you to see if there is any "signal" in the data to make this prediction.

In [None]:
# load numpy and pandas and the dataset
import numpy as np
import pandas as pd

df = pd.read_csv('credit_score.csv')

print(df)

## Data Processing

Here are steps I recommend to process the data in preparation for analysis.

- We see the column 'CUST_ID', which we can guess is the customer ID and should not be used as a predictor. Check that these are unique for every entry in the dataset, and then remove it from the columns.
- The targets are 'DEFAULT' (whether a loan defaulted or not) and 'CREDIT_SCORE' (the customers credit score). Set both of these apart for analyses later.
- Create your DataFrame of predictors. Check if any need to be processed as categorical variables. If so, do it like we did last week, or you can look at the function pd.get_dummies(), which can help you easily make one-hot encodings.
- Take note of which are real-valued and likely need to be standardized before analyses later.

To make life easier, I looked at the data in Excel and the below column names are the ones I think you should try to standardize.

In [None]:
columns_to_normalize = [
    'INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME',
       'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING',
       'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT',
       'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME',
       'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12',
       'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME',
       'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12',
       'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS',
       'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING',
       'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT',
       'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME',
       'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6',
       'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT',
       'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME',
       'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX',
       'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12',
       'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS',
       'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES',
       'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT',
       'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE',
       'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT',
]

## Classification experiment

Now predict the binary label 'DEFAULT'. Run an experiment comparing Logistic regression with no regularization, with L1 regularization, and with L2 regularization. You should probably run multiple cross validation splits, like last week.

Since this is a classification experiment, use the AUC score as your performance metric. Scikit-learn has the method: 'sklearn.metrics.roc_auc_score'.

Getting the Logistic Regression models to behave well requires some tweaking. I have done this for you to save you some time, but in the future you should get used to playing around with them and figuring out good settings yourself. Use the following method calls, where the training dataset needs to be created appropriately.

In [None]:
from sklearn.linear_model import LogisticRegression  # logistic regression with optionally built in regularization.
from sklearn.linear_model import LogisticRegressionCV  # logistic regression with built in regularization. The penalty parameter is Cross-validated
from sklearn.metrics import roc_auc_score # computes the AUC score
import matplotlib.pyplot as plt # for plotting


"""
Use the methods like this, which should look similar to what we did last week:

    clf = LogisticRegression(penalty=None, max_iter=10000)  # uses 5-fold cross validation by default
    clf.fit(X_train, Y_train)  # the .fit method operates on the clf object in-place
    Y_pred = clf.predict_proba(X_test)  # this gives us the probabilities of each label
    print(clf.classes_)  # look at this to see the order of the labels in Y_pred
    Y_pred_P1 = Y_pred[:, 1]  # so this is the probability of assigning Y=1
    
For L1 regularization, the following settings performed well for me:

    clf = LogisticRegressionCV(penalty='l1', solver='saga', max_iter=10000)  # uses 5-fold cross validation by default
    
And finally for L2 regularization, the following settings performed well for me:

    clf = LogisticRegressionCV(penalty='l2', max_iter=10000)  # uses 5-fold cross validation by default

You use roc_auc_score like this (definitely google the documentation for all of these methods):

    auc = roc_auc_score(Y_test, Y_pred_P1)
    
where Y_test is a vector of the true 0-1 labels and Y_pred_P1 is a corresponding vector of PROBABILITIES for Y=1.
See the example code I wrote above to see how to predict the probabilities P(Y=1).

"""



## Regression on discrete data

We now move on to the second part of the job, which is to try to predict the credit score (column name is 'CREDIT_SCORE'). Explore the data in this variable and decide what type of model from lecture is most appropriate.

Depending on what you decide, the below methods from Scikit-learn can be used. You should again be comparing different models on multiple cross validation splits.

As an evaluation metric, what should you use? There are different choices, but for now why don't you just try to use Root mean squared error (like last week).


In [None]:
from sklearn.linear_model import RidgeCV, LassoCV, PoissonRegressor
from sklearn.model_selection import GridSearchCV

"""
For Ridge regression, everything worked fine out of the box for me:
    
    model = RidgeCV()  # uses 5-fold cross validation by default
    model.fit(X_train, Y_train)  
    Y_pred = model.predict(X_test) 
    
For Lasso, the following worked well for me:

    model = LassoCV(max_iter=2000)  # uses 5-fold cross validation by default

And for Poisson Regression, the following setting worked well for me.

    model = PoissonRegressor(max_iter=10000)

Note that, while PoissonRegressor uses L2 regularization by default with a penalty parameter set to alpha=2,
it does not cross validate a good value for this parameter the way RidgeCV and LassoCV do for you. 

So you could consider cross validating this parameter using the sklearn.model_selection.GridSearchCV method.

"""
