# Case Study 2: Predicting bank loan defaults and credit scores

You are a data scientist working for a consumer bank that wants to use their data on customer savings, loans, and spending habits to predict whether they will default on a loan. They also want to know if they can predict what credit score a credit rating agency will give them.

They give you a dataset 'credit_score.csv' of 1000 customers who took out loans recently and ask you to see if there is any "signal" in the data to make this prediction.

In [1]:
# load numpy and pandas and the dataset
import numpy as np
import pandas as pd

df = pd.read_csv('./data/credit_score.csv')

print(df)

        CUST_ID  INCOME  SAVINGS     DEBT  R_SAVINGS_INCOME  R_DEBT_INCOME  \
0    C02COQEVYU   33269        0   532304            0.0000        16.0000   
1    C02OZKC0ZF   77158    91187   315648            1.1818         4.0909   
2    C03FHP2D0A   30917    21642   534864            0.7000        17.3000   
3    C03PVPPHOY   80657    64526   629125            0.8000         7.8000   
4    C04J69MUX0  149971  1172498  2399531            7.8182        16.0000   
..          ...     ...      ...      ...               ...            ...   
995  CZQHJC9HDH  328892  1465066  5501471            4.4546        16.7273   
996  CZRA4MLB0P   81404    88805   680837            1.0909         8.3637   
997  CZSOD1KVFX       0    42428    30760            3.2379         8.1889   
998  CZWC76UAUT   36011     8002   604181            0.2222        16.7777   
999  CZZV5B3SAL   44266   309859    44266            6.9999         1.0000   

     R_DEBT_SAVINGS  T_CLOTHING_12  T_CLOTHING_6  R_CLOTHING  .

## Data Processing

Here are steps I recommend to process the data in preparation for analysis.

- We see the column 'CUST_ID', which we can guess is the customer ID and should not be used as a predictor. Check that these are unique for every entry in the dataset, and then remove it from the columns.
- The targets are 'DEFAULT' (whether a loan defaulted or not) and 'CREDIT_SCORE' (the customers credit score). Set both of these apart for analyses later.
- Create your DataFrame of predictors. Check if any need to be processed as categorical variables. If so, do it like we did last week, or you can look at the function pd.get_dummies(), which can help you easily make one-hot encodings.
- Take note of which are real-valued and likely need to be standardized before analyses later.

To make life easier, I looked at the data in Excel and the below column names are the ones I think you should try to standardize.

In [2]:
columns_to_normalize = [
    'INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME',
       'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING',
       'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT',
       'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME',
       'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12',
       'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME',
       'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12',
       'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS',
       'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING',
       'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT',
       'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME',
       'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6',
       'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT',
       'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME',
       'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX',
       'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12',
       'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS',
       'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES',
       'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT',
       'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE',
       'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT',
]

In [3]:
# Check CUST_ID is unique
print(df['CUST_ID'].is_unique)
# Drop CUST_ID if it is unique
if df['CUST_ID'].is_unique:
    df.drop(columns=['CUST_ID'], inplace=True)

True


In [4]:
# Save y_default as the 1st target variable
y_default = df['DEFAULT']
# Drop y_default from the dataframe
df.drop(columns=['DEFAULT'], inplace=True)

# Save y_credit_score as the 2nd target variable
y_credit_score = df['CREDIT_SCORE']
# Drop y_credit_score from the dataframe
df.drop(columns=['CREDIT_SCORE'], inplace=True)

In [5]:
# Normalize the data using standard scaler from sklearn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[columns_to_normalize] = scaler.fit_transform(df[columns_to_normalize])
# display the normalized dataframe
print(df)

       INCOME   SAVINGS      DEBT  R_SAVINGS_INCOME  R_DEBT_INCOME  \
0   -0.777240 -0.933351 -0.263339         -1.024549       1.699167   
1   -0.391097 -0.727370 -0.484123         -0.726575      -0.338334   
2   -0.797934 -0.884465 -0.260730         -0.848054       1.921581   
3   -0.360312 -0.787594 -0.164673         -0.822840       0.296247   
4    0.249525  1.715197  1.639472          0.946701       1.699167   
..        ...       ...       ...               ...            ...   
995  1.823705  2.376077  4.800526          0.098616       1.823599   
996 -0.353740 -0.732750 -0.111975         -0.749494       0.392689   
997 -1.069947 -0.837511 -0.774441         -0.208158       0.362783   
998 -0.753116 -0.915276 -0.190092         -0.968525       1.832222   
999 -0.680487 -0.233413 -0.760677          0.740378      -0.867150   

     R_DEBT_SAVINGS  T_CLOTHING_12  T_CLOTHING_6  R_CLOTHING  \
0         -0.278144      -0.659327     -0.492793    0.192660   
1         -0.143371      -0.134

In [6]:
# Find the columns with object datatype
object_columns = df.select_dtypes(include='object').columns
# Print the name of the columns with object datatype
print(object_columns)


Index(['CAT_GAMBLING'], dtype='object')


In [7]:
# Apply one-hot encoding to the object columns using get_dummies
df = pd.get_dummies(df, columns=object_columns, drop_first=True)
# Check all columns are numeric
print(df.info())
# Print the first 5 rows of the dataframe
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 85 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   INCOME                   1000 non-null   float64
 1   SAVINGS                  1000 non-null   float64
 2   DEBT                     1000 non-null   float64
 3   R_SAVINGS_INCOME         1000 non-null   float64
 4   R_DEBT_INCOME            1000 non-null   float64
 5   R_DEBT_SAVINGS           1000 non-null   float64
 6   T_CLOTHING_12            1000 non-null   float64
 7   T_CLOTHING_6             1000 non-null   float64
 8   R_CLOTHING               1000 non-null   float64
 9   R_CLOTHING_INCOME        1000 non-null   float64
 10  R_CLOTHING_SAVINGS       1000 non-null   float64
 11  R_CLOTHING_DEBT          1000 non-null   float64
 12  T_EDUCATION_12           1000 non-null   float64
 13  T_EDUCATION_6            1000 non-null   float64
 14  R_EDUCATION              

Unnamed: 0,INCOME,SAVINGS,DEBT,R_SAVINGS_INCOME,R_DEBT_INCOME,R_DEBT_SAVINGS,T_CLOTHING_12,T_CLOTHING_6,R_CLOTHING,R_CLOTHING_INCOME,...,R_EXPENDITURE_INCOME,R_EXPENDITURE_SAVINGS,R_EXPENDITURE_DEBT,CAT_DEBT,CAT_CREDIT_CARD,CAT_MORTGAGE,CAT_SAVINGS_ACCOUNT,CAT_DEPENDENTS,CAT_GAMBLING_Low,CAT_GAMBLING_No
0,-0.77724,-0.933351,-0.263339,-1.024549,1.699167,-0.278144,-0.659327,-0.492793,0.19266,0.033098,...,0.333879,-0.562241,-0.417928,1,0,0,0,0,False,False
1,-0.391097,-0.72737,-0.484123,-0.726575,-0.338334,-0.143371,-0.134234,-0.655799,-1.84703,0.528442,...,-0.204296,-0.088731,-0.294962,1,0,0,1,0,False,True
2,-0.797934,-0.884465,-0.26073,-0.848054,1.921581,1.123182,-0.757155,-0.509407,1.222678,-0.483552,...,0.333879,0.317187,-0.421547,1,0,0,1,0,False,False
3,-0.360312,-0.787594,-0.164673,-0.82284,0.296247,0.231386,0.004624,0.042937,0.350766,0.784103,...,0.333879,0.207243,-0.36734,1,0,0,1,0,False,False
4,0.249525,1.715197,1.639472,0.946701,1.699167,-0.227697,-0.647432,-0.614559,-1.23792,-1.128032,...,-0.204296,-0.490648,-0.422317,1,1,1,1,1,False,False


## Classification experiment

Now predict the binary label 'DEFAULT'. Run an experiment comparing Logistic regression with no regularization, with L1 regularization, and with L2 regularization. You should probably run multiple cross validation splits, like last week.

Since this is a classification experiment, use the AUC score as your performance metric. Scikit-learn has the method: 'sklearn.metrics.roc_auc_score'.

Getting the Logistic Regression models to behave well requires some tweaking. I have done this for you to save you some time, but in the future you should get used to playing around with them and figuring out good settings yourself. Use the following method calls, where the training dataset needs to be created appropriately.

In [8]:
from sklearn.linear_model import LogisticRegression  # logistic regression with optionally built in regularization.
from sklearn.linear_model import LogisticRegressionCV  # logistic regression with built in regularization. The penalty parameter is Cross-validated
from sklearn.metrics import roc_auc_score # computes the AUC score
import matplotlib.pyplot as plt # for plotting


"""
Use the methods like this, which should look similar to what we did last week:

    clf = LogisticRegression(penalty=None, max_iter=10000)  # uses 5-fold cross validation by default
    clf.fit(X_train, Y_train)  # the .fit method operates on the clf object in-place
    Y_pred = clf.predict_proba(X_test)  # this gives us the probabilities of each label
    print(clf.classes_)  # look at this to see the order of the labels in Y_pred
    Y_pred_P1 = Y_pred[:, 1]  # so this is the probability of assigning Y=1
    
For L1 regularization, the following settings performed well for me:

    clf = LogisticRegressionCV(penalty='l1', solver='saga', max_iter=10000)  # uses 5-fold cross validation by default
    
And finally for L2 regularization, the following settings performed well for me:

    clf = LogisticRegressionCV(penalty='l2', max_iter=10000)  # uses 5-fold cross validation by default

You use roc_auc_score like this (definitely google the documentation for all of these methods):

    auc = roc_auc_score(Y_test, Y_pred_P1)
    
where Y_test is a vector of the true 0-1 labels and Y_pred_P1 is a corresponding vector of PROBABILITIES for Y=1.
See the example code I wrote above to see how to predict the probabilities P(Y=1).

"""



"\nUse the methods like this, which should look similar to what we did last week:\n\n    clf = LogisticRegression(penalty=None, max_iter=10000)  # uses 5-fold cross validation by default\n    clf.fit(X_train, Y_train)  # the .fit method operates on the clf object in-place\n    Y_pred = clf.predict_proba(X_test)  # this gives us the probabilities of each label\n    print(clf.classes_)  # look at this to see the order of the labels in Y_pred\n    Y_pred_P1 = Y_pred[:, 1]  # so this is the probability of assigning Y=1\n    \nFor L1 regularization, the following settings performed well for me:\n\n    clf = LogisticRegressionCV(penalty='l1', solver='saga', max_iter=10000)  # uses 5-fold cross validation by default\n    \nAnd finally for L2 regularization, the following settings performed well for me:\n\n    clf = LogisticRegressionCV(penalty='l2', max_iter=10000)  # uses 5-fold cross validation by default\n\nYou use roc_auc_score like this (definitely google the documentation for all of the

In [10]:
# Perform a stratified train-test split with 20% of the data in the test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(df, y_default, test_size=0.2, stratify=y_default, random_state=0)

In [12]:
# Instead of K-fold cross validation, I will just take random subsets of the dataset, which is easier to implement.
n_cv = 20

errors_lr = []  # these are containers to hold the test set errors
errors_ridge = []
errors_lasso = []
auc_lr = []
auc_ridge = []
auc_lasso = []

indices = list(range(len(df)))  # how we will index the dataset
n_train = int(len(df) * .80)  # each split will have 80% train and 20% test

# iterate through the test sets
for k in range(n_cv):
    
    np.random.shuffle(indices)  # shuffle the indices. this function works in-place
    train_inds = indices[:n_train]  # slice out the training indices
    test_inds = indices[n_train:]

    Y_train = y_default.iloc[train_inds]  # it is very important to remember to use iloc if using integer index
    X_train = df.iloc[train_inds, :].copy()
    
    Y_test = y_default.iloc[test_inds]
    X_test = df.iloc[test_inds, :].copy()
    
    # standardize the predictors (don't standardize the gender variable)
    for feature_name in columns_to_normalize:
        mean_ = X_train[feature_name].mean()
        std_ = X_train[feature_name].std()
        X_train[feature_name] = (X_train[feature_name] - mean_) / std_
        X_test[feature_name] = (X_test[feature_name] - mean_) / std_  # we must use the training statistics to transform the test set!
    
    # Now fit the models on the training set and predict the test targets
    lr = LogisticRegression(fit_intercept=True).fit(X_train, Y_train)  # Linear regression with an intercept. Do NOT use X_ from statsmodels.
    Y_pred = lr.predict(X_test)  # prediction on a test set
    rmse = np.sqrt(np.mean((Y_test - Y_pred) ** 2))  # root mean squared error is more interpretable than MSE
    errors_lr.append(rmse)
    auc = roc_auc_score(Y_test, Y_pred)
    auc_lr.append(auc)


    ridge = LogisticRegressionCV(penalty='l1', solver='saga', max_iter=10000).fit(X_train, Y_train)  # Ridge regression with an intercept. Selects the penalty from among 0.1, 1.0, and 10 using 5-fold cross validation.
    Y_pred = ridge.predict(X_test) 
    rmse = np.sqrt(np.mean((Y_test - Y_pred) ** 2))
    errors_ridge.append(rmse)
    auc = roc_auc_score(Y_test, Y_pred)
    auc_ridge.append(auc)

    lasso =LogisticRegressionCV(penalty='l2', max_iter=10000).fit(X_train, Y_train)  # Lasso with an intercept. Selects the penalty from among 0.1, 1.0, and 10 using 5-fold cross validation.
    Y_pred = lasso.predict(X_test)
    rmse = np.sqrt(np.mean((Y_test - Y_pred) ** 2))
    errors_lasso.append(rmse)
    auc = roc_auc_score(Y_test, Y_pred)
    auc_lasso.append(auc)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [13]:
print(np.mean(auc_lr))
print(np.mean(auc_ridge))
print(np.mean(auc_lasso))

0.572018841961581
0.5527969497392814
0.5464016256872268


## Regression on discrete data

We now move on to the second part of the job, which is to try to predict the credit score (column name is 'CREDIT_SCORE'). Explore the data in this variable and decide what type of model from lecture is most appropriate.

Depending on what you decide, the below methods from Scikit-learn can be used. You should again be comparing different models on multiple cross validation splits.

As an evaluation metric, what should you use? There are different choices, but for now why don't you just try to use Root mean squared error (like last week).


In [9]:
from sklearn.linear_model import RidgeCV, LassoCV, PoissonRegressor
from sklearn.model_selection import GridSearchCV

"""
For Ridge regression, everything worked fine out of the box for me:
    
    model = RidgeCV()  # uses 5-fold cross validation by default
    model.fit(X_train, Y_train)  
    Y_pred = model.predict(X_test) 
    
For Lasso, the following worked well for me:

    model = LassoCV(max_iter=2000)  # uses 5-fold cross validation by default

And for Poisson Regression, the following setting worked well for me.

    model = PoissonRegressor(max_iter=10000)

Note that, while PoissonRegressor uses L2 regularization by default with a penalty parameter set to alpha=2,
it does not cross validate a good value for this parameter the way RidgeCV and LassoCV do for you. 

So you could consider cross validating this parameter using the sklearn.model_selection.GridSearchCV method.

"""


'\nFor Ridge regression, everything worked fine out of the box for me:\n    \n    model = RidgeCV()  # uses 5-fold cross validation by default\n    model.fit(X_train, Y_train)  \n    Y_pred = model.predict(X_test) \n    \nFor Lasso, the following worked well for me:\n\n    model = LassoCV(max_iter=2000)  # uses 5-fold cross validation by default\n\nAnd for Poisson Regression, the following setting worked well for me.\n\n    model = PoissonRegressor(max_iter=10000)\n\nNote that, while PoissonRegressor uses L2 regularization by default with a penalty parameter set to alpha=2,\nit does not cross validate a good value for this parameter the way RidgeCV and LassoCV do for you. \n\nSo you could consider cross validating this parameter using the sklearn.model_selection.GridSearchCV method.\n\n'