### Credit Rating Models

This notebook builds a model to predict good/bad credit for customers using financial and education data. The data was obtained from [this Kaggle page](https://www.kaggle.com/rikdifos/credit-card-approval-prediction). We cleaned the data and did some exploratory analysis in separate notebooks. The goal here is to build a few different models on the data and compare them in terms of precision and recall. 

In [29]:
import sys
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

Let's begin by importing the main dataset we will work with. This dataset has rescaled continuous features and a one in K encoding of multiclass categorical features. 

In [5]:
with open("feature_dataframe_with_one_in_K.csv", "r") as features:
    dataframe = pd.read_csv(features)

print(dataframe.columns.values)

['Unnamed: 0' 'ID' 'Gender' 'Car' 'Property' 'Children' 'Income' 'Age'
 'Employment_Length' 'Mobile_Phone' 'Work_Phone' 'Phone' 'Email'
 'Family_Size' 'IDList' '6_Month' '12_Month' '24_Month' 'Lifetime'
 'Income_Type_Commercial associate' 'Income_Type_Pensioner'
 'Income_Type_State servant' 'Income_Type_Student' 'Income_Type_Working'
 'Education_Academic degree' 'Education_Higher education'
 'Education_Incomplete higher' 'Education_Lower secondary'
 'Education_Secondary / secondary special'
 'Marriage_Status_Civil marriage' 'Marriage_Status_Married'
 'Marriage_Status_Separated' 'Marriage_Status_Single / not married'
 'Marriage_Status_Widow' 'Housing_Co-op apartment'
 'Housing_House / apartment' 'Housing_Municipal apartment'
 'Housing_Office apartment' 'Housing_Rented apartment'
 'Housing_With parents' 'Occupation_Accountants'
 'Occupation_Cleaning staff' 'Occupation_Cooking staff'
 'Occupation_Core staff' 'Occupation_Drivers' 'Occupation_HR staff'
 'Occupation_High skill tech staff' 'O

### Logistic Regression

In the notebook on Exploratory Analysis, we computed information values for our features. We will want to remove features that have very low information value as those mostly add noise to a logistic regression model. As such, we will only keep the following features.

In [7]:
features = ['Property', 'Income', 'Age', 'Employment_Length', 'Email', 'Family_Size', 'Income_Type_Commercial associate', 
            'Income_Type_Pensioner', 'Income_Type_State servant', 'Income_Type_Student', 'Income_Type_Working', 'Education_Academic degree',
            'Education_Higher education', 'Education_Incomplete higher', 'Education_Lower secondary', 'Education_Secondary / secondary special', 
            'Marriage_Status_Civil marriage', 'Marriage_Status_Married', 'Marriage_Status_Separated', 'Marriage_Status_Single / not married',
            'Marriage_Status_Widow', 'Housing_Co-op apartment', 'Housing_House / apartment', 'Housing_Municipal apartment',
            'Housing_Office apartment', 'Housing_Rented apartment', 'Housing_With parents', 'Occupation_Accountants', 'Occupation_Cleaning staff',
            'Occupation_Cooking staff', 'Occupation_Core staff', 'Occupation_Drivers', 'Occupation_HR staff', 'Occupation_High skill tech staff', 
            'Occupation_IT staff',  'Occupation_Laborers', 'Occupation_Low-skill Laborers',  'Occupation_Managers', 'Occupation_Medicine staff',
            'Occupation_Null', 'Occupation_Private service staff', 'Occupation_Realty agents', 'Occupation_Sales staff', 'Occupation_Secretaries',
            'Occupation_Security staff', 'Occupation_Waiters/barmen staff']
 

labels = ['Lifetime']

In [8]:
feature_frame = dataframe.loc[:, features]

In [10]:
labels_frame = dataframe.loc[:, labels]

We have a good amount of datapoints in our dataset so we can be pretty aggressive about cutting up our data and cross-validating to parameter tune. So, our strategy will be as follows. First, we split the data into a training set and a test set. Since our classes are slightly imbalanced, we will use synthetic oversampling (SMOTE) to even out the classes in the training set. We will also run a GridSearchCV on the training set to tune our logistic regression hyperparameters, optimizing for f1_score. After finding the optimal parameters, we will train our model on the entire training set and then test the precision and recall on the test set.

In [16]:
features_array = feature_frame.to_numpy()
labels_array = labels_frame.to_numpy()

In [18]:
features_train, features_test, labels_train, labels_test = train_test_split(features_array, labels_array, test_size=0.3, 
                                                                             stratify = labels_array, random_state=41)

In [50]:
labels_train = labels_train.ravel()

In [22]:
smote = SMOTE(random_state=36)

In [64]:
logistic_parameters = {'C': [0.1, 1, 10, 100, 1000],
                        'max_iter': [1000, 10000, 100000]}



In [65]:
model = Pipeline([('sampling', SMOTE()), ('classification', LogisticRegression())])

In [66]:
search = GridSearchCV(LogisticRegression(), logistic_parameters, scoring='f1', verbose=3)

In [67]:
search.fit(features_train, labels_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ..............C=0.1, max_iter=1000;, score=0.000 total time=   0.1s
[CV 2/5] END ..............C=0.1, max_iter=1000;, score=0.000 total time=   0.1s
[CV 3/5] END ..............C=0.1, max_iter=1000;, score=0.000 total time=   0.1s
[CV 4/5] END ..............C=0.1, max_iter=1000;, score=0.006 total time=   0.1s
[CV 5/5] END ..............C=0.1, max_iter=1000;, score=0.000 total time=   0.1s
[CV 1/5] END .............C=0.1, max_iter=10000;, score=0.000 total time=   0.1s
[CV 2/5] END .............C=0.1, max_iter=10000;, score=0.000 total time=   0.1s
[CV 3/5] END .............C=0.1, max_iter=10000;, score=0.000 total time=   0.1s
[CV 4/5] END .............C=0.1, max_iter=10000;, score=0.006 total time=   0.1s
[CV 5/5] END .............C=0.1, max_iter=10000;, score=0.000 total time=   0.1s
[CV 1/5] END ............C=0.1, max_iter=100000;, score=0.000 total time=   0.1s
[CV 2/5] END ............C=0.1, max_iter=100000;

GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': [0.1, 1, 10, 100, 1000],
                         'max_iter': [1000, 10000, 100000]},
             scoring='f1', verbose=3)

In [68]:
search.best_params_

{'C': 10, 'max_iter': 1000}

In [45]:
smote_2 = SMOTE(random_state=36)

In [56]:
features_resampled, labels_resampled = smote_2.fit_resample(features_train, labels_train)

In [57]:
print(len(features_resampled), len(labels_resampled), labels_resampled.sum())

10252 10252 5126.0


In [75]:
clf = LogisticRegression(C=100.0, max_iter = 10000)
clf.fit(features_resampled, labels_resampled)

LogisticRegression(C=100.0, max_iter=10000)

In [76]:
pred = clf.predict(features_test)

In [77]:
print(confusion_matrix(pred, labels_test))

[[1199  378]
 [ 999  343]]


In [82]:
print("Precision of Logistic Regression Model: ", precision_score(pred, labels_test))
print('Recall of Logistic Regression Model: ', recall_score(pred, labels_test))
print('Accuracy of Logistic Regression Model: ', accuracy_score(pred, labels_test))

Precision of Logistic Regression Model:  0.47572815533980584
Recall of Logistic Regression Model:  0.2555886736214605
Accuracy of Logistic Regression Model:  0.5282631038026722


We see that logistic regression doesn't really give us a great model. This was somewhat expected from how poor the information values of our features were. We want to try and improve on this model. We'll attempt to do so in a number of different ways. First, we can limit our features further. 