## Linear Classifier using Logistic Regression

In [97]:
# import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [98]:
loan_df = pd.read_csv(r'C:\Users\Joseph\OneDrive\Documents\CFI\Loan Default Prediction With Machine Learning\data\new_data\data\vehicle_loans_feat_1.csv', index_col='UNIQUEID')

In [99]:
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 30 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   DISBURSED_AMOUNT                     233154 non-null  float64
 1   ASSET_COST                           233154 non-null  float64
 2   LTV                                  233154 non-null  float64
 3   MANUFACTURER_ID                      233154 non-null  int64  
 4   EMPLOYMENT_TYPE                      233154 non-null  object 
 5   STATE_ID                             233154 non-null  int64  
 6   AADHAR_FLAG                          233154 non-null  int64  
 7   PAN_FLAG                             233154 non-null  int64  
 8   VOTERID_FLAG                         233154 non-null  int64  
 9   DRIVING_FLAG                         233154 non-null  int64  
 10  PASSPORT_FLAG                        233154 non-null  int64  
 11  PERFORM_

In [100]:
# let's inspect the variable types of the categorical fields

category_cols = ['MANUFACTURER_ID', 'STATE_ID', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE_DESCRIPTION','EMPLOYMENT_TYPE']
loan_df[category_cols].dtypes

MANUFACTURER_ID                   int64
STATE_ID                          int64
DISBURSED_CAT                    object
PERFORM_CNS_SCORE_DESCRIPTION    object
EMPLOYMENT_TYPE                  object
dtype: object

I don't want to treat MANUFACTURER_ID and STATE_ID as integers. So, let's encode the categorical columns with the category data type.

In [101]:
#convert to categorical type

loan_df[category_cols] = loan_df[category_cols].astype('category')
loan_df[category_cols].dtypes

MANUFACTURER_ID                  category
STATE_ID                         category
DISBURSED_CAT                    category
PERFORM_CNS_SCORE_DESCRIPTION    category
EMPLOYMENT_TYPE                  category
dtype: object

Using these variables, I'll create a subset of loan_df and store it as a separate DataFrame loan_df_sml to keep the first model simple.

In [102]:
# create the subset loan_df_sml

small_cols = ['STATE_ID', 'LTV', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE', 'EMPLOYMENT_TYPE', 'LOAN_DEFAULT']
loan_df_sml = loan_df[small_cols]

In [103]:
loan_df_sml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           233154 non-null  category
 1   LTV                233154 non-null  float64 
 2   DISBURSED_CAT      233154 non-null  category
 3   PERFORM_CNS_SCORE  233154 non-null  float64 
 4   EMPLOYMENT_TYPE    233154 non-null  category
 5   LOAN_DEFAULT       233154 non-null  int64   
dtypes: category(3), float64(2), int64(1)
memory usage: 7.8 MB


### Training/Test Split

Before I fit (train) my basic linear model, I need to split the data into training and test sets

In [104]:
# create two variables x and y to match the required parameters for train_test_split

x = loan_df_sml.drop(['LOAN_DEFAULT'], axis=1)
y = loan_df_sml['LOAN_DEFAULT']

In [105]:
# check the rows and columns

print("x has {0} rows and {1} columns".format(x.shape[0], x.shape[1]))
print("y has {0} rows".format(y.count()))

x has 233154 rows and 5 columns
y has 233154 rows


In [106]:
x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           233154 non-null  category
 1   LTV                233154 non-null  float64 
 2   DISBURSED_CAT      233154 non-null  category
 3   PERFORM_CNS_SCORE  233154 non-null  float64 
 4   EMPLOYMENT_TYPE    233154 non-null  category
dtypes: category(3), float64(2)
memory usage: 6.0 MB


In [107]:
y.dtype

dtype('int64')

In [108]:
# create train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

The train_test_split returns 4 output values:

- x_train: the training rows without the target variable 
- x_test: the test rows without the target variable 
- y_train: the training rows, target variable only 
- y_test: the test rows, target variable only 

In [109]:
#check rows and columns

print("x_train has {0} rows and {1} columns".format(x_train.shape[0], x_train.shape[1]))
print("x_test has {0} rows and {1} columns".format(x_test.shape[0], x_test.shape[1]))
print("y_train has {0} rows".format(y_train.count()))
print("y_test has {0} rows".format(y_test.count()))

x_train has 186523 rows and 5 columns
x_test has 46631 rows and 5 columns
y_train has 186523 rows
y_test has 46631 rows


It looks like the number of rows and columns is what I would expect.

In [110]:
#x train info

x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 186523 entries, 633275 to 501520
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   STATE_ID           186523 non-null  category
 1   LTV                186523 non-null  float64 
 2   DISBURSED_CAT      186523 non-null  category
 3   PERFORM_CNS_SCORE  186523 non-null  float64 
 4   EMPLOYMENT_TYPE    186523 non-null  category
dtypes: category(3), float64(2)
memory usage: 4.8 MB


In [111]:
#y train info

y_train.head()

UNIQUEID
633275    1
646002    0
591252    0
475736    0
639478    0
Name: LOAN_DEFAULT, dtype: int64

In [112]:
#x test info

x_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46631 entries, 617183 to 626383
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   STATE_ID           46631 non-null  category
 1   LTV                46631 non-null  float64 
 2   DISBURSED_CAT      46631 non-null  category
 3   PERFORM_CNS_SCORE  46631 non-null  float64 
 4   EMPLOYMENT_TYPE    46631 non-null  category
dtypes: category(3), float64(2)
memory usage: 1.2 MB


In [113]:
#y test info

y_test.head()

UNIQUEID
617183    1
515702    0
466872    0
632384    0
461426    0
Name: LOAN_DEFAULT, dtype: int64

All the train and test data has the correct columns. Now, let's check the distribution of the class variable.

In [114]:
#check the training target variable

y_train.value_counts(normalize=True)

0    0.783099
1    0.216901
Name: LOAN_DEFAULT, dtype: float64

In [115]:
#check the test target variable

y_test.value_counts(normalize=True)

0    0.782248
1    0.217752
Name: LOAN_DEFAULT, dtype: float64

Both the training and test set contain defaulted loans at 21.7%

Now, I'll use the one hot incoding technique to convert my categorical data type into a format that can be fed into my logistic regression model to improve prediction accuracy.

To do so, I'll create a new variable 'loan_data_dumm' from my 'loan_df_sml'

In [118]:
loan_data_dumm = pd.get_dummies(loan_df_sml, prefix_sep='_', drop_first=True)

In [119]:
loan_data_dumm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233154 entries, 420825 to 630213
Data columns (total 31 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   LTV                            233154 non-null  float64
 1   PERFORM_CNS_SCORE              233154 non-null  float64
 2   LOAN_DEFAULT                   233154 non-null  int64  
 3   STATE_ID_2                     233154 non-null  uint8  
 4   STATE_ID_3                     233154 non-null  uint8  
 5   STATE_ID_4                     233154 non-null  uint8  
 6   STATE_ID_5                     233154 non-null  uint8  
 7   STATE_ID_6                     233154 non-null  uint8  
 8   STATE_ID_7                     233154 non-null  uint8  
 9   STATE_ID_8                     233154 non-null  uint8  
 10  STATE_ID_9                     233154 non-null  uint8  
 11  STATE_ID_10                    233154 non-null  uint8  
 12  STATE_ID_11              

Now, I'll investigate how pd.get_dummies is transforming my dataset.

In [121]:
print(loan_data_dumm['STATE_ID_13'].value_counts())
print(loan_data_dumm['STATE_ID_13'].value_counts(normalize=True))

print(loan_data_dumm['DISBURSED_CAT_60k-75k'].value_counts())
print(loan_data_dumm['DISBURSED_CAT_60k-75k'].value_counts(normalize=True))

0    215270
1     17884
Name: STATE_ID_13, dtype: int64
0    0.923295
1    0.076705
Name: STATE_ID_13, dtype: float64
0    183330
1     49824
Name: DISBURSED_CAT_60k-75k, dtype: int64
0    0.786304
1    0.213696
Name: DISBURSED_CAT_60k-75k, dtype: float64


In [122]:
# recreate the training and test set using loan_data_dumm

x = loan_data_dumm.drop(['LOAN_DEFAULT'], axis=1)
y = loan_data_dumm['LOAN_DEFAULT']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

0    0.782975
1    0.217025
Name: LOAN_DEFAULT, dtype: float64
0    0.782821
1    0.217179
Name: LOAN_DEFAULT, dtype: float64


In [123]:
# fit the model

logistic_model = LogisticRegression()
logistic_model.fit(x_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Let's increase the maximum allowed iterations to resolve the warning.

In [124]:
# the default value is 100, so I'll try 200

logistic_model = LogisticRegression(max_iter=200)
logistic_model.fit(x_train, y_train)

Great! I have successfully trained the model and I no lonber see the convergence warning.

Now, I'll generate some predictions for my test set.

In [125]:
# use predict to pass the test features to the model and generate predictions

preds = logistic_model.predict(x_test)
preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [126]:
logistic_model.score(x_test, y_test)

0.7828355755071698

It looks like the model performed quite well. It predicted 78% of the test cases correctly. However, accuracy can be misleading measure of model performance.

Let's explore other measures of model performance.