# Vehicle Loan Prediction Machine Learning Model

# Chapter 5 - Linear Classifiers

### Load Data and Import Libraries

- Notice that we have included two new modules from sklearn

In [23]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [24]:
loan_df = pd.read_csv('vehicle_loans_feat.csv', index_col='UNIQUEID')

## Lesson 1 - Train/Test Split

For the rest of this chapter, we will work through the steps of creating a simple linear classifier using Logistic Regression

First let's remind ourselves of the variables we are dealing with

In [25]:
#look at the columns
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 31 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   DISBURSED_AMOUNT                     233154 non-null  float64
 1   ASSET_COST                           233154 non-null  float64
 2   LTV                                  233154 non-null  float64
 3   MANUFACTURER_ID                      233154 non-null  int64  
 4   EMPLOYMENT_TYPE                      233154 non-null  object 
 5   STATE_ID                             233154 non-null  int64  
 6   AADHAR_FLAG                          233154 non-null  int64  
 7   PAN_FLAG                             233154 non-null  int64  
 8   VOTERID_FLAG                         233154 non-null  int64  
 9   DRIVING_FLAG                         233154 non-null  int64  
 10  PASSPORT_FLAG                        233154 non-null  int64  
 11  PERFORM_CNS_S

It is important that our classifier recognises categorical variables where appropriate.

Lets use the dtypes property to look at the variable types of our categorical feilds.

In [26]:
#look at categorical data types
category_cols =['MANUFACTURER_ID', 'STATE_ID', 'DISBURSAL_MONTH', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE_DESCRIPTION', 'EMPLOYMENT_TYPE']
loan_df[category_cols].dtypes

MANUFACTURER_ID                   int64
STATE_ID                          int64
DISBURSAL_MONTH                   int64
DISBURSED_CAT                    object
PERFORM_CNS_SCORE_DESCRIPTION    object
EMPLOYMENT_TYPE                  object
dtype: object


- We do not want to treat MANUFACTURER_ID, STATE_ID and DISBURSAL_MONTH as integers
- We can encode our categorical columns with the category data type

In [27]:
#convert to categorical type
loan_df[category_cols] = loan_df[category_cols].astype('category')
loan_df[category_cols].dtypes

MANUFACTURER_ID                  category
STATE_ID                         category
DISBURSAL_MONTH                  category
DISBURSED_CAT                    category
PERFORM_CNS_SCORE_DESCRIPTION    category
EMPLOYMENT_TYPE                  category
dtype: object

### EXERCISE 

- To keep our first model simple, select 6 variables including 'LOAN_DEFAULT' and 'DISBURSED_CAT'
- Using these variables create a subset of loan_df and store it as a separate DataFrame loan_df_sml
- HINT: Think about the results of your exploratory analysis, which variables might be good predictors?

### SOLUTION

- I have selected the following 6 columns, 'STATE_ID', 'LTV', 'DISBURSED_CAT', 'PERFORM_CNS_SCORE', 'DISBURSAL_MONTH', 'LOAN_DEFAULT'
- You could have selected any five which you are interested in, so long as one of them is 'LOAN_DEFAULT' and you have 'DISBURSED_CAT' which we will use later in this chapter

In [28]:
#type solution here
small_cat = ['LOAN_DEFAULT', 'DISBURSED_CAT', 'AGE', 'EMPLOYMENT_TYPE', 'DISBURSAL_MONTH', 'ASSET_COST']
loan_df_sml = loan_df[small_cat]

Nice! Let's have a quick look at our new dataframe

In [29]:
#check the dimensions
loan_df_sml.shape

(233154, 6)

We still have 233154 rows but now there are only 6 columns

In [30]:
#look at the columns
loan_df_sml.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   LOAN_DEFAULT     233154 non-null  int64   
 1   DISBURSED_CAT    233154 non-null  category
 2   AGE              233154 non-null  float64 
 3   EMPLOYMENT_TYPE  233154 non-null  category
 4   DISBURSAL_MONTH  233154 non-null  category
 5   ASSET_COST       233154 non-null  float64 
dtypes: category(3), float64(2), int64(1)
memory usage: 7.8 MB


### Training/Test Split

- Before we fit (train) our basic linear model we need to split our data into training and test sets.
- Training Data: used to fit the model to our specific data
- Test Data: used to test the predictive power of the trained model  

We can use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from sklearn to create our training and test sets

[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) has two required parameters 

- x: all of the rows and columns except the target variable 
- y: all of the rows but just the target variable column

### EXERCISE

- Create two variables x and y to match the required parameters for [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

### SOLUTION

In [31]:
#type solution here
x = loan_df_sml.drop('LOAN_DEFAULT', axis = 1)
y = loan_df_sml['LOAN_DEFAULT']

We should investigate the dimensions of x and y to make sure the above solution is correct

In [32]:
#check the rows and columns
print('x has {0} rows and {1} columns'.format(x.shape[0], x.shape[1]))
print('\n')
print('y has {0} rows'.format(y.count()))

x has 233154 rows and 5 columns


y has 233154 rows


In [33]:
#x info
x.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   DISBURSED_CAT    233154 non-null  category
 1   AGE              233154 non-null  float64 
 2   EMPLOYMENT_TYPE  233154 non-null  category
 3   DISBURSAL_MONTH  233154 non-null  category
 4   ASSET_COST       233154 non-null  float64 
dtypes: category(3), float64(2)
memory usage: 6.0 MB


In [34]:
#y info
y.info()

<class 'pandas.core.series.Series'>
Index: 233154 entries, 420825 to 630213
Series name: LOAN_DEFAULT
Non-Null Count   Dtype
--------------   -----
233154 non-null  int64
dtypes: int64(1)
memory usage: 3.6 MB


Great! Looks like we have what need, now we can create our training/test data sets 

In addition to the required parameters of [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) we will also use 
- test_size: floating value between 0 and 1 indicating the size of the test set 
- random_state: integer value used for random seeding, allows for repeatability of the split

In [35]:
#train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 50)

Notice that [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) returns 4 output values 

- x_train: the training rows without the target variable 
- x_test: the test rows without the target variable 
- y_train: the training rows, target variable only 
- y_test: the test rows, target variable only 

Let's familiarize ourselves with this output

In [36]:
#check rows and columns
print('x_train has {0} rows and {1} columns'.format(x_train.shape[0], x_train.shape[1]))
print('\n')
print('x_test has {0} rows and {1} columns'.format(x_test.shape[0], x_test.shape[1]))
print('\n')
print('y_train has {0} rows'.format(y_train.count()))
print('\n')
print('y_test has {0} rows'.format(y_test.count()))

x_train has 186523 rows and 5 columns


x_test has 46631 rows and 5 columns


y_train has 186523 rows


y_test has 46631 rows


Looks like the number of rows and columns is what we would expect

In [37]:
#x train info
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 186523 entries, 522465 to 557705
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   DISBURSED_CAT    186523 non-null  category
 1   AGE              186523 non-null  float64 
 2   EMPLOYMENT_TYPE  186523 non-null  category
 3   DISBURSAL_MONTH  186523 non-null  category
 4   ASSET_COST       186523 non-null  float64 
dtypes: category(3), float64(2)
memory usage: 4.8 MB


In [38]:
#y train info
y_train.info()

<class 'pandas.core.series.Series'>
Index: 186523 entries, 522465 to 557705
Series name: LOAN_DEFAULT
Non-Null Count   Dtype
--------------   -----
186523 non-null  int64
dtypes: int64(1)
memory usage: 2.8 MB


In [39]:
#x test info
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 46631 entries, 534327 to 445578
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   DISBURSED_CAT    46631 non-null  category
 1   AGE              46631 non-null  float64 
 2   EMPLOYMENT_TYPE  46631 non-null  category
 3   DISBURSAL_MONTH  46631 non-null  category
 4   ASSET_COST       46631 non-null  float64 
dtypes: category(3), float64(2)
memory usage: 1.2 MB


In [40]:
#y test info
y_test.info()

<class 'pandas.core.series.Series'>
Index: 46631 entries, 534327 to 445578
Series name: LOAN_DEFAULT
Non-Null Count  Dtype
--------------  -----
46631 non-null  int64
dtypes: int64(1)
memory usage: 728.6 KB


Brilliant! All the train and test data has the correct columns 

Now let's use [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) to check the distribution of the class variable

In [41]:
#check the training target variable
y_train.value_counts(normalize=True)

LOAN_DEFAULT
0    0.78282
1    0.21718
Name: proportion, dtype: float64

In [42]:
#check the test target variable
y_test.value_counts(normalize=True)

LOAN_DEFAULT
0    0.783363
1    0.216637
Name: proportion, dtype: float64

Great! Both the training and test set contain defaulted loans at 21.7%! 

We did not need to stratify due to the size of the dataset and the random nature of the sampling in [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

## Lesson 2 - Variable Encoding

Now its time build our first binary classifier!

First we create the model object using [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [43]:
#initialize the model
logistic_model = LogisticRegression()

Now we [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) the training data!

In [44]:
#fit the logistic model
logistic_model.fit(x_train, y_train)

ValueError: could not convert string to float: '45k - 60k'

### One Hot Encoding 

Ok, looks like we made a mistake!

The problem is that Logistic Regression, like most machine learning methods, does not know how to deal with string data

We can use [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to one hot encode our categorical variables 

- Remember in lesson 1 we converted our categorical variables to the 'category' data type 
- If we didn't do this, variables like STATE_ID which contained integer representations of categories would be missed by [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
- Then they would be incorrectly treated as continuous variables

Lets one hot encode our small dataframe and assign it to a new variable 'loan_data_dumm'


In [47]:
loan_df['STATE_ID'].sample(5)

UNIQUEID
514020    15
453077     8
571603     4
492309     3
558217     4
Name: STATE_ID, dtype: category
Categories (22, int64): [1, 2, 3, 4, ..., 19, 20, 21, 22]

In [48]:
loan_df_sml.sample(10)

Unnamed: 0_level_0,LOAN_DEFAULT,DISBURSED_CAT,AGE,EMPLOYMENT_TYPE,DISBURSAL_MONTH,ASSET_COST
UNIQUEID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
451833,0,45k - 60k,0.230769,Self employed,8,0.017913
542057,0,60k - 75k,0.076923,Salaried,9,0.044002
570154,0,30k - 45k,0.307692,Self employed,11,0.038317
510966,0,45k - 60k,0.615385,Self employed,9,0.021278
570614,0,45k - 60k,0.134615,Self employed,11,0.023883
586415,0,60k - 75k,0.673077,Self employed,10,0.022494
525633,0,45k - 60k,0.288462,Self employed,9,0.029209
634342,0,45k - 60k,0.442308,Self employed,10,0.020917
596610,0,45k - 60k,0.173077,Salaried,10,0.020729
447763,0,45k - 60k,0.076923,Salaried,8,0.019566


In [45]:
#one hot encode
loan_data_dumm = pd.get_dummies(loan_df_sml, prefix_sep = '_', drop_first=True)

We are passing three parameters to [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

- loan_df_sml: our small dataframe which we want to encode 
- prefix_sep: prefix separator for the dummy variables, new columns will be created like 'CNS_SCORE_CAT_Low'
- drop_first: drop the first dummy variable for each category 

HOLD ON! Why are we dropping the first dummy variable for each category?
- Think about it
- If we have 10 boolean variables indicating the presence of some category and there are no missing values 
- Then if the variable doesn't belong to one of 9 of the categories 
- It must belong to the 10th 
- So we can drop one of the dummy columns without losing any information
- This helps to simplify the model and reduce the impact of correlated variables 

Let's look at the results of [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [53]:
loan_data_dumm.head(10)

Unnamed: 0_level_0,LOAN_DEFAULT,AGE,ASSET_COST,DISBURSED_CAT_150k - 1m,DISBURSED_CAT_30k - 45k,DISBURSED_CAT_45k - 60k,DISBURSED_CAT_60k - 75k,DISBURSED_CAT_75k - 150k,EMPLOYMENT_TYPE_Salaried,EMPLOYMENT_TYPE_Self employed,...,DISBURSAL_MONTH_3,DISBURSAL_MONTH_4,DISBURSAL_MONTH_5,DISBURSAL_MONTH_6,DISBURSAL_MONTH_7,DISBURSAL_MONTH_8,DISBURSAL_MONTH_9,DISBURSAL_MONTH_10,DISBURSAL_MONTH_11,DISBURSAL_MONTH_12
UNIQUEID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
420825,0,0.326923,0.013442,False,False,True,False,False,True,False,...,True,False,False,False,False,False,False,False,False,False
537409,1,0.307692,0.017934,False,False,True,False,False,False,True,...,False,False,False,False,False,False,True,False,False,False
417566,0,0.288462,0.015302,False,False,True,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
624493,1,0.134615,0.018287,False,False,True,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
539055,1,0.461538,0.014636,False,False,True,False,False,False,True,...,False,False,False,False,False,False,True,False,False,False
518279,0,0.211538,0.015641,False,False,True,False,False,False,True,...,False,False,False,False,False,False,True,False,False,False
529269,0,0.25,0.01539,False,False,True,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False
510278,0,0.230769,0.015641,False,True,False,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False
490213,0,0.173077,0.015687,False,False,True,False,False,False,True,...,False,False,True,False,False,False,False,False,False,False
510980,0,0.634615,0.015264,False,False,True,False,False,True,False,...,False,False,False,False,False,False,True,False,False,False


In [46]:
#check the columns
loan_data_dumm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 233154 entries, 420825 to 630213
Data columns (total 21 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   LOAN_DEFAULT                   233154 non-null  int64  
 1   AGE                            233154 non-null  float64
 2   ASSET_COST                     233154 non-null  float64
 3   DISBURSED_CAT_150k - 1m        233154 non-null  bool   
 4   DISBURSED_CAT_30k - 45k        233154 non-null  bool   
 5   DISBURSED_CAT_45k - 60k        233154 non-null  bool   
 6   DISBURSED_CAT_60k - 75k        233154 non-null  bool   
 7   DISBURSED_CAT_75k - 150k       233154 non-null  bool   
 8   EMPLOYMENT_TYPE_Salaried       233154 non-null  bool   
 9   EMPLOYMENT_TYPE_Self employed  233154 non-null  bool   
 10  DISBURSAL_MONTH_2              233154 non-null  bool   
 11  DISBURSAL_MONTH_3              233154 non-null  bool   
 12  DISBURSAL_MONTH_4             

Great! Looks like we have dummy columns for our categoricals

### EXERCISE 

- Take time to investigate the contents these new columns 
- Make sure you understand how [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) is transforming our dataset

### SOLUTION

In [None]:
#Extra space for exploration


## Lesson 3 - Train and Validate

### EXERCISE 

- Recreate our training and test set using loan_data_dumm
- Make sure the class distributions are correct

### SOLUTION 

In [57]:
#type solution here
x1 = loan_data_dumm.drop('LOAN_DEFAULT', axis = 1)
y1 = loan_data_dumm['LOAN_DEFAULT']

x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size = 0.3, random_state = 69)

In [58]:
print(y1_train.value_counts(normalize = True))
print(y1_test.value_counts(normalize = True))

LOAN_DEFAULT
0    0.783661
1    0.216339
Name: proportion, dtype: float64
LOAN_DEFAULT
0    0.78122
1    0.21878
Name: proportion, dtype: float64


Now let's try to [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) our model again

In [59]:
#intialize and train logistic regression
logistic_model1 = LogisticRegression()
logistic_model1.fit(x1_train, y1_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Ok! We are nearly there. We have successfully trained our model. But there is a warning we should take care of.

The above warning is telling us that the LogisticRegression did not find a solution to fit our data within the maximum number of iterations. 

The specifics of this error are out of the scope of this course but the likely explanations are that our data is not actually linearly separable or that our selected columns and pre-processing do not provide enough information to make distinct separations on the data. 

Something to keep in mind, but for now, we can try to resolve the warning by increasing the maximum allowed iterations.

The default value is 100, so let's try 200!

*The waring may or may not appear depending on your system, if you don't see any problems you can skip this step*

In [61]:
#fit model
logistic_model1 =LogisticRegression(max_iter=200)
logistic_model1.fit(x1_train, y1_train)

Great! We have successfully trained our model

Now we need to generate some predictions for our test set

We pass our test features to the model to generate predictions, using [predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict)



In [63]:
#generate predictions
preds = logistic_model1.predict(x1_test)
#We do not pass y1 test because we are trying to get predictions on unlabeled test data.
preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

The output of predict is an array of 0s and 1s representing the loan default prediction

This is great but we need some measure of model performance

The [score](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) function generates predictions and compares the predicted class with the actual class. The output is a floating-point number between 0 and 1 telling us the percentage of loans we correctly classified!

In [64]:
#get accuracy
logistic_model1.score(x1_test,y1_test)

0.7812200666218708

Wow! Looks like our model performed quite well, it predicted 78% of our test cases correctly.

Don't get too excited, accuracy can be a misleading measure of model performance! The next chapter will look at other measures of model performance