`PART 3: Applying ML Algorithm`
--------------------------------------------
# Machine learning model that can accurately predict if a borrower will pay 


# off their loan on time or not?
In last two notebooks we cleaned data. Our eventual goal is to generate features from data, which we can feed into Machine Learning algorithm. The algorithm will make predictions whether or not loan will be paid off in time or not, which is contained in `loan_status` column of dataset. We prepared data, we cleaned the data, we removed columns containing data that can result into leakage, columns which have zero variance and columns which had redundant information. We also cleaned columns with formatting issues and converted categorical columns to dummy variable.

------------------------------------------------

#### class imbalance:
We know there is class imbalance as number of `1` in **loan_status** column is 6x more than number of `0`. We need to be aware of that as it may impact prediction. Machine learning models may show high accuracy in such case for training set but they aren't actually learning anything from train set.

In [1]:
import pandas as pd
loans = pd.read_csv("loans.csv")

In [2]:
loans.head()

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1
2,2400.0,15.96,84.33,10,12252.0,1,8.72,0.0,2.0,2.0,...,0,0,0,0,0,1,0,0,1,0
3,10000.0,13.49,339.31,10,49200.0,1,20.0,0.0,1.0,10.0,...,0,0,0,1,0,0,0,0,1,0
4,5000.0,7.9,156.46,3,36000.0,1,11.2,0.0,3.0,9.0,...,0,0,0,0,0,0,0,1,1,0


In [3]:
loans.corr()["loan_status"].sort_values()

int_rate                              -0.210814
term_ 60 months                       -0.171194
revol_util                            -0.099547
purpose_small_business                -0.078515
inq_last_6mths                        -0.070536
loan_amnt                             -0.062140
pub_rec                               -0.050193
dti                                   -0.042815
verification_status_Verified          -0.041976
installment                           -0.030309
purpose_debt_consolidation            -0.021098
home_ownership_RENT                   -0.020678
delinq_2yrs                           -0.019279
emp_length                            -0.016195
purpose_other                         -0.015565
revol_bal                             -0.007141
purpose_renewable_energy              -0.006921
home_ownership_OTHER                  -0.006418
purpose_house                         -0.006330
purpose_educational                   -0.006167
verification_status_Source Verified   -0

We can see above that there is no major correlation between any specific column and target column.

## Selecting Error Metric:
Our main focus should be on capturing true positive and true negative. We can adjust with false negative but we should totally **avoid** false positive. As false positive will result in loss of money.

For error metric measure, we can't use accuracy, as it may result in loss to us. We need high true positive rate and low false negative rate.

#### Experimenting with False Positive Ratio and False Negative Ratio

In [4]:
import pandas as pd
import numpy

# Predict that all loans will be paid off on time,so setting all the values of predictions to 1
#predictions = pd.Series(numpy.ones(loans.shape[0]))
predictions = pd.Series(numpy.ones(loans.shape[0]))
#fpr = fp/fp+tn
filter_fp = (predictions==1) & (loans["loan_status"]==0)
filter_tn = (predictions==0) & (loans["loan_status"]==0)
#tpr = tp/tp+fn
filter_tp = (predictions==1) & (loans["loan_status"]==1)
filter_fn = (predictions==0) & (loans["loan_status"]==1)

fpr = len(loans[filter_fp])/(len(loans[filter_fp])+len(loans[filter_tn]))
tpr = len(loans[filter_tp])/(len(loans[filter_tp])+len(loans[filter_fn]))

In [5]:
fpr,tpr

(1.0, 1.0)

We can notice that both the rates are 1 and 1. True positive rate is "1" implies that we correctly identified good loans. But False Positive rate is "1" implies we incorrectly identified bad loans.

We have already converted all the columns to Numeric type, so we can easily apply all the machine learning algorithms. Applying machine learning algorithms:

### Logistic Regression:

In [9]:
from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression()
features = loans.drop(["loan_status"],axis=1)
target = loans["loan_status"]
logistic_model.fit(features,target)
predictions = logistic_model.predict(features)

In [12]:
pd.Series(predictions).value_counts()

1    37611
0       64
dtype: int64

This model seems overfitting as we are using training set as test set. Let's implement K-Fold cross validation.
#### K-Fold

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
features = loans.drop(["loan_status"],axis=1)
target = loans["loan_status"]
logistic_model = LogisticRegression()
predictions = cross_val_predict(logistic_model,features,target,cv=3)
predictions = pd.Series(predictions)

#True positive rate
tp_filter = (loans["loan_status"]==1) & (predictions==1)
fn_filter = (loans["loan_status"]==1) & (predictions==0)
tpr = len(loans[tp_filter])/(len(loans[tp_filter]) +len(loans[fn_filter]) )

#False positive rate
fp_filter = (loans["loan_status"]==0) & (predictions==1)
tn_filter = (loans["loan_status"]==0) & (predictions==0)
fpr = len(loans[fp_filter])/(len(loans[fp_filter]) +len(loans[tn_filter]) )

(tpr,fpr)

(0.9987920460880877, 0.9962887363147152)

Both *True positive rate* and *False positive rate* are approximately 1, which isn't a good sign. We need to find a way to remove imbalance and ensuring equal participation of both the type of predictions.

#### using parameter: class_weight = balanced

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
features = loans.drop(["loan_status"],axis=1)
target = loans["loan_status"]
#-----------------change is done here
logistic_model = LogisticRegression(class_weight="balanced")
#----------------------------------------
predictions = cross_val_predict(logistic_model,features,target,cv=3)
predictions = pd.Series(predictions)

#True positive rate
tp_filter = (loans["loan_status"]==1) & (predictions==1)
fn_filter = (loans["loan_status"]==1) & (predictions==0)
tpr = len(loans[tp_filter])/(len(loans[tp_filter]) +len(loans[fn_filter]) )

#False positive rate
fp_filter = (loans["loan_status"]==0) & (predictions==1)
tn_filter = (loans["loan_status"]==0) & (predictions==0)
fpr = len(loans[fp_filter])/(len(loans[fp_filter]) +len(loans[tn_filter]) )

(tpr,fpr)

(0.6644675710834418, 0.3859714232696233)

We significantly improved false positive rate in the last screen by **balancing** the classes, which reduced true positive rate. Our true positive rate is now 67% and our false positive rate is around 40%. From conservative inverstor's point of view, its reassuring that the false positive rate is lower because it mean we'll be able to do a better job at avoiding bad loans than if we funded everything. However, we'd only ever decide to fund 67% of the total loans (true positive rate). 

We can try to lower false positive rate further by assigning harsher penalty for misclassifying negative class. While setting `class_weight` to balanced will automatically set a penalty based on number of `1s` and `0s` in the column.

#### Specifying penalty manually:

In [18]:
penalty = {0:10,1:1}
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
features = loans.drop(["loan_status"],axis=1)
target = loans["loan_status"]
#-----------------change is done here
logistic_model = LogisticRegression(class_weight=penalty)
#----------------------------------------
predictions = cross_val_predict(logistic_model,features,target,cv=3)
predictions = pd.Series(predictions)

#True positive rate
tp_filter = (loans["loan_status"]==1) & (predictions==1)
fn_filter = (loans["loan_status"]==1) & (predictions==0)
tpr = len(loans[tp_filter])/(len(loans[tp_filter]) +len(loans[fn_filter]) )

#False positive rate
fp_filter = (loans["loan_status"]==0) & (predictions==1)
tn_filter = (loans["loan_status"]==0) & (predictions==0)
fpr = len(loans[fp_filter])/(len(loans[fp_filter]) +len(loans[tn_filter]) )

(tpr,fpr)

(0.2514712259183547, 0.09482278715902764)

Specifying manual penalties lowered the false positive rate to 9.5% and hence lowered our risk. Note that this comes at the expense of true positive rate. We we have fewer false positives, we are also missing opportunities to fund more loans and potentially make more money.

We can tweak penalties further.But now let's use `Random Forests`.
Random Forests are able to work with non linear data and learn complex conditionals.Logistic Regression are only able to work with Linear data.

### Random Forest:

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
features = loans.drop(["loan_status"],axis=1)
target = loans["loan_status"]
#-----------------change is done here
model = RandomForestClassifier(class_weight="balanced",random_state=1)
#----------------------------------------
predictions = cross_val_predict(model,features,target,cv=3)
predictions = pd.Series(predictions)

#True positive rate
tp_filter = (loans["loan_status"]==1) & (predictions==1)
fn_filter = (loans["loan_status"]==1) & (predictions==0)
tpr = len(loans[tp_filter])/(len(loans[tp_filter]) +len(loans[fn_filter]) )

#False positive rate
fp_filter = (loans["loan_status"]==0) & (predictions==1)
tn_filter = (loans["loan_status"]==0) & (predictions==0)
fpr = len(loans[fp_filter])/(len(loans[fp_filter]) +len(loans[tn_filter]) )

(tpr,fpr)

(0.9700799107972495, 0.9181666357394693)

## Conclusion:
Using random forest classifier ddn't improve our false positive rate. The model is likely weighting too heavily on `1` class and still predicting `1s`.We can further apply penalties.

Our best model so far had true positive rate of `25%` and false positive rate of `9%`.

We can futher tune our models for better predictions.