## 1. Recap

We spent the last 2 missions cleaning and preparing a dataset that contains data on loans made to members of Lending Club.` Our eventual goal is to generate features from the data, which can feed into a machine learning algorithm.` The algorithm will make predictions about whether or not a loan will be paid off on time, which is contained in the loan_status column of the clean dataset.

As we prepared the data,**`we removed columns that had data leakage issues, contained redundant information, or required additional processing to turn into useful features. We cleaned features that had formatting issues, and converted categorical columns to dummy variables.`**

In the last mission, we noticed that there's a `class imbalance in our target column,` loan_status. There are about 6 times as many loans that were paid off on time (positive case, label of 1) than those that weren't (negative case, label of 0). Imbalances can cause issues with many machine learning algorithms, where they appear to have high accuracy, but actually aren't learning from the training data. Because of its potential to cause issues, we need to keep the class imbalance in mind as we build machine learning models.

In [1]:
import pandas as pd

loans=pd.read_csv('cleaned_loans_2007.csv')
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37675 entries, 0 to 37674
Data columns (total 38 columns):
loan_amnt                              37675 non-null float64
int_rate                               37675 non-null float64
installment                            37675 non-null float64
emp_length                             37675 non-null int64
annual_inc                             37675 non-null float64
loan_status                            37675 non-null int64
dti                                    37675 non-null float64
delinq_2yrs                            37675 non-null float64
inq_last_6mths                         37675 non-null float64
open_acc                               37675 non-null float64
pub_rec                                37675 non-null float64
revol_bal                              37675 non-null float64
revol_util                             37675 non-null float64
total_acc                              37675 non-null float64
home_ownership_MORTGAGE    

# 2. Picking an error metric

We established that this is a `binary classification problem` in the first mission of this course, and we converted the loan_status column to 0s and 1s as a result. Before diving in and selecting an algorithm to apply to the data, `we should select an error metric`.

Our objective in this is to make money -- we want to fund enough loans that are paid off on time to offset our losses from loans that aren't paid off.` An error metric will help us determine if our algorithm will make us money or lose us money.`

**An error metric will help us figure out when our model is performing well, and when it's performing poorly.**

# 3. Picking an error metric

* Find the number of true negatives.
  * Find the number of items where predictions is 0, and the corresponding entry in loans["loan_status"] is also 0.
  * Assign the result to tn.
* Find the number of true positives.
  * Find the number of items where predictions is 1, and the corresponding entry in loans["loan_status"] is also 1.
  * Assign the result to tp.
* Find the number of false negatives.
  * Find the number of items where predictions is 0, and the corresponding entry in loans["loan_status"] is 1.
  * Assign the result to fn.
* Find the number of false positives.
  * Find the number of items where predictions is 1, and the corresponding entry in loans["loan_status"] is 0.
  * Assign the result to fp.

In [2]:
loans.head(2)

Unnamed: 0,loan_amnt,int_rate,installment,emp_length,annual_inc,loan_status,dti,delinq_2yrs,inq_last_6mths,open_acc,...,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_renewable_energy,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,5000.0,10.65,162.87,10,24000.0,1,27.65,0.0,1.0,3.0,...,0,0,0,0,0,0,0,0,1,0
1,2500.0,15.27,59.83,0,30000.0,0,1.0,0.0,5.0,3.0,...,0,0,0,0,0,0,0,0,0,1


In [3]:
# Get predictions for above task

from sklearn.model_selection import train_test_split

# create train,test sets
features=loans.drop(['loan_status'],axis=1)
labels=loans['loan_status']
train_features,test_features,train_labels,test_labels=train_test_split(features,labels,test_size=0.2,random_state=1)

# train model
from sklearn.linear_model  import LogisticRegression
lr=LogisticRegression(solver='lbfgs')
lr.fit(train_features,train_labels)

# Get predictions for all observations
predictions=lr.predict(features)
loans['predictions']=predictions
loans.predictions[:5]

0    1
1    1
2    1
3    1
4    1
Name: predictions, dtype: int64

In [4]:
# Find the number of true negatives,model predicts loan not paid and actual label is same as model prediction
tn=loans.loc[(loans['predictions']==0)&(loans['loan_status']==0),'predictions']
tn=len(tn)
tn

15

In [5]:
# Find the number of true positives,model predicts loan will be paid and actual label is same as model prediction
tp=loans.loc[(loans['predictions']==1)&(loans['loan_status']==1),'predictions']
tp=len(tp)
tp

32248

In [6]:
# Find the number of false negatives,model predicts loan not paid but actual label says loan was paid
fn=loans.loc[(loans['predictions']==0)&(loans['loan_status']==1),'predictions']
fn=len(fn)
fn

38

In [7]:
# Find the number of false positives,model predicts loan paid but actual label says not paid
fp=loans.loc[(loans['predictions']==1)&(loans['loan_status']==0),'predictions']
fp=len(fp)
fp

5374

# 4. Class imbalance

We mentioned earlier that there is a significant class imbalance in the loan_status column. There are 6 times as many loans that were paid off on time (1), than loans that weren't paid off on time (0). `This causes a major issue when we use accuracy as a metric`. `This is because due to the class imbalance, a classifier can predict 1 for every row, and still have high accuracy.`

This is why it's important to always be `aware of imbalanced classes in machine learning models`, **`and to adjust your error metric accordingly`**. In this case, we don't want to use accuracy, and should instead use metrics that tell us the number of false positives and false negatives.

**This means that we should optimize for**:

* `high recall` (true positive rate)
* `low fall-out` (false positive rate)

Generally, `if we want to reduce false positive rate, true positive rate will also go down. This is because if we want to reduce the risk of false positives, we wouldn't think about funding riskier loans in the first place.`

# 5. Class imbalance

## TODO
* Compute the false positive rate for predictions.
  * Compute the number of false positives, then divide by the number of false positives plus the number of true negatives.
  * Assign to fpr.
* Compute the true positive rate for predictions.
  * Compute the number of true positives, then divide by the number of true positives plus the number of false negatives.
  * Assign to tpr.

In [8]:
# false positive rate, wrongly predicted paid by model divide by total no of actual loan not paid
fpr=fp/(fp+tn)
fpr

0.9972165522360363

In [9]:
# true positive rate,correct prediction by model divide by total no of actual loan paid
tpr=tp/(tp+fn)
tpr

0.9988230192653162

# 6. Cross Validation

In [10]:
from sklearn.model_selection import cross_val_predict

lr=LogisticRegression(solver='lbfgs')
predictions=cross_val_predict(lr,features,labels,cv=3)
predictionss=pd.Series(predictions)

# true positive and false positive
tp_filter=(predictionss==1)&(loans['loan_status']==1)
tp=len(predictionss[tp_filter])
tp



32250

In [11]:
fp_filter=(predictionss==1)&(loans['loan_status']==0)
fp=len(predictionss[fp_filter])
fp

5367

# 7. Penalizing the classifier

Unfortunately, even though we're not using accuracy as an error metric, the classifier is, and it isn't accounting for the imbalance in the classes. There are a few ways to get a classifier to correct for imbalanced classes. The two main ways are:

* `Use oversampling and undersampling to ensure that the classifier gets input that has a balanced number of each class.`
* `Tell the classifier to penalize misclassifications of the less prevalent class more than the other class.`

`We'll look into oversampling and undersampling first.` They involve taking a sample that contains equal numbers of rows where loan_status is 0, and where loan_status is 1. This way, the classifier is forced to make actual predictions, since predicting all 1s or all 0s will only result in 50% accuracy at most.

The downside of this technique is that since it has to preserve an equal ratio, you have to either:

* `Throw out many rows of data.` If we wanted equal numbers of rows where loan_status is 0 and where loan_status is 1, one way we could do that is to delete rows where loan_status is 1.
* `Copy rows multiple times`. One way to equalize the 0s and 1s is to copy rows where loan_status is 0.
* `Generate fake data.` One way to equalize the 0s and 1s is to generate new rows where loan_status is 0.

Unfortunately, none of these techniques are especially easy. The second method we mentioned earlier, telling the classifier to penalize certain rows more, is actually much easier to implement using scikit-learn.

We can do this by setting the `class_weight parameter to balanced when creating the LogisticRegression instance.` This tells scikit-learn to penalize the misclassification of the minority class during the training process. The penalty means that the logistic regression classifier pays more attention to correctly classifying rows where loan_status is 0. This lowers accuracy when loan_status is 1, but raises accuracy when loan_status is 0.

`By setting the class_weight parameter to balanced, the penalty is set to be inversely proportional to the class frequencies.` You can read more about the parameter [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn-linear-model-logisticregression)

# 8. Penalizing the classifier

In [12]:
lr = LogisticRegression(class_weight="balanced",solver='lbfgs')
predictions = cross_val_predict(lr, features, labels, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.567831258130459
0.3802189645574318




# 9. Manual penalties

We significantly improved false positive rate in the last screen by balancing the classes, which reduced true positive rate. Our true positive rate is now around 56%, and our false positive rate is around 39%. From a conservative investor's standpoint, it's reassuring that the false positive rate is lower because it means that we'll be able to do a better job at avoiding bad loans than if we funded everything. 

`We can try to lower the false positive rate further by assigning a harsher penalty for misclassifying the negative class. `While setting class_weight to balanced will automatically set a penalty based on the number of 1s and 0s in the column, we can also set a manual penalty. 

In [13]:
penalty = {
    0: 10,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
predictions = cross_val_predict(lr, features, labels, cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.2514712259183547
0.09482278715902764


# 10.Random forests

In [14]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(class_weight="balanced", random_state=1)
predictions = cross_val_predict(rf, features, labels ,cv=3)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.`
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)

print(tpr)
print(fpr)



0.9700799107972495
0.9181666357394693


**Given this, there's still quite a bit of room to improve:**

* We can tweak the penalties further.
* We can try models other than a random forest and logistic regression.
* We can use some of the columns we discarded to generate better features.
* We can `ensemble multiple models` to get more accurate predictions.
* We can `tune the parameters of the algorithm` to achieve higher performance.