# More Classification, Metrics, and Class Imbalances

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn import datasets
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
toy=pd.read_csv('toy_imb.csv')
toy.head()

FileNotFoundError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

MessageError: ignored

Like MakeHastie but with a higher value so that fewer observations end up in positive class.

In [None]:
#def target(row):
#    if row['A']**2+row['B']**2+row['C']**2+row['D']**2+row['E']**2>9:
#        return 1
#    else:
#        return 0
#toy['Target']=toy.apply(target,axis=1)
#toy.to_csv('toy_imb.csv',index=0)

In [None]:
len(toy)

In [None]:
toy['Target'].sum()/len(toy)

This dataset has a significant class imbalance. The positive class (1) is only around 11% of the observation. This presents a challenge in identifying a good classifier. 

In [None]:
toy.Target.hist()

In [None]:
X=toy.drop('Target',axis=1)
y=toy['Target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=5)

In [None]:
print(sum(y_train)/len(y_train),":",sum(y_test)/len(y_test))

## Let's try a few classifiers

In [None]:
logreg=LogisticRegression().fit(X_train,y_train)

Note: Some folks have been getting a convergence error in their logistic regression models in sklearn: If so a good idea is to scale the data using ```StandardScaler```

In [None]:
logreg.score(X_test,y_test)

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dum_maj=DummyClassifier(strategy='most_frequent').fit(X_train,y_train)

In [None]:
dum_maj.score(X_test,y_test)

So Logistic Regression and Dummy Classifier have 88% accuracy. How do we feel about that?

Let's also build a simple depth 4 decision tree classifier:

In [None]:
Tclf = DecisionTreeClassifier(max_depth=4,random_state=0).fit(X_train,y_train)

In [None]:
Tclf.score(X_test,y_test)

Confusion matrix (cross-tabulated True vs Predicted) for the **Logistic Regression Classifier**:

In [None]:
pd.crosstab(y_test,logreg.predict(X_test))

Confusion matrix for the **Dummy Classifier**:

In [None]:
pd.crosstab(y_test,dum_maj.predict(X_test))

Confusion matrix for the **Tree Classifier**:

In [None]:
pd.crosstab(y_test,Tclf.predict(X_test))

So, the Logistic Regression classifier and the dummy classifier are just predicting 0 for all classes. The tree classifier seems like it represents small porogress, so...

Let's try a random forest classifier:

In [None]:
rf = RandomForestClassifier(n_estimators=400,max_features='sqrt',random_state=0).fit(X_train,y_train)

In [None]:
accuracy_score(rf.predict(X_test),y_test)

Confusion Matrix for the **Random Forest Classifier**:

In [None]:
pd.crosstab(y_test,rf.predict(X_test))

Can add row and column labels for readability:

In [None]:
pd.crosstab(y_test,rf.predict(X_test),rownames=['True'], colnames=['Predicted'])

The tree based methods are at least predicting some 1 classes.

Let's consider the cells in the cross tabulation of Actual Labels versus Model Predictions. Think of Class 1 as **POSITIVE** and Class 0 as **NEGATIVE**.

 \\\  | Predicted 0 | Predicted 1
---:|:---:| ---
**Actual 0** | $TN$ | $FP$
**Actual 1** | $FN$ | $TP$

(from our model perspective:)

* $TN$ = True Negative
* $TP$ = True Positive
* $FN$ = False Negative
* $FP$ = False Positive

Thus far we've been using accuracy to assess classifiers. Accuracy is correct predictions over total observations:

$$
\text{Accuracy} = \frac{TN+TP}{TN+FN+FP+TP}
$$

In the test set above there are 300 observations:

In [None]:
len(y_test)

In [None]:
sum(y_test)

In [None]:
sum(y_test)/len(y_test)

Only 12% are in Class 1. We can get a seemingly (misleadingly) "*accurate*" classifier by just predicting 0 all the time. Accuracy is 88%.

Obviously, this is a useless classifier. Accuracy is a metric that works well when the classes are roughly balanced but fails to help us evaluate a classifier in the face of imbalanced data.

We need alternative metrics to accuracy to see how/what the classifier is doing. Two important metrics are: ***Precision*** and ***Recall***.

### Precision: What fraction of our *predicted* 1s are *actually* 1s? 

$$
\text{Precision} = \frac{TP}{TP+FP}
$$

### Recall: What fraction of the *actual* 1s do we catch?

$$
\text{Recall}=\frac{TP}{TP+FN}
$$

High Precision: Good if we want to limit $FP$. If you get an automated notification that you tested +, we might not want to tell a lot of people who don't have the disease that they have it. OR: We work for a pharmaceutical company and we are going to run a clinical trial (experiment). It's expensive! We want to be sure the drug in fact works when we say it works. We want to limit the number of false positives: we say it works but it doesn't. 

High Recall: Good metric if we want to limit $FN$. Example: Disease diagnosis. A $FN$ here mean that we are predicting "No Disease" when, in fact, the disease is present. This has obvious negative consequences for the patient. 

Covid antibody test: 1 means ANTIBODIES PRESENT:
* FP = we say you have antibodies but you dont.
* FN = We say you don't have antibodies but you do.

* Precision: fraction of people we claim to have antibodies who actually do.
* Recall: fraction of people truly having antibodies that we say have them.

 There's a new blood test that screens for several cancers using cell-free DNA and does not require FDA approval. Cost is around $1000. 

* FP : Test says Cancer but patient really does not have cancer.
* FN : Test says NO Cancer but patrient really *does* haver cancer.

Trade off: The more you tune up the tests ability to detect the disease (reduce the FNs) you will increase the number of FPs.

Aside: Suppose a test
* gives a correct + result 99% of time 
* gives a correct - result 99% of time 
* The disease has a 1% prevalence in the populaiton 

What is the probability that a person has the disease given a positive test result?

$$P(\text{Have} | +) = \frac{P(\text{Have and }+)}{(P(\text{Have and }+)+P(\text{Don't Have and }+)}$$

<img src="Bayes.png">

Precision: The fraction of True Positives out of all positive tests (True Pos + False Pos) 

In [None]:
#.01*(.99)/(.01*.99+.99*.01)
#image next commented out

<!--<img src="Comp.png">-->

In [None]:
from sklearn.metrics import classification_report

Back to evaluating the classifiers we built above.

Start with the tree classifier:

In [None]:
pd.crosstab(y_test,Tclf.predict(X_test),rownames=["True"],colnames=['Predicted'])

In [None]:
print(classification_report(y_test,Tclf.predict(X_test)))

Precision (class 1): TP/(TP+FP); Recall (class 1): TP /(TP+FN)

In [None]:
print(np.round(6/(6+3),2),":",np.round(6/(6+30),2))

For the Random Forest:

In [None]:
print(classification_report(y_test,rf.predict(X_test)))

For the logistic regression classifier:

In [None]:
pd.crosstab(y_test,logreg.predict(X_test),rownames=["True"],colnames=['Predicted'])

In [None]:
print(classification_report(y_test,logreg.predict(X_test)))

Errors thrown because we are dividing by 0 : TP=FP=0

### F1 score

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1_score(y_test,rf.predict(X_test))

f1-score is an average of precision and recall (heavily influenced by the smaller of the two). 

f1 is the harmonic mean of P and R:
$$
\left(\frac{P^{-1}+R^{-1}}{2}\right)^{-1} = \frac{2PR}{P+R}
$$

(Why?) Aside: Suppose a car travels distance of 100 miles at 30 mph and 100 back at 70 mph. What's the average speed?

Not 50:

$$
\frac{distance}{time} = \frac{200}{\frac{100}{30}+\frac{100}{70}} =42
$$

$$
\frac{2d}{ \frac{d}{s_1} + \frac{d}{s_2} } = \frac{2}{\frac{1}{s_1}+\frac{1}{s_2}} = \left(\frac{s_1^{-1}+s_2^{-1}}{2}\right)^{-1}
$$

f1-score is between 0 and 1, but won't be high (close to 1) unless both precision and recall are high.

One last reminder: Remember that logistic regression predicts the probability of an observation belonging to class 1. We can use a threshold other than 0.5 to increase the number of class 1 predictions:

In [None]:
logreg.predict_proba(X_test)[0:10]

In [None]:
logreg.predict(X_test)[0:10]

In [None]:
new_lr_pred = logreg.predict_proba(X_test)[:,1] >.2

In [None]:
f1_score(y_test,new_lr_pred)

In [None]:
print(classification_report(y_test,new_lr_pred))

In [None]:
pd.crosstab(y_test,new_lr_pred,colnames=['Predicted'])

Actually, the random forest has a **predict probability** functionality too.

In [None]:
rf.predict_proba(X_test)[0:10]

In [None]:
new_rf_pred=rf.predict_proba(X_test)[:,1]>.4

In [None]:
f1_score(y_test,new_rf_pred)

In [None]:
f1_score(y_test,rf.predict(X_test))

In [None]:
pd.crosstab(y_test,new_rf_pred)

In [None]:
pd.crosstab(y_test,rf.predict(X_test))

In [None]:
print(classification_report(y_test,new_rf_pred))

P: tp/(tp+fp); R: tp/tp+fn

In [None]:
print("rf",18/25,18/(2*18),":","newrf",19/(19+11),19/(19+17))

How about a Boosting Classifier

In [None]:
bclf = GradientBoostingClassifier(n_estimators=600,max_depth=3,learning_rate=0.1,random_state=0).fit(X_train,y_train)

In [None]:
accuracy_score(bclf.predict(X_test),y_test)

In [None]:
pd.crosstab(y_test,bclf.predict(X_test),rownames=["Actual"],colnames=["Predicted"])

In [None]:
f1_score(y_test,bclf.predict(X_test))

Ever so slighty worse than our RF classifier.

In [None]:
print(classification_report(y_test,bclf.predict(X_test)))

In [None]:
print(classification_report(y_test,rf.predict(X_test)))

### One more example

In [None]:
from sklearn.datasets import load_digits

In [None]:
digits = datasets.load_digits()
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 4))
for ax, image in zip(axes, digits.images):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r,interpolation='nearest')

Each image is an 8 by 8 array of pixel values-- digitized hand-written digits. Larger number = darker. Max=16.

In [None]:
digits.images[0]

We can flatten it into a row of 64 column values (features) and try to build a model.

In [None]:
ddf=pd.DataFrame(digits.data,columns=digits.feature_names)

In [None]:
y=pd.DataFrame(digits.target,columns=['Target'])

In [None]:
ddf.head()

Column name is pixes_row_column

In [None]:
y.head()

In [None]:
ddf.shape

In [None]:
y.shape

This could be a 10 class classification problem but we'll make it binary (and imbalanced).

In [None]:
def biny(x):
    if x==9:
        return 1
    else:
        return 0

In [None]:
by=y['Target'].apply(biny)

In [None]:
by[0:21]

In [None]:
digits.images[19]

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
ax.imshow(digits.images[19], cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()

In [None]:
by.sum()/len(by)

Train-Test split

In [None]:
Xd_train, Xd_test, yd_train, yd_test = train_test_split(ddf,by,random_state=0)

In [None]:
yd_test.shape

Reminder that accuracy won't help:

In [None]:
ddum=DummyClassifier(strategy='most_frequent').fit(Xd_train,yd_train)

In [None]:
ddum.score(Xd_test,yd_test)

In [None]:
f1_score(yd_test,ddum.predict(Xd_test))

In [None]:
dlr=LogisticRegression().fit(Xd_train,yd_train)

(This error goes away if we scale the data)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
Xd_train_scaled = scaler.fit_transform(Xd_train)
Xd_test_scaled = scaler.fit_transform(Xd_test)

In [None]:
scaled_dlr=LogisticRegression().fit(Xd_train_scaled,yd_train)

Simple decision tree classifier:

In [None]:
dclf=DecisionTreeClassifier(max_depth=4).fit(Xd_train,yd_train)

In [None]:
dclf.score(Xd_test,yd_test)

In [None]:
dlr.score(Xd_test,yd_test)

In [None]:
scaled_dlr.score(Xd_test_scaled,yd_test)

In [None]:
f1_score(yd_test,dlr.predict(Xd_test))

In [None]:
f1_score(yd_test,scaled_dlr.predict(Xd_test_scaled))

In [None]:
print(classification_report(yd_test,scaled_dlr.predict(Xd_test_scaled)))

In [None]:
pd.crosstab(yd_test,scaled_dlr.predict(Xd_test_scaled),colnames=["Predicted"])

Unscaled logistic regression:

In [None]:
pd.crosstab(yd_test,dlr.predict(Xd_test),colnames=['Predicted'])

Simple decision tree classifier:

In [None]:
f1_score(yd_test,dclf.predict(Xd_test))

In [None]:
pd.crosstab(yd_test,dclf.predict(Xd_test))

In [None]:
print(classification_report(yd_test,dclf.predict(Xd_test)))

Let's build a Random Forest:

In [None]:
digit_rf = RandomForestClassifier(n_estimators=400,max_features='sqrt',random_state=0).fit(Xd_train,yd_train)

In [None]:
print(classification_report(yd_test,digit_rf.predict(Xd_test)))

In [None]:
pd.crosstab(yd_test,digit_rf.predict(Xd_test))

Boosting classifier:

In [None]:
digit_boost = GradientBoostingClassifier(n_estimators=400, learning_rate=0.1,max_depth=2, random_state=0).fit(Xd_train, yd_train)

In [None]:
pd.crosstab(yd_test,digit_boost.predict(Xd_test))

In [None]:
print(classification_report(yd_test,digit_boost.predict(Xd_test)))

Both (scaled) logistic regression and boosting did very well. Boosting is the best.

We can look at a few of the images we "missed" on.

In [None]:
misses=np.where(yd_test!=digit_boost.predict(Xd_test))[0]

In [None]:
misses

All of these were, in fact, 9s, but we predicted NOT.

In [None]:
yd_test.iloc[[90,115,130,211, 312, 429]]

In [None]:
digit_boost.predict(Xd_test)[[90,115,130,211, 312, 429]]

Let's look at 211:

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
ax.imshow(Xd_test.iloc[211].values.reshape(8,8), cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()

Does that look nine-y?

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
ax.imshow(Xd_test.iloc[130].values.reshape(8,8), cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,4))
ax.imshow(Xd_test.iloc[115].values.reshape(8,8), cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()

### IMB learn

First, from the terminal run conda:

```conda install -c conda-forge imbalanced-learn```

```imblearn``` should be available in colab.

In [None]:
import imblearn

In [None]:
from imblearn.over_sampling import SMOTE

Let's go back to our original imbalanced toy data set:

In [None]:
toy.head()

In [None]:
print(toy.shape,toy['Target'].sum()/len(toy))

In [None]:
oversample = SMOTE()
Xs, ys = oversample.fit_resample(X_train, y_train)

In [None]:
np.sum(ys)/len(ys)

Smote creates synthetic positive class observations to make classes balanced. It uses a *knn* approach: Create a new observation on the line between $x$ and one of its 3 nearest neighbors.

In [None]:
print(X_train.shape, Xs.shape)

In [None]:
rfs = RandomForestClassifier(n_estimators=400,max_features='sqrt',random_state=0).fit(Xs,ys)

In [None]:
f1_score(y_test,rfs.predict(X_test))

In [None]:
bs_clf = GradientBoostingClassifier(n_estimators=600,max_depth=3,learning_rate=0.1,random_state=0).fit(Xs,ys)

In [None]:
f1_score(y_test,bs_clf.predict(X_test))

Considerable improvement

In [None]:
pd.crosstab(y_test,bs_clf.predict(X_test))

In [None]:
print(classification_report(y_test,bs_clf.predict(X_test)))

In [None]:
bs_clf_no_os = GradientBoostingClassifier(n_estimators=600,max_depth=3,learning_rate=0.1,random_state=0).fit(X_train,y_train)

In [None]:
f1_score(y_test,bs_clf_no_os.predict(X_test))

Oversampling gave improvement over a similar model w/o OS.

In [None]:
pd.crosstab(y_test,bs_clf_no_os.predict(X_test))

In [None]:
print(classification_report(y_test,bs_clf_no_os.predict(X_test)))

### A real dataset with serious class imbalance

Kinda big too so proceed with caution...

In [None]:
cc=pd.read_csv('creditcard.csv')

In [None]:
cc.shape

In [None]:
cc.columns

In [None]:
cc.head()

In [None]:
cc.Class.sum()

In [None]:
cc.Class.sum()/cc.shape[0]

In [None]:
np.round(cc.describe(),2)

In [None]:
Xc=cc.drop('Class',axis=1)
yc=cc['Class']

In [None]:
Xc=Xc.drop('Time',axis=1)

In [None]:
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc,yc,test_size=0.4, random_state=10)

In [None]:
yc_test.sum()

In [None]:
yc_train.sum()

In [None]:
#num_trees=[10,20]

In [None]:
Xc_train.shape

In [None]:
scaler=StandardScaler()
Xc_train_scaled=scaler.fit_transform(Xc_train)
Xc_test_scaled=scaler.fit_transform(Xc_test)

In [None]:
lrc=LogisticRegression().fit(Xc_train_scaled,yc_train)

In [None]:
f1_score(yc_test,lrc.predict(Xc_test_scaled))

In [None]:
unsc_lrc=LogisticRegression().fit(Xc_train,yc_train)

In [None]:
dtc=DecisionTreeClassifier(max_depth=4,random_state=0).fit(Xc_train,yc_train)

In [None]:
f1_score(yc_test,dtc.predict(Xc_test))

In [None]:
dtcbig=DecisionTreeClassifier(random_state=0).fit(Xc_train,yc_train)

In [None]:
f1_score(yc_test,dtcbig.predict(Xc_test))

In [None]:
##### Warning: Below takes crazy long! Be careful!!!
#If you did want to validate some paraeter choices like number of trees or max-depth etc.

```tqdm``` around the iterator will give a progress bar to show progress through the loop.

In [None]:
#from tqdm import tqdm
#cvres=[]
#for num in tqdm(num_trees):
#    rf_cv_clf = RandomForestClassifier(n_estimators = num, max_features='sqrt',max_depth=4,random_state=0)
#    cvclf = cross_validate(rf_cv_clf, Xc_train, yc_train, cv=5, scoring='f1')
#    cvres.append(cvclf['test_score'].mean())

In [None]:
rfc=RandomForestClassifier(n_estimators = 150, max_features='sqrt',max_depth=4,random_state=0).fit(Xc_train,yc_train)

In [None]:
f1_score(yc_test,rfc.predict(Xc_test))

In [None]:
print(classification_report(yc_test,rfc.predict(Xc_test)))

* FP: we say FRAUD and it's not 
* FN: we say NO FRAUD and it is fraud

In [None]:
pd.crosstab(yc_test,rfc.predict(Xc_test),colnames=["Prediction"])

Possible that we'd prefer to identify a few more of the False Negatives.

In [None]:
new_rfc_pred = rfc.predict_proba(Xc_test)[:,1] > .25

In [None]:
pd.crosstab(yc_test,new_rfc_pred,colnames=["Predicted"])

In [None]:
f1_score(yc_test,new_rfc_pred)

That actually *improved* the f1 score.

We can look at the performance of the classifier for various values of the prediction threshold all at once using precision recall curves.

In [None]:
from sklearn.metrics import precision_recall_curve

In [None]:
precision, recall, thresholds = precision_recall_curve(yc_test,rfc.predict_proba(Xc_test)[:,1])

In [None]:
plt.figure(figsize=(8,6))
plt.plot(precision,recall,label='rf')
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.legend(['rf'],loc='best')
plt.show()

Each point on the curve gives Precision and Recall for a particular threshold value.

One way to get high Recall (identify lots of true 1s) is to predict lots of 1s. 

What's wrong with that? Lot's of false positives, i.e., low precision.

In [None]:
max(thresholds)

Let's pick a threshold close to 0.5:

In [None]:
pt=np.argmin(np.abs(thresholds-0.5))

In [None]:
print(pt,":",thresholds[pt])

P and R at that point:

In [None]:
print(np.round(precision[pt],3),np.round(recall[pt],3))

In [None]:
plt.figure(figsize=(8,6))
plt.plot(precision,recall,label='rf')
plt.plot(precision[pt],recall[pt],'^',c='k',markersize=10)
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.show()

Let's choose a lower threshold... so make it easier to predict Class = 1

What would we expect in terms of FP, FN?

In [None]:
newpt=np.argmin(np.abs(thresholds-0.25))

In [None]:
plt.figure(figsize=(8,6))
plt.plot(precision,recall,label='rf')
plt.plot(precision[pt],recall[pt],'^',c='k',markersize=10)
plt.plot(precision[newpt],recall[newpt],'o',c='b',markersize=10)
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.show()

In [None]:
print(np.round(precision[newpt],3),":",np.round(recall[newpt],3))

In [None]:
from scipy.stats import hmean

In [None]:
hmean([precision[newpt],recall[newpt]])

In [None]:
hmean([precision[pt],recall[pt]])

The ROC curve is another window into (basically) the same idea. It plots Recall against the FP rate.

In [None]:
from sklearn.metrics import roc_curve

In [None]:
fpr, tpr, thresholds = roc_curve(yc_test,rfc.predict_proba(Xc_test)[:,1])

In [None]:
pt=np.argmin(np.abs(thresholds-0.5))
newpt=np.argmin(np.abs(thresholds-0.25))
thirdpt=np.argmin(np.abs(thresholds-0.0001))

In [None]:
plt.figure(figsize=(8,6))
plt.plot(fpr,tpr,label='rf')
plt.plot(fpr[pt],tpr[pt],'^',c='k',markersize=10)
plt.plot(fpr[newpt],tpr[newpt],'o',c='b',markersize=10)
plt.plot(fpr[thirdpt],tpr[thirdpt],'*',c='r',markersize=10)
plt.xlabel('FPR')
plt.ylabel('TPR (Recall)')
plt.show()

As we increase Recall (our ability to identify true 1s we pay the price of more FP)

An overall metric of teh quality of the classifier is Area Under the ROC Curve (AUC). A good classifier will have a ROC curve that stays close to the y-axis and have high recall without increasing the FP rate (much)... so AUC close to 1.

In [None]:
from sklearn.metrics import average_precision_score, roc_auc_score

In [None]:
roc_auc_score(yc_test,rfc.predict_proba(Xc_test)[:,1])

Avg precision score: Avg Precision at each threshold, weighted by increase in Recall over previous threshold

In [None]:
average_precision_score(yc_test,rfc.predict_proba(Xc_test)[:,1])

Compare RF to LR

In [None]:
lrfpr, lrtpr, lrthresholds = roc_curve(yc_test,lrc.predict_proba(Xc_test_scaled)[:,1])

In [None]:
plt.figure(figsize=(8,6))
plt.plot(fpr,tpr,label='rf')
plt.plot(lrfpr,lrtpr,label='lr')
plt.plot(fpr[pt],tpr[pt],'^',c='k',markersize=10)
plt.plot(fpr[newpt],tpr[newpt],'o',c='b',markersize=10)
plt.plot(fpr[thirdpt],tpr[thirdpt],'*',c='r',markersize=10)
plt.xlabel('FPR')
plt.ylabel('TPR (Recall)')
plt.legend(['rf','lr'],loc='best')
plt.show()

In [None]:
lr_precision, lr_recall, thresholds = precision_recall_curve(yc_test,lrc.predict_proba(Xc_test_scaled)[:,1])

In [None]:
plt.figure(figsize=(8,6))
plt.plot(precision,recall,label='rf')
plt.plot(lr_precision,lr_recall,label='lr')
plt.xlabel('Precision')
plt.ylabel('Recall')
plt.legend(loc='best')
plt.show()

## Sandbox

##### CreditCard dataset:

In [None]:
cc.shape

* Try a boosting model and see how it does. Validation will take a long time so maybe try 150 trees, learning rate of 0.1, and max depth = 3.
* What seems to be an optimal threshold value to use in the scaled logistic regression model? How are you evaluating what optimal means?
* Also try knn and SMOTE on the credit card data.

Another (usually worse approach) is downsampling: Make cc smaller but keep all the 1s: (usually worse because we're throwing information away)

In [None]:
cc1=cc[cc['Class']==1]
cc0=cc[cc['Class']==0]

In [None]:
ccs=cc0.sample(frac=.1)

In [None]:
ccs.shape

In [None]:
new_cc=pd.concat([ccs,cc1])
new_cc.shape

In [None]:
new_cc['Class'].sum()/new_cc.shape[0]

Class imbalance now closer to 1.7%

In [None]:
new_cc.reset_index(inplace=True,drop=True)

In [None]:
new_cc.head()

#### Digits

Above we treated the digits data as a binary classification problem: 9 or Other. Now try leaving it as a full 10 Class problem. That is, try to correctly classify each digit as 0,1,2,3,4,5,6,7,8, or 9. Use logistic regression, random forest, boosting, knn. What does the classification report look like in thuis case? 

#### Another synthetioc dataset:

Play around with this and see how well you can do with classification.

In [None]:
from sklearn import datasets

In [None]:
X,y=datasets.make_classification(n_samples=1000, n_features=20, n_informative=7, n_redundant=5, n_classes=2, n_clusters_per_class=2, weights=[0.94,0.6], flip_y=0.07, class_sep=1.1, hypercube=False, shift=0.0, scale=1.0, shuffle=True, random_state=10)

In [None]:
fr=pd.DataFrame(X)

In [None]:
fr.head()

In [None]:
y[0:10]