<a href="https://colab.research.google.com/github/quinn-dougherty/DS-Unit-2-Sprint-4-Model-Validation/blob/master/24SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [137]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,RobustScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

dependent = 'made_donation_07'

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': dependent
})


print(df[dependent].value_counts())

df.head()


0    570
1    178
Name: made_donation_07, dtype: int64


Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_07
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [138]:
majority_class = df[dependent].value_counts().idxmax()

y0 = pd.DataFrame(np.full((df.shape[0], 1) , fill_value=majority_class), columns=['predicted' + dependent])

accuracy = np.divide(df[dependent].value_counts()[majority_class], df.shape[0])

print(f'The majority-class baseline has an accuracy score of {accuracy:.3}')


The majority-class baseline has an accuracy score of 0.762


What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

## We _never_ predicted `1` in the majority-class baseline. 
## our `True Positive` rate is `0`. 
## $Recall = \frac{Accurately-predicted-1}{1-is-observed-in-data} = \frac{0}{178} = 0$

## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [139]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(dependent, axis=1), 
                                                    df[dependent], 
                                                    test_size=0.25, 
                                                    shuffle=True)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(561, 4) (187, 4) (561,) (187,)


## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [0]:
from sklearn.preprocessing import StandardScaler,RobustScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Define an estimator and param_grid
pipe = Pipeline(steps=[
    ('scale', RobustScaler()), 
    ('reduce_dim', SelectKBest(f_classif)), 
    ('classify', LogisticRegression(solver='lbfgs'))])

# pipe.fit(X_train,y_train)

# sum(pipe.predict(X_test) == y_test)

## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [141]:
%%time
pg = {
    'reduce_dim__k': range(1,5),
    'classify__class_weight': ['None', 'Balanced'],
    'classify__C': [10**k for k in range(-4, 5)]
}

gs = GridSearchCV(pipe, param_grid=pg, cv=5, scoring='recall', 
                  verbose=10, n_jobs=-1, return_train_score=True, iid=True)
# i'm on GPU
gs.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0255s.) Setting batch_size=14.
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 144 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 242 tasks      | elapsed:    1.6s


CPU times: user 424 ms, sys: 5.46 ms, total: 430 ms
Wall time: 2.29 s


[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    2.3s finished


## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [143]:
print(gs.best_estimator_)

Pipeline(memory=None,
     steps=[('scale', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('reduce_dim', SelectKBest(k=3, score_func=<function f_classif at 0x7f6e936807b8>)), ('classify', LogisticRegression(C=10, class_weight='None', dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])


In [144]:
best_k = 3
best_classweight = 'None'
best_C = 10

report3 = f'According to my 5-fold gridsearch: \n\tthe best number of ' + \
          f'features to select is {best_k}\n\tthe best class-weighing ' + \
          f'is {best_classweight}\n\tthe best regularization ' +\
          f'strength is {best_C}' 

print(report3)

According to my 5-fold gridsearch: 
	the best number of features to select is 3
	the best class-weighing is None
	the best regularization strength is 10


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [146]:
def confusionmatrix(model=gs.best_estimator_, dat=X_test, target=y_test):
#   rows = pd.MultiIndex.from_product([['Actual', 'Predicted'],
#                                      ['Negative', 'Positive']])
  seconds = ['Negative', 'Positive']
  firsts = ['Actual', 'Predicted']
  
  print("at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1")
  
  def tupdex(first): 
    return [first + ' ' + seconds[0], first + ' ' + seconds[1]]
  
  c = np.empty((2,2))
  
  def fill(i,j): 
    val = sum([x==i and y==j for x,y in zip(target, model.predict(dat))])
    c[i][j] = val
    pass
  fill(0,0)
  fill(0,1)
  fill(1,0)
  fill(1,1)
  df = pd.DataFrame(c, index=tupdex(firsts[0]), columns=tupdex(firsts[1]))
  return df

pipe1_cm = confusionmatrix(gs.best_estimator_)

pipe1_cm

at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1


Unnamed: 0,Predicted Negative,Predicted Positive
Actual Negative,134.0,3.0
Actual Positive,45.0,5.0


Calculate precision

In [147]:
def precision(cm=confusionmatrix()): 
  TP = cm['Predicted Positive'].loc['Actual Positive']
  PP = cm['Predicted Positive'].sum()
  return np.divide(TP, PP)

precision()

at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1


0.625

Calculate recall

In [148]:
def recall(cm=confusionmatrix()): 
  TP = cm['Predicted Positive'].loc['Actual Positive']
  AP = cm.loc['Actual Positive'].sum()
  return np.divide(TP, AP)

recall()

def F1(cm=confusionmatrix()): 
  prec = precision(cm)
  reca = recall(cm)
  return 2 * np.divide(prec * reca, prec + reca)

def typeI(cm=confusionmatrix()):
  return cm['Predicted Positive'].loc['Actual Negative']

at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1
at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1
at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1


In [149]:
def confusion_report(cm=confusionmatrix()):
  s1 = f'this model got a precision score of {precision(cm):.3}\n'
  s2 = f'a recall score of {recall(cm):.3}\n'
  s3 = f'an F1 score of {F1(cm):.3}\n'
  s4 = f'and {int(typeI(cm))} Type I errors'
  return s1+s2+s3+s4

initial_pipeline_performance = (report3, confusion_report(pipe1_cm), pipe1_cm)

print(initial_pipeline_performance[0])
print(initial_pipeline_performance[1])
initial_pipeline_performance[2]

at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1
According to my 5-fold gridsearch: 
	the best number of features to select is 3
	the best class-weighing is None
	the best regularization strength is 10
this model got a precision score of 0.625
a recall score of 0.1
an F1 score of 0.172
and 3 Type I errors


Unnamed: 0,Predicted Negative,Predicted Positive
Actual Negative,134.0,3.0
Actual Positive,45.0,5.0


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 

In [150]:
df.head()

X = df.drop(dependent, axis=1)
y = df[dependent]

poly = PolynomialFeatures(degree=4)
poly.fit(X)
X_poly = pd.DataFrame(poly.transform(X), columns=poly.get_feature_names(X.columns))

print(X_poly.shape)

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_poly, 
                                                    y, 
                                                    test_size=0.25, 
                                                    shuffle=True)
print(X_train2.shape, X_test2.shape, y_train2.shape, y_test2.shape)

(748, 70)
(561, 70) (187, 70) (561,) (187,)


In [151]:
%%time

# i pick 17 as my number by whihc I want observations to outnumber features: 
B = int(np.divide(X_poly.shape[0], 17))

# Define an estimator and param_grid
pipe2 = Pipeline(steps=[
    ('scale', StandardScaler()), 
    ('reduce_dim', SelectKBest(f_classif)), 
    ('classify', LogisticRegression(solver='lbfgs', max_iter=12345))])

pg2 = {
    'reduce_dim__k': range(B//3,B, 2),
    'classify__class_weight': ['None', 'Balanced'],
    'classify__C': [10**k for k in range(-4, 5)]
}

gs2 = GridSearchCV(pipe2, param_grid=pg2, cv=15, scoring='recall', 
                  verbose=4, n_jobs=-1, return_train_score=True, iid=True)
# i'm on GPU
gs2.fit(X_train2, y_train2)

Fitting 15 folds for each of 270 candidates, totalling 4050 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 208 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 1084 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done 2284 tasks      | elapsed:   56.9s
[Parallel(n_jobs=-1)]: Done 2863 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 3226 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 3558 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 3965 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 4050 out of 4050 | elapsed:  7.7min finished
  f = msb / msw


CPU times: user 13.5 s, sys: 986 ms, total: 14.5 s
Wall time: 7min 45s


In [152]:
gs2.best_estimator_

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('reduce_dim', SelectKBest(k=38, score_func=<function f_classif at 0x7f6e936807b8>)), ('classify', LogisticRegression(C=10000, class_weight='None', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=12345,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])

In [154]:
best_k_2 = 38
best_classweight_2 = 'None'
best_C_2 = 10000

report32 = f'According to my 15-fold gridsearch after generating polynomial features of degree 4: \n\tthe best number of ' + \
          f'features to select is {best_k_2}\n\tthe best class-weighing ' + \
          f'is {best_classweight_2}\n\tthe best regularization ' +\
          f'strength is {best_C_2}' 


pipe2_cm = confusionmatrix(gs2.best_estimator_, dat=X_test2, target=y_test2)


second_pipeline_performance = (report32, confusion_report(pipe2_cm), pipe2_cm)

print(second_pipeline_performance[0])
print(second_pipeline_performance[1])
second_pipeline_performance[2]

at an implicit decision rule of 0.5, i.e., if model(dat)>=0.5 then model(dat)=1
According to my 15-fold gridsearch after generating polynomial features of degree 4: 
	the best number of features to select is 38
	the best class-weighing is None
	the best regularization strength is 10000
this model got a precision score of 0.5
a recall score of 0.278
an F1 score of 0.357
and 10 Type I errors


Unnamed: 0,Predicted Negative,Predicted Positive
Actual Negative,141.0,10.0
Actual Positive,26.0,10.0


I ran it at `cv=5` and got an improvement. 
It took under 2 minutes, so I knew setting `cv=15` would take no more than 6 minutes. To be sure, i reduced the amount of "best ks" it would try. 

My hypothesis is that higher cv makes it slightly better. 

# At cv=5
- Precision went up from 2/3 to 0.741 --- an improvement! 
- recall went up from 0.095 to 0.377 --- an improvement! 
- F1 went up from 0.167 to 0.377 --- and improvement! 
- we got more false-positives, up from 2 to 7, tho. 


# at cv=15
- precision is worse than `cv=5` (0.5)
- recall is worse (0.278)
- f1 score slightly worst (0.357)
- more type1 errors, up to 10. 

Should have kept it at `cv=5`!