 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The **goal** is to predict the **last column** = whether the donor made a **donation in March 2007**, using information about each donor's history. We'll measure success using **_recall score_ as the model evaluation metric**.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [35]:
# initial imports
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, recall_score
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
# from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})
print(df.shape)  # 748 rows by 5 columns
print(df.isna().sum())  # zero nan's; thanks, Ryan Herr!  
df.head()

(748, 5)
months_since_last_donation     0
number_of_donations            0
total_volume_donated           0
months_since_first_donation    0
made_donation_in_march_2007    0
dtype: int64


Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [15]:
# make copy of df, work w copy going forward
df1 = df.copy()

# will refrain, in this cell, from yet doing tts on df1
# Hat Tip to Ryan Herr/LSDS
X = df1.drop('made_donation_in_march_2007', axis='columns')
y_true = df1.made_donation_in_march_2007
majority_class = y_true.mode()[0]
y_pred = np.full(shape=y_true.shape, fill_value=majority_class)

# validate
print(y_true.shape, y_pred.shape)
all(y_pred == majority_class)

(748,) (748,)


True

In [17]:
# compute accuracy_score
print('accuracy score is:', accuracy_score(y_true, y_pred))

accuracy score is: 0.7620320855614974


What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [19]:
# compute recall_score
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
'''
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.
The recall is intuitively the ability of the classifier to find all the positive samples.
'''
print('recall score is:', recall_score(y_true, y_pred, average=None))

recall score is: [1. 0.]


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [21]:
# generate cross_val_score_model
# cf. https://github.com/johnpharmd/DS-Unit-2-Sprint-4-Model-Validation/blob/master/module-1-begin-modeling-process/LS_DS_241_Begin_modeling_process_LIVE_LESSON.ipynb
X_train, X_test, y_train, y_test = tts(X, y_true, shuffle=True)

## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [32]:
# make pipeline, which is kernel_svm
# hat tip to Ryan Herr/LSDS for following URL:
# https://github.com/rasbt/python-machine-learning-book/blob/master/code/bonus/svm_iris_pipeline_and_gridsearch.ipynb
cls = SVC(C=10.0, kernel='rbf', gamma=0.1, decision_function_shape='ovr')

kernel_svm = Pipeline([('std', StandardScaler()), ('svc', cls)])

# select features using SelectKBest
features = SelectKBest(f_classif, k=3)

# perform classification using LogReg
log_reg = LogisticRegression().fit(X_train, y_train)



## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [42]:
# perform GridSearchCV
# make param_grid
param_grid = [{'svc__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0],
               'svc__gamma': [0.001, 0.0001], 'svc__kernel': ['rbf']},]
param_grid_adjust = [{'k': [1, 2, 3, 4], 'class_weight': [None, 'balanced'],
                     'svc__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0]},]

# make gs object
gs = GridSearchCV(estimator=kernel_svm, param_grid=param_grid_adjust, 
                  scoring='recall', 
                  n_jobs=-1, 
                  cv=5, 
                  verbose=1, 
                  refit=True,
                  pre_dispatch='2*n_jobs')

# run gs
gs.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


ValueError: Invalid parameter class_weight for estimator Pipeline(memory=None,
     steps=[('std', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]). Check the list of available parameters with `estimator.get_params().keys()`.

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [38]:
# display best gscv score, and the best parameters, from the gs
# best cv score
print('Best GS Score %.2f' % gs.best_score_)

# best parameters COMMENT: need to refactor param_grid for k, class_weight, and C
print('best GS Params %s' % gs.best_params_)

Best GS Score 0.09
best GS Params {'svc__C': 10000.0, 'svc__gamma': 0.001, 'svc__kernel': 'rbf'}


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [39]:
# accuracy == (TP + TN)/Total
accuracy = (36 + 85)/187
print('accuracy is:', accuracy)

accuracy is: 0.6470588235294118


Calculate precision

In [40]:
# precision == TP/(TP + FP)
precision = 36/(36 + 58)
print('precision is:', precision)

precision is: 0.3829787234042553


Calculate recall

In [41]:
# recall == sensitivity == TP/P
recall = 36/44
print('recall is:', recall)

recall is: 0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 