 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [1]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [10]:
#check our values
print(df['made_donation_in_march_2007'].value_counts())

0    570
1    178
Name: made_donation_in_march_2007, dtype: int64


In [12]:
#accuracy == true/total
print('Majority Class Baseline Accuracy:')
print(570/(570+178))

Majority Class Baseline Accuracy:
0.7620320855614974


What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [13]:
#Recall == true positives/true positives+false negatives
print('Recall for 1:')
print('0')


Recall for 1:
0


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [20]:
#split our data
from sklearn.model_selection import train_test_split

X = df.iloc[:,:-1]
y = df['made_donation_in_march_2007']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.75, shuffle=True)

In [28]:
#make sure the split went well
print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

187 561
187 561


## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [41]:
#import all necessary components for our pipeline

from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn import pipeline
from sklearn.feature_selection import f_classif
from sklearn.model_selection import GridSearchCV

In [52]:
#build pipeline using Robust scaler, select k best, and logistic regression 

pipe = pipeline.make_pipeline(
    RobustScaler(),
    SelectKBest(f_classif),
    LogisticRegression(solver='lbfgs'))

In [53]:
#Handy little function to see the keys for my gridsearch
pipe.get_params().keys()

dict_keys(['memory', 'steps', 'robustscaler', 'selectkbest', 'logisticregression', 'robustscaler__copy', 'robustscaler__quantile_range', 'robustscaler__with_centering', 'robustscaler__with_scaling', 'selectkbest__k', 'selectkbest__score_func', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__max_iter', 'logisticregression__multi_class', 'logisticregression__n_jobs', 'logisticregression__penalty', 'logisticregression__random_state', 'logisticregression__solver', 'logisticregression__tol', 'logisticregression__verbose', 'logisticregression__warm_start'])

## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [55]:
param_grid = {
    'selectkbest__k': [1, 2, 3, 4], 
    'logisticregression__class_weight': [None, 'balanced'],
    'logisticregression__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0]
}


gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, 
                  scoring='recall', 
                  verbose=1)

gs.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    3.5s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x1a1d24b730>)), ('logisticregression', LogisticRegression(C=1.0, cla...nalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=1)

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [56]:
validation_score = gs.best_score_
print()
print('Current Best Cross-Validation Score:', validation_score)
print()
print(' Current Best estimator:', gs.best_estimator_)
print()


Current Best Cross-Validation Score: 0.7507427213309567

 Current Best estimator: Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=4, score_func=<function f_classif at 0x1a1d24b730>)), ('logisticregression', LogisticRegression(C=10.0, class_weight='balanced', dual=False,
   ...enalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])



## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [58]:
#accuracy == (true positive + true negative)/Total
tp = 36
tn = 85
fp = 58
fn = 8

accuracy = (tp+tn)/(tp+tn+fp+fn)
print(accuracy)

0.6470588235294118


Calculate precision

In [61]:
#precision == tp/tp+fp
precision = tp/(tp+fp)
print(precision)

0.3829787234042553


Calculate recall

In [60]:
#recall == tp/tp+fn
recall = tp/(tp+fn)
print(recall)

0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 

In [68]:
#Imports go here
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import RFECV



In [69]:
#generate polynomial features

poly = PolynomialFeatures(degree=2)
X_train_polynomial = poly.fit_transform(X_train)

In [106]:
#Use RFECV to narrow features 

scaler = RobustScaler()

X_train_scaled = scaler.fit_transform(X_train_polynomial)

rfe = RFECV(LogisticRegression(solver='lbfgs', class_weight='balanced'), scoring='recall', 
            step=20, cv=3, verbose=1)

X_train_subset = rfe.fit_transform(X_train_scaled, y_train)

Fitting estimator with 15 features.
Fitting estimator with 15 features.
Fitting estimator with 15 features.


In [118]:
#names of selected poly features
#borrowed directly from Ryan Herr (thanks!)

all_names = poly.get_feature_names(X_train.columns)
selected_mask = rfe.support_
selected_names = [name for name, selected in zip(all_names, selected_mask) if selected]

print(f'{rfe.n_features_} Features selected:')
for name in selected_names:
    print(name)

15 Features selected:
1
months_since_last_donation
number_of_donations
total_volume_donated
months_since_first_donation
months_since_last_donation^2
months_since_last_donation number_of_donations
months_since_last_donation total_volume_donated
months_since_last_donation months_since_first_donation
number_of_donations^2
number_of_donations total_volume_donated
number_of_donations months_since_first_donation
total_volume_donated^2
total_volume_donated months_since_first_donation
months_since_first_donation^2


In [115]:
#Define an estimator and param_grid

#received a warning that feature[0] is a constant 
new_X = X_train_subset[:,1:]

#use selectkbest to tighten up the features generated by the RFECV
pipe = pipeline.make_pipeline(
        SelectKBest(f_classif),
        LogisticRegression(solver='lbfgs', class_weight = 'balanced'))


param_grid = {
    'selectkbest__k': [5,6,7,8,9,10,11,12,13,14],
    'logisticregression__C': [.0001, .001, .01, .1, 1.0,10.0]
}

#cv = 3 because I want to save electricity 
gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, 
                  scoring='recall', 
                  verbose=1)

gs.fit(new_X, y_train)
validation_score = gs.best_score_
print()
print('Cross-Validation Score:', validation_score)
print()
print('Best estimator:', gs.best_estimator_)
print()

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.



Cross-Validation Score: 0.6830404889228419

Best estimator: Pipeline(memory=None,
     steps=[('selectkbest', SelectKBest(k=6, score_func=<function f_classif at 0x1a1d24b730>)), ('logisticregression', LogisticRegression(C=10.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])



[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed:    1.3s finished


In [124]:
# Do everything that we did up there, down here

X_test_polynomial = poly.transform(X_test)
X_test_scaled = scaler.transform(X_test_polynomial)
X_test_subset = rfe.transform(X_test_scaled)

new_X_test = X_test_subset[:,1:]

test_score = gs.score(new_X_test, y_test)
y_pred = gs.predict(new_X_test)
print('Test Score:', test_score)

Test Score: 0.6119402985074627


In [121]:
#not great, but better than pure chance

In [130]:
#generate confusion matrix and f1 score

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score


print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))
print('')
print('f1 score ',f1_score(y_test, y_pred))
print('')

#false positive == FP/total negative
#false positive == 114
#total negative == 427

print('False positive rate ', 114/427)

Confusion Matrix
[[313 114]
 [ 52  82]]

f1 score  0.49696969696969695

False positive rate  0.26697892271662765
