 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

**The goal is to predict the last column, whether the donor made a donation in March 2007**, using information about each donor's history. We'll measure success **using recall score** as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [28]:
#!pip install seaborn --upgrade
import pandas as pd
import numpy as np
import numpy as np
import pandas as pd
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
import time
warnings.filterwarnings('ignore')

import seaborn as sns

In [9]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

In [10]:
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [11]:
df.isnull().sum().sum()

0

In [12]:
X = df.drop(columns='made_donation_in_march_2007')
y = df.made_donation_in_march_2007

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [13]:
# Finding Baseline Accuracy using Sklearn
import numpy as np
y_pred = np.full(shape=y.shape, fill_value=y.mode())# TODO
baseline_accuracy = accuracy_score(y, y_pred)
print ('Baseline Accuracy',baseline_accuracy)

Baseline Accuracy 0.7620320855614974


In [15]:
# Finding Baseline Accuracy using Dummy Classifier

pipe = make_pipeline(
    DummyClassifier(strategy='most_frequent',random_state=42))

pipe.fit(X, y)

# Get the scores with the appropriate score function
# Predict with X features and Compare predictions to y labels
y_pred = pipe.predict(X)
dummy_score = accuracy_score(y, y_pred)
print(y.sum())
print(y_pred.sum())
print('Dummy Classification Score (Accuracy):', dummy_score)

178
0
Dummy Classification Score (Accuracy): 0.7620320855614974


What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [40]:
# This question is tricky because using a baseline there would be no false negatives
print(f"Lets Check the value counts so we'll know what our majority class is:\n{y.value_counts()}\n")
print(f"Because our mode is zero and we don't have any false positives on the baseline our recall is zero true postives / (zero true positives + zero false positives))")

# Finding Baseline Recall using Sklearn
import numpy as np
y_pred = np.full(shape=y.shape, fill_value=y.mode())# TODO
baseline_recall = recall_score(y, y_pred)
print ('Baseline Recall',baseline_recall)

Lets Check the value counts so we'll know what our majority class is:
0    570
1    178
Name: made_donation_in_march_2007, dtype: int64

Because our mode is zero and we don't have any false positives on the baseline our recall is zero true postives / (zero true positives + zero false positives))
Baseline Recall 0.0


In [16]:
# Finding Baseline Recall Using Dummy Classifier


pipe = make_pipeline(
    DummyClassifier(strategy='most_frequent',random_state=42))

pipe.fit(X, y)

# Get the scores with the appropriate score function
# Predict with X features and Compare predictions to y labels
y_pred = pipe.predict(X)
dummy_score = recall_score(y, y_pred)
print('Dummy Classification Score (recall):', dummy_score)

Dummy Classification Score (recall): 0.0


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [17]:
from sklearn.model_selection import train_test_split
def split(X_values, y_values):
    # Hold out an "out-of-time" test set, from the last 100 days of data
    X_train, X_test, y_train, y_test = train_test_split(X_values, y_values, test_size=0.25, random_state=42)
    return X_train, X_test, y_train, y_test
  
X_train, X_test, y_train, y_test = split(X,y)
print ("Took this data...")
print (f'X Shape: {X.shape}\nY Shape: {y.shape}\n\n')
print ("And split it into this data... ")
print (f'X_train Shape: {X_train.shape},\nX_test Shape: {X_test.shape},\ny_train Shape: {y_train.shape},\ny_test Shape: {y_test.shape}')

Took this data...
X Shape: (748, 4)
Y Shape: (748,)


And split it into this data... 
X_train Shape: (561, 4),
X_test Shape: (187, 4),
y_train Shape: (561,),
y_test Shape: (187,)


## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [18]:

pipe = make_pipeline(
    RobustScaler(), 
    SelectKBest(f_classif), 
    LogisticRegression())

print("Pipeline Created")

Pipeline Created


## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [19]:

param_grid = {
    'selectkbest__k': [1,2,3,4], 
    'logisticregression__class_weight' : [None, 'balanced'],
    'logisticregression__C' : [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.00, 1000.0, 10000.0]
}

# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, 
                  scoring='recall_weighted', 
                  verbose=1)
print("Grid Search Cross Validation now running...")
gs.fit(X_train, y_train)
print("Grid Search CV complete...")

Grid Search Cross Validation now running...
Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Grid Search CV complete...


[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    2.4s finished


## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [20]:
# Scores for the Question's Answer
validation_score = gs.best_score_
print('***** Grid Search Scores *****')
print('\nBest Cross-Validation Score:', validation_score)
print('Best parameters:', gs.best_params_ ,'\n')

# Scores for me. 
print("\n***** A few more stats just for me *****")

# Predict with X_test features and compare to actual.
y_pred_train = gs.predict(X_train)
train_score_A = accuracy_score(y_train, y_pred_train)
print('Train Score (Accuracy):', train_score_A)
y_pred_test = gs.predict(X_test)
test_score_A = accuracy_score(y_test, y_pred_test)
print('Test Score (Accuracy):', test_score_A)

train_score_B = gs.score(X_train, y_train)
print('\nTrain Score ("Recall"): ', train_score_B)
test_score_B = gs.score(X_test, y_test)
print('Test Score ("Recall"): ', test_score_B)
print('\nBest estimator:\n', gs.best_estimator_)
cvresults = pd.DataFrame(gs.cv_results_)
print('\nGenerated Results with Shape:', cvresults.shape)

***** Grid Search Scores *****

Best Cross-Validation Score: 0.7807486631016043
Best parameters: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'selectkbest__k': 4} 


***** A few more stats just for me *****
Train Score (Accuracy): 0.7789661319073083
Test Score (Accuracy): 0.7540106951871658

Train Score ("Recall"):  0.7789661319073083
Test Score ("Recall"):  0.7540106951871658

Best estimator:
 Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=4, score_func=<function f_classif at 0x7fe5d18f2158>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

Generated Results with Shape: (72, 23)


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

In [21]:
true_negative = 85
true_positive = 36
false_negative = 8
false_positive = 58

Calculate accuracy

In [22]:
accuracy = (true_negative + true_positive) / (true_negative + true_positive + false_negative + false_positive)
print(accuracy)

0.6470588235294118


Calculate precision

In [23]:
precision = true_positive / (true_positive + false_positive)
print(precision)

0.3829787234042553


Calculate recall

In [24]:
recall = true_positive / (true_positive + false_negative)
print(recall)

0.8181818181818182


## BONUS — How you can earn a score of 3



### Part 1
Do feature engineering, to try improving your cross-validation score.



In [25]:
df["new_variable"] = (df["months_since_first_donation"] - df["months_since_last_donation"])

In [26]:
X = df.drop(columns='made_donation_in_march_2007')
y = df.made_donation_in_march_2007

# Split Data
X_train, X_test, y_train, y_test = split(X,y)
print ("Took this data...")
print (f'X Shape: {X.shape}\nY Shape: {y.shape}\n\n')
print ("And split it into this data... ")
print (f'X_train Shape: {X_train.shape},\nX_test Shape: {X_test.shape},\ny_train Shape: {y_train.shape},\ny_test Shape: {y_test.shape}')

# Make Pipeline
pipe = make_pipeline(
    RobustScaler(), 
    SelectKBest(f_classif), 
    LogisticRegression())

print("\nPipeline Created")

# GridSearch CV
param_grid = {
    'selectkbest__k': [1,2,3,4], 
    'logisticregression__class_weight' : [None, 'balanced'],
    'logisticregression__C' : [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.00, 1000.0, 10000.0]
}

# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, 
                  scoring='recall_weighted', 
                  verbose=1)
print("\nGrid Search Cross Validation now running...")
gs.fit(X_train, y_train)
print("Grid Search CV complete...")

# Check out the scores
new_validation_score = gs.best_score_
print('\n***** Grid Search Scores *****')
print('New Cross-Validation Score:', new_validation_score)
print('Previous Cross-Validation Score', validation_score)
print('Best parameters:', gs.best_params_ ,'\n')

# Scores for me. 
print("\n***** A few more stats just for me *****")

# Predict with X_test features and compare to actual.
y_pred_train = gs.predict(X_train)
train_score_C = accuracy_score(y_train, y_pred_train)
print('Train Score (Accuracy):', train_score_C)
y_pred_test = gs.predict(X_test)
test_score_C = accuracy_score(y_test, y_pred_test)
print('Test Score (Accuracy):', test_score_C)

train_score_D = gs.score(X_train, y_train)
print('\nTrain Score ("Recall"): ', train_score_D)
test_score_D = gs.score(X_test, y_test)
print('Test Score ("Recall"): ', test_score_D)

Took this data...
X Shape: (748, 5)
Y Shape: (748,)


And split it into this data... 
X_train Shape: (561, 5),
X_test Shape: (187, 5),
y_train Shape: (561,),
y_test Shape: (187,)

Pipeline Created

Grid Search Cross Validation now running...
Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Grid Search CV complete...

***** Grid Search Scores *****
New Cross-Validation Score: 0.7807486631016043
Previous Cross-Validation Score 0.7807486631016043
Best parameters: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'selectkbest__k': 4} 


***** A few more stats just for me *****
Train Score (Accuracy): 0.7807486631016043
Test Score (Accuracy): 0.7540106951871658

Train Score ("Recall"):  0.7807486631016043
Test Score ("Recall"):  0.7540106951871658


[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    2.4s finished


### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.


In [27]:
# I'm going to try a slightly different one this time. I want to try random forest classifier. 

# Make Pipeline
pipe = make_pipeline(
    RobustScaler(), 
    SelectKBest(f_classif), 
    RandomForestClassifier())

print("\nPipeline Created")

# GridSearch CV
param_grid = {
    'selectkbest__k': [1,2,3,4], 
    "randomforestclassifier__max_depth": [80, 90, 100, 110],
#    "randomforestclassifier__max_features": [2, 3],
    "randomforestclassifier__min_samples_split": [8, 10, 12],
    "randomforestclassifier__min_samples_leaf": [3, 4, 5],
    "randomforestclassifier__bootstrap": [False],
    "randomforestclassifier__n_estimators" :[100, 200, 300, 1000],
    "randomforestclassifier__criterion": ["gini"]}


# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipe, param_grid=param_grid, cv=10, 
                  scoring='recall_weighted', 
                  verbose=1, n_jobs=10)

print("\nGrid Search Cross Validation now running...")
gs.fit(X_train, y_train)
print("Grid Search CV complete...")

# Check out the scores
new_validation_score = gs.best_score_
print('\n***** Grid Search Scores *****')
print('New Cross-Validation Score:', new_validation_score)
print('Previous Cross-Validation Score', validation_score)
print('Best parameters:', gs.best_params_ ,'\n')
print('Best score:', gs.best_score_)
print('Best estimator:', gs.best_estimator_)




Pipeline Created

Grid Search Cross Validation now running...
Fitting 10 folds for each of 576 candidates, totalling 5760 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    1.4s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:    9.8s
[Parallel(n_jobs=10)]: Done 430 tasks      | elapsed:   21.1s
[Parallel(n_jobs=10)]: Done 780 tasks      | elapsed:   41.3s
[Parallel(n_jobs=10)]: Done 1230 tasks      | elapsed:  1.1min
[Parallel(n_jobs=10)]: Done 1780 tasks      | elapsed:  1.6min
[Parallel(n_jobs=10)]: Done 2430 tasks      | elapsed:  2.2min
[Parallel(n_jobs=10)]: Done 3180 tasks      | elapsed:  2.8min
[Parallel(n_jobs=10)]: Done 4030 tasks      | elapsed:  3.6min
[Parallel(n_jobs=10)]: Done 4980 tasks      | elapsed:  4.5min


Grid Search CV complete...

***** Grid Search Scores *****
New Cross-Validation Score: 0.8003565062388592
Previous Cross-Validation Score 0.7807486631016043
Best parameters: {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__max_depth': 80, 'randomforestclassifier__min_samples_leaf': 4, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 200, 'selectkbest__k': 4} 

Best score: 0.8003565062388592
Best estimator: Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=4, score_func=<function f_classif at 0x7fe5d18f2158>)), ('randomforestclassifier', RandomForestClassifier(bootstrap=False, class_weight=None, cr...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])


[Parallel(n_jobs=10)]: Done 5760 out of 5760 | elapsed:  5.2min finished


#### Lets see how it's test stats turned out

In [37]:
y_pred_train = gs.predict(X_train)
train_score_C = accuracy_score(y_train, y_pred_train)
print('Random Forest Classifier - Train Score (Accuracy):', train_score_C)
y_pred_test = gs.predict(X_test)
test_score_C = accuracy_score(y_test, y_pred_test)
print('Random Forest Classifier - Test Score (Accuracy):', test_score_C)

train_score_D = gs.score(X_train, y_train)
print('\nRandom Forest Classifier - Train Score ("Recall"): ', train_score_D)
test_score_D = gs.score(X_test, y_test)
print('Random Forest Classifier - Test Score ("Recall"): ', test_score_D)

Random Forest Classifier - Train Score (Accuracy): 0.8645276292335116
Random Forest Classifier - Test Score (Accuracy): 0.7379679144385026

Random Forest Classifier - Train Score ("Recall"):  0.8645276292335116
Random Forest Classifier - Test Score ("Recall"):  0.7379679144385026


### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?


In [38]:
# Which features were selected?
# 'selectkbest' is the autogenerated name of the SelectKBest() function in the pipeline
selector = gs.best_estimator_.named_steps['selectkbest']
all_names = X_train.columns

# get_support returns a mask of the columns in True / False
selected_mask = selector.get_support()
# Passing the boolean list as the column names creates a 
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print()
print('Features not selected:')
for name in unselected_names:
    print(name)

Features selected:
months_since_last_donation
number_of_donations
total_volume_donated
new_variable

Features not selected:
months_since_first_donation



### Part 4
Calculate F1 score and False Positive Rate. 

In [39]:
from sklearn.metrics import confusion_matrix
y_pred_train = gs.predict(X_train)
train_score_F1 = f1_score(y_train, y_pred_train)
print('Random Forest Classifier - F1 Train Score:', train_score_F1)
y_pred_test = gs.predict(X_test)
test_score_F1 = f1_score(y_test, y_pred_test)
print('Random Forest Classifier - F1 Test Score:', test_score_F1)

tn, fp, fn, tp = confusion_matrix(y_train, y_pred_train).ravel()
print(f'Random Forest Classifier - Train False Positive Rate: {fp / (fp + tp)}')

tn, fp, fn, tp = confusion_matrix(y_test, y_pred_test).ravel()
print(f'Random Forest Classifier - Test False Positive Rate: {fp / (fp + tp)}')

Random Forest Classifier - F1 Train Score: 0.6576576576576576
Random Forest Classifier - F1 Test Score: 0.3466666666666667
Random Forest Classifier - Train False Positive Rate: 0.20652173913043478
Random Forest Classifier - Test False Positive Rate: 0.5185185185185185
