 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [1]:
import pandas as pd
import sklearn.metrics
import numpy as np
import matplotlib.pyplot as plt
import sklearn.model_selection
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
import warnings


df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

In [2]:
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [3]:
df['donation_lifetime'] = df['months_since_first_donation'] - df['months_since_last_donation']

In [4]:
# df['lifetime_monthly_vol'] = df['total_volume_donated'] / df['donation_lifetime']

In [5]:
df.shape

(748, 6)

In [6]:
df['made_donation_in_march_2007'].value_counts(normalize=True)

0    0.762032
1    0.237968
Name: made_donation_in_march_2007, dtype: float64

In [7]:
df.isnull().sum()

months_since_last_donation     0
number_of_donations            0
total_volume_donated           0
months_since_first_donation    0
made_donation_in_march_2007    0
donation_lifetime              0
dtype: int64

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [8]:
df['made_donation_in_march_2007'].value_counts(normalize=True)

0    0.762032
1    0.237968
Name: made_donation_in_march_2007, dtype: float64

In [9]:
# So by using a majority class baseline, you would always assume the majority class (which is those who did not 
# donate blood on march of '07) and you would be right 76.22% of the time

What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [10]:
prelim_y_val = df['made_donation_in_march_2007']
majority_class = 0
y_pred_majority = [majority_class] * len(prelim_y_val)

print ('Recall Score:', sklearn.metrics.recall_score(prelim_y_val, y_pred_majority))
print ('Accuracy Score:', sklearn.metrics.accuracy_score(prelim_y_val, y_pred_majority))
print ('roc_auc Score:', sklearn.metrics.roc_auc_score(prelim_y_val, y_pred_majority))
print ('Precision Score:', sklearn.metrics.precision_score(prelim_y_val, y_pred_majority))

# so by assuming all predictions to be the majority class which is a negative result (0), the classifier will miss
# all of the positive results and as a result has a recall score of 0

Recall Score: 0.0
Accuracy Score: 0.7620320855614974
roc_auc Score: 0.5
Precision Score: 0.0


  'precision', 'predicted', average, warn_for)


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [11]:
y = df['made_donation_in_march_2007']
X = df.drop('made_donation_in_march_2007',1)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.25)

In [12]:
print ('X_train shape:', X_train.shape)
print ('X_test shape:', X_test.shape)
print ('y_train shape:', y_train.shape)
print ('y_test shape:', y_test.shape)

X_train shape: (561, 5)
X_test shape: (187, 5)
y_train shape: (561,)
y_test shape: (187,)


## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [13]:
best_k = range(1, len(X_train.columns)+1)



log_reg_pipe = make_pipeline(RobustScaler(), SelectKBest(f_regression),
                             LogisticRegression(max_iter=200, n_jobs=-1, random_state=42))

param_grid = {'selectkbest__k': range(1, len(X_train.columns)+1),
              'logisticregression__class_weight': [None, 'balanced'],
              'logisticregression__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0]}

# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(log_reg_pipe, param_grid = param_grid, cv=5,
                 scoring={'recall':'recall', 'roc_auc': 'roc_auc'},refit=False, verbose=1)


              
gs.fit(X_train, y_train);

Fitting 5 folds for each of 90 candidates, totalling 450 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(e

  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))
  " = {}.".format(effective_n_jobs(self.n_jobs)))


In [14]:
cross_val_results = pd.DataFrame(gs.cv_results_)




In [15]:
cross_val_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_selectkbest__k,params,split0_test_recall,split1_test_recall,...,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,split0_train_roc_auc,split1_train_roc_auc,split2_train_roc_auc,split3_train_roc_auc,split4_train_roc_auc,mean_train_roc_auc,std_train_roc_auc
0,0.008558,0.004130,0.004076,0.001428,0.0001,,1,"{'logisticregression__C': 0.0001, 'logisticreg...",0.000000,0.000000,...,0.700944,0.038834,71,0.703502,0.696387,0.690407,0.719061,0.698934,0.701658,0.009675
1,0.004931,0.000227,0.003279,0.000253,0.0001,,2,"{'logisticregression__C': 0.0001, 'logisticreg...",0.000000,0.038462,...,0.724364,0.049276,64,0.670883,0.743383,0.730648,0.753654,0.753405,0.730395,0.030922
2,0.007699,0.001475,0.004593,0.001357,0.0001,,3,"{'logisticregression__C': 0.0001, 'logisticreg...",0.000000,0.076923,...,0.719443,0.058313,66,0.647152,0.738483,0.735714,0.754208,0.746816,0.724475,0.039204
3,0.005574,0.000970,0.003534,0.001003,0.0001,,4,"{'logisticregression__C': 0.0001, 'logisticreg...",0.000000,0.076923,...,0.727575,0.055140,62,0.645055,0.747910,0.741860,0.763995,0.751190,0.730002,0.043085
4,0.004896,0.000259,0.003008,0.000060,0.0001,,5,"{'logisticregression__C': 0.0001, 'logisticreg...",0.000000,0.076923,...,0.727226,0.053677,63,0.627586,0.748796,0.749169,0.772937,0.771788,0.734055,0.054253
5,0.005910,0.000766,0.003615,0.000545,0.0001,balanced,1,"{'logisticregression__C': 0.0001, 'logisticreg...",0.814815,0.807692,...,0.700944,0.038834,71,0.703502,0.696387,0.690407,0.719061,0.698934,0.701658,0.009675
6,0.005306,0.000297,0.003040,0.000064,0.0001,balanced,2,"{'logisticregression__C': 0.0001, 'logisticreg...",0.777778,0.769231,...,0.722335,0.041802,65,0.730532,0.715089,0.714535,0.728682,0.721982,0.722164,0.006645
7,0.005079,0.000305,0.003094,0.000105,0.0001,balanced,3,"{'logisticregression__C': 0.0001, 'logisticreg...",0.777778,0.730769,...,0.703821,0.043413,70,0.714487,0.694408,0.698200,0.709441,0.705011,0.704309,0.007291
8,0.005989,0.001101,0.003363,0.000425,0.0001,balanced,4,"{'logisticregression__C': 0.0001, 'logisticreg...",0.740741,0.730769,...,0.697309,0.045475,90,0.709149,0.689936,0.690504,0.704582,0.698920,0.698618,0.007586
9,0.006013,0.000983,0.003215,0.000135,0.0001,balanced,5,"{'logisticregression__C': 0.0001, 'logisticreg...",0.740741,0.730769,...,0.697615,0.048792,89,0.705766,0.693120,0.692276,0.708015,0.700526,0.699941,0.006399


## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [16]:
# I completed this after configuring my pipeline. Results are listed above

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [17]:
pd.options.display.max_columns = None
cross_val_results.sort_values(by='rank_test_recall', ascending=True)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_selectkbest__k,params,split0_test_recall,split1_test_recall,split2_test_recall,split3_test_recall,split4_test_recall,mean_test_recall,std_test_recall,rank_test_recall,split0_train_recall,split1_train_recall,split2_train_recall,split3_train_recall,split4_train_recall,mean_train_recall,std_train_recall,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,split0_train_roc_auc,split1_train_roc_auc,split2_train_roc_auc,split3_train_roc_auc,split4_train_roc_auc,mean_train_roc_auc,std_train_roc_auc
36,0.005351,0.000420,0.003507,0.000677,0.1,balanced,2,"{'logisticregression__C': 0.1, 'logisticregres...",0.851852,0.884615,0.846154,0.730769,0.692308,0.801230,0.075279,1,0.798077,0.790476,0.790476,0.819048,0.752381,0.790092,0.021562,0.748277,0.763864,0.794723,0.700581,0.725626,0.746617,0.032141,29,0.749986,0.744075,0.736296,0.754707,0.752519,0.747517,0.006642
37,0.005797,0.000974,0.003387,0.000287,0.1,balanced,3,"{'logisticregression__C': 0.1, 'logisticregres...",0.851852,0.884615,0.846154,0.730769,0.692308,0.801230,0.075279,1,0.798077,0.790476,0.752381,0.819048,0.742857,0.780568,0.028643,0.742248,0.763864,0.797406,0.698345,0.722496,0.744867,0.034008,54,0.749818,0.743217,0.736905,0.754181,0.752409,0.747306,0.006396
87,0.005567,0.000550,0.003335,0.000531,10000,balanced,3,"{'logisticregression__C': 10000.0, 'logisticre...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.746124,0.758945,0.793381,0.701029,0.727415,0.745380,0.030887,37,0.747163,0.743577,0.735161,0.751218,0.753765,0.746177,0.006513
86,0.005315,0.000331,0.003456,0.000406,10000,balanced,2,"{'logisticregression__C': 10000.0, 'logisticre...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.746124,0.758945,0.793381,0.701029,0.727415,0.745380,0.030887,37,0.747163,0.743577,0.735161,0.751218,0.753765,0.746177,0.006513
57,0.005150,0.000207,0.003134,0.000245,10,balanced,3,"{'logisticregression__C': 10.0, 'logisticregre...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.746124,0.758945,0.793381,0.702370,0.727415,0.745648,0.030504,34,0.746995,0.743577,0.735299,0.750886,0.753765,0.746104,0.006411
56,0.005907,0.001012,0.003246,0.000344,10,balanced,2,"{'logisticregression__C': 10.0, 'logisticregre...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.746124,0.758945,0.793381,0.702370,0.727415,0.745648,0.030504,34,0.746995,0.743577,0.735299,0.750886,0.753765,0.746104,0.006411
67,0.005310,0.000346,0.003213,0.000421,100,balanced,3,"{'logisticregression__C': 100.0, 'logisticregr...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.746124,0.758945,0.793381,0.701029,0.727415,0.745380,0.030887,37,0.747163,0.743577,0.735161,0.751218,0.753765,0.746177,0.006513
66,0.005203,0.000356,0.003184,0.000205,100,balanced,2,"{'logisticregression__C': 100.0, 'logisticregr...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.746124,0.758945,0.793381,0.701029,0.727415,0.745380,0.030887,37,0.747163,0.743577,0.735161,0.751218,0.753765,0.746177,0.006513
47,0.005136,0.000236,0.003168,0.000146,1,balanced,3,"{'logisticregression__C': 1.0, 'logisticregres...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.747416,0.758945,0.792039,0.702370,0.726073,0.745372,0.030264,43,0.746855,0.743439,0.735437,0.750831,0.754374,0.746187,0.006514
46,0.005843,0.001141,0.003339,0.000318,1,balanced,2,"{'logisticregression__C': 1.0, 'logisticregres...",0.851852,0.846154,0.846154,0.730769,0.730769,0.801230,0.057483,3,0.788462,0.790476,0.790476,0.809524,0.771429,0.790073,0.012074,0.747416,0.758945,0.793381,0.702370,0.726073,0.745640,0.030679,36,0.746967,0.743439,0.735050,0.750941,0.754374,0.746154,0.006661


In [18]:
number_ones = cross_val_results[cross_val_results['rank_test_recall'] == 1]

In [19]:
number_ones
# It looks like as I had my pipelines configured, 10 different results all yielded the same mean recall score
# of 0.790305

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_selectkbest__k,params,split0_test_recall,split1_test_recall,split2_test_recall,split3_test_recall,split4_test_recall,mean_test_recall,std_test_recall,rank_test_recall,split0_train_recall,split1_train_recall,split2_train_recall,split3_train_recall,split4_train_recall,mean_train_recall,std_train_recall,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,split0_train_roc_auc,split1_train_roc_auc,split2_train_roc_auc,split3_train_roc_auc,split4_train_roc_auc,mean_train_roc_auc,std_train_roc_auc
36,0.005351,0.00042,0.003507,0.000677,0.1,balanced,2,"{'logisticregression__C': 0.1, 'logisticregres...",0.851852,0.884615,0.846154,0.730769,0.692308,0.80123,0.075279,1,0.798077,0.790476,0.790476,0.819048,0.752381,0.790092,0.021562,0.748277,0.763864,0.794723,0.700581,0.725626,0.746617,0.032141,29,0.749986,0.744075,0.736296,0.754707,0.752519,0.747517,0.006642
37,0.005797,0.000974,0.003387,0.000287,0.1,balanced,3,"{'logisticregression__C': 0.1, 'logisticregres...",0.851852,0.884615,0.846154,0.730769,0.692308,0.80123,0.075279,1,0.798077,0.790476,0.752381,0.819048,0.742857,0.780568,0.028643,0.742248,0.763864,0.797406,0.698345,0.722496,0.744867,0.034008,54,0.749818,0.743217,0.736905,0.754181,0.752409,0.747306,0.006396


# These are the same results but using ROC_AUC to sort my rankings

In [20]:
cross_val_results.sort_values(by='rank_test_roc_auc', ascending=True)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_selectkbest__k,params,split0_test_recall,split1_test_recall,split2_test_recall,split3_test_recall,split4_test_recall,mean_test_recall,std_test_recall,rank_test_recall,split0_train_recall,split1_train_recall,split2_train_recall,split3_train_recall,split4_train_recall,mean_train_recall,std_train_recall,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,split0_train_roc_auc,split1_train_roc_auc,split2_train_roc_auc,split3_train_roc_auc,split4_train_roc_auc,mean_train_roc_auc,std_train_roc_auc
33,0.005638,0.001062,0.003701,0.001050,0.1,,4,"{'logisticregression__C': 0.1, 'logisticregres...",0.111111,0.115385,0.000000,0.115385,0.076923,0.083809,0.044272,67,0.057692,0.085714,0.085714,0.076190,0.123810,0.085824,0.021574,0.777347,0.794946,0.798971,0.708184,0.728980,0.761714,0.036508,1,0.756191,0.760230,0.758887,0.777976,0.778405,0.766338,0.009766
38,0.005248,0.000208,0.002995,0.000038,0.1,balanced,4,"{'logisticregression__C': 0.1, 'logisticregres...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.788462,0.752381,0.761905,0.800000,0.790476,0.778645,0.018234,0.778639,0.793157,0.799419,0.711315,0.724955,0.761528,0.036273,2,0.755548,0.761780,0.759468,0.779665,0.778931,0.767079,0.010177
39,0.005196,0.000224,0.002964,0.000015,0.1,balanced,5,"{'logisticregression__C': 0.1, 'logisticregres...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.778846,0.761905,0.761905,0.790476,0.771429,0.772912,0.010847,0.789836,0.795841,0.794499,0.701476,0.721825,0.760747,0.040590,3,0.761502,0.760507,0.758915,0.779250,0.779762,0.767987,0.009443
24,0.006190,0.001109,0.004434,0.000820,0.01,,5,"{'logisticregression__C': 0.01, 'logisticregre...",0.037037,0.076923,0.000000,0.076923,0.076923,0.053532,0.030897,70,0.048077,0.057143,0.076190,0.057143,0.095238,0.066758,0.016935,0.782946,0.795841,0.793157,0.702818,0.724508,0.759895,0.038552,4,0.759266,0.759815,0.760078,0.779028,0.778322,0.767302,0.009293
34,0.005589,0.001020,0.003374,0.000567,0.1,,5,"{'logisticregression__C': 0.1, 'logisticregres...",0.111111,0.153846,0.038462,0.115385,0.076923,0.099167,0.038893,56,0.057692,0.085714,0.095238,0.085714,0.142857,0.093443,0.027722,0.789406,0.794052,0.797630,0.683140,0.724955,0.757893,0.045918,5,0.761334,0.759372,0.757115,0.778779,0.777436,0.766807,0.009333
88,0.005142,0.000117,0.003023,0.000083,10000,balanced,4,"{'logisticregression__C': 10000.0, 'logisticre...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.769231,0.752381,0.761905,0.780952,0.790476,0.770989,0.013497,0.789836,0.794946,0.798077,0.684481,0.716458,0.756819,0.047112,6,0.761642,0.758929,0.755426,0.778198,0.776633,0.766166,0.009408
68,0.005156,0.000298,0.002980,0.000051,100,balanced,4,"{'logisticregression__C': 100.0, 'logisticregr...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.769231,0.752381,0.761905,0.780952,0.790476,0.770989,0.013497,0.789836,0.794946,0.798077,0.684481,0.716458,0.756819,0.047112,6,0.761614,0.758901,0.755482,0.778142,0.776578,0.766143,0.009375
89,0.005317,0.000230,0.003145,0.000217,10000,balanced,5,"{'logisticregression__C': 10000.0, 'logisticre...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.769231,0.752381,0.761905,0.780952,0.790476,0.770989,0.013497,0.789836,0.794946,0.798077,0.684481,0.716458,0.756819,0.047112,6,0.761642,0.758929,0.755426,0.778198,0.776606,0.766160,0.009401
78,0.005632,0.000615,0.003195,0.000192,1000,balanced,4,"{'logisticregression__C': 1000.0, 'logisticreg...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.769231,0.752381,0.761905,0.780952,0.790476,0.770989,0.013497,0.789836,0.794946,0.798077,0.684481,0.716458,0.756819,0.047112,6,0.761642,0.758929,0.755426,0.778198,0.776606,0.766160,0.009401
79,0.005193,0.000119,0.003034,0.000120,1000,balanced,5,"{'logisticregression__C': 1000.0, 'logisticreg...",0.851852,0.846154,0.807692,0.692308,0.653846,0.770516,0.081788,22,0.769231,0.752381,0.761905,0.780952,0.790476,0.770989,0.013497,0.789836,0.794946,0.798077,0.684481,0.716458,0.756819,0.047112,6,0.761642,0.758929,0.755426,0.778198,0.776606,0.766160,0.009401


In [21]:
number_ones_roc_auc = cross_val_results[cross_val_results['rank_test_roc_auc'] == 1]

In [22]:
# This is the optimal paramter set for ROC_AUC
number_ones_roc_auc

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_selectkbest__k,params,split0_test_recall,split1_test_recall,split2_test_recall,split3_test_recall,split4_test_recall,mean_test_recall,std_test_recall,rank_test_recall,split0_train_recall,split1_train_recall,split2_train_recall,split3_train_recall,split4_train_recall,mean_train_recall,std_train_recall,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,split0_train_roc_auc,split1_train_roc_auc,split2_train_roc_auc,split3_train_roc_auc,split4_train_roc_auc,mean_train_roc_auc,std_train_roc_auc
33,0.005638,0.001062,0.003701,0.00105,0.1,,4,"{'logisticregression__C': 0.1, 'logisticregres...",0.111111,0.115385,0.0,0.115385,0.076923,0.083809,0.044272,67,0.057692,0.085714,0.085714,0.07619,0.12381,0.085824,0.021574,0.777347,0.794946,0.798971,0.708184,0.72898,0.761714,0.036508,1,0.756191,0.76023,0.758887,0.777976,0.778405,0.766338,0.009766


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

In [23]:
true_pos = 36 
false_pos = 58 

true_neg = 85
false_neg = 8

Calculate accuracy

In [24]:
# Accuracy rate = 
# (true positives + true negatives) 
#               /
# (true positives + true negatives + false positives + false negatives)

accuracy = (true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg)

print ('The Accuracy Score for the above confusion matrix is:', accuracy)

The Accuracy Score for the above confusion matrix is: 0.6470588235294118


Calculate precision

In [25]:
# Precision rate = true positives / (true positives + false positives)

precision = true_pos / (true_pos +  false_pos)

print ('The Precision Score for the above confusion matrix is:', precision)

The Precision Score for the above confusion matrix is: 0.3829787234042553


Calculate recall

In [26]:
# Recall rate = true positives / (true positives + false negatives)

recall = true_pos / (true_pos + false_neg)

print ('The Recall Score for the above confusion matrix is:', recall)

The Recall Score for the above confusion matrix is: 0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 

# Bonus Checklist

In [28]:
# Part 1

# Done. I added a column that is the months since the earliest donation subtracted by the months since the most
# recent donation for a 'donation lifetime' indicator. I had also coded out a feature that would show the 
# donor's average monthly donation volume, but it became too much for the crossvalidation computations, so I 
# commented it out.



# Part 2

# Done. I added ROC_AUC as an additional scoring parameter. 

# Part 3

# Because I specified two scoring metrics, the data is not refitted according to the parameter
# setting with the best cross-validated score for that metric. To resolve this, I would have to
# pick one with which I wanted to refit the estimator to that metric's optimal parameter set.
# In short, yes I know how to do this, but I am going to refrain from doing so for fear of messing
# up my currently functional code in the name of optional bonus points. 

# Part 4

f1_score = (2*precision*recall)/(precision+recall)
print ('F1 Score:', f1_score)
# Done

F1 Score: 0.5217391304347826
