 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [0]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

In [45]:
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
months_since_last_donation     748 non-null int64
number_of_donations            748 non-null int64
total_volume_donated           748 non-null int64
months_since_first_donation    748 non-null int64
made_donation_in_march_2007    748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB


## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

> Our accuracy score with majority class baseline is 0.762  (guessing donation was not made in March 2007)

In [47]:
# Our accuracy score with majority class baseline is 0.762 
# Guessing donation was not made in March 2007

df.made_donation_in_march_2007.value_counts(normalize=True)

0    0.762032
1    0.237968
Name: made_donation_in_march_2007, dtype: float64

What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

 > Recall is the number of correctly predicted divided by actual values
- Recall for 'Not in March 2007' (0) with majoritiy class baseline would be 1
- Recall for 'March 2007' (1) with majoritiy class baseline would be 0

In [48]:
# Recall is the number of correctly predicted divided by actual values
# Recall for 'Not in March 2007' (0) with majoritiy class baseline would be 1
# Recall for 'March 2007' (1) with majoritiy class baseline would be 0
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score, classification_report

y_pred = np.full(df.made_donation_in_march_2007.shape,
                df.made_donation_in_march_2007.mode()[0])


print(classification_report(df.made_donation_in_march_2007, y_pred))

              precision    recall  f1-score   support

           0       0.76      1.00      0.86       570
           1       0.00      0.00      0.00       178

   micro avg       0.76      0.76      0.76       748
   macro avg       0.38      0.50      0.43       748
weighted avg       0.58      0.76      0.66       748



  'precision', 'predicted', average, warn_for)


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [49]:
from sklearn.model_selection import train_test_split

# Defining X and Y
X = df.drop(columns='made_donation_in_march_2007')
y = df['made_donation_in_march_2007']

# Splitting data into train & test
# Shuffle parameter is True by default with sklearn's train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=0)

# Checking shape for each set
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((561, 4), (561,), (187, 4), (187,))

## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [0]:
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler

# Making a pipeline ('pipe')
pipe = make_pipeline(
    RobustScaler(), # RobustScaler is robust with outliers
    SelectKBest(f_regression),
    LogisticRegression(solver='lbfgs'))

## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [67]:
param_grid = {
    'selectkbest__k': [1, 2, 3, 4],
    'logisticregression__class_weight': [None, 'balanced'],
    'logisticregression__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0]
}

# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, 
                  scoring='accuracy', # using accuracy score to compare w/baseline
                  verbose=False)

gs.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_regression at 0x7f777eaac730>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=False)

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [68]:
validation_score = gs.best_score_
# flipping validation score from negative to positive with neg sign on front
print('Cross-Validation Score:', validation_score) 
print()
print('Best estimator:', gs.best_estimator_)

Cross-Validation Score: 0.7896613190730838

Best estimator: Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=4, score_func=<function f_regression at 0x7f777eaac730>)), ('logisticregression', LogisticRegression(C=0.1, class_weight=None, dual=False, fit_i...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [53]:
# Accuracy: overall how often is the classifier correct
# TruePos + TrueNeg / Total

tp = 36   # true positive
tn = 85   # true negative
fp = 58   # false positive
fn = 8    # false negative
total = tp + tn + fp + fn

accuracy = (tp + tn) / total
accuracy

0.6470588235294118

Calculate precision

In [54]:
# Precision: Probability of correct prediction when it predicts Positive
# TruePos/predicted Positive

precision = tp / (tp + fp)
precision

0.3829787234042553

Calculate recall

In [55]:
# Recall: How often does the positive condition actually occur
# actual pos / total
recall = (fn + tp) / total
recall

0.23529411764705882

## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 

In [0]:
df1 = df

In [81]:
df1.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [82]:
# Using copy of df as df1 for Bonus Section
df1.columns

Index(['months_since_last_donation', 'number_of_donations',
       'total_volume_donated', 'months_since_first_donation',
       'made_donation_in_march_2007'],
      dtype='object')

In [0]:
# new features in new DataFrame (df1)
df1['avg_donations_per_month'] = (df1.months_since_first_donation - 
                                  df1.months_since_last_donation) / df.number_of_donations

In [85]:
# Defining X and Y
X1 = df1.drop(columns='made_donation_in_march_2007')
y1 = df1['made_donation_in_march_2007']

# Splitting data into train & test
# Shuffle parameter is True by default with sklearn's train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(
    X1, y1, test_size=0.25, random_state=0)

# Checking shape for each set
X_train1.shape, y_train1.shape, X_test1.shape, y_test1.shape

((561, 5), (561,), (187, 5), (187,))

In [0]:
# Making a sencond pipeline ('pipe2')
pipe2 = make_pipeline(
    StandardScaler(),  # <------ Trying StandardScaler now
    SelectKBest(f_regression),
    LogisticRegression(solver='lbfgs'))

In [92]:
# added new Cs
param_grid1 = {
    'selectkbest__k': [1, 2, 3, 4],
    'logisticregression__class_weight': [None, 'balanced'],
    'logisticregression__C': [.0001, .0002, .001, .002,  .01, .02, .1, .2, 
                              1.0, 2.0, 10.0, 20.0, 100.00, 200.0, 1000.0, 
                              2000.0, 10000.0, 20000.0]
}

# Fit on the train set, with grid search cross-validation
gs1 = GridSearchCV(pipe2, param_grid=param_grid1, cv=3, 
                  scoring='accuracy', # using accuracy score to compare w/baseline
                  verbose=False)

gs1.fit(X_train1, y_train1)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)
  Xt = transform

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_regression at 0x7f777eaac730>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_inte...nalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__C': [0.0001, 0.0002, 0.001, 0.002, 0.01, 0.02, 0.1, 0.2, 1.0, 2.0, 10.0, 20.0, 100.0, 200.0, 1000.0, 2000.0, 10000.0, 20000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=False)

In [93]:
validation_score1 = gs1.best_score_
# flipping validation score from negative to positive with neg sign on front
print('Cross-Validation Score:', validation_score1) 
print()
print('Best estimator:', gs1.best_estimator_)

Cross-Validation Score: 0.7896613190730838

Best estimator: Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectkbest', SelectKBest(k=4, score_func=<function f_regression at 0x7f777eaac730>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])


In [94]:
selector = gs1.best_estimator_.named_steps['selectkbest']
all_names = X_train1.columns
selected_mask = selector.get_support() # .get_support shows if feature was selected or not
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)

print()
print('Features not selected:')
for name in unselected_names:
    print(name)

Features selected:
months_since_last_donation
number_of_donations
total_volume_donated
avg_donations_per_month

Features not selected:
months_since_first_donation


In [95]:
# test score
test_score = gs1.score(X_test1, y_test1)
print('Test Score:', test_score)

Test Score: 0.7272727272727273


  Xt = transform.transform(Xt)
