 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [37]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

def ini_preview(df):
  print(df.head().T)
  print("-"*100)
  for i in df.columns:
    print(i)
    print(df[i].value_counts().index.sort_values())  
    print("-"*100)
ini_preview(df)

                                 0     1     2     3     4
months_since_last_donation       2     0     1     2     1
number_of_donations             50    13    16    20    24
total_volume_donated         12500  3250  4000  5000  6000
months_since_first_donation     98    28    35    45    77
made_donation_in_march_2007      1     1     1     1     0
----------------------------------------------------------------------------------------------------
months_since_last_donation
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 25, 26, 35, 38, 39, 40, 72, 74], dtype='int64')
----------------------------------------------------------------------------------------------------
number_of_donations
Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 33, 34, 38, 41, 43, 44, 46, 50], dtype='int64')
----------------------------------------------------------------------------------------------------
tot

#### Import

In [38]:
%matplotlib inline
from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)

In [59]:
# preview data
print("df shape:"), print(df.shape), print("---"*20)
print("df columns:"), print(df.columns), print("---"*20)
print("df select_dtypes(include=[np.number]).columns.values:"), print(df.select_dtypes(include=[np.number]).columns.values), print("---"*20)
print("df select_dtypes(exclude=[np.number]).columns:"), print(df.select_dtypes(exclude=[np.number]).columns), print("---"*20)
print("df dtypes.sort_values(ascending=False):"), print(df.dtypes.sort_values(ascending=False)), print("---"*20)
print("df head().T:"), print(df.head().T), print("---"*20)
print("df isnull().sum().sum():"), print(df.isnull().sum().sum()), print("---"*20)
print("df isna().sum().sort_values(ascending=False):"), print(df.isna().sum().sort_values(ascending=False)), print("---"*20)
# nan finder
print("columns[df.isna().any()].tolist():"), print(df.columns[df.isna().any()].tolist()), print("")
# stats data
print("df corr().T:"), print(df.corr().T), print("")
print("df describe(include='all').T:"), print(df.describe(include='all').T), print("")

df shape:
(748, 5)
------------------------------------------------------------
df columns:
Index(['months_since_last_donation', 'number_of_donations', 'total_volume_donated', 'months_since_first_donation', 'made_donation_in_march_2007'], dtype='object')
------------------------------------------------------------
df select_dtypes(include=[np.number]).columns.values:
['months_since_last_donation' 'number_of_donations' 'total_volume_donated'
 'months_since_first_donation' 'made_donation_in_march_2007']
------------------------------------------------------------
df select_dtypes(exclude=[np.number]).columns:
Index([], dtype='object')
------------------------------------------------------------
df dtypes.sort_values(ascending=False):
made_donation_in_march_2007    int64
months_since_first_donation    int64
total_volume_donated           int64
number_of_donations            int64
months_since_last_donation     int64
dtype: object
-----------------------------------------------------------

(None, None, None)

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [122]:
from sklearn.metrics import accuracy_score

# Data source
X = df.drop(columns=["made_donation_in_march_2007"], axis=1)
y = df["made_donation_in_march_2007"]

# Majority class baseline = mode
majority_class = y.mode()[0]
y_pred = np.full(shape=y.shape, fill_value=majority_class)

# Accuracy score
accuracy = accuracy_score(y,y_pred)
print('Accuracy:',accuracy)

Accuracy: 0.7620320855614974


What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [123]:
from sklearn.metrics import recall_score
recall = recall_score(y, y_pred)
print('Recall score from majority class baseline:',recall)

Recall score from majority class baseline: 0.0


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [124]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.25)

## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [125]:
import warnings
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
# data Process
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import PolynomialFeatures
# model setup
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.linear_model import LogisticRegression
# metric
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

pipeline = make_pipeline(
    StandardScaler(),
    SelectKBest(f_classif),
    LogisticRegression(solver = 'lbfgs'))

## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [126]:
# Define param_grid
param_grid = {
    'selectkbest__k': [1,2,3,4],
    'logisticregression__class_weight': [None, 'balanced'],
    'logisticregression__C' : [.0001,.001,.01,.1,1.0,10.0,100.00,1000.0,10000.0]
}

# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipeline, param_grid=param_grid,cv=5, scoring='recall', verbose=1)
gs.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    7.5s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f15064ee510>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=1)

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [127]:
# Cross-Validation Results
validation_score = gs.best_score_
print('Validation Score: ', validation_score)
print('Best parameter:', gs.best_params_)
print('Best estimator:', gs.best_estimator_)

Validation Score:  0.7869710173631743
Best paramter: {'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'selectkbest__k': 2}
Best estimator: Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectkbest', SelectKBest(k=2, score_func=<function f_classif at 0x7f15064ee510>)), ('logisticregression', LogisticRegression(C=0.1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False))])


In [129]:
selector = gs.best_estimator_.named_steps['selectkbest']
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names=all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print(all_names)

print("-"*100)
print('Features selected:')
for name in selected_names:
  print(name)
  
print("-"*100)
print("Features not selected:")
for name in unselected_names:
  print(name)

print("-"*100)
y_pred = gs.predict(X_test)
recall = recall_score(y_test, y_pred)
print('recall_score:', recall)

Index(['months_since_last_donation', 'number_of_donations', 'total_volume_donated', 'months_since_first_donation'], dtype='object')
----------------------------------------------------------------------------------------------------
Features selected:
months_since_last_donation
total_volume_donated
----------------------------------------------------------------------------------------------------
Features not selected:
number_of_donations
months_since_first_donation
----------------------------------------------------------------------------------------------------
recall_score: 0.8571428571428571


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [71]:
true_negative  = 85
false_positive = 58
false_negative = 8
true_positive  = 36
predicted_negative = true_negative + false_negative
predicted_positive = true_positive + false_positive
actual_negative = true_negative + false_positive
actual_positive = true_positive + false_negative

accuracy = (true_negative + true_positive) / (true_negative + false_positive + false_negative + true_positive)
precision = true_positive / predicted_positive
recall = true_positive / actual_positive
print(accuracy)

0.6470588235294118


Calculate precision

In [72]:
print(precision)

0.3829787234042553


Calculate recall

In [73]:
print(recall)

0.8181818181818182


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

In [132]:
from sklearn.preprocessing import RobustScaler

# Data source
X = df.drop(columns=["made_donation_in_march_2007"], axis=1)
y = df["made_donation_in_march_2007"]

# Test polynomialFeatures before split
poly = PolynomialFeatures()
X = poly.fit_transform(X)
X = pd.DataFrame(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.25)

pipeline = make_pipeline(
    RobustScaler(),
    SelectKBest(f_classif),
    LogisticRegression(solver = 'liblinear'))

In [133]:
warnings.filterwarnings(action='ignore', category=RuntimeWarning)
# Define param_grid
param_grid = {
    'selectkbest__k': range(1, len(X_train.columns)+1),
    'logisticregression__class_weight': [None, 'balanced'],
    'logisticregression__C' : [.0001,.001,.01,.1,1.0,10.0,100.00,1000.0,10000.0]
}

# Fit on the train set, with grid search cross-validation
gs = GridSearchCV(pipeline, param_grid=param_grid,cv=5, scoring='recall', verbose=1)
gs.fit(X_train, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1350 out of 1350 | elapsed:   34.7s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f15064ee510>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_int...ty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': range(1, 16), 'logisticregression__class_weight': [None, 'balanced'], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=1)

In [134]:
# Cross-Validation Results
validation_score = gs.best_score_
print('Validation Score: ', validation_score)
print('Best parameter:', gs.best_params_)
print('Best estimator:', gs.best_estimator_)

Validation Score:  0.8003565062388592
Best paramter: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'balanced', 'selectkbest__k': 8}
Best estimator: Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=8, score_func=<function f_classif at 0x7f15064ee510>)), ('logisticregression', LogisticRegression(C=0.01, class_weight='balanced', dual=False,
 ...ty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])


### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

In [135]:
selector = gs.best_estimator_.named_steps['selectkbest']
all_names = X_train.columns
selected_mask = selector.get_support()
selected_names=all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print(all_names)

print("-"*100)
print('Features selected:')
for name in selected_names:
  print(name)
  
print("-"*100)
print("Features not selected:")
for name in unselected_names:
  print(name)

print("-"*100)
y_pred = gs.predict(X_test)
recall = recall_score(y_test, y_pred)
print('recall_score:', recall)

RangeIndex(start=0, stop=15, step=1)
----------------------------------------------------------------------------------------------------
Features selected:
1
2
3
5
8
9
10
12
----------------------------------------------------------------------------------------------------
Features not selected:
0
4
6
7
11
13
14
----------------------------------------------------------------------------------------------------
recall_score: 0.7169811320754716


### Part 4
Calculate F1 score and False Positive Rate. 

In [136]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))

print("-"*100)
pd.DataFrame(confusion_matrix(y_test, y_pred), 
             columns=['Predicted Negative', 'Predicted Positive'], 
             index=['Actual Negative', 'Actual Positive'])

              precision    recall  f1-score   support

           0       0.85      0.62      0.72       134
           1       0.43      0.72      0.54        53

   micro avg       0.65      0.65      0.65       187
   macro avg       0.64      0.67      0.63       187
weighted avg       0.73      0.65      0.66       187

----------------------------------------------------------------------------------------------------


Unnamed: 0,Predicted Negative,Predicted Positive
Actual Negative,83,51
Actual Positive,15,38


In [139]:
true_negative  = 83
false_positive = 51
false_negative = 15
true_positive  = 38
predicted_negative = true_negative + false_negative
predicted_positive = true_positive + false_positive
actual_negative = true_negative + false_positive
actual_positive = true_positive + false_negative

accuracy = (true_negative + true_positive) / (true_negative + false_positive + false_negative + true_positive)
precision = true_positive / predicted_positive
recall = true_positive / actual_positive

FPR = false_positive/(false_positive+true_negative)
f1 = 2 * precision*recall / (precision+recall)
print('Accuracy:',accuracy)
print('Precision:',precision)
print('Recall:',recall)
print('False Positive Rate:',FPR)
print('F1 Score:',f1)

Accuracy: 0.6470588235294118
Precision: 0.42696629213483145
Recall: 0.7169811320754716
False Positive Rate: 0.3805970149253731
F1 Score: 0.5352112676056338
