 # Pierre D. Data Science Unit 2 Sprint Challenge 4 — Model Validation

In [0]:
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import make_pipeline, Pipeline 
from sklearn.metrics import accuracy_score, recall_score
from sklearn.preprocessing import MaxAbsScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split


## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [0]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

Accuracy score = .76

Recall score = 0

In [3]:
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [4]:
mcb1 = 1 - df.made_donation_in_march_2007.mean()
mcb1 #Can just take the mean and subtract by 1

0.7620320855614973

What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [0]:
X_train = df.drop(columns='made_donation_in_march_2007')
y_train = df.made_donation_in_march_2007 

X_test  = df.drop(columns='made_donation_in_march_2007')
y_test  = df.made_donation_in_march_2007 

In [6]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((748, 4), (748, 4), (748,), (748,))

In [0]:
import numpy as np

majority_class = y_train.mode()[0]
y_pred = np.full(shape=y_test.shape, fill_value=majority_class)

In [8]:
from sklearn.metrics import recall_score
recall_score(y_test, y_pred)

0.0

## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [0]:
X = df.drop(columns='made_donation_in_march_2007')
y = df['made_donation_in_march_2007'] 

In [0]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=True, test_size=.25)

In [11]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((561, 4), (187, 4), (561,), (187,))

## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [0]:
# Define an estimator and param_grid
pipe = Pipeline(steps=[
    ('MaxAbsScaler', MaxAbsScaler()), 
    ('SelectKBest', SelectKBest(f_classif)), 
    ('LogisticRegression', LogisticRegression(solver='saga'))])




In [0]:
paramater_grid = {
    'SelectKBest__k': range(1,5),
    'LogisticRegression__class_weight': ['None', 'Balanced'],
    'LogisticRegression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.00, 1000.0, 10000.0]
}

gscv = GridSearchCV(pipe, param_grid=paramater_grid, cv=5, scoring='recall', verbose=10, n_jobs=10, return_train_score=True, iid=True)



## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [14]:
gscv.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 tasks      | elapsed:    6.3s
[Parallel(n_jobs=10)]: Done  12 tasks      | elapsed:    6.5s
[Parallel(n_jobs=10)]: Done  21 tasks      | elapsed:    6.8s
[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:    7.0s
[Parallel(n_jobs=10)]: Done  41 tasks      | elapsed:    7.4s
[Parallel(n_jobs=10)]: Done  52 tasks      | elapsed:    7.6s
[Parallel(n_jobs=10)]: Done  65 tasks      | elapsed:    7.9s
[Parallel(n_jobs=10)]: Done  78 tasks      | elapsed:    8.2s
[Parallel(n_jobs=10)]: Done  93 tasks      | elapsed:    8.5s
[Parallel(n_jobs=10)]: Done 108 tasks      | elapsed:    8.8s
[Parallel(n_jobs=10)]: Done 125 tasks      | elapsed:    9.1s
[Parallel(n_jobs=10)]: Done 142 tasks      | elapsed:    9.4s
[Parallel(n_jobs=10)]: Done 161 tasks      | elapsed:    9.8s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:   10.2s
[Parallel(n_jobs=10)]: Done 201 tasks      | elapsed:  

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('MaxAbsScaler', MaxAbsScaler(copy=True)), ('SelectKBest', SelectKBest(k=10, score_func=<function f_classif at 0x7f15e5606730>)), ('LogisticRegression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='saga',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=10,
       param_grid={'SelectKBest__k': range(1, 5), 'LogisticRegression__class_weight': ['None', 'Balanced'], 'LogisticRegression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='recall', verbose=10)

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [15]:
print('Best Cross Validation Score using recall', gscv.best_score_)

Best Cross Validation Score using recall 0.16406396798553663


**Best Parameters 
**

K:4

class_weight:None

C:100

In [16]:
gscv.best_estimator_


Pipeline(memory=None,
     steps=[('MaxAbsScaler', MaxAbsScaler(copy=True)), ('SelectKBest', SelectKBest(k=4, score_func=<function f_classif at 0x7f15e5606730>)), ('LogisticRegression', LogisticRegression(C=1000.0, class_weight='None', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='saga', tol=0.0001, verbose=0, warm_start=False))])

## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

In [0]:
#Another way to look at the table is a, b, c, d; which makes it easier to explain.


<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85(A)</td>
    <td>58(C)</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8(B)</td>
    <td>36(D)</td>
  </tr>
</table>

Calculate accuracy

![alt text](http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/cm1.gif)

In [18]:
Accuracy = ((85 + 36) / (85 +36 +8+ 58))
print('Accuracy is', Accuracy)

Accuracy is 0.6470588235294118


Calculate precision

![alt text](http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/cm6.gif)

In [19]:
Precision = 36/(8+36)
print('Precision is', Precision)

Precision is 0.8181818181818182


Calculate recall

![alt text](http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/cm2.gif)

In [20]:
Recall = (36)/(58 +36)
print('Recall is', Recall)

Recall is 0.3829787234042553


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 