<a href="https://colab.research.google.com/github/axrd/DS-Unit-2-Sprint-4-Model-Validation/blob/master/DS_Unit_2_Sprint_Challenge_4_Model_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [0]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

In [0]:
# Tools:
import numpy as np
import pandas as pd
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

In [45]:
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [19]:
df["made_donation_in_march_2007"].value_counts()

0    570
1    178
Name: made_donation_in_march_2007, dtype: int64

In [22]:
df["made_donation_in_march_2007"].shape

(748,)

In [24]:
# Majority class / number of observations gives us accuracy score:
570 / 748

0.7620320855614974

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [46]:
# Majority class baseline using libraries:
from sklearn.metrics import accuracy_score
import numpy as np

majority_class = df["made_donation_in_march_2007"].mode()[0]
y_pred = np.full((748,), fill_value=majority_class)
y_true = df["made_donation_in_march_2007"]

accuracy_score(y_true, y_pred)

0.7620320855614974

What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)



---

Recall in this specific model boils down to: "when it predicts 'No Donation', how often is it correct?".

In other words, 

Recall = correct non-donation predictions / number of no-donation predictions

Recall = 570 / 748 = about 76%

In our majority class baseline, Recall = Accuracy score. 

---



## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [0]:
from sklearn.model_selection import train_test_split

# Splitting data into train, test sets:
X = df.drop("made_donation_in_march_2007", axis=1)
y = df["made_donation_in_march_2007"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True)

## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [51]:
# Ensuring we have no nulls and all numeric features for Scikit-learn:
def no_nulls(df):
    return not any(df.isnull().sum())
  
def all_numeric(df):
    from pandas.api.types import is_numeric_dtype
    return all(is_numeric_dtype(df[col]) for col in df)


no_nulls(X_train), all_numeric(X_train)

(True, True)

In [0]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# Define an estimator and param_grid
pipe = make_pipeline(
    RobustScaler(), 
    SelectKBest(f_regression), 
    LogisticRegression(solver='lbfgs'))



## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [73]:
# Setting up Parameter Grid:
param_grid = {
    'selectkbest__k': (1,2,3,4), 
    'logisticregression__class_weight': [None, 'balanced'],
    'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 100.00, 1000.0, 10000.0]
}

# Fitting on the train set with GSCV:
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5,
                 scoring='neg_mean_absolute_error',
                 verbose=1)

gs.fit(X_train, y_train)
val_score = gs.best_score_



Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 320 out of 320 | elapsed:    3.0s finished


## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [75]:
print('Cross-Validation Score:', -val_score)
print('\n Best estimator:', gs.best_estimator_)

Cross-Validation Score: 0.2192513368983957

 Best estimator: Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=4, score_func=<function f_regression at 0x7f2c754b9620>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_i...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

In [0]:
TP = 36
TN = 85
FP = 58
FN = 8
Total = TP + TN + FP + FN

Calculate accuracy

In [39]:
accuracy = (TP+TN) / Total
print("Accuracy = {}".format(accuracy))

Accuracy = 0.6470588235294118


Calculate precision

In [41]:
precision = TP/(FP+TP)
print("Precision = {}".format(precision))

Precision = 0.3829787234042553


Calculate recall

In [42]:
# Recall is the True Positive Rate (aka 'Sensitivity'):
recall = TP/(TP+FN)
print("Recall = {}".format(recall))

Recall = 0.8181818181818182




---


**BONUS:**




Calculate F1 Score

In [77]:
F1_score = ((precision*recall)/(precision+recall))*2
print("F1 Score = {}".format(F1_score))

F1 Score = 0.5217391304347826


Calculate False Positive Rate

In [44]:
false_positive_rate = FP/(FP+TN)
print("False Positive Rate = {}".format(false_positive_rate))

False Positive Rate = 0.40559440559440557


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 