<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-2-Sprint-4-Model-Validation/blob/master/DS_Unit_2_Sprint_Challenge_4_Model_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

In [0]:
from sklearn.metrics import accuracy_score, confusion_matrix, mean_absolute_error, recall_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_predict, cross_val_score, train_test_split, cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.linear_model import LogisticRegression

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [0]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})

In [3]:
df.head(3)

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1


In [4]:
df.shape

(748, 5)

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [5]:
#made_donation is binary, so let's make baseline all zeroes

prediction = [0] * len(df)
accuracy_score(df['made_donation_in_march_2007'], prediction) #our baseline accuracy is 76.20%

0.7620320855614974

What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [6]:
recall_score(df['made_donation_in_march_2007'], prediction)

0.0

In [0]:
#bonus - feature engineering
df['donations_per_month'] = df['number_of_donations'] / (df['months_since_first_donation'] - df['months_since_last_donation'])
df['volume_per_donation'] = df['total_volume_donated'] / df['number_of_donations']

In [8]:
#I was getting an 'input contains infinites' warning when fitting pipeline
import numpy as np
np.isfinite(df).sum() #...sadly donations_per_month has to be dropped

months_since_last_donation     748
number_of_donations            748
total_volume_donated           748
months_since_first_donation    748
made_donation_in_march_2007    748
donations_per_month            556
volume_per_donation            748
dtype: int64

In [9]:
df.isnull().sum() #...and there are no nans

months_since_last_donation     0
number_of_donations            0
total_volume_donated           0
months_since_first_donation    0
made_donation_in_march_2007    0
donations_per_month            0
volume_per_donation            0
dtype: int64

In [10]:
df.info() #...and they are properly encoded (no string nulls)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 7 columns):
months_since_last_donation     748 non-null int64
number_of_donations            748 non-null int64
total_volume_donated           748 non-null int64
months_since_first_donation    748 non-null int64
made_donation_in_march_2007    748 non-null int64
donations_per_month            748 non-null float64
volume_per_donation            748 non-null float64
dtypes: float64(2), int64(5)
memory usage: 41.0 KB


In [0]:
df = df.drop(columns='donations_per_month')

## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [0]:
X = df.drop(columns='made_donation_in_march_2007')
y = df['made_donation_in_march_2007']
  
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=.25)

In [13]:
#one class error
#had to remove an 'outlier cleaning' cell that cleaned all ones
ytrain.value_counts()

0    420
1    141
Name: made_donation_in_march_2007, dtype: int64

## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [0]:
pipeline = make_pipeline(
           RobustScaler(),
           SelectKBest(f_classif),
           LogisticRegression(solver='lbfgs', max_iter=5000))

## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [0]:
param_grid = {
    'selectkbest__k': [1,2,3,4],
    'logisticregression__C': [.0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0],
    'logisticregression__class_weight': [None, 'balanced']
}

In [16]:
gridsearch = GridSearchCV(pipeline, param_grid=param_grid, cv=5, scoring='recall')

gridsearch.fit(Xtrain, ytrain)

  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = msb / msw
  f = ms

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f751b9e1158>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_int...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 'logisticregression__class_weight': [None, 'balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=0)

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [20]:
gsresults = pd.DataFrame(gridsearch.cv_results_).sort_values(by='rank_test_score')
gsresults = gsresults[['rank_test_score', 'mean_test_score', 'mean_train_score',
                      'param_selectkbest__k', 'param_logisticregression__class_weight',
                      'param_logisticregression__C']]
gsresults.head(1)



Unnamed: 0,rank_test_score,mean_test_score,mean_train_score,param_selectkbest__k,param_logisticregression__class_weight,param_logisticregression__C
37,1,0.794702,0.796128,2,balanced,1


In [23]:
gsresults.loc[37]

rank_test_score                                  1
mean_test_score                           0.794702
mean_train_score                          0.796128
param_selectkbest__k                             2
param_logisticregression__class_weight    balanced
param_logisticregression__C                      1
Name: 37, dtype: object

## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [0]:
false_positive = 58
true_positive = 36

false_negative = 8
true_negative = 85

accuracy = ((85 + 36) / (85 + 36 + 8 + 58))
accuracy

0.6470588235294118

Calculate precision

In [0]:
actual_negative = 85 + 58
actual_positive = 8 + 36

predicted_negative = 85 + 8
predicted_positive = 58 + 36

precision = true_positive / predicted_positive
precision

0.3829787234042553

Calculate recall

In [0]:
recall = true_positive / actual_positive
recall

0.8181818181818182

In [0]:
#F1 Score
f1 = 2*precision*recall / (precision+recall)
f1

0.5217391304347826

In [0]:
#false positive
false_positive_rate = false_positive/actual_negative
false_positive_rate

0.40559440559440557

## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 