<a href="https://colab.research.google.com/github/Captmoonshot/DS-Unit-2-Sprint-4-Model-Validation/blob/master/Sammy%20Lee%20-%20DS_Unit_2_Sprint_Challenge_4_Model_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # Data Science Unit 2 Sprint Challenge 4 — Model Validation

Follow the instructions for each numbered part to earn a score of 2. See the bottom of the notebook for a list of ways you can earn a score of 3.

## Predicting Blood Donations

Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive.

The goal is to predict the last column, whether the donor made a donation in March 2007, using information about each donor's history. We'll measure success using recall score as the model evaluation metric.

Good data-driven systems for tracking and predicting donations and supply needs can improve the entire supply chain, making sure that more patients get the blood transfusions they need.

#### Run this cell to load the data:

In [1]:
import pandas as pd

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data')

df = df.rename(columns={
    'Recency (months)': 'months_since_last_donation', 
    'Frequency (times)': 'number_of_donations', 
    'Monetary (c.c. blood)': 'total_volume_donated', 
    'Time (months)': 'months_since_first_donation', 
    'whether he/she donated blood in March 2007': 'made_donation_in_march_2007'
})
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [2]:
df.dtypes

months_since_last_donation     int64
number_of_donations            int64
total_volume_donated           int64
months_since_first_donation    int64
made_donation_in_march_2007    int64
dtype: object

In [3]:
df.shape

(748, 5)

In [4]:
df.isna().sum()

months_since_last_donation     0
number_of_donations            0
total_volume_donated           0
months_since_first_donation    0
made_donation_in_march_2007    0
dtype: int64

In [5]:
df['made_donation_in_march_2007'].value_counts()

0    570
1    178
Name: made_donation_in_march_2007, dtype: int64

## Part 1.1 — Begin with baselines

What **accuracy score** would you get here with a **"majority class baseline"?** 
 
(You don't need to split the data into train and test sets yet. You can answer this question either with a scikit-learn function or with a pandas function.)

In [6]:
df['made_donation_in_march_2007'].value_counts(normalize=True)

print("Accuracy Score: ", 0.76)

Accuracy Score:  0.76


What **recall score** would you get here with a **majority class baseline?**

(You can answer this question either with a scikit-learn function or with no code, just your understanding of recall.)

In [0]:
import numpy as np

majority_class = df['made_donation_in_march_2007'].mode()[0]
y_pred = np.full(shape=df['made_donation_in_march_2007'].shape, fill_value=majority_class)



In [8]:
from sklearn.metrics import accuracy_score

accuracy_score(df['made_donation_in_march_2007'], y_pred)


0.7620320855614974

In [9]:
from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(df['made_donation_in_march_2007'], y_pred)
print("Confusion Matrix:\n{}".format(confusion))

Confusion Matrix:
[[570   0]
 [178   0]]


In [10]:
# The recall for predicting 0(False) is a perfect 1, while reacall for predicting 1(True) is zero, and the average
# of the two gives 76%


from sklearn.metrics import classification_report

print(classification_report(df['made_donation_in_march_2007'], y_pred))

              precision    recall  f1-score   support

           0       0.76      1.00      0.86       570
           1       0.00      0.00      0.00       178

   micro avg       0.76      0.76      0.76       748
   macro avg       0.38      0.50      0.43       748
weighted avg       0.58      0.76      0.66       748



  'precision', 'predicted', average, warn_for)


## Part 1.2 — Split data

In this Sprint Challenge, you will use "Cross-Validation with Independent Test Set" for your model evaluation protocol.

First, **split the data into `X_train, X_test, y_train, y_test`**, with random shuffle. (You can include 75% of the data in the train set, and hold out 25% for the test set.)


In [11]:
from sklearn.model_selection import train_test_split

X = df.loc[:, 'months_since_last_donation':'months_since_first_donation'].values
y = df['made_donation_in_march_2007'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=42)

print("X_train: {}".format(X_train.shape[0]))
print("y_train: {}".format(y_train.shape[0]))
print("X_test: {}".format(X_test.shape[0]))
print("y_test: {}".format(y_test.shape[0]))

X_train: 561
y_train: 561
X_test: 187
y_test: 187


## Part 2.1 — Make a pipeline

Make a **pipeline** which includes:
- Preprocessing with any scikit-learn [**Scaler**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- Feature selection with **[`SelectKBest`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)([`f_classif`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html))**
- Classification with [**`LogisticRegression`**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(
    MinMaxScaler(),
    SelectKBest(f_classif),
    LogisticRegression()
)


## Part 2.2 — Do Grid Search Cross-Validation

Do [**GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) with your pipeline. Use **5 folds** and **recall score**.

Include these **parameters for your grid:**

#### `SelectKBest`
- `k : 1, 2, 3, 4`

#### `LogisticRegression`
- `class_weight : None, 'balanced'`
- `C : .0001, .001, .01, .1, 1.0, 10.0, 100.00, 1000.0, 10000.0`


**Fit** on the appropriate data.

In [13]:
from sklearn.metrics import recall_score

param_grid = {
    'selectkbest__k': [1, 2, 3, 4],
    'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.00, 1000.0, 10000.0],
    'logisticregression__class_weight': [None, 'balanced']
}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5,
                   scoring='recall', verbose=1)

grid.fit(X_train, y_train)

Fitting 5 folds for each of 72 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 360 out of 360 | elapsed:    2.1s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f9f687d3488>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 'logisticregression__class_weight': [None, 'balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=1)

## Part 3 — Show best score and parameters

Display your **best cross-validation score**, and the **best parameters** (the values of `k, class_weight, C`) from the grid search.

(You're not evaluated here on how good your score is, or which parameters you find. You're only evaluated on being able to display the information. There are several ways you can get the information, and any way is acceptable.)

In [14]:
print("best cross-validation score: {:.2f}".format(grid.best_score_))
print()
print("best parameters: {}".format(grid.best_params_))

best cross-validation score: 0.78

best parameters: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'balanced', 'selectkbest__k': 1}


## Part 4 — Calculate classification metrics from a confusion matrix

Suppose this is the confusion matrix for your binary classification model:

<table>
  <tr>
    <th colspan="2" rowspan="2"></th>
    <th colspan="2">Predicted</th>
  </tr>
  <tr>
    <th>Negative</th>
    <th>Positive</th>
  </tr>
  <tr>
    <th rowspan="2">Actual</th>
    <th>Negative</th>
    <td>85</td>
    <td>58</td>
  </tr>
  <tr>
    <th>Positive</th>
    <td>8</td>
    <td>36</td>
  </tr>
</table>

Calculate accuracy

In [15]:
print("Accuracy: {:.2f}%".format((36 + 85) / (36 + 85 + 58 + 8)))

Accuracy: 0.65%


Calculate precision

In [16]:
print("Precision: {:.2f}".format(36 / (36 + 58)))


Precision: 0.38


Calculate recall

In [17]:
print("Recall: {:.2f}".format(36 / (36 + 8)))

Recall: 0.82


## BONUS — How you can earn a score of 3

### Part 1
Do feature engineering, to try improving your cross-validation score.

### Part 2
Add transformations in your pipeline and parameters in your grid, to try improving your cross-validation score.

### Part 3
Show names of selected features. Then do a final evaluation on the test set — what is the test score?

### Part 4
Calculate F1 score and False Positive Rate. 

In [18]:
df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [19]:
# RandomForestClassifier baseline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# Since we've already done train_test_split

rf = RandomForestClassifier(n_estimators=100)

kfold = KFold(n_splits=5)
scores = cross_val_score(rf, X_train, y_train, cv=kfold)

print("Cross-validation scores:\n{}".format(scores))
print()
print("Cross-validation scores mean:\n{}".format(scores.mean()))
print()
print("Cross-validation scores standard deviation:\n{}".format(scores.std()))

Cross-validation scores:
[0.78761062 0.75       0.79464286 0.78571429 0.74107143]

Cross-validation scores mean:
0.7718078381795197

Cross-validation scores standard deviation:
0.02183970673904591


In [20]:
# The baseline RandomForestClassifier does not perform better than our majority classifier

df_2 = df.copy()
df_2.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


In [21]:
df_2['number_of_donations_and_total_volume_donated'] = df['number_of_donations'] * df['total_volume_donated']
df_2.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007,number_of_donations_and_total_volume_donated
0,2,50,12500,98,1,625000
1,0,13,3250,28,1,42250
2,1,16,4000,35,1,64000
3,2,20,5000,45,1,100000
4,1,24,6000,77,0,144000


In [22]:
df_2.describe()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007,number_of_donations_and_total_volume_donated
count,748.0,748.0,748.0,748.0,748.0,748.0
mean,9.506684,5.514706,1378.676471,34.282086,0.237968,16115.975936
std,8.095396,5.839307,1459.826781,24.376714,0.426124,49268.17004
min,0.0,1.0,250.0,2.0,0.0,250.0
25%,2.75,2.0,500.0,16.0,0.0,1000.0
50%,7.0,4.0,1000.0,28.0,0.0,4000.0
75%,14.0,7.0,1750.0,50.0,0.0,12250.0
max,74.0,50.0,12500.0,98.0,1.0,625000.0


In [23]:
# Outlier Removal

from scipy import stats

df_3 = df_2[(np.abs(stats.zscore(df_2)) < 3).all(axis=1)]
df_3.describe()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007,number_of_donations_and_total_volume_donated
count,729.0,729.0,729.0,729.0,729.0,729.0
mean,9.21262,5.060357,1265.089163,33.238683,0.234568,10983.88203
std,7.037385,4.284098,1071.024478,23.551979,0.424019,18375.364357
min,0.0,1.0,250.0,2.0,0.0,250.0
25%,3.0,2.0,500.0,16.0,0.0,1000.0
50%,7.0,4.0,1000.0,28.0,0.0,4000.0
75%,14.0,7.0,1750.0,48.0,0.0,12250.0
max,26.0,23.0,5750.0,98.0,1.0,132250.0


In [24]:
# Let's run RandomForestClassifier baseline on a DataFrame without Outliers

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold

# split the data for new dataframe
X_df = df_3.drop('made_donation_in_march_2007', axis=1)
X = X_df.values
y = df_3['made_donation_in_march_2007'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=42)

rf = RandomForestClassifier(n_estimators=100)

kfold = KFold(n_splits=5)
scores = cross_val_score(rf, X_train, y_train, cv=kfold)

print("Cross-validation scores:\n{}".format(scores))
print()
print("Cross-validation scores mean:\n{}".format(scores.mean()))
print()
print("Cross-validation scores standard deviation:\n{}".format(scores.std()))

Cross-validation scores:
[0.69090909 0.74311927 0.77981651 0.77981651 0.80733945]

Cross-validation scores mean:
0.7602001668056714

Cross-validation scores standard deviation:
0.040211253355166225


In [25]:
# Logistic Baseline with one interaction term and outliers removed

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold

# split the data for new dataframe
X_df = df_3.drop('made_donation_in_march_2007', axis=1)
X = X_df.values
y = df_3['made_donation_in_march_2007'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=42)

lr = LogisticRegression(solver='lbfgs', max_iter=1000)

kfold = KFold(n_splits=5)
scores = cross_val_score(lr, X_train, y_train, cv=kfold)

print("Cross-validation scores:\n{}".format(scores))
print()
print("Cross-validation scores mean:\n{}".format(scores.mean()))
print()
print("Cross-validation scores standard deviation:\n{}".format(scores.std()))

Cross-validation scores:
[0.73636364 0.76146789 0.82568807 0.81651376 0.8440367 ]

Cross-validation scores mean:
0.7968140116763971

Cross-validation scores standard deviation:
0.04087877154224808


In [26]:
# Logistic Baseline with one interaction term and outliers removed with scaled data

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler

# No need to split the data since we've already done it

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression(solver='lbfgs', max_iter=1000)

kfold = KFold(n_splits=5)
scores = cross_val_score(lr, X_train_scaled, y_train, cv=kfold)

print("Cross-validation scores:\n{}".format(scores))
print()
print("Cross-validation scores mean:\n{}".format(scores.mean()))
print()
print("Cross-validation scores standard deviation:\n{}".format(scores.std()))

Cross-validation scores:
[0.74545455 0.75229358 0.80733945 0.82568807 0.82568807]

Cross-validation scores mean:
0.7912927439532944

Cross-validation scores standard deviation:
0.03534303191029202




In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(
    StandardScaler(),
    SelectKBest(f_classif),
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

from sklearn.metrics import recall_score

param_grid = {
    'selectkbest__k': [1, 2, 3, 4],
    'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.00, 1000.0, 10000.0],
    'logisticregression__class_weight': [None, 'balanced']
}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5,
                   scoring='recall')

grid.fit(X_train, y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f9f687d3488>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 'logisticregression__class_weight': [None, 'balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=0)

In [28]:
print("LogisticRegression with one interaction term, outlierss removed, and scaled data:")
print()
print("best cross-validation score: {:.2f}".format(grid.best_score_))
print()
print("best parameters: {}".format(grid.best_params_))

LogisticRegression with one interaction term, outlierss removed, and scaled data:

best cross-validation score: 0.80

best parameters: {'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'selectkbest__k': 2}


In [30]:
# Let's try correcting for imbalanced data with SMOTE

from imblearn.over_sampling import SMOTE

os = SMOTE(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=42)

os_data_X, os_data_y = os.fit_sample(X_train, y_train)

os_data_y_s = pd.Series(os_data_y)
os_data_y_s.value_counts()

1    394
0    394
dtype: int64

In [31]:
# Logistic Baseline with one interaction term and outliers removed with scaled data and imbalanced data corrected with SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler

# No need to split the data since we've already done it

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(os_data_X)
# X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression(solver='lbfgs', max_iter=1000)

kfold = KFold(n_splits=5)
scores = cross_val_score(lr, X_train_scaled, os_data_y, cv=kfold)

print("Cross-validation scores:\n{}".format(scores))
print()
print("Cross-validation scores mean:\n{}".format(scores.mean()))
print()
print("Cross-validation scores standard deviation:\n{}".format(scores.std()))

Cross-validation scores:
[0.70253165 0.56329114 0.58227848 0.75796178 0.49044586]

Cross-validation scores mean:
0.6193017818269774

Cross-validation scores standard deviation:
0.09722712766985112




In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(
    StandardScaler(),
    SelectKBest(f_classif),
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

from sklearn.metrics import recall_score

param_grid = {
    'selectkbest__k': [1, 2, 3, 4],
    'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.00, 1000.0, 10000.0],
    'logisticregression__class_weight': [None, 'balanced']
}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5,
                   scoring='recall')

grid.fit(os_data_X, os_data_y)





GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('selectkbest', SelectKBest(k=10, score_func=<function f_classif at 0x7f9f687d3488>)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'selectkbest__k': [1, 2, 3, 4], 'logisticregression__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0], 'logisticregression__class_weight': [None, 'balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='recall', verbose=0)

In [34]:
print("LogisticRegression with one interaction term, outlierss removed, and scaled data and imbalance-corrected data:")
print()
print("best cross-validation score: {:.2f}".format(grid.best_score_))
print()
print("best parameters: {}".format(grid.best_params_))

LogisticRegression with one interaction term, outlierss removed, and scaled data and imbalance-corrected data:

best cross-validation score: 0.81

best parameters: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'selectkbest__k': 2}


In [38]:
df_3.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,made_donation_in_march_2007,number_of_donations_and_total_volume_donated
1,0,13,3250,28,1,42250
2,1,16,4000,35,1,64000
3,2,20,5000,45,1,100000
5,4,4,1000,4,0,4000
6,2,7,1750,14,1,12250


In [37]:
X_train[0]

array([  21,    2,  500,   21, 1000])

In [39]:
y_train[0]

0

In [40]:
X_df = pd.DataFrame(X_train)
X_df.columns = ['months_since_last_donation', 'number_of_donations', 'total_volume_donated', 'months_since_first_donation', 'number_of_donations_and_total_volume_donated']
X_df.head()

Unnamed: 0,months_since_last_donation,number_of_donations,total_volume_donated,months_since_first_donation,number_of_donations_and_total_volume_donated
0,21,2,500,21,1000
1,16,2,500,16,1000
2,2,1,250,2,250
3,14,2,500,21,1000
4,4,2,500,31,1000


In [41]:
# Which features were selected?
selector = grid.best_estimator_.named_steps['selectkbest']
all_names = X_df.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]
unselected_names = all_names[~selected_mask]

print('Features selected:')
for name in selected_names:
    print(name)


Features selected:
months_since_last_donation
total_volume_donated


In [42]:
grid.fit(X_test, y_test)

y_pred = grid.predict(X_test)

print("Test set Accuracy score: {:.2f}".format(grid.score(X_test, y_test)))



Test set Accuracy score: 0.76




In [43]:
print("Test set Accuracy score: {:.2f}".format(grid.score(X_test, y_test)))

Test set Accuracy score: 0.76




In [44]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.52      0.65       164
           1       0.35      0.76      0.48        55

   micro avg       0.58      0.58      0.58       219
   macro avg       0.61      0.64      0.57       219
weighted avg       0.74      0.58      0.61       219



In [45]:
confusion = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:\n{}".format(confusion))

Confusion Matrix:
[[86 78]
 [13 42]]


In [51]:
# F1 score is 0.61

FP = 78
TN = 86

# False Positive Rate:
FPR = FP / (FP + TN)

print("False Positive Rate: {:.2f}%".format(FPR * 100))

False Positive Rate: 47.56%
