# Supervised Learning with scikit-learn - Part 3

> Chapter 3 - How good is your model?

- toc: true
- branch: master
- badges: true
- comments: true
- author: Hai Nguyen
- categories: [Datacamp, Machine Learning, Supervised Learning, Python, Classification, Overfitting, Underfitting, scikit-learn]
- image: images/supervised_learning_p3.png
- hide: false
- search_exclude: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

> Having trained models, now you will learn how to evaluate them. In this chapter, you will be introduced to several metrics along with a visualization technique for analyzing classification model performance using scikit-learn. You will also learn how to optimize classification and regression models through the use of hyperparameter tuning.

In [1]:
import pandas as pd
import numpy as np
import warnings

pd.set_option('display.expand_frame_repr', False)

warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)


## 3.1 How good is your model?

Classification metrics 
- Measuring model performance with accuracy:
- Fraction of correctly classified samples
- Not always a useful metric

Class imbalance
- Classi,cation for predicting fraudulent bank transactions
    - 99% of transactions are legitimate; 1% are fraudulent

- Could build a classi,er that predicts NONE of the transactions are fraudulent
    - 99% accurate!
    - But terrible at actually predicting fraudulent transactions
    - Fails at its original purpose

- Class imbalance: Uneven frequency of classes
- Need a different way to assess performance

![](./images/metrics1.png)

![](./images/metrics2.png)

![](./images/metrics3.png)

![](./images/metrics4.png)



In [9]:
df = pd.read_csv('./datasets/diabetes_clean.csv', index_col=None)
display(df.head())

diabetes_df = df.loc[(df['glucose'] != 0) & (df['bmi'] != 0)].copy()


X = diabetes_df[["bmi", "age"]].values
y = diabetes_df["diabetes"].values

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [10]:
# Confusion matrix in scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix


knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print(confusion_matrix(y_test, y_pred))

[[151  47]
 [ 47  56]]



### Deciding on a primary metric

Deciding on a primary metric
As you have seen, several metrics can be useful to evaluate the performance of classification models, including accuracy, precision, recall, and F1-score.

In this exercise, you will be provided with three different classification problems, and your task is to select the problem where precision is best suited as the primary metric.

> Answer: A model predicting if a customer is a high-value lead for a sales team with limited capacity.

With limited capacity, the sales team needs the model to return the highest proportion of true positives compared to all predicted positives, thus minimizing wasted effort.


### Assessing a diabetes prediction classifier

In this chapter you'll work with the diabetes_df dataset introduced previously.

The goal is to predict whether or not each individual is likely to have diabetes based on the features body mass index (BMI) and age (in years). Therefore, it is a binary classification problem. A target value of 0 indicates that the individual does not have diabetes, while a value of 1 indicates that the individual does have diabetes.

diabetes_df has been preloaded for you as a pandas DataFrame and split into X_train, X_test, y_train, and y_test. In addition, a KNeighborsClassifier() has been instantiated and assigned to knn.

You will fit the model, make predictions on the test set, then produce a confusion matrix and classification report.

Instructions:
- Import confusion_matrix and classification_report.
- Fit the model to the training data.
- Predict the labels of the test set, storing the results as y_pred.
- Compute and print the confusion matrix and classification report for the test labels versus the predicted labels.


In [11]:
# Import confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[168  30]
 [ 69  34]]
              precision    recall  f1-score   support

           0       0.71      0.85      0.77       198
           1       0.53      0.33      0.41       103

    accuracy                           0.67       301
   macro avg       0.62      0.59      0.59       301
weighted avg       0.65      0.67      0.65       301



The model produced 168  true positives, 34 true negatives, 30 false negatives, and 69 false positives. The classification report shows a better F1-score for the zero class, which represents individuals who do not have diabetes.

## 3.2 Logistic regression and the ROC curve



### Building a logistic regression model


### The ROC curve


### ROC AUC


## 3.3 Hyperparameter tuning


### Hyperparameter tuning with GridSearchCV


### Hyperparameter tuning with RandomizedSearchCV