# Supervised Learning with scikit-learn

## Chapter 3

#### Chapter Slides

https://drive.google.com/drive/folders/1YH14o1I9KlcEpPEvtuDh89hxY68C4ibe?usp=sharing

#### Accuracy 

Common metric used to evaluate the performance of a classification model. However, it can be misleading in some cases, **such as imbalanced datasets**, where the majority class dominates the minority class, leading to high accuracy even with poor performance on the minority class.

In such cases, metrics such as **precision**, **recall**, **F1-score**, **AUC-ROC**, and confusion matrix can provide a more comprehensive evaluation of the model's performance.

In a class imbalanced scenario, accuracy can be misleading and can provide a false sense of the model's performance. A model that always predicts the majority class can achieve high accuracy but fail to identify the minority class, which is the important class in this case (fraudulent transactions). 

To address class imbalance, you can use metrics such as **precision**, **recall**,**F1-score**, **AUC-ROC**, and the confusion matrix to get a better understanding of the model's performance. 

Precision measures the proportion of true positive predictions out of all positive predictions, recall measures the proportion of true positive predictions out of all actual positive cases, and F1-score is the harmonic mean of precision and recall. 

**AUC-ROC (Area Under the Receiver Operating Characteristic Curve)** measures the model's ability to distinguish between positive and negative classes. The confusion matrix provides a detailed summary of the true positive, true negative, false positive, and false negative predictions made by the model.

There are several ways to assess the performance of a classification model, depending on the problem you are trying to solve and the distribution of your data. Some commonly used metrics include:

<p>
  <ol>
    <li><b>Accuracy:</b><br>
      The proportion of correct predictions made by the model, calculated as (True Positives + True Negatives) / Total.</li>
    <br>
    <li><b>Precision:</b><br>
      The proportion of true positive predictions out of all positive predictions, calculated as True Positives / (True Positives + False Positives).</li>
    <br>
    <li><b>Recall (or Sensitivity or True Positive Rate):</b><br>
      The proportion of true positive predictions out of all actual positive cases, calculated as True Positives / (True Positives + False Negatives).</li>
    <br>
    <li><b>F1-Score:</b><br>
      The harmonic mean of precision and recall, calculated as 2 * (Precision * Recall) / (Precision + Recall).</li>
    <br>
    <li><b>AUC-ROC (Area Under the Receiver Operating Characteristic Curve):</b><br>
      A measure of the model's ability to distinguish between positive and negative classes, represented by a graph that plots the True Positive Rate against the False Positive Rate.</li>
    <br>
    <li><b>Confusion Matrix:</b><br>
      A table that summarizes the true positive, true negative, false positive, and false negative predictions made by the model.</li>
  </ol>
</p>


#### Precision

Metric used in classification to evaluate the quality of positive predictions made by a model. It measures the proportion of true positive predictions out of all positive predictions made by the model. Precision is calculated as:

<p style="background-color:LavenderBlush;"><b>Precision = True Positives / (True Positives + False Positives)</b></p>

Where True Positives (TP) are the number of samples that are correctly predicted as positive by the model, and False Positives (FP) are the number of samples that are incorrectly predicted as positive by the model. Precision provides information on the ability of the model to make correct positive predictions. High precision means that the model is making fewer false positive predictions, and thus has a low rate of false alarms.

Precision should be used in combination with other metrics, such as recall and F1-score, to get a more complete understanding of the model's performance.

#### Recall

Known as Sensitivity or True Positive Rate, is a metric used in classification to evaluate the ability of a model to identify positive instances. It measures the proportion of true positive predictions made by the model out of all actual positive instances. Recall is calculated as:

<p style="background-color:LavenderBlush;"><b>Recall = True Positives / (True Positives + False Negatives)
</b></p>

Where True Positives (TP) are the number of samples that are correctly predicted as positive by the model, and False Negatives (FN) are the number of samples that are incorrectly predicted as negative by the model. Recall provides information on the ability of the model to find all positive instances. High recall means that the model is able to identify most of the positive instances.

Recall should be used in combination with other metrics, such as precision and F1-score, to get a more complete understanding of the model's performance.

#### The F1 score

A measure of a model's accuracy that balances precision and recall. It is the harmonic mean of precision and recall

<p style="background-color:Lavender;"><b>calculated as 2 * (Precision * Recall) / (Precision + Recall)
</b></p>

The F1 score is a useful metric when there is an imbalance between precision and recall, or when the cost of false positives and false negatives is not equal. A high F1 score indicates a good balance between precision and recall, whereas a low F1 score indicates a poor balance.

### Deciding on a primary metric

As you have seen, several metrics can be useful to evaluate the performance of classification models, including accuracy, precision, recall, and F1-score.

In this exercise, you will be provided with three different classification problems, and your task is to select the problem where precision is best suited as the primary metric.

#### A model predicting the presence of cancer as the positive class.

### Assessing a diabetes prediction classifier

In this chapter you'll work with the diabetes_df dataset introduced previously.

The goal is to predict whether or not each individual is likely to have diabetes based on the features body mass index (BMI) and age (in years). Therefore, it is a binary classification problem. A target value of 0 indicates that the individual does not have diabetes, while a value of 1 indicates that the individual does have diabetes.

diabetes_df has been preloaded for you as a pandas DataFrame and split into X_train, X_test, y_train, and y_test. In addition, a KNeighborsClassifier() has been instantiated and assigned to knn.

You will fit the model, make predictions on the test set, then produce a confusion matrix and classification report.

In [33]:
import pandas as pd


df = pd.read_csv('/home/minmin/Downloads/diabetes_clean.csv')
df.head()

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Instructions

Import confusion_matrix and classification_report.

Fit the model to the training data.

Predict the labels of the test set, storing the results as y_pred.

Compute and print the confusion matrix and classification report for the test labels versus the predicted labels.

In [34]:
# Import necessary libraries
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load your data into a Pandas dataframe
df = pd.read_csv('/home/minmin/Downloads/diabetes_clean.csv')

# Split the data into training and testing sets
X = df.drop('diastolic', axis=1)
y = df['diastolic']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train the KNN classifier on the training data
knn = KNeighborsClassifier(n_neighbors=6)
knn.fit(X_train, y_train)

# Predict the labels on the test data
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(classification_report(y_test, y_pred, zero_division=1))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.18      0.57      0.28         7
          24       0.00      1.00      0.00         0
          40       0.00      1.00      0.00         0
          44       0.00      0.00      0.00         2
          46       0.00      1.00      0.00         0
          48       0.00      0.00      0.00         1
          50       0.00      0.00      0.00         5
          52       0.00      0.00      0.00         1
          54       0.00      0.00      0.00         4
          55       0.00      1.00      0.00         0
          56       0.00      0.00      0.00         2
          58       0.11      0.67      0.19         3
          60       0.12      0.11      0.12         9
          62       0.00      0.00      0.00         8
          64       0.00      0.00      0.00         9
          65       1.00      0.00      0.00         4
          66       0.10      0.20      0.13         5
          68       0.00    

####  Logistic regression

Type of regression analysis used for solving binary classification problems, i.e. problems where the dependent variable can only take two values, 0 or 1. The output of logistic regression is a probability that a given data point belongs to one of the two classes. The threshold value of 0.5 is used to make the final class prediction, where if the probability is greater than 0.5, it is labeled as 1, otherwise it is labeled as 0.

### Building a logistic regression model

In this exercise, you will build a logistic regression model using all features in the diabetes_df dataset. The model will be used to predict the probability of individuals in the test set having a diabetes diagnosis.

The diabetes_df dataset has been split into X_train, X_test, y_train, and y_test, and preloaded for you.

#### Instructions

Import LogisticRegression.

Instantiate a logistic regression model, logreg.

Fit the model to the training data.

Predict the probabilities of each individual in the test set having a diabetes diagnosis, storing the array of positive probabilities as y_pred_probs.



In [52]:
"""
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[:10])

"""
# Import libraries
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logreg = LogisticRegression(solver='saga')

# Use RFE to select the most relevant features
rfe = RFE(logreg, n_features_to_select=10)
X_train_rfe = rfe.fit_transform(X_train, y_train)

# Fit the model on the selected features
logreg.fit(X_train_rfe, y_train)

# Predict probabilities
y_pred_probs = logreg.predict_proba(rfe.transform(X_test))[:, 1]

print(y_pred_probs[:10])


[0.00222979 0.00123342 0.00083997 0.00235243 0.00226814 0.00173799
 0.000134   0.00022037 0.00161866 0.00208011]




### The ROC curve

Now you have built a logistic regression model for predicting diabetes status, you can plot the ROC curve to visualize how the true positive rate and false positive rate vary as the decision threshold changes.

The test labels, y_test, and the predicted probabilities of the test features belonging to the positive class, y_pred_probs, have been preloaded for you, along with matplotlib.pyplot as plt.

You will create a ROC curve and then interpret the results.

#### Instructions 1/2

Import roc_curve.

Calculate the ROC curve values, using y_test and y_pred_probs, and unpacking the results into fpr, tpr, and thresholds.

Plot true positive rate against false positive rate.

In [73]:
#code to try 

"""
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Instantiate the encoder
encoder = OneHotEncoder()
y_test_bin = pd.get_dummies(y_test).to_numpy()

# Fit and transform the 1-dimensional y_test array to a 2-dimensional binary array
y_test_bin = encoder.fit_transform(y_test.to_numpy().reshape(-1, 1))

# Initialize list to store fpr, tpr, and thresholds values
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

# Loop over each class
for i in range(y_test_bin.shape[1]):
    fpr, tpr, _ = roc_curve(y_test_bin[:, i], y_pred_probs[:, i])
    tprs.append(np.interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = roc_auc_score(y_test_bin[:, i], y_pred_probs[:, i])
    aucs.append(roc_auc)
    
# Compute mean tpr and AUC
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = np.mean(aucs)
std_auc = np.std(aucs)

# Plot the ROC curve
plt.plot(mean_fpr, mean_tpr, label='Mean ROC (AUC = {:0.2f} $\pm$ {:0.2f})'.format(mean_auc, std_auc),
         color='blue', lw=2, alpha=.8)

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.legend(loc="lower right")
plt.show()
"""

#the answer


"""
# Import roc_curve
from sklearn.metrics import roc_curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--')

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()
# Import roc_curve
from sklearn.metrics import roc_curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--')

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()
"""

"\n# Import roc_curve\nfrom sklearn.metrics import roc_curve\n\n# Generate ROC curve values: fpr, tpr, thresholds\nfpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)\n\nplt.plot([0, 1], [0, 1], 'k--')\n\n# Plot tpr against fpr\nplt.plot(fpr, tpr)\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title('ROC Curve for Diabetes Prediction')\nplt.show()\n# Import roc_curve\nfrom sklearn.metrics import roc_curve\n\n# Generate ROC curve values: fpr, tpr, thresholds\nfpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)\n\nplt.plot([0, 1], [0, 1], 'k--')\n\n# Plot tpr against fpr\nplt.plot(fpr, tpr)\nplt.xlabel('False Positive Rate')\nplt.ylabel('True Positive Rate')\nplt.title('ROC Curve for Diabetes Prediction')\nplt.show()\n"

### ROC AUC

The ROC curve you plotted in the last exercise looked promising.

Now you will compute the area under the ROC curve, along with the other classification metrics you have used previously.

The confusion_matrix and classification_report functions have been preloaded for you, along with the logreg model you previously built, plus X_train, X_test, y_train, y_test. Also, the model's predicted test set labels are stored as y_pred, and probabilities of test set observations belonging to the positive class stored as y_pred_probs.

A knn model has also been created and the performance metrics printed in the console, so you can compare the roc_auc_score, confusion_matrix, and classification_report between the two models.

#### Instructions

Import roc_auc_score.

Calculate and print the ROC AUC score, passing the test labels and the predicted positive class probabilities.

Calculate and print the confusion matrix.

Call classification_report().

In [87]:
"""
# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs, multi_class='ovr'))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))
"""

"\n# Import roc_auc_score\nfrom sklearn.metrics import roc_auc_score\n\n# Calculate roc_auc_score\nprint(roc_auc_score(y_test, y_pred_probs, multi_class='ovr'))\n\n# Calculate the confusion matrix\nprint(confusion_matrix(y_test, y_pred))\n\n# Calculate the classification report\nprint(classification_report(y_test, y_pred))\n"

### Hyperparameter tuning with GridSearchCV

Now you have seen how to perform grid search hyperparameter tuning, you are going to build a lasso regression model with optimal hyperparameters to predict blood glucose levels using the features in the diabetes_df dataset.

X_train, X_test, y_train, and y_test have been preloaded for you. A KFold() object has been created and stored for you as kf, along with a lasso regression model as lasso.

#### Instructions

Import GridSearchCV.

Set up a parameter grid for "alpha", using np.linspace() to create 20 evenly spaced values ranging from 0.00001 to 1.

Call GridSearchCV(), passing lasso, the parameter grid, and setting cv equal to kf.

Fit the grid search object to the training data to perform a cross-validated grid search.

In [86]:
"""
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv= kf)

# Fit to the training data
lasso_cv.fit(X_train,y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))
"""

'\n# Import GridSearchCV\nfrom sklearn.model_selection import GridSearchCV\n\n#\xa0Set up the parameter grid\nparam_grid = {"alpha": np.linspace(0.00001, 1, 20)}\n\n# Instantiate lasso_cv\nlasso_cv = GridSearchCV(lasso, param_grid, cv= kf)\n\n# Fit to the training data\nlasso_cv.fit(X_train,y_train)\nprint("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))\nprint("Tuned lasso score: {}".format(lasso_cv.best_score_))\n'

Hyperparameter tuning with RandomizedSearchCV
As you saw, GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space. In this case, you can use RandomizedSearchCV, which tests a fixed number of hyperparameter settings from specified probability distributions.

Training and test sets from diabetes_df have been pre-loaded for you as X_train. X_test, y_train, and y_test, where the target is "diabetes". A logistic regression model has been created and stored as logreg, as well as a KFold variable stored as kf.

You will define a range of hyperparameters and use RandomizedSearchCV, which has been imported from sklearn.model_selection, to look for optimal hyperparameters from these options.


Instructions
Create params, adding "l1" and "l2" as penalty values, setting C to a range of 50 float values between 0.1 and 1.0, and class_weight to either "balanced" or a dictionary containing 0:0.8, 1:0.2.
Create the Randomized Search CV object, passing the model and the parameters, and setting cv equal to kf.
Fit logreg_cv to the training data.
Print the model's best parameters and accuracy score.



In [92]:
"""
from sklearn.model_selection import GridSearchCV

# Create the parameter space
params = {"penalty": ["l1", "l2"],
         "tol": np.linspace(0.0001, 1.0, 50),
         "C": np.linspace(0.1, 1.0, 50),
         "class_weight": ["balanced", {0:0.8, 1:0.2}]}

# Instantiate the RandomizedSearchCV object
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)

# Fit the data to the model
logreg_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))
"""

'\nfrom sklearn.model_selection import GridSearchCV\n\n#\xa0Create the parameter space\nparams = {"penalty": ["l1", "l2"],\n         "tol": np.linspace(0.0001, 1.0, 50),\n         "C": np.linspace(0.1, 1.0, 50),\n         "class_weight": ["balanced", {0:0.8, 1:0.2}]}\n\n# Instantiate the RandomizedSearchCV object\nlogreg_cv = RandomizedSearchCV(logreg, params, cv=kf)\n\n# Fit the data to the model\nlogreg_cv.fit(X_train, y_train)\n\n# Print the tuned parameters and score\nprint("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))\nprint("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))\n'

### Hyperparameter tuning with RandomizedSearchCV

As you saw, GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space. In this case, you can use RandomizedSearchCV, which tests a fixed number of hyperparameter settings from specified probability distributions.

Training and test sets from diabetes_df have been pre-loaded for you as X_train. X_test, y_train, and y_test, where the target is "diabetes". A logistic regression model has been created and stored as logreg, as well as a KFold variable stored as kf.

You will define a range of hyperparameters and use RandomizedSearchCV, which has been imported from sklearn.model_selection, to look for optimal hyperparameters from these options.

#### Instructions

Create params, adding "l1" and "l2" as penalty values, setting C to a range of 50 float values between 0.1 and 1.0, and class_weight to either "balanced" or a dictionary containing 0:0.8, 1:0.2.

Create the Randomized Search CV object, passing the model and the parameters, and setting cv equal to kf.

Fit logreg_cv to the training data.

Print the model's best parameters and accuracy score.

In [94]:
"""
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))
"""

'\n# Print the tuned parameters and score\nprint("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))\nprint("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))\n'