# Fine-tuning your model

Having trained your model, your next task is to evaluate its performance. In this chapter, you will learn about some of the other metrics available in scikit-learn that will allow you to assess your model's performance in a more nuanced manner. Next, learn to optimize your classification and regression models using hyperparameter tuning.

# (1) How good is your model?

## Classification metrics
- Measuring model performance with accuracy
    - Fraction of correctly classified samples

## Class imbalence example: Emails
- Spam classification
    - 99% of emails are real; 1% of emails are spam
- Could build a classifier that predicts ALL emails as real
    - 99% accurate
    - But horrible at actually classifying spam
    - Fails at its original purpose
- Need more nuanced metrics

## Diagnosing classification predictions
- Confusion matrix

| | Predicted: | Predicted: |
| :-: | :-: | :-: |
| | Spam Email | Real Email |
| Actual: Spam Email | True Positive | False Negative |
| Acutal: Real Email | False Positive | True Negative |

- Accuracy:
$$\frac{tp + tn}{tp + tn + fp + fn}$$

## Metrics from the confusion matrix
- Precision $\frac{tp}{tp + fp}$
- Recall $\frac{tp}{tp + fn}$
- F1score: $2\cdot \frac{precision * recall}{precision + recall}$
- High precision: Not many real emails predicted as spam
- High recall: Predicted most spam emails correctly

## Confusion matrix in scikit-learn

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=8)
X_trian, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
knn.fit(X_train, y_train)

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

# Exercise I: Metrics for classification

In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, as Andy discussed, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

You may have noticed in the video that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the [PIMA Indians](https://www.kaggle.com/uciml/pima-indians-diabetes-database) dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of `0` indicates that the patient does not have diabetes, while a value of `1` indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.

The dataset has been loaded into a DataFrame `df` and the feature and target variable arrays `X` and `y` have been created for you. In addition, `sklearn.model_selection.train_test_split` and `sklearn.neighbors.KNeighborsClassifier` have already been imported.

Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

### Instructions

- Import `classification_report` and `confusion_matrix` from `sklearn.metrics`.
- Create training and testing sets with 40% of the data used for testing. Use a random state of `42`.
- Instantiate a k-NN classifier with `6` neighbors, fit it to the training data, and predict the labels of the test set.
- Compute and print the confusion matrix and classification report using the `confusion_matrix()` and `classification_report()` functions.


In [None]:
# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


# (2) Logistic regression and the ROC curve

## Logistic regression for binary classification
- Logistic regression output probabilities
- If the probability 'p' is greater than 0.5:
    - The data is labeled '1'
- If the probability 'p' is less than 0.5:
- The data is labeled '0'

## Linear decision boundary

<img src="image/Screenshot 2021-02-02 015911.png">

## Logistic regression in scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

## Probability thresholds
- By default, logistic regression threshold = 0.5
- Not specific to logistic regression

## The ROC curve

<img src="image/Screenshot 2021-02-02 020337.png">

## Plot the ROC curve

In [None]:
from sklearn.metrics import roc_curve
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC Curve')
plt.show()

<img src="image/Screenshot 2021-02-02 020732.png">

In [None]:
logreg.predict_proba(X_test)[:, 1]

# Exercise II: Building a logistic regression model

Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!

The feature and target variable arrays `X` and `y` have been pre-loaded, and `train_test_split` has been imported for you from `sklearn.model_selection`.

### Instructions

- Import:
    - `LogisticRegression` from `sklearn.linear_model`.
    - `confusion_matrix` and `classification_report` from `sklearn.metrics`.
- Create training and test sets with 40% (or `0.4`) of the data used for testing. Use a random state of `42`. This has been done for you.
- Instantiate a `LogisticRegression` classifier called `logreg`.
- Fit the classifier to the training data and predict the labels of the test set.
- Compute and print the confusion matrix and classification report. This has been done for you, so hit 'Submit Answer' to see how logistic regression compares to k-NN!


In [None]:
# Import the necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=42)

# Create the classifier: logreg
logreg = LogisticRegression()

# Fit the classifier to the training data
logreg.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


# Exercise III: Plotting an ROC curve

Great job in the previous exercise - you now have a new addition to your toolbox of classifiers!

Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. As Hugo demonstrated in the video, most classifiers in scikit-learn have a `.predict_proba()` method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the `.predict_proba()` method and become familiar with its functionality.

Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as `logreg`.

### Instructions

- Import `roc_curve` from `sklearn.metrics`.
- Using the `logreg` classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set `X_test`. Save the result as `y_pred_prob`.
- Use the `roc_curve()` function with `y_test` and `y_pred_prob` and unpack the result into the variables `fpr`, `tpr`, and `thresholds`.
- Plot the ROC curve with `fpr` on the x-axis and `tpr` on the y-axis.


In [None]:
# Import necessary modules
from sklearn.metrics import roc_curve

# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

# Plot
<img src="image/2021-02-02-022118.svg" width=50%>

# Exercise IV: Precision-recall Curve

When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:

$$Precision = \frac{TP}{TP + FP}$$
$$Recall = \frac{TP}{TP + FN}$$

On the right, a precision-recall curve has been generated for the diabetes dataset. The classification report and confusion matrix are displayed in the IPython Shell.

Study the precision-recall curve and then consider the statements given below. Choose the one statement that is **not** true. Note that here, the class is positive (1) if the individual has diabetes.

<img src="image/2021-02-02-022453.svg">

### Instructions

- A recall of 1 corresponds to a classifier with a low threshold in which all females who contract diabetes were correctly classified as such, at the expense of many misclassifications of those who did not have diabetes.

- Precision is undefined for a classifier which makes no positive predictions, that is, classifies everyone as not having diabetes.

- When the threshold is very close to 1, precision is also 1, because the classifier is absolutely certain about its predictions.

- Precision and recall take true negatives into consideration. (T)