In this notebook, we're going to build a classification model that discriminates between malignant and benign skin lesions. The data will come from a machine learning challenge published by the [International Skin Imaging Collaboration (ISIC)](https://challenge.isic-archive.com/data/) in 2016.

This will also be our first encounter with [`scikit-learn`](https://scikit-learn.org/stable/index.html) — the most popular Python library for building and evaluating machine learning models.

In [None]:
!pip install scikit-learn
import sklearn

# Important: Run this code cell each time you start a new session!

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install os
!pip install opencv-python
!pip install scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cv2
import sklearn

In [None]:
!wget -Ncnp https://isic-challenge-data.s3.amazonaws.com/2016/ISBI2016_ISIC_Part3_Training_Data.zip

In [None]:
!unzip -n ISBI2016_ISIC_Part3_Training_Data.zip

In [None]:
!wget -Ncnp https://isic-challenge-data.s3.amazonaws.com/2016/ISBI2016_ISIC_Part3_Training_GroundTruth.csv

In [None]:
!wget -Ncnp https://isic-challenge-data.s3.amazonaws.com/2016/ISBI2016_ISIC_Part1_Training_GroundTruth.zip

In [None]:
!unzip -n ISBI2016_ISIC_Part1_Training_GroundTruth.zip

# Step 1: Define the Problem You Are Trying to Solve

The overarching goal of the ISIC 2016 Challenge was to develop image analysis tools that automatically diagnose of melanoma from dermoscopic images. The organizers of the challenge provided collections of 1022 (w) $\times$ 767 (h) px images gathered from distinct patients. Each image was determined to be benign or malignant based on the judgment of a clinician.

In [None]:
# The relevant folders and files associated with this dataset
# (we will talk about some of them later)
image_folder = 'ISBI2016_ISIC_Part3_Training_Data'
segementation_folder = 'ISBI2016_ISIC_Part1_Training_GroundTruth'
label_filename = 'ISBI2016_ISIC_Part3_Training_GroundTruth.csv'

In [None]:
# Load two pre-selected image files to show what they look like
benign_filename = 'ISIC_0000000.jpg'
malignant_filename = 'ISIC_0000002.jpg'
benign_img = cv2.imread(os.path.join(image_folder, benign_filename))
benign_img = cv2.cvtColor(benign_img, cv2.COLOR_BGR2RGB)
malignant_img = cv2.imread(os.path.join(image_folder, malignant_filename))
malignant_img = cv2.cvtColor(malignant_img, cv2.COLOR_BGR2RGB)

# Show the images and their labels
plt.figure(figsize=(6, 3))
plt.subplot(1, 2, 1), plt.imshow(benign_img), plt.title('Benign')
plt.subplot(1, 2, 2), plt.imshow(malignant_img), plt.title('Malignant')
plt.show()

Because we are deciding between distinct outcome categories, we will want to create a ***classification model***. We have two possible labels: 'benign' and 'malignant'. For simplicity, we will call these labels 'negative' (`0`) and 'positive' (`1`) respectively.

# Step 2: Create Your Features and Labels

## Creating the Labels

The labels for our dataset are provided in a `.csv` file. Let's load it and see what it looks like:

In [None]:
labels_df = pd.read_csv(label_filename, header=None)
labels_df

The first column provides the name of an image file (without the extension), and the second column provides the label that should be associated with that image.

To make this more usable, we are going to do two things:
1. To make it easier to merge our features with our labels, we will set the index of the `DataFrame` to be the image name.
2. Most libraries prefer that labels are represented with numbers rather than strings of text. Therefore, we are going to map the text `'benign'` to `0` and `'malignant'` to `1`.

In [None]:
labels_df.rename(columns={0: 'Image Name', 1: 'Label'}, inplace=True)
labels_df.set_index(['Image Name'], inplace=True)
labels_df['Label'].replace({'benign': 0, 'malignant': 1}, inplace=True)
labels_df

## Plan for Creating the Features

Creating the features is not going to be as straightforward. We could theoretically use the color of every single pixel as its own feature, but would result in an extremely complex and rigid feature space that would result in overfitting. Deep learning can handle such complex data, but not traditional machine learning models.

Instead, we are going to use the image processing techniques we discussed earlier to summarize each image as a series of numerical features. More specifically, we are going to locate the skin lesion within the broader image and then quantify aspects of the lesion that we think will be relevant for prediction.

We could generate hundreds of random features and hope that the machine learning model can figure out which ones are most important. We could also generate hundreds of random features and hope that a feature selection process can determine which features are more important before model training.

However, having domain expertise about the problem we are trying to solve can save us significant time and effort while possibly leading to a more accurate model. In this particular case, we can use the ABCDE rule of dermatology. This rule is a handy tool that helps people visually identify potential signs of melanoma. It stands for:

* **Asymmetry (A):** Melanomas are often asymmetric, meaning one half of the mole or lesion does not mirror the other half.
* **Border (B):** Melanomas typically have irregular, ragged, or blurred borders, rather than smooth and well-defined edges.
* **Color (C):** Melanomas often exhibit a variety of colors within the same lesion, such as different shades of brown, black, red, or blue.
* **Diameter (D):** Melanomas tend to be larger in size compared to benign moles. Although the exact threshold may vary, a diameter greater than 6 millimeters is often considered a warning sign.
* **Evolution (E):** Any significant change in size, shape, color, or texture of a mole or lesion over time should be closely monitored.

We will only be looking a single image of each skin lesion, we will only be able to extract features representing the first four components of the rule.

Below is a description of the specific values we are going to calculate for each rule component along with a brief summary of their intuition. There are multiple ways we could have formulated these features (e.g., average RGB color instead of HSV color), and some of these calculations are more advanced than others.

Knowing how to translate human-interpretable rules to features is a difficult skill that comes with practice, exposure to a diverse toolbox of techniques, and a healthy amount of internet searching for code examples and academic papers. For now, the main takeaway should simply be that each feature was informed by a combination of domain expertise and knowledge about how to work with image data.

| Rule Component | Feature | Explanation |
|------|:-----|:-----|
| Asymmetry (A) | The Hausdorff distance between the contour and a flipped version of it | The Hausdorff distance is a measure of similarity between two polygons |
| Border (B) | The ratio between the perimeter of the skin lesion and the perimeter of a smoothed version of the skin lesion's contour | The more jagged the contour, the more different it will be from the smoothed contour |
| Color (C) | The color of the skin lesion in HSV | Using HSV since it will extract more intuitive color components |
| Diameter (D) | The diameter of the minimum enclosing circle | We care about the widest line through the skin lesion |

## Extracting the Features from a Single Image

Properly segmenting a skin lesion from a image is difficult for multiple reasons:
* Skin lesions can be red, brown, black, or purple, so a single color filter won't suffice
* People can have different skin tones, so a dynamic brightness threshold wouldn't work either
* Hair can cover skin lesions and make it more difficult to accurately detect edges

We are going to skip this step and rely on image annotations provided by the ISIC challenge organizers. These annotations indicate where the skin lesion is according to a binary image where white pixels belong to the skin lesion and black pixels correspond to everything else.

In [None]:
filename = 'ISIC_0000001'

# Load an image and its corresponding annotation
rgb_filename = filename + '.jpg'
seg_filename = filename + '_Segmentation.png'
img = cv2.imread(os.path.join(image_folder, rgb_filename))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
seg_img = cv2.imread(os.path.join(segementation_folder, seg_filename))
seg_img = cv2.cvtColor(seg_img, cv2.COLOR_BGR2GRAY)

# Show the image with its annotation
plt.figure(figsize=(6, 3))
plt.subplot(1, 2, 1), plt.imshow(img), plt.title('RGB Image')
plt.subplot(1, 2, 2), plt.imshow(seg_img, cmap='gray'), plt.title('Annotation')
plt.show()

To extract the contour from the image on the right, all we need to do is call `cv2.findContours()` and return the first (and only) contour from the list:

In [None]:
def extract_contour(seg_img):
    """
    Extracts the lone contour from the image annotation
    seg_img: a binary image representing an annotation
    """
    cnts, hierarchy = cv2.findContours(seg_img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    return cnts[0]

Once we have a contour that defines where the skin lesion is within the correspnding color image, we can use a series of helper functions to extract our features. We will create one helper function for each component of the ABCD(E) rule. The features related strictly to the shape of the skin lesion only require the contour, while the features related to color require both the contour and the original color image.

Again, you do not need to understand the exact mechanics of these calculations, as some are more complicated than you might expect. You should simply appreciate the fact that many of these helper functions use image processing techniques that we discussed earlier in the school year.

If you get lost looking at these helper functions, jump straight to the next header.

In [None]:
from scipy.spatial.distance import cdist

# Helper functions for calculating our custom asymmetry score
def rotate_contour(cnt, angle, center):
    rotation_matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated_contour = cv2.transform(cnt.reshape(-1, 1, 2), rotation_matrix).reshape(-1, 2).astype(int)
    return rotated_contour

def flip_contour_horiz(cnt, width):
    return np.array([[[width - point[0][0], point[0][1]] for point in cnt]], dtype=np.int32)

def get_hausdorff_distance(cnt1, cnt2):
    pts1 = np.array(cnt1).squeeze()
    pts2 = np.array(cnt2).squeeze()
    distances = cdist(pts1, pts2)
    return np.max(np.min(distances, axis=0))

In [None]:
def compute_asymmetry(img, cnt):
    """
    Compute the asymmetry of the skin lesion by comparing the contour with a
    reflected version of itself
    img: the image of the skin lesion
    cnt: the contour of the skin lesion
    """
    # Get the min enclosing ellipse
    center, axes, angle = cv2.fitEllipse(cnt)

    # Rotate the contour so that it is upright
    rotated_cnt = rotate_contour(cnt, angle, center)

    # Flip the contour horizontally
    flipped_rotated_cnt = flip_contour_horiz(cnt, img.shape[1])

    distance = get_hausdorff_distance(rotated_cnt, flipped_rotated_cnt)

    # Compute the diameter of the contour
    _, r = cv2.minEnclosingCircle(cnt)
    d = 2*r

    # Compute the symmetry score
    return distance / d

In [None]:
def compute_border(cnt):
    """
    Compute the jaggedness of the skin lesion's border by comparing the
    perimeter of the actual border to the perimeter of a smoothed version of it
    cnt: the contour of the skin lesion
    """
    # Compute the perimeter
    perimeter = cv2.arcLength(cnt, True)

    # Approximate the contour as a convex hull
    epsilon = 0.01 * perimeter
    approx = cv2.approxPolyDP(cnt, epsilon, True)

    # Compute the perimeter of the convex hull
    simplified_perimeter = cv2.arcLength(approx, True)

    # Return the ratio between the two
    return simplified_perimeter / perimeter

In [None]:
def compute_color(img, cnt):
    """
    Compute the average color of the skin lesion within the contour
    img: the image of the skin lesion
    cnt: the contour of the skin lesion
    """
    # Convert the image to HSV
    hsv_img = cv2.cvtColor(img, cv2.COLOR_RGB2HSV)

    # Recreate the binary mask using the contour
    mask = np.zeros(img.shape[:2], dtype=np.uint8)
    cv2.drawContours(mask, [cnt], -1, (255), thickness=-1)

    # Apply the mask to the image
    # TODO: see if this changes things and remove otherwise???
    masked_img = cv2.bitwise_and(img, img, mask=mask)

    # Compute the average HSV color
    return cv2.mean(masked_img, mask=mask)[:3]

In [None]:
def compute_diameter(cnt):
    """
    Compute the radius of the skin lesion according to the min enclosing circle
    cnt: the contour of the skin lesion
    """
    _, r = cv2.minEnclosingCircle(cnt)
    return 2*r

## Extract the Features from All Images

Now that we have helper functions to calculate our features, let's put everything together into a single function. This function will take a single image as input and return all of the features calculated for that image as a `dict`.

In [None]:
def process_img(filename):
    """
    Process a skin lesion image and produce all of the features according to
    the ABCD(E) rule as a dictionary (one value per key)
    filename: the name of the skin lesion image without the file extension
    """
    # Get the contour filename
    rgb_filename = filename + '.jpg'
    seg_filename = filename + '_Segmentation.png'

    # Get both of the images (RGB and segmentation annotation)
    img = cv2.imread(os.path.join(image_folder, rgb_filename))
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    seg_img = cv2.imread(os.path.join(segementation_folder, seg_filename))
    seg_img = cv2.cvtColor(seg_img, cv2.COLOR_BGR2GRAY)

    # Get the contour
    cnt = extract_contour(seg_img)

    # Extract features from the image
    asymmetry = compute_asymmetry(img, cnt)
    border = compute_border(cnt)
    color = compute_color(img, cnt)
    diameter = compute_diameter(cnt)

    # Combine everything into a feature vector
    feature_dict = {'Asymmetry': asymmetry,
                    'Border': border,
                    'Color (H)': color[0],
                    'Color (S)': color[1],
                    'Color (V)': color[2],
                    'Diameter': diameter}
    return feature_dict

In [None]:
# Test our function
filename = 'ISIC_0000000'
process_img(filename)

To get the features for all of the images in our dataset, we will iterate through all of the files and call our `process_image()` function on each image. We will gather the results in a single `DataFrame` that will hold all of our features. Running this will take some time since we have lots of images.

In [None]:
# Get all the filenames but remove the extension
img_filenames = os.listdir(image_folder)
img_filenames = sorted([f[:-4] for f in img_filenames])

# Iterate through the filenames
features_df = pd.DataFrame()
for img_filename in img_filenames:
    # Generate the features
    feature_dict = process_img(img_filename)

    # Add the image name
    feature_dict['Image Name'] = img_filename
    feature_df = pd.DataFrame([feature_dict])
    features_df = pd.concat([features_df, feature_df], axis=0)

# Set the index to the image name
features_df.set_index(['Image Name'], inplace=True)
features_df

## Grouping Features and Labels Together

To finalize our features and labels, we will combine `labels_df` and `features_df` into a single `DataFrame`. While this is an optional step, it will yield a few benefits:
1. We can export this `DataFrame` as a `.csv` to colleagues so that they can analyze the processed data for themselves
2. We can save this processed data for later so that we don't have to generate features from scratch
2. We can make sure the features and the labels are properly matched as we do our dataset splitting

To ensure that everything lines up properly, we will combine `labels_df` and `features_df` so that the rows line up according to their index (the image name).

In [None]:
df = features_df.merge(labels_df, left_index=True, right_index=True)
df

# Step 3: Decide How the Data Should Be Split for Training and Testing

The organizers of the ISIC challenge technically provided separate datasets for model training and testing. However, we are going to make our own splits for the sake of practice.

Since we only have one image per person, we can treat the images independently and do not have to worry about splitting our dataset in any fancy way. We are going to do a simple 80%-20% split where 80% of the images will be used for model training and the rest will be used for model testing. We can do this by using the `train_test_split()` function, which takes two input parameters:
1. **arrays:** A list, `numpy` array, or `DataFrame` containing our data
2. **test_size:** The fraction of the data that will be assigned to the test split

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2)
print(f'Number of samples in train data: {len(train_df)}')
print(train_df.head())
print(f'Number of samples in test data: {len(test_df)}')
print(test_df.head())

The splits are randomly decided, which means that we will get a different split each time we call this function. Randomization is important since we want to avoid fine-tuning our pipeline for a very specific configuration. However, randomization can also make it more difficult to debug our code since we won't be able to tell whether we are getting different results because of the randomness or because of changes we made.

One way to avoid this issue is by setting a ***random seed*** — an arbitrarily selected number that controls how random numbers are generated. As we are building our pipeline, we can set the random seed so that we get the same results every time. Once we are confident that everything is working properly, we can "turn off" the random seed by setting it to `None`. We will keep the random seed set in this notebook so that everyone gets the same results when they get to the end.

In [None]:
# Set the random seed to an arbitrary number of your choosing
np.random.seed(42)

# Rerunning this code will always have the same outcome
train_df, test_df = train_test_split(df, test_size=0.2)
print(f'Number of samples in train data: {len(train_df)}')
print(train_df.head())
print(f'Number of samples in test data: {len(test_df)}')
print(test_df.head())

Once we have our train and test splits, we will separate our data back into features and labels since `scikit-learn` models require `numpy` arrays as input.

In [None]:
x_train = train_df.drop('Label', axis=1).values
y_train = train_df['Label'].values
x_test = test_df.drop('Label', axis=1).values
y_test = test_df['Label'].values

# Step 4: (Optional) Add Feature Selection

Given that we only have a few features and they are informed by domain expertise, we are going to skip this step and assume that we have a reasonable set of features.

# Step 5: (Optional) Balance Your Dataset

We can check the balance of our dataset by looking at the frequency of values in our labels column. We can do this by checking either the `DataFrames` or the `numpy` arrays:

In [None]:
def print_label_dist(y):
    """
    Prints out the balance between positive and negative samples
    y: a 1D array of labels
    """
    num_neg = np.count_nonzero(y == 0)
    num_pos = np.count_nonzero(y == 1)
    print(f'Number of benign samples: {num_neg}')
    print(f'Number of malignant samples: {num_pos}')
    print(f'Fraction of positive samples: {num_pos/(num_pos+num_neg):0.2f}')

In [None]:
print_label_dist(df['Label'].values)

Notice that we have many more benign cases than we do malingant ones in our overall dataset. This imbalance trickles down once we split our data into train and test sets.

In [None]:
print_label_dist(y_train)

In [None]:
print_label_dist(y_test)

For now, we are going to leave this undisturbed, but we will revisit this issue the next time we look at this model.

# Step 6: Select an Appropriate Model

`scikit-learn` provides numerous classification model architectures with their own advantages and disadvantages. For now, we are going to stick with a ***random forest classifier***, which uses a collection of decision trees to make a decision.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# Step 7: (Optional) Select Your Hyperparameters

For now, we are going to stick with the default hyperparameters for our model.

# Step 8: Train and Test Your Model

We are finally ready to train our machine learning model! All we need to do is call the `.fit()` method while providing the features and labels from our training dataset. Underneath the hood, the model will adjust its underlying parameters and decision boundaries in order to optimize its performance on that data.

In [None]:
clf.fit(x_train, y_train)

Once we've trained the model, we can see how the model would classify a set of features by calling the `.predict()` method. People often forgo generating predictions for the training dataset since it will not give us a real indication of how the model will perform on previously unseen data. However, we will generate predictions for both our training and test data so that we can compare the model accuracy on both sets.

In [None]:
y_train_pred = clf.predict(x_train)
y_test_pred = clf.predict(x_test)

To plot an ROC curve, we actually need to generate our predictions a slightly different way. Rather than generating binary predictions (`0` or `1`), we will need to generate probabilistic predictions that indicate the likelihood that the given sample belongs to the positive class.

We can do this by calling `.predict_proba()` instead of `.predict()` on our model. This function produces an $n \times c$ array. $n$ is the number of samples, and $c$ is the number of classes we have in our dataset. The number in row $n_i$ and column $c_j$ indicates the likelihood that sample $i$ belongs to class $j$ according to our model.

Let's compare the output of `.predict()` and `.predict_proba()` on a subset of our training dataset features:

In [None]:
y_binary = clf.predict(x_train[:10, :])
y_prob = clf.predict_proba(x_train[:10, :])
print(y_binary)
print(y_prob)

Notice that only the 7th row has a `1` in the output of `.predict()`, indicating that it was the only sample that was predicted to be malignant out of the 10 samples. The corresponding row in the output of `predict_proba()` is the only one where the value on the left is less than the value on the right, reflecting the fact that the model was more confident that the sample was malignant instead of benign.

Since we only have two classes, we are simply going to save the rightmost column, which indicates the likelihood that each sample belongs to the positive class (`malignant`) according to our model.

In [None]:
y_train_pred_prob = clf.predict_proba(x_train)[:, 1]
y_test_pred_prob = clf.predict_proba(x_test)[:, 1]

# Step 9: Use an Appropriate Method for Interpreting Results

Now that we have predictions, we will examine a variety of metrics to see how well our model performed. Most of the functions we will discuss in this section require two inputs:
1. **y_true:** The known ground-truth labels from our dataset
2. **y_pred:** The labels predicted from the model

We will start by examining how well our model worked on the training dataset, after which we will revisit them for our test dataset.

## Confusion Matrix

We can manually generate and save a confusion matrix using the `confusion_matrix()` function:

In [None]:
from sklearn.metrics import confusion_matrix

# Generate the confusion matrix
cm = confusion_matrix(y_train, y_train_pred)
print(cm)

# Split the confusion matrix according to decision outcomes
tn = cm[0][0]
fp = cm[0][1]
fn = cm[1][0]
tp = cm[1][1]
print(f'True positives: {tp}')
print(f'True negatives: {tn}')
print(f'False positives: {fp}')
print(f'False negatives: {fn}')

However, most people prefer to generate a figure that shows the visualization since that's what ends up getting put into a paper or report. `scikit-learn` provides a handy class called `ConfusionMatrixDisplay` for creating such a visualization.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
classes = ['benign', 'malignant']
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred,
                                        display_labels=classes)
plt.show()

## Classification Accuracy Rates

The `metrics` module provides numerous functions you can call to calcluate various scores. A few examples are provided below:

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

print(f'Accuracy: {accuracy_score(y_train, y_train_pred)}')
print(f'F1 Score: {f1_score(y_train, y_train_pred)}')
print(f'Precision: {precision_score(y_train, y_train_pred)}')
print(f'Recall: {recall_score(y_train, y_train_pred)}')

Unfortunately, it does not provide functions for calculating sensitivity and specificity, which are commonly used in medical applications. However, we can calculate these ourselves using the entries of the confusion matrix:

In [None]:
# Get the confusion matrix
cm = confusion_matrix(y_train, y_train_pred)
tn = cm[0][0]
fp = cm[0][1]
fn = cm[1][0]
tp = cm[1][1]

# Calculate sensitivity and specificity
sens = tp / (tp+fn)
spec = tn / (tn+fp)
print(f'Sensitivity: {sens}')
print(f'Specificity: {spec}')

`classification_report()` provides a quick printout of commonly used classification accuracy metrics. This table breaks down the performance across the individual classes, which can be useful for determining whether your model is performing better for one class versus another. This table breaks down what each of the numbers represents:

| |Precision|Recall|F1-Score|Support|
|-----|-----|-----|-----|-----|
| **Benign** | Precision if it considered `benign` to be the positive label | Recall if it considered `benign` to be considered the positive label | F1-score if it considered `benign` to be considered the positive label | The number of examples that were labeled `benign` |
| **Malignant** | Precision if it considered `malignant` to be the positive label | Recall if it considered `malignant` to be considered the positive label | F1-score if it considered `malignant` to be considered the positive label | The number of examples that were labeled `malignant` |
| **Accuracy** | | | The F1-score across the entire dataset | The nubmer of examples in the entire dataset |
| **Macro avg** | The unweighted average precision across both classes | The unweighted average recall across both classes | The unweighted average F1-score across both classes | The nubmer of examples in the entire dataset |
| **Weighted avg** | The weighted average precision across both classes | The weighted average recall across both classes | The weighted average F1-score across both classes | The nubmer of examples in the entire dataset |


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, y_train_pred, target_names=classes))

## ROC Curve

We can generate the raw data for an ROC curve using `roc_curve()` and the area under the curve (AUC) using `auc_roc_score()`. Instead of providing the predicted binary labels from our model, we will need to provide the predicted likelihood scores that were output by `.predict_proba()`.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
fpr, tpr, thresholds = roc_curve(y_train, y_train_pred_prob)
auc = roc_auc_score(y_train, y_train_pred_prob)
print(f'AUC: {auc}')

Similar to what we said about confusion matrices, however, `scikit-learn` provides a handy function for generating an ROC curve visualization for us.

In [None]:
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(y_train, y_train_pred_prob)
plt.show()

## Comparing Performance on Train and Test Data

Let's create a function that will generate a detailed classification accuracy report combining a subset of the aforementioned metrics and visualizations:

In [None]:
def classification_evaluation(y_true, y_pred, y_pred_prob):
    """
    Generate a series of graphs that will help us determine the performance of
    a binary classifier model
    y_true: the target binary labels
    y_pred: the predicted binary labels
    y_pred_prob: the predicted likelihood scores for a positive label
    """
    # Calculate f1 score, sensitivity, and specificity
    cm = confusion_matrix(y_true, y_pred)
    tn = cm[0][0]
    fp = cm[0][1]
    fn = cm[1][0]
    tp = cm[1][1]
    f1 = f1_score(y_true, y_pred)
    sens = tp / (tp+fn)
    spec = tn / (tn+fp)

    # Generate the confusion matrix
    cm_title = f'Confusion Matrix \n(Sensitivity: {sens:0.2f}, Specificity: {spec:0.2f})'
    ConfusionMatrixDisplay.from_predictions(y_true, y_pred, display_labels=classes)
    plt.title(cm_title)
    plt.show()

    # Display the ROC curve
    roc_title = f'ROC Curve (F1 score: {f1:0.2f})'
    RocCurveDisplay.from_predictions(y_true, y_pred_prob)
    plt.title(roc_title)
    plt.show()

Let's run this function on both our train and test predictions to see the disparity in performance:

In [None]:
classification_evaluation(y_train, y_train_pred, y_train_pred_prob)

In [None]:
classification_evaluation(y_test, y_test_pred, y_test_pred_prob)

So what can we learn from all of these results:
* As we saw earlier, our dataset has many more benign cases than it does malignant cases. This imbalance could bias our model to assume that skin lesions are benign.
* Our model achieved perfect accuracy on our training dataset, which means that the model was able to learn something useful from the features we provided.
* However, our model did not perform so well on the test dataset, achieving a poor sensitivity and F1 score. Although we achieved a high specificity score, but that was because almost all of the test cases were predicted to be benign. This confirms our suspicion that the model may be biased.

To summarize, our model was able to learn from the features we provided, but it did not generalize to the unseen test dataset. We are going to revisit this machine learning pipeline in a later session to see how we can improve its ability to generalize to new data.