### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 12 - Predictive modelling (supplementary material)

*Written by:* Oliver Scott

**This notebook provides a general introduction to predictive modelling using scikit-learn.**

Do not be afraid to make changes to the code cells to explore how things work!

-----

### What is predictive modelling?

Predictive modelling is the use of statistical models to forecast future outcomes. Typically, a predictive model incorporates a machine learning algorithm that is trained on existing observations to make predictions for new data. Predictive modelling can generally be divided into two main categories: regression and classification.

**Regression** models analyze relationships between a dependent variable (often called the "outcome" or "response") and one or more independent variables (known as "predictors" or "features"). For example, a regression model might predict the response of a biological system under different conditions.

**Classification** models aim to assign discrete class labels to observations based on independent variables, rather than providing a continuous output like regression. For instance, a classification model could be used to diagnose a condition based on measurements such as gene expression levels.

In this notebook, we will focus on a supervised classification objective: **predicting predefined class labels for a set of observations**.

### Supervised classification

Classification tasks are generally divided into two main categories: supervised and unsupervised. In supervised learning, class labels for the training data are known *a priori*, enabling us to "supervise" the model’s learning process.

The image below illustrates a supervised classification task with two variables ($x_1$ and $x_2$), where each class is indicated by a colour. The dotted line represents the (linear) decision boundary used to define two decision regions. New observations are classified based on the region they fall into, although it’s likely that the model will misclassify a percentage of unseen samples.

<p align="center">
  <img src="https://scipython.com/static/media/uploads/blog/logistic_regression/decision-boundary.png" alt="Decision Boundary" width="50%"/>
  <br>
</p>

[Image source](https://scipython.com/static/media/uploads/blog/logistic_regression/decision-boundary.png)

In contrast, unsupervised classification is used when class labels are unknown *a priori* and must be inferred directly from the observations. Typically, an unsupervised algorithm includes a clustering technique that groups data based on some form of distance (or similarity) measurement.

In this notebook, we will go through a basic pipeline for constructing a supervised classification model.

### What is scikit-learn?

[scikit-learn](https://scikit-learn.org/stable/) (sklearn) is a popular and robust Python package for machine learning. It offers a range of efficient tools for machine learning and statistical modelling, including classification, regression, clustering, and dimensionality reduction. You can explore the full set of features on the [project's website](https://scikit-learn.org/stable/). 

In this notebook, we will use scikit-learn to construct a supervised classification model.

-----

## Contents

1. [Objective](#Objective)
2. [Data analysis](#Data-analysis)
3. [Dimensionality reduction](#Dimesnionality-reduction)
4. [Train/Test splits](#Train/Test-splits)
4. [Training models](#Training-models)
5. [Model evaluation](#Model-evaluation)
6. [Feature importance](#Feature-importance)
7. [Discussion](#Discussion)

-----

#### Reference:

- [scikit-learn documentation](https://scikit-learn.org/stable/)
- K. P. Bennett and O. L. Mangasarian's "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software, 1992.

-----

## Objective

In this notebook, we’ll learn how to construct a supervised classification model using the **Breast Cancer Wisconsin (Diagnostic) dataset** created by Dr William H. Wolberg, a physician at the University of Wisconsin Hospital in Madison, USA. The dataset features are derived from an image of a fine needle aspirate (FNA) of a breast mass, where software called *Xcyt* was used to measure the cell nuclei characteristics in the image. The software calculates 10 features using a curve-fitting algorithm and then computes the mean, extreme values, and standard error for each feature, resulting in 30 real-valued attributes.

<p align="center">
  <img src="https://jithinjk.github.io/blog/images/histo/pcam.png" alt="breast cancer dataset distribution" width="100%"/>
  <br>
</p>

[Image source](https://jithinjk.github.io/blog/images/histo/pcam.png)

#### Attributes:

**Target:**
- Malignant or Benign

**Features:**
- **Radius**: Mean of distances from center to points on the perimeter
- **Texture**: Standard deviation of gray-scale values
- **Perimeter**
- **Area**
- **Smoothness**: Local variation in radius lengths
- **Compactness**: Calculated as $perimeter^2 / area - 1.0$
- **Concavity**: Severity of concave portions of the contour
- **Concave Points**: Number of concave portions of the contour
- **Symmetry**
- **Fractal Dimension**: coastline approximation $- 1$

For each feature, the mean, standard error, and “worst” or largest (mean of the three largest values) were computed, creating 30 features in total. For example, field 3 is `mean radius`, field 13 is `radius error`, and field 23 is `worst radius`.

**Class distribution**: 357 benign, 212 malignant

#### Goal

Our goal is to build a model capable of classifying whether a breast sample is malignant or benign based on these features using a supervised classification algorithm.

#### Discussion points:

- How could an accurate machine learning model for cancer diagnosis be beneficial in a clinical setting?
- What are some reasons a machine learning model might not be relied upon for such diagnoses?
- Can you think of other biomedical or unrelated problems that might benefit from predictive modelling?

-----

## Data analysis

Before we even think about building a model we need to understand the data we have been given. Probably the most important step is actually getting the data into Python. Luckily scikit-learn provides a method for retrieving this paticular dataset. 

scikit-learn is imported with the name `sklearn` (hyphens `-` are not allowed in Python package/module names).

In [None]:
from sklearn import datasets  # Import the `datasets` module

dataset = datasets.load_breast_cancer(as_frame=True)

# No output is expected from this cell

The `dataset` object contains some useful attributes.

In [None]:
# Print the targets in the dataset
print('Targets:', dataset.target_names)

In [None]:
# Print the names of the features in the dataset
print('Features:', dataset.feature_names)

The scikit-learn dataset also contains the actual data as a Pandas DataFrame (features) and a Pandas Series (target). 

Let's take a look at these and assign them to variables to use later.

In [None]:
target = dataset.target  # Our class labels
target.unique()

In [None]:
features = dataset.data  # Our features
features

Notice that the targets are binary encoded as `[0, 1]`. Let’s check the count of each label.

In [None]:
target.value_counts()

It appears there are 212 examples of class 0 ('malignant') and 357 examples of class 1 ('benign').

To better reflect a real-life scenario, we may prefer to reverse the class labels, with 0 representing 'benign' and 1 representing 'malignant'.

In [None]:
target = 1 - target  # Simple way to flip the labels
target.value_counts()

We can also plot the class counts to get a visual representation.

In [None]:
# We add this 'Jupyter magic' to display plots in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

# Construct the plot
target.value_counts(sort=True).plot.bar()

# Set the axis labels and x-axis ticks
plt.xlabel('Target classification')
plt.ylabel('Count')
plt.xticks([0, 1], ['benign', 'malignant'], rotation=0);

Now let’s take a look at the features. Remember, we can use `.info()` and `.describe()` to get an overview of the dataset. Check for any null values and ensure the data types are appropriate.

In [None]:
features.info()

In [None]:
features.describe()

The dataset looks clean, with no null values, and all features are real-valued, which saves us some additional preprocessing work. If we had discrete features, we’d need to convert them into numerical values, often using a technique called [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/).

Interpreting raw numbers alone can be challenging, so visualising the feature distributions can provide helpful insights. We can plot these distributions with histograms or [density plots](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.density.html). Let’s try visualising a few feature distributions.

In [None]:
# Create a histogram
features['mean radius'].plot.hist();

In [None]:
# Create a density plot
features['mean texture'].plot.density();

From the distributions of our data, we can observe that each feature occupies a different value range, although all are approximately [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution). This suggests that scaling the features to the same range might improve our model's performance.

**Feature scaling** is a crucial part of data preprocessing, as some machine learning algorithms depend on it to function correctly. For instance, classifiers that use distance measurements can be disproportionately influenced by features with larger ranges. Feature scaling ensures that each feature contributes roughly equally to the distance calculations. However, not all algorithms require scaled input data.

In some cases, you may also consider **normalising** features so that distributions become closer to a normal shape. Normally distributed features often lead to better-performing models, as they provide an approximately equal number of observations above and below the mean. Many models assume normally distributed input data.

The difference between scaling and normalisation is that scaling changes the **range** of your data, while normalisation changes the **shape** of its distribution. Thus, normalisation is useful when your data distribution is irregular, while scaling is helpful when your data is normally distributed but has different ranges. If you’d like more detail, there’s a helpful guide [here](https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff).

Remember, always examine your data!

#### Looking for correlations

Before performing any feature scaling, we can examine correlations in the data to identify features that may be especially useful for classification. By comparing the feature distributions for each class label and exploring the correlations between individual features, we gain insights into the data. For this analysis, we’ll use a Python package called [seaborn](https://seaborn.pydata.org/). **Seaborn** builds upon Matplotlib, offering convenient functions for creating informative visualisations.

Here, we’ll use the [`pairplot()` function](https://seaborn.pydata.org/generated/seaborn.pairplot.html) to plot pairwise relationships within the DataFrame:

 - `dataset.frame` is used, as it contains all features along with the target column.
 - The `hue` argument specifies which column to use for colouring the data points.
 - The `vars` argument defines which columns to compare.

In [None]:
import seaborn as sns

# We will restrict the columns to ones that contain the 'mean' features
cols = [x for x in features.columns if 'mean' in x]

# Now we can construct the plot
sns.pairplot(dataset.frame, hue='target', vars=cols);

# You may see a `FutureWarning` error message
# The plots may take a while to load

The plot is quite large but reveals interesting insights about our data. From the feature distributions on the diagonal, we can see that geometry-based features (such as radius, perimeter, and area) vary more across classes compared to features like texture and smoothness. Additionally, features like radius and perimeter show strong correlations, which makes sense given their geometric relationship.

Having so many plots on one page can sometimes be overwhelming. Instead, we can use a heatmap to visualise correlations between features more clearly. 

Pandas provides a `.corr()` method that calculates the correlation matrix, which we can then pass to Seaborn’s `heatmap()` [function](https://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap) for a consolidated view of feature correlations in a single plot.

In [None]:
# First calculate the correlation matrix
correlation = features.corr()

# Plot the result 
# Here we set up a Matplotlib figure to specify the size
plt.figure(figsize=(16,12)) 
sns.heatmap(correlation);

From the heatmap, we can observe several key correlations:

- **Geometric features**: High correlations between features like mean radius, perimeter, and area indicate that larger tumors have larger measurements across these dimensions.
- **"Worst" features**: Features labeled "worst" (e.g., worst radius, worst perimeter) are strongly correlated, as extreme values across these measurements often coincide.
- **Shape-related features**: Mean compactness, concavity, and concave points are moderately to strongly correlated, as they all relate to the tumor's contour.
- **Lower correlations**: Texture and symmetry have lower correlations with geometric features, suggesting they capture different tumor characteristics.

These correlations suggest some features provide overlapping information, which could allow for *dimensionality reduction* in further analysis.

Suppose we've identified that "mean concave points" may be a useful feature based on the distributions we’ve plotted. We can examine the correlation between this feature and the class label using the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). Since our target variable is dichotomous, this correlation is specifically known as a point-biserial correlation coefficient (real-valued vs. dichotomous categorical). We can use the [SciPy](https://scipy.org/) package to perform this test.

In [None]:
from scipy.stats import pearsonr

result = pearsonr(features['mean concave points'], target)
print('Correlation:', result[0])

In the context of a binary target variable (0 for "benign" and 1 for "malignant"), a positive correlation with a feature means that as the feature's value increases, the likelihood of the target being 1 ("malignant") also increases. Conversely, a negative correlation implies that as the feature’s value increases, the likelihood of the target being 0 ("benign") increases.

Looks like there is a definite correlation between the 'mean concave points' and our target variable. Higher concave points values are more often found in the "malignant" cases.

Can you identify any other features with high correlation to the target?

Let's calculate a correlation for all features vs. the target and produce a visualisation.

In [None]:
import pandas as pd

# Step 1: Construct a DataFrame to store feature names and their correlation values
rankings = pd.DataFrame({
    'feature': features.columns  # The 'feature' column holds the feature names
})

# Step 2: Calculate the Pearson correlation coefficient for each feature
# We use an anonymous function to compute Pearson r between each feature and the target
rankings['pearsonr'] = rankings.feature.apply(lambda x: pearsonr(features[x], target)[0])

# Step 3: Sort the features by Pearson r value, placing the most negatively correlated at the top
rankings.sort_values('pearsonr', inplace=True)
rankings.reset_index(inplace=True, drop=True)

# Step 4: Plot the correlation results as a horizontal bar chart
ax = rankings['pearsonr'].plot.barh(figsize=(12, 10))
ax.set_title('Feature vs. Target Pearson correlation');
ax.set_yticklabels(rankings.feature);  # Set y-axis labels as feature names
ax.set_xlabel('Correlation');  # Label the x-axis as 'Correlation'
ax.set_ylabel('Feature'); # Label the y-axis as 'Feature'

The bar chart provides an overview of how each feature correlates with the target variable. Features at the top of the chart (higher positive correlation) may be more indicative of one class, while those at the bottom (higher negative correlation) may suggest the opposite class. This insight can help us identify which features are most informative for classifying the target.

We observe that some features show strong correlations with the target, while others appear to be less useful. Notably, both positive and negative correlations can be valuable. In practice, we might reduce the number of features to simplify the model, though in this notebook, we’ll use all features and later assess if the model’s results align with this correlation-based ranking.

-----

## Dimensionality reduction

The number of input variables or 'features' in a dataset is referred to as its dimensionality. Dimensionality reduction techniques are a group of algorithms that can reduce the dimensionality of a dataset while preserving its essential information. Higher dimensionality often makes model construction more challenging, a problem known as the "*curse of dimensionality*".

Using dimensionality reduction can simplify feature sets for machine learning algorithms and also help visualise data with more than three dimensions. Visualisations in reduced dimensions can guide the next steps in predictive modelling, such as selecting suitable models.

**[Principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis)** is a widely-used dimensionality reduction tool. PCA computes the principal components, which are the most informative dimensions of the data, capturing most of its variance. PCA is considered an unsupervised machine learning technique. The mathematics of PCA is beyond the scope of this notebook.

Depicting data in two or three dimensions is straightforward with x-, y-, and z-axes, but visualising higher dimensions is impossible. PCA helps by reducing dimensions to two or three, making the data more visually interpretable. Remember, as noted earlier, some algorithms are sensitive to scale; PCA is one of them. 

Let's start by scaling the data using scikit-learn's [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), which removes the mean and scales to unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

# Step 1: Setup the scaler
scaler = StandardScaler() 

# Step 2: Fit the scaler to the feature data and transform it
# This fits the scaler to the features dataset, calculating the mean and standard deviation for each feature,
# and then transforms each feature by subtracting the mean and dividing by the standard deviation
scaled_X = scaler.fit_transform(features)  # This results in a scaled dataset where each feature has a mean of 0 and a standard deviation of 1

# No output is expected from this cell

Now let's transform the scaled data with PCA.

In [None]:
from sklearn.decomposition import PCA

# Step 1: Create an instance of PCA to reduce dimensionality
# Setting n_components=2 will reduce the dataset to two principal components, making it easier to visualise
pca = PCA(n_components=2)

# Step 2: Fit the PCA model to the scaled data and transform it
# This step calculates the two principal components that capture the maximum variance in the data 
# and then transforms the original data to this reduced two-dimensional space
Xt = pca.fit_transform(scaled_X)

# Display the shape of the transformed data to verify the reduction
print('New shape:', Xt.shape)

# Step 3: Create a new DataFrame to store the principal components
# This DataFrame includes two columns: PC1 and PC2, representing the first and second principal components
components = pd.DataFrame(Xt, columns=['PC1', 'PC2'])

# Step 4: Add the target labels for easier visualisation of the classes
components['Target'] = target
components

A nice way to visualise this is to use a simple scatter plot using the seaborn `.scatterplot()` method. This visualisation helps us understand how well the reduced dimensions separate the classes.

In [None]:
# Convert binary labels to text
components['Target'] = components.Target.apply(lambda x: 'Benign' if x == 0 else 'Malignant')

# Create scatter plot
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Target', data=components);

Seems like our classes look quite easily seperable using the features we have. This is good news for our predictive modelling. We can be quite confident that a machine learning method will perform well. 

If we were to use features from PCA to train a model, how do we decide how many components to use? We look at the cumulative sum of the explained variance ratio, which tells us how much of the dataset's total variance is captured by adding each new component. By plotting this cumulative variance, we can see the point at which adding more components provides diminishing returns in terms of the variance explained.

This plot helps us determine the optimal number of components to keep. For example, if we see that 95% of the variance is explained by the first 10 components, we might decide to use 10 components for our model instead of the full set. This allows us to reduce dimensionality while retaining most of the information in the dataset.

The code first fits the PCA model to the scaled data, finds the explained variance ratio for each component, and then plots the cumulative sum of these ratios against the number of components. (*Don't worry if you do not understand the next piece of code, the plot is more interesting.*)

In [None]:
import numpy as np

pca = PCA().fit(scaled_X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

**We can see now that ~6 components explain ~90% of the entire variance!** This could be a practical starting point for building a model, as it balances retaining most of the information while reducing the dataset's dimensionality. 

Reducing the number of features to 6 components simplifies the dataset, potentially improving the model's computational efficiency and performance without a significant loss of information. This approach is particularly helpful if we suspect that only a few of the features are truly informative for the classification task.

## Train/Test splits

To ensure that our model can generalise well to new data, we divide our dataset into two subsets:

1. **Training set**: The larger portion of the data used to train the model.
2. **Test set**: A smaller portion that remains "unseen" by the model during training. This allows us to evaluate how well the model performs on new, previously unseen data, helping us assess its generalisability.

Testing on an unseen subset helps us identify if our model is overfitting—meaning it performs well on the training data but fails to generalise to new data. Overfitting can lead to misleadingly high performance during training but poor accuracy in real-world scenarios.

<p align="center">
  <img src="https://www.researchgate.net/profile/Konstantinos-Patlatzoglou/publication/364197638/figure/fig6/AS:11431281088359012@1665048828903/sualizations-of-model-underfitting-and-overfitting-resulting-in-different-training-and.png" alt="Overfitting" width="80%"/>
  <br>
</p>

[Image source](https://www.researchgate.net/profile/Konstantinos-Patlatzoglou/publication/364197638/figure/fig6/AS:11431281088359012@1665048828903/sualizations-of-model-underfitting-and-overfitting-resulting-in-different-training-and.png)

In this example, we’ll use an 80/20 split, with 80% for training and 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split the features and labels (NumPy arrays)
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, shuffle=False)

print('X train shape:', x_train.shape)
print('X test shape:', x_test.shape)
print('Y train shape:', y_train.shape)
print('Y test shape:', y_test.shape)

## Training models

There are numerous machine learning algorithms we could apply to this problem. Popular choices include:

- [Nearest Neighbours](https://scikit-learn.org/stable/modules/neighbors.html)
- [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
- [Decision Tree](https://scikit-learn.org/stable/modules/tree.html)
- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#random-forests)
- [Support Vector Machine](https://scikit-learn.org/stable/modules/svm.html)
- [Neural Network](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)

While detailed explanations of these algorithms are beyond the scope of this notebook, you can explore the links above if you're interested in learning more about machine learning techniques.

In practice, data scientists often train multiple algorithms to identify the best-performing solution. Each algorithm makes different assumptions about the data to generalise and make predictions, so the effectiveness of a model depends on how well these assumptions align with the underlying patterns in the data.

In this notebook, we will focus on the **Logistic Regression algorithm** to model our problem. Despite its name, Logistic Regression is actually used for classification tasks and is conceptually similar to [linear regression](https://en.wikipedia.org/wiki/Linear_regression). However, unlike linear regression, which fits data to a straight line, logistic regression uses an S-shaped curve. This method is based on the concept of [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation), which estimates the parameters of an assumed probability distribution function based on observed data, aiming to make the observed data most probable.

<p align="center">
  <img src="https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-vs-logistic-regression.png" alt="Logistic vs Linear Regression" width="50%"/>
  <br>
</p>

[Image source](https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-vs-logistic-regression.png)

One advantage of Logistic Regression is that it can provide probability estimates for class membership, giving insight into the model's confidence in its predictions. What do you think are the benefits of being able to provide a probability estimation for a particular prediction?

Based on our observations from PCA, it appears that our data is linearly separable, making Logistic Regression a suitable choice. Let's build a model using a [Logistic Regression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from scikit-learn. While scaling isn’t strictly necessary, it can help speed up convergence and make coefficients easier to interpret.

In [None]:
from sklearn.linear_model import LogisticRegression

# Setup our standard scaler
scaler = StandardScaler()

# Scale our training features
x_train_scaled = scaler.fit_transform(x_train)

# Initialize our model (default parameters)
lr = LogisticRegression()

# Fit our model on the scaled training data (features, target)
lr.fit(x_train_scaled, y_train)

That was simple and fast! 

**The model however is fairly useless to us if we have not performed any evaluation.** How do we know if it can generalise to unseen data?

## Model evaluation

Model evaluation is performed on a held-out set of observations to assess a model's ability to generalise to unseen data. Here, we set aside 20% of our data for this purpose. 

To begin, we need to make predictions for the test observations using our Logistic Regression model. Since we scaled the training data, we also need to scale the test data in the same way. 

However, instead of using `.fit_transform()`, which both fits and transforms the data, we will use `.transform()` on the test set. This ensures that the scaling is consistent with the training data and prevents data leakage from the test set.

In [None]:
# Scale our testing features
x_test_scaled = scaler.transform(x_test)

# Make our predictions
y_pred = lr.predict(x_test_scaled)
print('Predictions:', y_pred)

#### Accuracy
Accuracy is perhaps the simplest metric we can use to evaluate our model. It measures how well our model correctly predicts the true labels of input observations, calculated as follows:

$Accuracy = \frac{Number\ of \ correct\ predictions}{Total\ number\ of\ predictions}$

In other words, accuracy reflects the proportion of correct predictions over the total predictions made by the model.

In [None]:
# Calculate and round to three decimal places
accuracy_score = round(lr.score(x_test_scaled, y_test), 3)

print('Accuracy of Logistic Regression model on test data:', str(accuracy_score * 100) + '%')

Although accuracy is a common evaluation metric, it should be treated with caution! Accuracy is only a reliable metric if the dataset has balanced classes. For example, if a dataset contains 90% of class A and only 10% of class B, a model could achieve 90% accuracy by always predicting class A. However, this would not reflect the model’s ability to distinguish between classes accurately.

This limitation can be especially concerning in critical applications, such as diagnostics, where a model might fail to identify a disease due to an over-reliance on accuracy as a performance metric. In our case, the "malignant" class is the minority class, so we should be cautious with accuracy and consider using other metrics, like precision, recall, or F1-score, to get a better sense of the model’s true performance.

#### Confusion matrix

In a classification problem, there are four possible types of outcomes:

- **True Positives (TP)**: Correctly predicted positive cases (success!)
- **False Positives (FP)**: Incorrectly predicted positive cases (Type I error, or failure)
- **True Negatives (TN)**: Correctly predicted negative cases (success!)
- **False Negatives (FN)**: Incorrectly predicted negative cases (Type II error, or failure)

A **confusion matrix** is a useful tool for visualising these outcomes in a grid format. Ideally, we want higher values along the diagonal (representing true positives and true negatives), which indicates accurate predictions. 

The confusion matrix also serves as a foundation for calculating other important metrics, such as precision, recall, and F1-score:

<p align="center">
  <img src="https://2.bp.blogspot.com/-EvSXDotTOwc/XMfeOGZ-CVI/AAAAAAAAEiE/oePFfvhfOQM11dgRn9FkPxlegCXbgOF4QCLcBGAs/s1600/confusionMatrxiUpdated.jpg" alt="Confusion Matrix" width="60%"/>
  <br>
</p>

[Image source](https://2.bp.blogspot.com/-EvSXDotTOwc/XMfeOGZ-CVI/AAAAAAAAEiE/oePFfvhfOQM11dgRn9FkPxlegCXbgOF4QCLcBGAs/s1600/confusionMatrxiUpdated.jpg)

Let’s plot a confusion matrix for our model to visualise its performance.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

# Construct a confusion matrix (NumPy array)
c_m = confusion_matrix(y_true=y_test, y_pred=y_pred)

# Plot the confusion matrix
ConfusionMatrixDisplay(c_m, display_labels=['Benign', 'Malignant']).plot(cmap='Blues');

In this confusion matrix, we can observe the following:

- The model has predicted 86 true negatives (correctly identified "benign" cases) and 26 true positives (correctly identified "malignant" cases).
- There are 2 false positives, where "benign" cases were incorrectly predicted as "malignant".
- There are no false negatives, meaning no "malignant" cases were incorrectly classified as "benign".

The model's errors are limited to a small number of false positives, where "benign" cases are flagged as "malignant". In a clinical context, it may be preferable to have false positives (diagnosing "benign" as "malignant") over false negatives (missing a "malignant" diagnosis), as the latter could lead to a missed diagnosis of cancer.

#### Receiver operating characteristic (ROC) Curve

The ROC curve is a visual tool used to summarise all possible confusion matrices at various threshold levels, making it popular for evaluating binary classifiers. It plots the *true positive rate* (Recall/Sensitivity) against the *false positive rate*. A model that performs no better than random chance will produce a diagonal line at 50%, indicating no discrimination between classes. 

The **area under the curve (AUC)** provides a single metric to evaluate the model's performance. An **AUC** of 1 indicates perfect classification, while 0.5 suggests random guessing. The **AUC** represents the likelihood that the classifier will correctly rank a randomly chosen positive example higher than a randomly chosen negative one, making it a valuable metric for classifier evaluation.

<p align="center">
  <img src="https://miro.medium.com/max/1400/1*uC8BcLIMqYTmmojrFFzB9g.png" alt="ROC Curve" width="40%"/>
  <br>
</p>

[Image source](https://miro.medium.com/max/1400/1*uC8BcLIMqYTmmojrFFzB9g.png)

Let’s calculate the **AUC** and plot the ROC curve for our model.

In [None]:
from sklearn.metrics import RocCurveDisplay, roc_curve, auc

# First calculate our FPR and TPR
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

# Calculate the AUC from FPR and TPR
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name='Logistic Regression').plot();

- The ROC curve shows a high *true positive rate* with an AUC of 0.99, indicating strong model performance and a low rate of false positives.
- Based on the AUC of 0.99, this appears to be a good model, as it demonstrates high accuracy in distinguishing between the classes.
- While the high AUC suggests adequate performance, a larger dataset could improve robustness and reduce the likelihood of overfitting, especially for real-world applications.
- Although the model performs well on this test set, further evaluation on diverse and larger datasets would be necessary to confirm its robustness for real-life diagnostic use, particularly in critical applications like healthcare.

In practice, a data scientist would apply [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) techniques to obtain more reliable estimates of evaluation metrics. Cross-validation involves dividing the data into multiple subsets, training the model on some subsets while validating it on others. This process is repeated multiple times, allowing the model's performance to be assessed across different splits of the data, which provides a more robust measure of accuracy and helps prevent overfitting.

## Feature importance

We've now trained and evaluated a Logistic Regression model, but how can we interpret it? Specifically, what features is the model using to make predictions, and which are less influential? 

Model interpretation is a critical aspect of machine learning, especially when applying models in real-life contexts where understanding the decision-making process is essential. Fortunately, we can use the model’s coefficients to get a basic sense of 'feature importance'.
> The logistic regression coefficient $β$ indicates how a one-unit increase in a predictor $X$ changes the odds of the outcome. Specifically:
> 
> - If $β$ is positive, increasing $X$ raises the odds of the outcome.
> - If $β$ is negative, increasing $X$ lowers the odds of the outcome.
>
> The amount by which the odds change is given by $e^β$.

Since our data has been scaled, all features are on the same scale, making the coefficients directly comparable.

In [None]:
# Extract coefficients from the trained Logistic Regression model
coefficients = lr.coef_[0]

# Create a DataFrame to store feature names and their corresponding coefficients
feat_importance = pd.DataFrame({
    'feature': features.columns,
    'coef': coefficients
})

# Sort features by their coefficient values
feat_importance.sort_values('coef', inplace=True)

# Plot the feature importance as a horizontal bar chart
ax = feat_importance['coef'].plot.barh(figsize=(12, 10))
ax.set_title('Logistic Regression coefficients');
ax.set_yticklabels(feat_importance.feature);
ax.set_xlabel('Correlation');

The "feature importances" (logistic regression coefficients) in this chart have some similarities with the Pearson correlation coefficients calculated earlier, but there are also notable differences:

1. **Top predictors**: Features like "radius error," "worst texture," and "worst concave points" are prominent in both analyses, indicating they are significant predictors of the target outcome in both Pearson correlation and logistic regression.

2. **Sign direction**: The sign (positive or negative) of coefficients aligns with the Pearson correlation directions. For instance, features with high positive coefficients (like "radius error" and "worst texture") also had high positive Pearson correlation scores, meaning they both increase the likelihood of the target being "malignant".

3. **Magnitude differences**: Some features with strong correlations have less pronounced coefficients and vice versa. This is because Logistic Regression coefficients are also influenced by the relationships between features, while Pearson's r measures individual linear correlations without considering multicollinearity or interactions between features.

4. **Interpretation nuances**: Logistic Regression coefficients help interpret how each feature contributes to the odds of classification, while Pearson’s r only tells us about direct linear associations. This difference could lead to discrepancies in feature importance when features are interdependent.

While there are overlaps, Logistic Regression coefficients provide a model-specific measure of feature importance that accounts for interactions, which might differ from the more isolated view provided by Pearson correlations.

Of course there are many more complicated ways of calculating feature importances which are more robust.

## Discussion

Predictive modelling is a huge part of the data science landscape today. It involves applying statistical techniques to forecast future outcomes. Most predictive models rely on machine learning algorithms to analyse patterns and make predictions. Thorough data exploration is critical for building a successful model. Accurate model evaluation is especially important when considering real-world applications.

Feel free to add more code cells and experiment with the concepts you have learnt.

You can use this notebook as reference if you need to refresh your knowledge on any of the concepts explored.

You can click [here](#Contents) to go back to the top.