### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 07 - Introduction to Predictive Modelling

*Written by:* Oliver Scott

**This notebook provides a general introduction to predictive modelling using scikit-learn.**

Do not be afraid to make changes to the code cells to explore how things work!

-----

### What is Predictive Modelling?

In short, predictive modelling is the application of statistical models to predict future outcomes. Typically a predictive model includes some form of machine learning algorithm, which is trained using existing observations in order to make a prediction given new observations. Predictive modelling can be generalised into two main classes; regression and classification.

**Regression:**

Regression models are based on the analysis of relationships between a dependent variable ('outcome'/'response' variable) and one or more independent variables ('predictors'/'features'), e.g. predicting the output of a biological system under different conditions. 

**Classification:**

Classification models aim to assign discrete class labels to a set of independent variables rather than a continuous value as in regression. For example a classification model could be used to diagnose a condition (or not) based on measurements such as gene expression.

To keep it simple, in this notebook we will focus on a supervised classification objective, predicting predefined class labels for a set of observations.

### Supervised?

Classification tasks can be further grouped into two main categories: supervised an unsupervised. In supervised learning, we know the the class labels for training data *a priori* hence we can use this knowledge to 'supervise' the models learning process. the following image illustrates a classification task for samples with two random variables (x1, x2), where the class is indicated by the colour and the dotted line represents the (linear) decision boundary used to define two decision regions. New observations will be assigned to a respective class depending on in which decision region they will fall into. We can make the assumption that our model given unseen observations will not be be completely accurate miss-classifying some percentage of input samples.

<p align="center">
  <img src="https://scipython.com/static/media/uploads/blog/logistic_regression/decision-boundary.png" alt="Desicion boundry" width="30%"/>
  <br>
</p>

In contrast to supervised classification, unsupervised approaches are used when the labels for a set of observations are not known *a priori* and must be inferred from the observations themselves. Typically an unsupervised algorithm consists of some form of 'clustering' algorithm, grouping data into clusters (groups) based on some form of distance (or similarity) measurement. In this notebook we will run through a basic pipeline for constructing a supervised classification model.

### What is scikit-learn?

[scikit-learn](https://scikit-learn.org/stable/) (sklearn) is the most popular and robust Python package for machine learning. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. The full set of features can be seen at the [projects website](https://scikit-learn.org/stable/). In this notebook we will be utilising scikit-learn to construct a supervised classification model.

-----

# Contents

1. [Objective](#Objective)
2. [Data Analysis](#Data-Analysis)
3. [Dimensionality Reduction](#Dimesnionality-Reduction)
4. [Training Models](#Training-Models)
5. [Model Evaluation](#Model-Evaluation)
6. [Feature Importance](#Feature-Importance)
7. [Discussion](#Discussion)

-----

#### Extra Resources:

- [scikit-learn documentation](https://scikit-learn.org/stable/)

-----

#### References:

- [scikit-learn documentation](https://scikit-learn.org/stable/)

-----

## Objective

As mentioned previously, in this notebook we will be learning how to construct a supervised classification model. Specifically we will be using the 'Breast Cancer Wisconsin (Diagnostic) dataset' created by  Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin, USA. The features are 
computed from an image of a fine needdle aspirate (FNA) of a breast mass. A computer software 'Xcyt' was used to describe characteristics of the cell neuclei present in the image. The program constructs 10 features using a curve-fitting algorithm and then calculates the mean value, extreme value and standard error for each feature, resulting in 30 real-valued features. The process is described in [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

<p align="center">
  <img src="https://jithinjk.github.io/blog/images/histo/pcam.png" alt="breastcancer" width="100%"/>
  <br>
</p>

[image source](https://jithinjk.github.io/blog/images/histo/pcam.png)

**Attributes:**

Target:

- Malignant or Benign

Features:

- Radius (mean of distances from center to points on the perimeter)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness (local variation in radius lengths)
- Compactness (perimeter^2 / area - 1.0)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Symmetry
- Fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Class distribution: 357 benign, 212 malignant

**Objective**

Our objective is to construct a model capable of classifying wether the breast sample is either malignant or benign using the given features and a supervised classification algortithm.

**Talking Points:**

- Do you think that using a machine learning model that can diagnose cancer accurately would be beneficial in a clinical setting?
- Can you think of any reasons why a machine learning model might not be used to make such diagnoses?
- Can you identify any other biomedical related (or not) problems that may benefit from predictive modelling?

-----

## Data Analysis

Before we even think about building a model we need to understand the data we have been given/collected. Probably the most important step is actually getting the data into Python. Luckily scikit-learn provides a method for retrieving this paticular dataset. scikit-learn is imported with the name sklearn (hyphens `-` are not allowed in Python package/module names):


In [None]:
from sklearn import datasets  # import the datasets module

# Download the dataset:
dataset = datasets.load_breast_cancer(as_frame=True)

The dataset object contains some useful attributes to get some intial information about the dataset:

In [None]:
# Print the targets in the dataset
print('Targets:', dataset.target_names)

In [None]:
# Print the names of the features in the dataset
print('Features:', dataset.feature_names)

The scikit-learn dataset also contains the actual data as a Pandas `DataFrame` (features) and a Pandas `Series` (target). Let's take a look at these and assign them to variables to use later:

In [None]:
target = dataset.target  # Our class labels
target.unique()

In [None]:
features = dataset.data  # Our features
features.head(5)

Notice that the targets are binary encoded [0, 1] lets check how many there are of each label:

In [None]:
target.value_counts()  # remember this from the last notebook?

Looks like there are 212 examples of class 0 'malignant' and 357 examples of class 1 'benign'. I would prefer that the class labels were the other way around, 0 being 'benign' and 1 being 'malignant' so that it reflects a real-life scenario:

In [None]:
target = 1 - target  # simple way to flip the labels!

We can also plot the class counts to get a nice visual representation:

In [None]:
# We add this 'Jupyter magic' to display plots in the notebook.
%matplotlib inline
import matplotlib.pyplot as plt

# Construct the plot:
target.value_counts(sort=True).plot.bar()

# Set the axis labels:
plt.xlabel('Target classification')
plt.ylabel('Count')
plt.xticks([0, 1], ['benign', 'malignant']);

Now lets take a look at the features, remember that we can use `.info()` and `.describe()` to get some nice global summaries. Check for any null values and make sure the data types make sense:

In [None]:
features.info()

In [None]:
features.describe()

Looks like the dataset is nice and clean, no null values! All of the features are also real-valued, saving us from some extra work. When we have discrete features we need to use a method to 'encode' these features into numerical values. A technique called [one-hot-encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) is often used in this case.

Looking at raw numbers can often be hard to interpret so visualising the distributions of features can be insightful. We can use pandas to plot these distributions using a histogram or a [density plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.density.html). Let's try visualise a couple feature distributions:

In [None]:
# Create a histogram:
features['mean radius'].plot.hist();

In [None]:
# Create a density plot:
features['mean texture'].plot.density();

What observations can you make about the features? Try plotting some further features to make further observations.

From the distributions of our data we can see that each feature occupies a different value range, although all are approximately [normally distributed](https://en.wikipedia.org/wiki/Normal_distribution). This observation tells us that we may benefit from scaling the features so that they occupy the same value range.  

**Feature scaling** is an important part of data-preprocessing, as some machine learning alogithms will not function correctly without it. For example some classifiers use a distance measurement between two points. If one feature has a large range of values then this distance measurment will be influenced greatly by this feature opposed to a feature with a smaller range. Feature scaling brings all data into the same range so that each feature contributes approximatley  proportionately to the distance. Note that not all machine learning algorithms require that input data is scaled.

In some cases you may want to **normalise** features so that features with strange distributions become more 'normal'. Normally distributed features tend to lead to better models because there is an approximately equal number of observations above and below the mean. Many models maker the assumption that your data is normally distributed. 

The difference between scaling and normalisation is that scaling changes the **range** of your data whereas normalisation changes the **shape** of the distribution. Therefore normalisation is useful when you know your data is not normally distributed and scaling is useful when your data is normally distributed but occupies different value ranges. There is a nice guide [here](https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff) if you need further explanation.

Always look at your data!

-----

**Looking for correlations**

Before we perform any feature scaling we can look at existing correlations in the data to identify features which may be more useful for the classification task. We can compare the distributions of features for each class label and also look at the correlations between individual features. For this analysis we will use a Python package called [seaborn](https://seaborn.pydata.org/). Seaborn is built upon matplotlib providing convenience functions for creating useful visualisations.

Here we will use the [`pairplot()` function](https://seaborn.pydata.org/generated/seaborn.pairplot.html) which plots pairwise relationships in the DataFrame:

 - We use `dataset.frame` as it contains all features and also the target column
 - The hue argument specifys which column to use for colouring the data
 - the vars argument specifies which columns to compare

In [None]:
import seaborn as sns  # import seaborn

# we will restrict the columns to ones that contain the 'mean' features
cols =  [x for x in features.columns if 'mean' in x]

# Now we can construct the plot (may take a few seconds)
sns.pairplot(dataset.frame, hue='target', vars=cols);

The plot is pretty large! But reveals some interesting insights into our data. From the feature distributions (diagonal) we can see that the geometry based features (radius, perimeter, area etc.) vary more greatly per class than other features such as texture and smoothness. We can also see that features such as radius and perimeter are highly correlated (obviously!). What else can you decipher from the above plot?

Sometimes having so many plots on one page can be hard to interpret, instead we can use a heatmap to plot the correlation between features. Pandas has a method `.corr()` which can calculate the correlation which we can feed into the seaborn `heatmap()` [function](https://seaborn.pydata.org/generated/seaborn.heatmap.html?highlight=heatmap#seaborn.heatmap). Now we can also see all the features in one plot:

In [None]:
# First calculate the correlation matrix
correlation = features.corr()

# Plot the result (here we set up a matplotlib figure to specify the size)
plt.figure(figsize=(16,12)) 
sns.heatmap(correlation);

What correlations can you pick out? Do they make sense?

Say we have identified that 'mean concave points' may be a useful feature based on the distributions we have plotted. We can check the correlation of this feature with the class label. One could use the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) to test this since our target value is dichotomous. This correlation is also known as a point-biserial correlation coefficient (real vs dichotomous categorical). The [SciPy](https://scipy.org/) package has statistical tools we can utilise: 

In [None]:
from scipy.stats import pearsonr

col = 'mean concave points'
result = pearsonr(features[col], target)

print('Correlation:', result[0])

Looks like there is a definite correlation between the 'mean concave points' and our target variable! Can you identify any other features with high correlation to the target?

Why not calculate a correlation for all features vs the target and produce a visualisation:

In [None]:
import pandas as pd

# We can construct a dataframe to store our correlations:
rankings = pd.DataFrame({
    'feature': features.columns
})

# Calculate the pearson r-squared for each feature:
rankings['pearsonr'] = rankings.feature.apply(lambda x: pearsonr(features[x], target)[0])

# lets sort the features by pearsonr and display most negatively correlated:
rankings.sort_values('pearsonr', inplace=True)
rankings.reset_index(inplace=True, drop=True)

# Plot the result
ax = rankings['pearsonr'].plot.barh(figsize=(12, 10))
ax.set_title('Feature vs Target Pearson Correlation')
ax.set_yticklabels(rankings.feature)
ax.set_xlabel('Correlation');

We have some useful features and some perhaps not so useful features. Note that negative correlations are just as useful as positive correlations. We could potentially reduce the number of features we use for training to reduce the complexity of the model. In this notebook we will use all the features and check whether the model agrees with the correlation based ranking.

-----

## Dimensionality Reduction

The number of input variables or 'features' in a dataset is referred to as its dimensionality. Dimensionality reduction techniques are a group of algorithms that are able to reduce the dimensionality of a dataset while preserving its information content. The higher the dimesnionality of a dataset often the more challenging it is to construct a model. This problem is known as 'the curse of dimesnionality'. 

While using dimensionality reduction techniques can be used to simplify your features before input into a machine learning algorithm, it also a great way to help visualize data with more than three dimensions. These sorts of visualisations can help us decide the next steps of our predictive modelling project, such as which model might be applicable.   

**Principal Component Analysis (PCA)**

[Principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) is by far the most popular tool for dimensionality reduction. PCA is the process of computing the principal components and using them to perform a change of basis on the data. PCA considers the most informative dimensions of the data called principal components. These components capture most of the variation in the original dataset. PCA can be considered an unsupervised machine learning technique. 

The mathemtics of PCA is beyond the scope of this notebook, but those interested can take a look at the [wikipedia article](https://en.wikipedia.org/wiki/Principal_component_analysis).

**Visualisation**

Depicting data in two or three dimesnsions is easy with x- y- and z-axes, however depicting things in futher dimensions is impossible. PCA is able to reduce the dimensions to two or three so that we can visualise the data effectively. Remeber that earlier we mentioned that some algorithms are sensitive to scale? PCA is one of these algorithms, so lets begin by scaling the data using scikit-learns [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) which standardizes scale by removing the mean and scaling to unit variance:

In [None]:
from sklearn.preprocessing import StandardScaler

# First setup the scaler:
scaler = StandardScaler() 

# Now scale the features (now a NumPy Array):
scaled_X = scaler.fit_transform(features)

Now let's transform the scaled data with PCA:

In [None]:
from sklearn.decomposition import PCA

# We will reduce the dimensions to 2:
pca = PCA(n_components=2)

# Transform the data
Xt = pca.fit_transform(scaled_X)

# Check the output shape!
print('New Shape:', Xt.shape)

# We can save these components to a new DataFrame
components = pd.DataFrame(Xt, columns=['PC1', 'PC2'])

# Add the targets to help with plotting
components['Target'] = target
components.head(3)

A nice way to visualise this is to use a simple scatter plot using seaborn `scatterplot()`:

In [None]:
# Convert binary labels to text:
components['Target'] = components.Target.apply(lambda x: 'Benign' if x == 0 else 'Malignant')

# Create scatter plot:
plt.figure(figsize=(10, 8)) # Make figure larger!
sns.scatterplot(x='PC1', y='PC2', hue='Target', data=components);

Seems like our classes look quite easily seperable using the features we have! This is good news for our predictive modelling. We can be quite confident that a machine learning method will perform well. If we were to use features from PCA to train a model how do we decide how many components to use? We could simply look at the cumulative sum of the explained variance ratio! Don't worry if you do not understand the next piece of code, the plot is more interesting.

In [None]:
import numpy as np

pca = PCA().fit(scaled_X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

We can see easily now that ~6 components contain ~90% of the entire variance! Maybe this would be a good place to start if using these components as input to a model.

## Training Models

## Model Evaluation

## Feature Importance

In [None]:
​