# Visualisations and Explainabilty of Machine Learning models for time series data

In the 0_Intro.ipynb and 1_Baseline.ipynb notebooks, we create a machine learning pipeline for the classification of accelerometry data. For the pipeline developed, we have found the classification of activity labels in the test set is not perfect, hence we would like to make improvements to the model. This is a difficult task to do, and requires visualisations and model explainability to be done efficiently and effectively.

## Setup

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn import metrics
from sklearn import preprocessing
from sklearn import manifold
import seaborn as sns

# For reproducibility
np.random.seed(42)

## Load the dataset

In [None]:
# Path to your extracted windows
DATASET_PATH = 'processed_data/'
X_FEATS_PATH = 'X_feats.pkl'  # path to your extracted features, if have one
print(f'Content of {DATASET_PATH}')
print(os.listdir(DATASET_PATH))

X = np.load(DATASET_PATH+'X.npy', mmap_mode='r')
Y = np.load(DATASET_PATH+'Y.npy')
T = np.load(DATASET_PATH+'T.npy')
pid = np.load(DATASET_PATH+'pid.npy')
X_feats = pd.read_pickle(DATASET_PATH+'X_feats.pkl')

# As before, let's map the text annotations to simplified labels
ANNO_LABEL_DICT_PATH = 'capture24/annotation-label-dictionary.csv'
anno_label_dict = pd.read_csv(ANNO_LABEL_DICT_PATH, index_col='annotation', dtype='string')
Y = anno_label_dict.loc[Y, 'label:Walmsley2020'].to_numpy()

The first task in understanding any dataset is to visualise the data. Here we plot 5 example plots for each label.

In [None]:
def plot_acc_samples(X, Y, n_samples=5, plot_norm=False):
    labels = np.unique(Y)
    num_labels = len(labels)
    
    fig, axs = plt.subplots(num_labels, n_samples, figsize=(16,8))
    fig.set_facecolor('white')
    for i in range(num_labels):
        axs[i,0].set_ylabel(labels[i])
        idxs = np.where(Y==labels[i])[0]
        for j in range(n_samples):
            if plot_norm:
                axs[i,j].plot(np.linalg.norm(X[idxs[j]], axis=1)-1)
                axs[i,j].set_ylim([-2,2])
                legend = ['norm']
            else:
                axs[i,j].plot(X[idxs[j]])
                legend = ['x', 'y', 'z']
            axs[i,j].set_xticks([])
            axs[i,j].set_yticks([])
    fig.legend(legend)
    plot_desc = "normalised" if plot_norm else "raw"
    plt.suptitle("Example {} plots for each label in Capture 24 dataset".format(plot_desc))
    fig.tight_layout()

plot_acc_samples(X, Y, plot_norm=False)

Our raw data contained 30 second segments of acceleration in x, y and z. The acceleration in these 3 axes can overlap with each other when plotting, limiting the visualisation of the signal. For this reason, it is sometimes more useful to plot the euclidean norm of acceleration, and subtracting 1g to remove gravity.

Note: All plots of accleration norm minus one are limited between -2g and 2g.

In [None]:
plot_acc_samples(X, Y, plot_norm=True)

**Exercise 1**: Inspecting the differences between the raw and normalised signal, what are the disadvantages to plotting only the normalisaed signal to observe patterns.

**Exercise 2**: What are the noticeable differences between the examples for each label? Is this as you expected?

**Exercise 3**: For simplicity of code, the plot_acc_samples function plots the first n accelerometry samples of a given label. Why might there be an issue with using these samples to represent the entire dataset? What one line of code could be added to address this issue?

Hint: How are the X and Y arrays ordered? What is the meaning of the first n samples?

## t-SNE visualization

Instead of plotting the raw or minimally processed accelerometry data, we can instead using unsupervised dimension reduction methods to view the data. While not perfect, if clusters of a certain label are clearly grouped in a t-SNE plot, it is expected that our model classifier should have good performance in predicting this label.

Note: The data needs to be downsampled 100x, to run the t-SNE in a relevant time window.

In [None]:
def scatter_plot(X, Y):
    unqY = np.unique(Y)
    fig, ax = plt.subplots()
    plt.title("t-SNE of extracted features of Capture24 data")
    for y in unqY:
        X_y = X[Y==y]
        ax.scatter(X_y[:,0], X_y[:,1], label=y, alpha=.5, s=10)
    fig.legend()
    
print("Plotting t-SNE on extracted features...")

scaler = preprocessing.StandardScaler()  # PCA requires normalized data

X_scaled = scaler.fit_transform(X[::100].reshape(X[::100].shape[0],-1))

tsne = manifold.TSNE(n_components=2,  # project down to 2 components
    init='random', random_state=42, perplexity=100, learning_rate='auto')
X_tsne_pca = tsne.fit_transform(X_scaled)
scatter_plot(X_tsne_pca, Y[::100])

Train/Test split

In [None]:
# Hold out participants P101-P151 for testing (51 participants)
test_ids = [f'P{i}' for i in range(101,152)]
mask_test = np.isin(pid, test_ids)
mask_train = ~mask_test
X_train, Y_train, T_train, pid_train = \
    X_feats[mask_train], Y[mask_train], T[mask_train], pid[mask_train]
X_test, Y_test, T_test, pid_test, X_raw_test = \
    X_feats[mask_test], Y[mask_test], T[mask_test], pid[mask_test], X[mask_test]
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)

## Train a random forest classifier

*This may take a while*

In [None]:
# Argument oob_score=True to be used for HMM smoothing (see later below)
clf = BalancedRandomForestClassifier(
    n_estimators=1000,
    replacement=True,
    sampling_strategy='not minority',
    oob_score=True,
    n_jobs=4,
    random_state=42,
    verbose=1
)
clf.fit(X_train, Y_train)

## Model performance

The classification report gives standard metrics for measure the performance of a model, including accuracy and macro F1 score. 

In [None]:
Y_test_pred = clf.predict(X_test)
print('\nClassifier performance')
print('Out of sample:\n', metrics.classification_report(Y_test, Y_test_pred))

Another way to visualise the model performance is the confusion matrix, comparing the true test labels to the predicted test labels. This gives us an understanding how the model incorrectly assigns labels.

In [None]:
plt.figure(figsize=(8,8))
conf = pd.DataFrame(metrics.confusion_matrix(Y_test, Y_test_pred, normalize='true'), index=clf.classes_, columns=clf.classes_)
sns.set(font_scale = 2)
sns.heatmap(conf, annot=True, fmt='.2f', ).set(title="Confusion matrix", xlabel="Predicted", ylabel="True")
plt.xticks(rotation=45)
sns.set(font_scale = 1)

When observing the performance of a model, it is important to inspect the failure cases of the model. That is, the samples in which the model prediction does not match up to the true label. Inspecting this offers the possibility of identifying some potential causes of failure, such as:
- Lack of model generalisability
- Poor/inaccurate labelling of truth labels
- Non trivial differences between labels 

In [None]:
def plot_failure_cases(X, Y, Y_pred, label, n_samples=5):
    labels = np.unique(Y_pred)
    num_labels = len(labels)
    
    fig, axs = plt.subplots(num_labels, n_samples, figsize=(16,8))
    fig.supylabel('Predicted label')
    fig.set_facecolor('white')
    for i in range(num_labels):
        idxs = np.where((Y_pred==labels[i]) & (Y==label))[0]
        axs[i,0].set_ylabel("{}: {:.1f}%".format(labels[i], 100*len(idxs)/sum(Y==label)))
        for j in range(n_samples):
            axs[i,j].plot(np.linalg.norm(X[idxs[j]], axis=1)-1)
            axs[i,j].set_ylim([-2,2])
            axs[i,j].set_xticks([])
            axs[i,j].set_yticks([])
    fig.legend(['norm'])
    plt.suptitle("Example normalised plots for labelled {} activity".format(label))
    fig.tight_layout()

plot_failure_cases(X_raw_test, Y_test, Y_test_pred, label='sleep')

**Exercise 4**: Plot the failure cases for the other labelled activities. What do you think is the main cause for these failure cases?

## Feature selection

The selection of features to be extracted for this classifaction task is a key factor in improving model performance. In previous notebooks, exercises were given to encourage you to expand the list of features extracted from the accelerometry windows. We note however that extracting more features requires more compute power, and takes a longer time to run. In some cases, these extract features may add minimal improvements to performance. We hence seek to find efficient ways to determine which of the extracted features can be removed/replaced while maintaining good model performance.

### Feature correlation

Feature correlation gives an indication of how a pair of features are associated with one another. A pair of features with a high correlation coefficient (close to 1) tends to make an inefficient model, as we only need one of these features to extract the information required for classification. A visualisation of the correlation of all extracted features can be seen in a correlation matrix, and any pairs of features with high values can be removed from the feature extraction pipeline.

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(X_feats.corr().abs()).set(title="Correlation matrix of extracted features")

**Exercise 5**: What is the correlation coefficient for the mean and median normalised acceleration? Do you think both features need to be extracted for an efficient model? 

### Feature importance

Feature importance is an explainable AI technique to reveal the relative significance of individual features on model outputs. There are many different methods that can be used to determine feature importance, however in this notebook, we will use GINI importance. When looking to optimise feature extraction (less compute power and time, better model performance), features with lowest importance should be removed first.

#### GINI Importance
GINI importance is a feature importance method that can be extracted directly from the BalancedRandomForestClassifier class. More information on how exactly it works can be found [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213).
Note: GINI importance is known to perform poorly with highly correlated features. If relying only on GINI importance to determine which features to remove, highly correlated features should be removed first.

In [None]:
plt.figure(figsize=(4,10))
sns.barplot(x=clf.feature_importances_, y=X_feats.columns).set(title="GINI Feature importance")

**Exercise 5**: Which features have highest GINI importance? Does this match your expectation?

# References

GINI importance

- [A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-213)
- [Feature Importance Measures for Tree Models — Part I](https://medium.com/the-artificial-impostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3)
