<h1><span style="color:green">The GREEN exercise</span></h1>

Patterns from each of ***two classes have been drawn from distributions with the same mean***. I.e. on average the patterns associated with each class are identical, and typically we would not want to conclude that the classes differ. However, a simple classification has been run for each of 40 participants, and the ***classifier accuracy is significantly above chance***. 

***Is this correct?*** 

***Change the analysis so that it no longer finds a significant difference between the classes.***

## Getting ready

Import the packages we might need:

In [None]:
import numpy as np              # This lets python process matrices, like Matlab
import matplotlib.pyplot as plt # This lets python plot graphs like Matlab
import seaborn as sns           # This provides another popular set of plotting functions
import scipy                    # This provides scientific capabilities like t-tests

# scikit-learn is the major library for machine learning in Python:
import sklearn as skl
from sklearn import preprocessing         # includes LabelEncoder, OneHotEncoder, StandardScaler...
from sklearn import model_selection       # includes StratifiedKFold, LeaveOneGroupOut...
from sklearn import linear_model          # includes LogisticRegression, RidgeClassifier...
from sklearn import svm                   # includes SVC, NuSVC & LinearSCV...
from sklearn import discriminant_analysis # includes LinearDiscriminantAnalysis
from sklearn import metrics               # includes roc_auc_score...
from sklearn import pipeline              # includes make_pipeline
from sklearn import inspection            # includes DecisionBoundaryDisplay

Set the random number generator:

In [None]:
np.random.seed(1)

## Simulate some data (two, balanced classes) with no mean difference between conditions:

In [None]:
npeople = 40 # number of participants

nvox = 2 # number of voxels
nruns= 6 # number of runs
n_samples_per_run = 10 # number of samples/patterns per run; these will be divided into conditions/classes "0" and "1"
proportion_of_samples_from_condition_0 = 0.5

mu = np.arange(nvox)+1 # mean activation for both conditions (voxels all have different activation strength)
voxel_covariance = np.diag(mu)+1 # voxel covariance: independent noise per voxel plus some covariance

null_data = [] # list of pattern matrices (one per participant)

for p in np.arange(npeople):
    
    null_data_per_run = [] # list of pattern matrices (one for each run)
    labels_per_run    = [] # list of label vectors    (one for each run)
    for r in np.arange(nruns):
        # label each sample as condition 0 or 1 (the +0 is a trick to convert the logical values to integers):
        label_vector= ((np.arange(n_samples_per_run)/n_samples_per_run)>=proportion_of_samples_from_condition_0) + 0;
        
        # sample activations with patterns from each class having the same mean (mu):
        data_matrix0 =  np.random.multivariate_normal(mu, voxel_covariance,   size=np.sum(label_vector == 0))
        data_matrix1 =  np.random.multivariate_normal(mu, voxel_covariance*8, size=np.sum(label_vector == 1))
        data_matrix  =  np.concatenate((data_matrix0, data_matrix1))
        
        null_data_per_run.append( data_matrix ) 
        labels_per_run.append( label_vector ) 
    
    # concatenate runs for this participant
    null_data.append(   np.concatenate(null_data_per_run, axis = 0) )

# same for all particiapnts:
labels =      np.concatenate(labels_per_run, axis = 0) 
run_indices = np.concatenate([[i] * n_samples_per_run for i in range(nruns)]) 

\
Plot the data for the first two voxels of the first participant:

In [None]:
fig = plt.figure(figsize = (8,8))  # create a matplotlib figure
plt.rcParams.update({'font.size': 14})
plt.title('Activation patterns for first two voxels')
scatter = plt.scatter(null_data[0][:,[0]], null_data[0][:,[1]], 
                      s = 90, alpha = 0.7, c = labels, cmap = 'bwr')
plt.legend(handles=scatter.legend_elements()[0], labels=set(np.unique(labels))) # "set" returns unique values
plt.xlabel('Voxel 1 activity')
plt.ylabel('Voxel 2 activity')
plt.axis('equal')
plt.show()

\
Specify the pre-processing, classification pipeline, and leave-one-run-out cross-validation scheme:

In [None]:
scaler = skl.preprocessing.StandardScaler()
SVM    = skl.svm.SVC()
pipe   = skl.pipeline.make_pipeline(scaler, SVM)
logo   = skl.model_selection.LeaveOneGroupOut()  

\
Run the classification for each participant:

In [None]:
accuracy = np.full(npeople, np.nan)
for p in np.arange(npeople):
    accuracy_per_fold = skl.model_selection.cross_val_score(pipe, null_data[p], labels, groups = run_indices, cv = logo, scoring = 'balanced_accuracy')
    accuracy[p] = np.mean(accuracy_per_fold)
    print("Participant ", p, ": Accuracy per fold: ", accuracy_per_fold, "Mean accuracy (for random data):", accuracy[p])

\
Run a one-sample t-test across participants and find a significant difference between classes:

In [None]:
result = scipy.stats.ttest_1samp(accuracy, 0.5, alternative = 'greater')
print("Mean accuracy across participants (for random data):", np.mean(accuracy))
print(f't({result.df}) = {result.statistic:.2f}; p = {result.pvalue:.2e}')

## Questions:
 - We have two classes, drawn from distributions with the same mean pattern, so why is accuracy not close to 0.5?
 - Is the classifier performing as it should?
 - Can you change the analysis to get a classification score closer to a level equivalent to 50% accuracy?

## Hints:
- Consider the data that is being provided to the classifier per class.
- What classifier is being used? Consider other options when creating it.
- Is there a "better" way to measure classifier performance in this case?

# Explanation and possible solutions:

The two classes come from distributions with the same *mean*, but *different variances*. Classifiers can typically use information in the variance to distinguish the classes. Sometimes, two representations might meaningfully differ in their variance, and the brain might be able to use this information, and we might be interested in capturing it. However, classes can also often have different variance for uninteresting reasons: one condition might be based on fewer trials; particpants might move more in one condition; participants might pay more attention in one condition, etc.

The classifier is doing exactly what it is supposed to: maximizing its average classification performance. The issue is that the classifier can use information in the pattern variance, whereas we might only be interested in the mean pattern.

The classifier used in the example is a support vector machine (SVM) initialised with no input arguments. This defaults to a non-linear (radial basis function) classifier, which is very sensitive to class variance. Let's visualise its classification boundary for one participant:

In [None]:
SVM.fit(null_data[0], labels)
skl.inspection.DecisionBoundaryDisplay.from_estimator(SVM, null_data[0], 
                                                      alpha = 0.5, ax = fig.axes[0], cmap = 'bwr', response_method = 'predict');
fig

\
A simpler, linear classifier is less prone to overfitting (which is typically better for fMRI data anyway), and is also less sensitive to differences in pattern variance. A linear SVM can be specified like this:


In [None]:
SVM    = skl.svm.SVC(kernel = 'linear') ##################### The change
pipe   = skl.pipeline.make_pipeline(scaler, SVM)

for p in np.arange(npeople):
    accuracy_per_fold = skl.model_selection.cross_val_score(pipe, null_data[p], labels, groups  = run_indices, cv = logo, scoring = 'balanced_accuracy')
    accuracy[p] = np.mean(accuracy_per_fold)
    
result = scipy.stats.ttest_1samp(accuracy, 0.5, alternative = 'greater')
print("Mean accuracy across participants (for random data):", np.mean(accuracy))
print(f't({result.df}) = {result.statistic:.2f}; p = {result.pvalue:.2e}')

\
Accuracy is closer to chance, but still significantly above chance.

This is because the decision boundary can be placed to the side of the tighter cluster. This means it will be correct all of the time for the tight cluster, while still being correct some of the time for the noisier cluster:  

In [None]:
fig = plt.figure(figsize = (8,8))  # create a matplotlib figure
plt.rcParams.update({'font.size': 14})
plt.title('Activation patterns for first two voxels')
scatter = plt.scatter(null_data[0][:,[0]], null_data[0][:,[1]], 
                      s = 90, alpha = 0.7, c = labels, cmap = 'bwr')
plt.legend(handles=scatter.legend_elements()[0], labels=set(np.unique(labels))) # "set" returns unique values
plt.xlabel('Voxel 1 activity')
plt.ylabel('Voxel 2 activity')
plt.axis('equal')

SVM.fit(null_data[0], labels)
skl.inspection.DecisionBoundaryDisplay.from_estimator(SVM, null_data[0], 
                                                      alpha = 0.5, ax = fig.axes[0], cmap = 'bwr', response_method = 'predict');

plt.show()

\
There is another way to specify a linear SVM:

In [None]:
SVM    = skl.svm.LinearSVC(dual = 'auto') ##################### The change
pipe   = skl.pipeline.make_pipeline(scaler, SVM)

for p in np.arange(npeople):
    accuracy_per_fold = skl.model_selection.cross_val_score(pipe, null_data[p], labels, groups  = run_indices, cv = logo, scoring = 'balanced_accuracy')
    accuracy[p] = np.mean(accuracy_per_fold)
    
result = scipy.stats.ttest_1samp(accuracy, 0.5, alternative = 'greater')
print("Mean accuracy across participants (for random data):", np.mean(accuracy))
print(f't({result.df}) = {result.statistic:.2f}; p = {result.pvalue:.2e}')

This time the accuracy is much closer to chance, but still signficantly above chance. I'm not sure of the exact difference between these two implementations of a linear SVM, or why they seem to have different sensitivity to pattern variance. If you know, please tell me!

Anyway, the accuracy of a linear classifier is less sensitive to differences in pattern variance than the accuracy of a non-linear classifier, but still has some sensitivity to pattern variance.

\
For linear classifiers, a performance metric that should, on average, be insensitive to pattern variance is the mean signed distance to the decision boundary, where the sign is positive for correct classifications and negative for misclassifications. This metric should have an expectation of zero, under the null hypothesis of no mean difference in patterns, regardless of pattern variance:

In [None]:
signed_decisions = np.full(npeople, np.nan)
for p in np.arange(npeople):
    decisions = skl.model_selection.cross_val_predict(pipe, null_data[p], labels, groups = run_indices, cv = logo, method = 'decision_function')
    # For a linear classifier, these are distances of each test sample to the hyperplane, with the sign indicating the predicted class.
    # To get a measure of performance, we want the sign to indicate accuracy:
    predictions = skl.model_selection.cross_val_predict(pipe, null_data[p], labels, groups = run_indices, cv = logo, method = 'predict') # binary class assignments
    accuracy = (predictions == labels)*2-1   # +1 for correct, -1 for incorrect
    signed_decisions_per_sample = np.abs(decisions) * accuracy
    signed_decisions[p] = np.mean(signed_decisions_per_sample)

result = scipy.stats.ttest_1samp(signed_decisions, 0, alternative = 'greater')
print("Mean signed distance to hyperplane across participants (for random data):", np.mean(signed_decisions))
print(f't({result.df}) = {result.statistic:.2f}; p = {result.pvalue:.2e}')

ax = sns.histplot(signed_decisions_per_sample, element = 'step', alpha = 0.5)
ax.set(xlabel = 'Mean signed distance to hyperplane');
ax.set(title = 'Example participant');
lh = ax.axvline(0, color = 'k',label = 'chance')
mh = ax.plot(np.mean(signed_decisions_per_sample), 0, marker = 'o', color = 'r', markersize = 10, label = 'observed mean')
ax.legend();