<a href="https://colab.research.google.com/github/Schauhan21/AC/blob/main/exercise3_for_student_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Affective Computing - Programming Assignment 3

### Objective

Your task is to use the **feature-level** method to combine facial expression features and audio features. A **multi-modal emotion recognition system** is constructed to recognize happy versus sadness facial expressions (binary-class problem) by using a classifier training and testing structure.

The original data is based on lab1 and lab2, from ten actors acting happy and sadness behaviors. 
* Task 1: **Subspace-based feature fusion** method: In this case, z-score normalization is utilized. Please read “Fusing Gabor and LBP feature sets for kernel-based face recognition” and learn how to use subspace-based feature fusion method for multi-modal system.

* Task 2: Based on Task 1, use **Canonical Correlation Analysis(CCA)** to calculate the correlation coefficients of facial expression and audio features. Finally, use CCA to build a multi-modal emotion recognition system. The method is described in one conference paper “Feature fusion method based on canonical correlation analysis and handwritten character recognition”
* Task 3: Based on Task 1, create a **Leave-One-Subject-Out (LOSO) cross-validation** to estimate the performance more reliably.

To produce emotion recognition case, Support Vector Machine (SVM) classifiers are trained.  50 videos from 5 participants are used to train the emotion recognition systems by using spatiotemporal features. The rest of the data (50 videos) are used to evaluate the performances of the trained recognition systems.

## Task 1. Subspace-based method
Please read “Fusing Gabor and LBP feature sets for kernel-based face recognition” and apply their framework for the exercise. We use Support Vector Machine (SVM) with linear kernel for classification. As opposed to using Gabor features we are using the prosodic features from the last exercise.


### Setting up the environment 

First, we need to import the basic modules for loading the data and data processing

In [None]:
import sys
sys.path.append('../')
from skimage import io
from skimage import transform
from skimage import color
from skimage import img_as_ubyte
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import sklearn
import scipy.io as sio

### Loading data 

We load the facial expression data (training data, training class, testing data, testing class) and audio data (training data, testing data)

In [None]:
mdata = sio.loadmat('lab3_data.mat')

#Facial expression training and testing data, training and testing class
training_data = mdata["training_data"]
testing_data = mdata["testing_data"]
training_class = np.ravel(mdata["training_class"])
testing_class = np.ravel(mdata["testing_class"])

#Audio training and testing data
training_data_proso = mdata["training_data_proso"]
testing_data_proso = mdata["testing_data_proso"]



### Extract the subspace for facial expression features and audio features
Extract the subspace for facial expression features and audio features using principal component analysis through using __[`sklearn.decomposition.PCA()`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)__ function.
`reduced_dim` is the dimensionality of the reduced subspace.
Set `reduced_dim` to 20 and 15 for facial expression features and audio features, respectively. Normalization should be done sample wise. The test data should be normalized with the values from the training data.
For concatenating the features use the __[`np.concatenate()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html)__ function. Set the random state to be 0. The PCA uses a randomized truncated SVD, meaning that the results may vary depending on the seed.

In [None]:
from sklearn.decomposition import PCA 
from scipy import stats
from sklearn.preprocessing import StandardScaler

#Set Reduced_dim for facial expression features and audio features, respectively.
reduced_dim_v = 20
reduced_dim_a = 15

#Extract the subspace for facial expression features though PCA
pca_v = PCA(reduced_dim_v, random_state=0) #Random state ensures we get same results on different runs
pca_v.fit(training_data)

#Transform training_data and testing data respectively
training_data_transform = pca_v.transform(training_data)
testing_data_transform = pca_v.transform(testing_data)

#Extract the subspace for audio features though PCA
pca_a = PCA(reduced_dim_a, random_state=0) 
pca_a.fit(training_data_proso)

#Transform the training_data and testing_data respectively
training_data_trans = pca_a.transform(training_data_proso)
testing_data_trans = pca_a.transform(testing_data_proso)

#Normalize the features
norm_a = StandardScaler()
norm_v = StandardScaler()

norm_training_a = norm_a.fit_transform(training_data_trans)
norm_testing_a = norm_a.transform(testing_data_trans)
norm_training_v = norm_v.fit_transform(training_data_transform)
norm_testing_v = norm_v.transform(testing_data_transform)


#Concatenate the transformed training data of facial expression features and audio features together
combined_train = np.concatenate((norm_training_v, norm_training_a), axis=1)

#Concatenate the transformed testing data of facial expression features and audio features together
combined_test = np.concatenate((norm_testing_v, norm_testing_a), axis=1)



### Question 1. Why is PCA used? Why not just concatenate the extracted features without PCA?

### Your answer:

PCA is useful to reduce dimension without losing data or information. Using concatenate will result in curse of dimensionality which do not occur in low dimension.

### Feature classification
Use the __[`SVM`](http://scikit-learn.org/stable/modules/svm.html)__ function to train Support Vector Machine (SVM) classifiers.
Construct a SVM using the combined training data and linear kernel. The `training_class` group vector contains the class of samples: 1 = happy, 2 = sadness, corresponding to the rows of the training data matrices.

Then, calculate average classification performances for both training and testing data. The correct class labels corresponding with the rows of the training and testing data matrices are in the variables ‘training_class’ and ‘testing_class’, respectively.

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score
# Train SVM classifier
classifier = svm.SVC(kernel="linear")
classifier.fit(combined_train, training_class)

#The prediction results
predict_train = classifier.predict(combined_train)
predict_test = classifier.predict(combined_test)

#Calculate and print the training accuracy and testing accuracy. 
testing_acc = accuracy_score(testing_class, predict_test)
print(testing_acc)

training_acc = accuracy_score(training_class, predict_train)
print(training_acc)



0.98
1.0


Compute the confusion matrices using __[`sklearn.metrics.confusion_matrix()`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)__function for both the training data and testing data.

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(training_class, predict_train))

print(confusion_matrix(testing_class, predict_test))


[[25  0]
 [ 0 25]]
[[25  0]
 [ 1 24]]


## Task 2. 
As opposed to a simple concatenation we can try something smarter that utilizes the common characteristics of the fused features. This is achieved using the CCA. Use the PCA transformed vectors and set the number of components for the CCA to be 15.


Use (__[`sklearn.cross_decomposition.CCA()`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html)__) function to calculate the correlation coefficients of facial expression features and audio features. For `n_components` of CCA use the same number as the reduced dimensionality of the audio features in the previous task.

In [None]:
from sklearn.cross_decomposition import CCA
import numpy as np

#Use CCA to construct the Canonical Projective Vector (CPV)
cca = CCA(15)
cca.fit(training_data_transform, training_data_trans)

#Construct Canonical Correlation Discriminant Features (CCDF) for both the training data and testing data
train_v_cca, train_a_cca = cca.transform(training_data_transform, training_data_trans)
test_v_cca, test_a_cca = cca.transform(testing_data_transform, testing_data_trans)


# Concatenate the CCA transformed features for training data and testing data
combined_train_cca = np.concatenate((train_v_cca, train_a_cca), axis=1)
combined_test_cca = np.concatenate((test_v_cca, test_a_cca), axis=1)


Train a SVM classifier using a linear kernel, print the training and testing accuracy and compute the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
# Train SVM classifier
classifier = svm.SVC(kernel="linear")
classifier.fit(combined_train_cca, training_class)

#The prediction results
predict_train = classifier.predict(combined_train_cca)
predict_test = classifier.predict(combined_test_cca)

#Calculate and print the training accuracy and testing accuracy. 
from sklearn.metrics import accuracy_score

training_acc = accuracy_score(training_class, predict_train)
print(training_acc)

testing_acc = accuracy_score(testing_class, predict_test)
print(testing_acc)

print(confusion_matrix(training_class, predict_train))


1.0
0.92
[[25  0]
 [ 0 25]]


### Question 2. In this exercise a feature-level method was used to fuse the features. What are the other types of methods for data fusion?

### Your answer:

Model-level, decision-level, match-score-level and sensor-level.

### Question 3. Compare the results from all the the different methods from assignments 1, 2 and 3. What method performed the best? What was the worst? Hypothesize as to why certain methods performed better than others.

### Your answer:

## Task 3: 
For a more reliable evaluation, often the **Leave-One-Subject-Out (LOSO) cross-validation** is used instead of the common train-test split. Cross-validation gives us a more reliable measure of the performance as all of the data is used for both training and testing. LOSO is used as emotions are highly dependent on the subject. By using LOSO, we guarantee that a subject is always in either the training or testing data and not in both.

* Join the training/testing data matrices and the class vectors. Combine also the ‘training_personID’ and ‘testing_personID’ vectors.

* Assume we have a total of $n$ subjects. Now, we will create a total of $n$ folds (loops), where each folds' training set contains the data from $n-1$ subjects and the testing set consists of only $1$ subject.

* Follow the steps taken in the first task: project the data to a subspace using PCA, conatenate the audio and video features together, train an SVM and finally evaluate the performance.

* The solution should be able to generalize over different numbers of subjects and samples, *e.g.*, a dataset may have 24 subjects, where subject1 has 4 samples and subject2 has 32 samples.

In [None]:
mdata = sio.loadmat('lab3_data.mat')

#Combine the training data, testing data, labels and person ID for video and audio respectively,
#in order to get the whole dataset. 
video_data = np.append(mdata["training_data"], mdata["testing_data"], axis=0)
proso_data = np.append(mdata["training_data_proso"], mdata["testing_data_proso"], axis=0)

labels = np.append(np.ravel(mdata["training_class"]), np.ravel(mdata["testing_class"]), axis=0)
subjects = np.append(np.ravel(mdata["training_personID"]), np.ravel(mdata["testing_personID"]), axis=0)

#Get the number of the subject
subject_ids = np.unique(subjects)

#Print the shapes and the list of subject_ids for a sanity check
print("Shape of video_data:", video_data.shape)
print("Shape of proso_data:", proso_data.shape)
print("Shape of labels:", labels.shape)
print("Shape of subjects:", subjects.shape)
print("Value of subject_ids:", subject_ids)


Shape of video_data: (100, 708)
Shape of proso_data: (100, 15)
Shape of labels: (100,)
Shape of subjects: (100,)
Value of subject_ids: [ 1  2  3  4  5  7  8  9 10 12]


In [None]:
accuracies = []
#Loop over each subject
for subject_id in subject_ids:
    #Create a boolean array for the training and testing set indices
    #The train_idx should be a list of form [True, True, False, ...], where True indicates the position
    #for the samples that are not the current subject_id
    train_idx = [False if subject_id==subject else True for subject in subjects]
    #Similar for the test_idx, True indicates the position of the current subject_id
    test_idx = [True if subject_id==subject else False for subject in subjects]
    
    #Create the training and testing sets for lbp, proso and labels by indexing lbp_data, proso_data and labels
    #with the boolean arrays train_idx and test_idx
    
    lbp_data_train = np.delete(video_data, test_idx, axis=0)
    proso_data_train = np.delete(proso_data, test_idx, axis=0)
    labels_train = np.delete(labels, test_idx, axis=0)
    
    lbp_data_test = np.delete(video_data, train_idx, axis=0)
    proso_data_test = np.delete(proso_data, train_idx, axis=0)
    labels_test = np.delete(labels, train_idx, axis=0)
    
    #Create the PCA for both lbp and proso. We take a slight shortcut compared to task 1,
    #by using the whiten=True parameter for normalizing the features. This means that
    #there is no need for normalization afterwards
    pca_v = PCA(n_components=20, whiten=True, random_state=0)
    pca_a = PCA(n_components=15, whiten=True, random_state=0)
    
    #Fit the PCAs with the training data
    pca_v.fit(lbp_data_train)
    pca_a.fit(proso_data_train)
    
    #Transform both the training and testing data with the PCA
    lbp_data_train_pca = pca_v.transform(lbp_data_train)
    proso_data_train_pca = pca_a.transform(proso_data_train)
    lbp_data_test_pca = pca_v.transform(lbp_data_test)
    proso_data_test_pca = pca_a.transform(proso_data_test)
    
    #Concatenate the features together
    train_combined = np.concatenate((lbp_data_train_pca, proso_data_train_pca), axis=1) 
    test_combined = np.concatenate((lbp_data_test_pca, proso_data_test_pca), axis=1)
    
    #Create a linear SVM and train it
    classifier = svm.SVC(kernel="linear")
    classifier.fit(train_combined, labels_train)
    
    #Calculate the accuracy for the testing data and add it to the list of accuracies
    predict_test = classifier.predict(test_combined)
    accuracies.append(accuracy_score(labels_test, predict_test))
    
#Calculate the average of the accuracies. Print both the list of accuracies and the average    
print("List of Accuracies: ", accuracies)
print("Mean: ", np.mean(accuracies))

List of Accuracies:  [0.9, 0.8, 1.0, 0.9, 0.9, 1.0, 1.0, 1.0, 0.8, 1.0]
Mean:  0.93


### Question 4. The accuracy of LOSO (0.93) is lower than the accuracy achieved by the train-test split (0.98) in task 1. Hypothesize as to why the two are different. Which one is better for evaluation?

### Your answer:

LOSO is better at evaluation because the test subjects are test classes and not subjected to training. 