# Affective Computing - Programming Assignment 3

### Objective

Your task is to use the feature-level method to combine facial expression features and audio features. A multi-modal emotion recognition system is constructed to recognize happy versus sadness facial expressions (binary-class problem) by using a classifier training and testing structure.

The original data is based on lab1 and lab2, from ten actors acting happy and sadness behaviors. 
* Task 1: Subspace-based feature fusion method: In this case, z-score normalization is utilized. Please read “Fusing Gabor and LBP feature sets for kernel-based face recognition” and learn how to use subspace-based feature fusion method for multi-modal system.

* Task 2: Based on Task 1, use Canonical Correlation Analysis to calculate the correlation coefficients of facial expression and audio features. Finally, use CCA to build a multi-modal emotion recognition system. The method is described in one conference paper “Feature fusion method based on canonical correlation analysis and handwritten character recognition”
* Task 3: Based on Task 1, create a Leave-One-Subject-Out (LOSO) cross-validation to estimate the performance more reliably.

To produce emotion recognition case, Support Vector Machine (SVM) classifiers are trained.  50 videos from 5 participants are used to train the emotion recognition systems by using spatiotemporal features. The rest of the data (50 videos) are used to evaluate the performances of the trained recognition systems.

## Task 1. Subspace-based method  
Please read “Fusing Gabor and LBP feature sets for kernel-based face recognition” and apply their framework for the exercise. We use Support Vector Machine (SVM) with linear kernel for classification. As opposed to using Gabor features we are using the prosodic features from the last exercise.


### Setting up the environment 

First, we need to import the basic modules for loading the data and data processing

In [14]:
import sys
sys.path.append('../')
from skimage import io
from skimage import transform
from skimage import color
from skimage import img_as_ubyte
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import sklearn
import scipy.io as sio

### Loading data  <font color='red'>(0.5 point)</font>

We load the facial expression data (training data, training class, testing data, testing class) and audio data (training data, testing data)

In [15]:
mdata = sio.loadmat('lab3_data.mat')
#Facial expression training and testing data, training and testing class
training_data = mdata["training_data"]
testing_data = mdata["testing_data"]
training_class = mdata["training_class"].ravel()
testing_class = mdata["testing_class"].ravel()

#Audio training and testing data
training_data_proso = mdata["training_data_proso"]
testing_data_proso = mdata["testing_data_proso"]

### Extract the subspace for facial expression features and audio features <font color='red'>(2 point)</font>
Extract the subspace for facial expression features and audio features using principal component analysis through using **PCA class**.
The `reduced_dim` is the dimensionality of the reduced subspace.
Set `reduced_dim` to 20 and 15 for facial expression features and audio features, respectively. Normalization should be done subject wise. The test data should be normalized with the values from the training data.
For concatenating the features use the __[`np.concatenate()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html)__ function.

You will implement the PCA class with two methods, **fit** and **transform**. The **fit** method takes one input array with no return values and the **transform** method takes one input array and returns a transformed array with dimensions. Use (__[`numpy.linalg.svd`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html)__) for singular values extraction.

In [16]:
class PCA:
    """Principal component analysis (PCA).
    Parameters
    ----------
    n_components : int
        Number of principal components to use.
    whiten : bool, default=False
        When true, the output of transformed features is divided by the
        square root of the explained variance.
    Examples
    --------
    >>> import numpy as np
    >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
    >>> pca = PCA(n_components=2)
    >>> pca.fit(X)
    >>> pca.transform(X)
    >>> array([[ 1.38340578,  0.2935787 ],
               [ 2.22189802, -0.25133484],
               [ 3.6053038 ,  0.04224385],
               [-1.38340578, -0.2935787 ],
               [-2.22189802,  0.25133484],
               [-3.6053038 , -0.04224385]])
    """
    def __init__(self, n_components: int, whiten: bool = False) -> None:
        self.n_components = n_components
        self.whiten = whiten
        self.selected_components = None
        self.mean = None 
                   
    def fit(self, X: np.ndarray) -> None:
        """Fit the model with X.
        Parameters
        ----------
        X : a numpy array with dimensions (n_samples, n_features)
        """  
        n_samples, n_features = X.shape
        
        #Step 1: Find the mean, and center the data
        self.mean = np.mean(X, axis=0)
        X = X - self.mean
        
        #Step2:  Find the Covariance
        cov = np.cov(X)

        #Step 3: Apply SVD and choose the components, make the hermitian argument True.
        U, S, V = np.linalg.svd(X, full_matrices=False, hermitian=False)
        self.selected_components = V[:self.n_components]
        # choose the singular values of diagnal matrix
        self.explained_variance = ((S ** 2) / (n_samples - 1))[:self.n_components]

        
    def transform(self, X: np.ndarray) -> np.ndarray:
        """Transform X with the fitted model.
        Parameters
        ----------
        X : a numpy array with dimensions (n_samples, n_features)
        
        Returns
        -------
        X_transformed: a numpy array with dimensions (n_samples, n_components)
        """
        # Center the data
        X_transformed = X - self.mean
        
        # Step 4: Choose and transform the features
        X_transformed = np.dot(X_transformed, self.selected_components.T)
        
        if self.whiten:
            # Normalize the transform features
            X_transformed /= np.sqrt(self.explained_variance)
            
        return X_transformed
    

In [17]:
#X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
#pca = PCA(n_components=2)
#pca.fit(X)
#pca.transform(X)
#print(training_data.shape)
#print(training_data_proso.shape)

In [18]:
#from sklearn.decomposition import PCA 
from scipy import stats

#Set Reduced_dim for facial expression features and audio features, respectively.
reduced_dim_v = 20
reduced_dim_a = 15

#Extract the subspace for facial expression features though PCA. 
#If you are using sklearn use random_state=0, to ensure consistant results
pca_v = PCA(reduced_dim_v)
pca_v.fit(training_data)

#Transform training_data and testing data respectively
v_transformed_training_data = pca_v.transform(training_data)
v_transformed_testing_data = pca_v.transform(testing_data)

#Extract the subspace for audio features though PCA
pca_a = PCA(reduced_dim_a)
pca_a.fit(training_data_proso)

#Transform the training_data and testing_data respectively
a_transformed_training_data = pca_a.transform(training_data_proso)
a_transformed_testing_data = pca_a.transform(testing_data_proso)

#Normalize the features
#v_normalized_training_data = stats.zscore(v_transformed_training_data)
#a_normalized_training_data = stats.zscore(a_transformed_training_data)

v_mean = np.mean(v_transformed_training_data, axis=0)
v_std = np.std(v_transformed_training_data, axis=0)
v_normalized_training_data = [(value - v_mean) / v_std for value in v_transformed_training_data]
v_normalized_testing_data = [(value - v_mean) / v_std for value in v_transformed_testing_data]

a_mean = np.mean(a_transformed_training_data, axis=0)
a_std = np.std(a_transformed_training_data, axis=0)
a_normalized_training_data = [(value - a_mean) / a_std for value in a_transformed_training_data]
a_normalized_testing_data = [(value - a_mean) / a_std for value in a_transformed_testing_data]

#Concatenate the transformed training data of facial expression features and audio features together
combined_train = np.concatenate((v_normalized_training_data,a_normalized_training_data),axis=1)

#Concatenate the transformed testing data of facial expression features and audio features together
combined_test = np.concatenate((v_normalized_testing_data,a_normalized_testing_data),axis=1)

### Question 1. Why is PCA used? Why not just concatenate the extracted features without PCA? <font color='red'>(0.5 point)</font>

### Your answer:

It reduces dimensionality and makes the processing more efficient.
There are 708 features (dimensions) in image data. With PCA, this high dimensionality is reduced to 20 dimensions. PCA dimensionality reduction assures that in those 20 dimensions, there is most of the original information from the data. So we traded very little difference precision for great difference in performance of the algorithm.

### Feature classification <font color='red'>(0.5 point)</font>
Use the __[`SVM`](http://scikit-learn.org/stable/modules/svm.html)__ function to train Support Vector Machine (SVM) classifiers.
Construct a SVM using the combined training data and linear kernel. The `training_class` group vector contains the class of samples: 1 = happy, 2 = sadness, corresponding to the rows of the training data matrices.

Then, calculate average classification performances for both training and testing data. The correct class labels corresponding with the rows of the training and testing data matrices are in the variables ‘training_class’ and ‘testing_class’, respectively.

In [19]:
from sklearn import svm
from sklearn.metrics import accuracy_score

# Train SVM classifier
svm_p = svm.SVC(kernel='linear')
svm_p.fit(combined_train, training_class)

#The prediction results
predict_train = svm_p.predict(combined_train)
predict_test = svm_p.predict(combined_test)

#Calculate and print the training accuracy and testing accuracy. 
print(accuracy_score(predict_train, training_class))
print(accuracy_score(predict_test, testing_class))

1.0
0.98


### <font color='red'>(0.5 point)</font>
Compute the confusion matrices using __[`sklearn.metrics.confusion_matrix()`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)__function for both the training data and testing data.


In [20]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(training_class,predict_train))
print(confusion_matrix(testing_class,predict_test))

[[25  0]
 [ 0 25]]
[[25  0]
 [ 1 24]]


## Task 2. 
As opposed to a simple concatenation we can try something smarter that utilizes the common characteristics of the fused features. This is achieved using the CCA. Use the PCA transformed vectors and set the number of components for the CCA to be 15.


### <font color='red'>(1 point)</font>

Use (__[`sklearn.cross_decomposition.CCA()`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html)__) function to calculate the correlation coefficients of facial expression features and audio features. For `n_components` of CCA use the same number as the reduced dimensionality of the audio features in the previous task.

In [21]:
from sklearn.cross_decomposition import CCA
import numpy as np

#Use CCA to construct the Canonical Projective Vector (CPV)
cca = CCA(reduced_dim_a)
cca.fit(v_transformed_training_data, a_transformed_training_data)

#Construct Canonical Correlation Discriminant Features (CCDF) for both the training data and testing data
v_training_c, a_training_c = cca.transform(v_transformed_training_data, a_transformed_training_data)
v_testing_c, a_testing_c = cca.transform(v_transformed_testing_data, a_transformed_testing_data)

# Concatenate the CCA transformed features for training data and testing data
combined_train_c = np.concatenate((v_training_c,a_training_c),axis=1)
combined_test_c = np.concatenate((v_testing_c,a_testing_c),axis=1)

### <font color='red'>(1 point)</font>
Train a SVM classifier using a linear kernel, print the training and testing accuracy and compute the confusion matrix.

In [22]:
#Train svm classifier 
svm_c = svm.SVC(kernel='linear')
svm_c.fit(combined_train_c, training_class)

#The prediction results
predict_train_c = svm_c.predict(combined_train_c)
predict_test_c = svm_c.predict(combined_test_c)

#Calculate and print the training accuracy and testing accuracy. 
print(accuracy_score(predict_train_c, training_class))
print(accuracy_score(predict_test_c, testing_class))

# Compute the confusion matrix using sklearn.metrics.confusion_matrix() function for training data and testing data respectively
print(confusion_matrix(training_class,predict_train_c))
print(confusion_matrix(testing_class,predict_test_c))


1.0
0.92
[[25  0]
 [ 0 25]]
[[25  0]
 [ 4 21]]


### Question 2. In this exercise a feature-level method was used to fuse the features. What are the other types of methods for data fusion? <font color='red'>(0.5 point)</font>

### Your answer:

Other types of data fusion are the observation-level fusion and the decision-level fusion.

### Question 3. Compare the results from all the the different methods from assignments 1, 2 and 3. What method performed the best? What was the worst? Hypothesize as to why certain methods performed better than others. <font color='red'>(0.5 point)</font>

### Your answer:

The worst accuracy was achieved by analyzing just the image data. The most acurate method turned out to by the one using the fusion of the image and the audio data. 

## Task 3: 
For a more reliable evaluation, often the Leave-One-Subject-Out (LOSO) cross-validation is used instead of the common train-test split. Cross-validation gives us a more reliable measure of the performance as all of the data is used for both training and testing. LOSO is used as emotions are highly dependent on the subject. By using LOSO, we guarantee that a subject is always in either the training or testing data and not in both.

* Join the training/testing data matrices and the class vectors. Combine also the ‘training_data_personID’ and ‘testing_data_personID’ vectors.

* Assume we have a total of $n$ subjects. Now, we will create a total of $n$ folds (loops), where each folds' training set contains the data from $n-1$ subjects and the testing set consists of only $1$ subject.

* Follow the steps taken in the first task: project the data to a subspace using PCA, conatenate the audio and video features together, train an SVM and finally evaluate the performance.

* The solution should be able to generalize over different numbers of subjects and samples, *e.g.*, a dataset may have 24 subjects, where subject1 has 4 samples and subject2 has 32 samples.

### <font color='red'>(0.5 point)</font>

In [23]:
mdata = sio.loadmat('lab3_data.mat')
#Combine the training data, testing data,label and persion ID for video and audio respectively, in order to get the whole dataset. 
lbp_data = np.concatenate((mdata["training_data"],mdata["testing_data"]),axis=0)
proso_data =  np.concatenate((mdata["training_data_proso"],mdata["testing_data_proso"]),axis=0)

labels = np.concatenate((mdata["training_class"],mdata["testing_class"]),axis=0).ravel()
subjects = np.concatenate((mdata["training_personID"],mdata["testing_personID"]),axis=0).ravel()

#Get the number of the subject
subject_ids = np.unique(subjects)

#Print the shapes and the list of subject_ids for a sanity check
print(lbp_data.shape)
print(proso_data.shape)
print(labels.shape)
print(subjects.shape)
print(subject_ids)

(100, 708)
(100, 15)
(100,)
(100,)
[ 1  2  3  4  5  7  8  9 10 12]


### <font color='red'>(2 point)</font>

In [24]:
accuracies = [] 
#Loop over each subject
for subject_id in subject_ids:
    #Create a boolean array for the training and testing set indices
    #The train_idx should be a list of form [True, True, False, ...], where True indicates the position
    #for the samples that are not the current subject_id
    train_idx = subjects != subject_id
    #Similar for the test_idx, True indicates the position of the current subject_id
    test_idx = subjects == subject_id
    
    #Create the training and testing sets for lbp, proso and labels by indexing lbp_data, proso_data and labels
    #with the boolean arrays train_idx and test_idx
    lbp_data_train = lbp_data[train_idx]
    proso_data_train = proso_data[train_idx]
    labels_train = labels[train_idx]
    
    lbp_data_test = lbp_data[test_idx]
    proso_data_test = proso_data[test_idx]
    labels_test = labels[test_idx]
    
    #Create the PCA for both lbp and proso. We take a slight shortcut compared to task 1,
    #by using the whiten=True parameter for normalizing the features. This means that
    #there is no need for normalization afterwards
    pca_v_l = PCA(n_components=20, whiten=True)
    pca_a_l = PCA(n_components=15, whiten=True)
    
    #Fit the PCAs with the training data
    pca_v_l.fit(lbp_data_train)
    pca_a_l.fit(proso_data_train)
    
    #Transform both the training and testing data with the PCA
    v_transformed_train = pca_v_l.transform(lbp_data_train)
    v_transformed_test = pca_v_l.transform(lbp_data_test)
    
    a_transformed_train = pca_a_l.transform(proso_data_train)
    a_transformed_test = pca_a_l.transform(proso_data_test)
    
    #Concatenate the features together
    combined_train_l = np.concatenate((v_transformed_train,a_transformed_train),axis=1)
    combined_test_l = np.concatenate((v_transformed_test,a_transformed_test),axis=1)
    
    #Create a linear SVM and train it
    svm_l = svm.SVC(kernel='linear')
    svm_l.fit(combined_train_l, labels_train)

    predict_train_l = svm_l.predict(combined_train_l)
    predict_test_l = svm_l.predict(combined_test_l)

    #Calculate the accuracy for the testing data and add it to the list of accuracies
    accuracies.append(accuracy_score(predict_test_l, labels_test))
        
#Calculate the average of the accuracies. Print both the list of accuracies and the average    
print(accuracies)
print(np.mean(accuracies))

[0.9, 0.8, 1.0, 0.9, 0.9, 1.0, 1.0, 1.0, 0.8, 1.0]
0.93


### Question 4. The accuracy of LOSO (0.93) is lower than the accuracy achieved by the train-test split (0.98) in task 1. Hypothesize as to why the two are different. Which one is better for evaluation?  <font color='red'>(0.25 point)</font>

### Your answer:

LOSO assures, that the the person, on which the model was tested was never used in the training process. This lowers the accuracy of the model in the laboratory, but gives us more insight to the performance and accuracy of the model in the real world application (where the faces were never seen before). Train-test split has higher accuracy, because although the data in train and test datasets are not the same, they are from the same group of people.

### Question 5. In PCA why `whiten` parametere is better and why it replaces the normalization?  <font color='red'>(0.25 point)</font>

### Your answer:

Whitening makes all the features have the same variance. This is useful, when processing the image features, because there is too high pixel redundancy which affects covariance score.