# Machine learning for acoustic signals and text

This practical work provides an introduction to the basics of machine learning (ML) for acoustic signals and text using Python. ML refers to a set of artificial intelligence (AI) algorithms that learn to perform a task using data. The type of data used is usually referred to as the modality used for ML. Here we will be working with audio (specifically acoustic signals) and text, and the tasks considered here are sentiment classification from text, emotion classification from acoustic signals, and gender classification for acoustic signals. To perform an ML task from different input modalities, there are traditionally several stages involved, which are shown below [(Hüffmeier et al. 2020)](https://www.researchgate.net/figure/An-Overview-of-the-Steps-That-Compose-the-Machine-Learning-Process-adopted-from-13_fig4_348446831):

![steps-ML](https://www.researchgate.net/profile/Johannes-Hueffmeier/publication/348446831/figure/fig4/AS:979783357825027@1610609964871/An-Overview-of-the-Steps-That-Compose-the-Machine-Learning-Process-adopted-from-13.png)


Here, you will learn the following:

- Data processing: download and load a dataset
- Data partitions: what is the difference between different partitions and how one can define them
- Feature extraction for text (tf-idf) and acoustics (filterbanks)
- Defining different metrics (accuracy, and unweighted average recall)
- Learning the basics of Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs)
- Training SVM and MultiLayer Perceptrion (MLP), which is an ANN model
- Optimisation over different hyper-parameters, depending on the used model
- Evaluation and analysis of the trained models

Note that this practical work is divided into parts that are already filled in, which you just have to follow, and then there are two parts that need to be filled in, which are marked as `Question` or `Exercise`. In the questions you will be asked to write your answer under the question, and in the exercises you will be asked to fill in some parts of the codes. 

## Setting up the environment

Before we start with the practical work, we need to install and import the packages we will be using. To do so, run the scripts below:

In [1]:
#!pip install beautifulsoup4==4.11.2
!pip install nltk==3.8.1
!pip install numpy==1.24.1
!pip install pandas==1.5.3
!pip install -U scikit-learn==1.2.1
!pip install ffmpeg-python==0.2.0
!pip install scipy==1.10.0
!pip install matplotlib==3.6.3
!pip install python_speech_features==0.6.0

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applic

In [2]:
import os, glob
import numpy as np
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

## Data processing and feature extraction

This section is divided into processing textual datasets, and acoustic datasets, and itroduces the processing stages and feature extraction related to each modality. In what follows, each part is explained in more detail in its related section.


### Processing a textual dataset

Here, we will learn how to load a textual dataset, and extract `tf-idf` features for machine learning (`tf-idf` was the subject of the previous practical work). Here, we will focus on a sentiment analysis dataset, called `Allociné`, which consists of reviews of French television series. These reviews can be positive (labelled `1`) or negative (labelled `0`). In this section, we do the following:

- Download the `Allociné` corpus
- Load the three partitions of `train`, `dev`, and `test` into memory
- Extract `tf-idf` features for each partition
- Read sentiment targets as numerical values, ready for machine learning usage

#### Downloading the dataset

You can run the scripts below to download the dataset. This dataset is the same as the previous practical work, so the related files can also be copied here and put under the directory `allocine` next to where this `TP.ipynb` file is stored.

In [3]:
#dl_path = "http://sentiment.nlproc.org/sentiment-dataset-fr.zip"
#os.system(f"wget {dl_path}")
#os.system(f"unzip ./sentiment-dataset-fr.zip -d ./allocine")
#os.system(f"rm ./sentiment-dataset-fr.zip")

#### Loading the data

Run the scripts below to load the dataset into the memory.

In [4]:
def get_all_dataset(path):
    train_path = os.path.join(path)
    train_data = np.loadtxt(train_path, dtype='str', delimiter='\t', skiprows=0)
    return train_data

# you might change the following lines according to where the files are stored in your system
train_data = get_all_dataset("./allocine/fr/train.tsv") 
dev_data = get_all_dataset("./allocine/fr/dev.tsv") 
test_data = get_all_dataset("./allocine/fr/test.tsv") 

print("An example of a comment in the train partition:\n", train_data[0,1], "\n")
print("An example of a comment in the dev partition:\n", dev_data[0,1], "\n")
print("An example of a comment in the test partition:\n", test_data[0,1], "\n")


An example of a comment in the train partition:
 probablement le meilleur pilote jamais réalisé pour une série télé . diablement addictif et interpété de manière inspiré , lost est une série à ne pas manquer . 

An example of a comment in the dev partition:
 ca commence doucement dans les premiers épisodes mais ensuite l'histoire prend une ampleur innatendue . bons débuts à confirmer dans la saison 2 . 

An example of a comment in the test partition:
 j'ai commencé à regarder la série à ses débuts et j'avais beaucoup aimé ( concept original , quelques scènes sympathiques ) . mais depuis le départ de shannen doherty , cette série a perdu tout son charmed . dommage , en tout cas vivement la fin , les scenarii s'enlisent , on décroche trop rapidement . 



#### Question 1

- You may have noticed that the dataset above comes with predefined `train`, `dev` and `test` partitions,
    - what is the difference between `train`, `dev`, and `test` partitions? what is the purpose of each one?
    - Why do you think this dataset (like many others) has predefined partitions? 


#### Extracting features

In the previous practical work, we learned about the statistical measure 'tf-idf', which can be used as a basic but useful feature for processing text. In this practical work, we would like to extract features for different partitions using `tf-idf`.

In [5]:
def get_vectorizer(corpus):
    vectorizer = TfidfVectorizer()
    _ = vectorizer.fit_transform(corpus)
    return vectorizer

corpus = train_data[:,1]
vectorizer = get_vectorizer(corpus)

feats_train_text = vectorizer.transform(train_data[:,1]).toarray()
tars_train_text  = [int(num) for num in train_data[:,0]]

feats_dev_text = vectorizer.transform(dev_data[:,1]).toarray()
tars_dev_text  = [int(num) for num in dev_data[:,0]]

feats_test_text = vectorizer.transform(test_data[:,1]).toarray()
tars_test_text  = [int(num) for num in test_data[:,0]]

print("the first five tf-idf features for the train partition:\n", feats_train_text[:5])
print("the first five targets for the train partition:\n", tars_train_text[:5])

the first five tf-idf features for the train partition:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
the first five targets for the train partition:
 [1, 0, 1, 0, 1]


#### Question 2

- Above, what does each row in the `feats_train_text` indicate? the notion of feature is to provid distinct attributes, how can `tf-idf` values do that?

- What does each item in the `tars_train_text` indicate?

### Processing an acoustic dataset

Above, we talked about loading a textual dataset, and extracting textual features. Here, we will do the same but for an acoustic dataset. More specifically, We will work with the Canadian French Emotional (CaFE) speech dataset, which contains six different sentences, pronounced by six male and six female actors, in six basic emotions plus one neutral emotion. The six basic emotions are acted in two different intensities of mild ("Faible") and strong ("Fort") ([see here](https://zenodo.org/record/1478765#.Y_G96-zMI-Q)).

#### Downloading the dataset

You can download the dataset following [this link](https://zenodo.org/record/1478765#.Y_G96-zMI-Q), or you can run the scripts below to automatically download the dataset, unzip it, and remove the zip file keeping only the data folder.

In [6]:
#cmd = "curl 'https://zenodo.org/record/1478765/files/CaFE_48k.zip?download=1' --compressed --output cafe.zip"
#os.system(cmd)
#os.system("unzip ./cafe.zip -d ./cafe")
#os.system("rm ./cafe.zip")

#### Converting audio files (Exercise 1)

It is a common practice in speech processing applications to convert all audio files to 16khz PCM encoded wav files (the reason why was asked in the previous practical work). Thus, here we want to write a function that gets the directory containing the `CaFE` dataset audio files and outputs the converted files, with the same structure in another directory.

**Note**: You may use here the encoding script using `ffmpeg` for audio conversion from the previous practical work.

In [7]:
def printProgressBar (iteration: int, total: int, prefix = '', suffix = '', decimals = 1, length = "fit", fill = '█') -> None:
    """Prints a progress bar on the terminal
    """
    if length=="fit":
        rows, columns = os.popen('stty size', 'r').read().split() # checks how wide the terminal width is
        length = int(columns) // 2
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    if iteration == total: # go to new line when the progress bar is finished
        print()

def writeWavFiles(wavs_dir:str, output_dir:str, ext="wav") -> None:
    """
    This function writes wav files in the specific format of 16 bit integer PCM encoding at 16k rate and one audio channel (mono)
    Inputs:
        `wavs_dir`: the input folder of files where the unprocessed files exist
        `output_dir`: the folder of processed wav files for the output
    
    Note that we would like to keep the structure of folders inside `wavs_dir` for `output_dir`, 
    for example: the file "[wavs_dir]/Surprise/Faible/01-S-1-1.wav" would be processed
    and put as "[output_dir]/Surprise/Faible/01-S-1-1.wav"
    """
    wavFiles = glob.glob(os.path.join(wavs_dir, "**", "*."+ext), recursive=True)
    for i, filePath in enumerate(wavFiles):
        printProgressBar(i + 1, len(wavFiles), prefix = 'Transforming audio files:', suffix = 'complete', length=50)
        fileDirectory = os.path.split(filePath)[0]
        newName = os.path.split(filePath)[-1]
        newName = os.path.splitext(newName)[0] + ".wav"
        # newName = newName.split(".")[:-1] + [".wav"]
        # newName = "".join(newName)
        fileNewPath = fileDirectory.replace(wavs_dir, output_dir)
        fileNewPath = os.path.join(fileNewPath, newName)
        #if allInOne: fileNewPath = os.path.join(output_dir, newName)
        directory = os.path.dirname(fileNewPath)
        if not os.path.exists(directory): os.makedirs(directory)
        os.system('ffmpeg -i ' + filePath + ' -ar 16000 -ac 1 -c:a pcm_s16le -af "volume=0dB" -hide_banner -v 0 -y ' + fileNewPath)

#writeWavFiles("./cafe", "./cafe_p")


#### Extracting features (Exercise 2)

Here, the objectice is to extract useful audio features for ML. `mel filter-banks` and `mfcc` features are the two widly used traditional feature extraction techniques that are still popular today, due to their effectiveness and low computational requirements ([learn more here](https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html)). Such features are calculated for low periods of time (about 25 ms) where an acoustic signal is considered to be statistically stationary. Thus, you will have a different size of feature set for each audio file, with a different length. This is problematic for many machine learning approaches as the input needs to have a fixed length. This is traditionally addressed by averaging the features over time. However, averaging is only one way of statistically representing data. In order to better represent the set of acoustic features for an uttered phrase, here we would like to also use standard deviation `std`, in addition to the average `mean`. Here is a summary of your tasks for this exercise:

- Here we will work with [python-speech-features](https://python-speech-features.readthedocs.io/en/latest/) package. you may use `logfbank` and `mfcc` which are already imported, and pass it as `func`, and also pass `nfilt=40` as `**kwargs` to pass the parameters to the functions. See the `python-speech-features` documentation for why and more detail.
- Write a function that gets the directory of the processed wav files as input (see above), as well as a feature extraction function, which can be `logfbank` or `mfcc`
- This function should calculate the mean and std of each feature set calculated for each file.
- The output should be a dictionary that contains the basenames of the files as keys, and the features as values, for example, {"01-N-1-1": [2.68227271 3.66441116 2.60504873 ...], ...}



In [8]:
def extract_features(wavs_dir:str, func:object, **kwargs) -> dict:
    """
    This function extracts features from audio wav files.
    Inputs:
        wavs_dir: the directory where the processed 16-bit and 16khz wav files exist
        func: the function that can extract different features
        **kwargs: to pass extra options for the `func` input
    output:
        a dictionary that contains the basenames of the files as keys, and the features as values
        for example, {"01-N-1-1": [2.68227271 3.66441116 2.60504873 ...], ...}
    """
    output = {}
    wav_paths = glob.glob(os.path.join(wavs_dir, "**", "*.wav"), recursive=True)
    for i, wav_path in enumerate(wav_paths):
        printProgressBar(i + 1, len(wav_paths), prefix = 'extracting features:', suffix = 'complete', length=50)
        (rate,sig) = wav.read(wav_path)
        file_name = os.path.basename(wav_path).replace(".wav", "")
        feats_all = func(sig, **kwargs)
        means = np.mean(feats_all, 0)
        std = np.std(feats_all, 0)
        feats = np.concatenate((means, std))
        output[file_name] = feats
    return output

feats_fbank = extract_features("./cafe_p", logfbank, nfilt=40)
feats_mfcc  = extract_features("./cafe_p", mfcc, nfilt=40)

extracting features: |██████████████████████████████████████████████████| 100.0% complete
extracting features: |██████████████████████████████████████████████████| 100.0% complete


#### Defining targets (Exercise 3)

Now that we have extracted features, we also need to have targets, in order to train and evaluate a ML model in a supervised manner. Different datasets usually come with a little documentation about what kind of data you have and how you can use them. Here, you can go and look at the `Readme.txt` file inside the directory related to the `CaFE` dataset. As you see, you have the information related to gender, and emotional expression for each file. In the code section below, we would like to write two functions to provide us with a numerical representation of the emotion and gender information for each file. The numerical representation can be simply assigining an integer to each emotion (for example, angry=0, disgust=1, etc. or female=0, male=1).

In [9]:
def get_emo_tar(file_names:list) -> dict:
    """
    This function gets a list of file_names based on CaFE dataset, and outputs numerical targets for machine learning
    Here, the output targets represent different emotional expressions
    Inputs:
        file_names: the list of file names of the CaFE dataset
    Output:
        a dictionary that contains the basenames of the files as keys, and the targets as values
        for example, {"01-N-1-1": 3, '11-S-2-5': 5, ...}
    """
    tars = {}
    emos = ["C", "D", "J", "N", "P", "S", "T"]
    for file_name in file_names:
        tar = emos.index(file_name[3])
        tars[file_name] = tar
    return tars

def get_gen_tar(file_names:list) -> dict:
    """
    This function gets a list of file_names based on CaFE dataset, and outputs numerical targets for machine learning
    Here, the output targets represent different genders
    Inputs:
        file_names: the list of file names of the CaFE dataset
    Output:
        a dictionary that contains the basenames of the files as keys, and the targets as values
        for example, {"01-N-1-1": 1, '08-S-2-6': 0, ...}
    """
    tars = {}
    for file_name in file_names:
        actor_num = file_name[:2]
        rem = int(actor_num) % 2
        tar = 0 if rem == 0 else 1
        tars[file_name] = tar
    return tars

cafe_ids = list(feats_fbank.keys())
emo_tars = get_emo_tar(cafe_ids)
gender_tars  = get_gen_tar(cafe_ids)
print("example of file ids:\n", cafe_ids[0])
print("example of emotion targets:\n", emo_tars[cafe_ids[0]])
print("example of gender targets:\n", gender_tars[cafe_ids[0]])

example of file ids:
 07-S-1-1
example of emotion targets:
 5
example of gender targets:
 1


## Training and testing machine learning models

Now that we have features and targets for both of our acoustic and textual datasets, its time to train and test some ML models. Here, we will work with `SVM` and `MLP` model provided by the `sklearn` Python package. We will also learn how to optimise such models using the development set, before testing them with the test set, and final evaluation of the results. The evaluation, however, requires a specific metric to be done. In what follows, you will define two metrics to be used for the evaluation of the ML models after their training.


### Defining metrics (Exercise 4)

Here, as the task is classification, we will work with `accuracy` and `Unweighted Average Recall (UAR)`. You are most probably already familiar with accuracy, however as a metric it can be limited at times (why? is the question asked below). However, `Unweighted Average Recall` can solve some of the limitations of the accuracy metric, by calculating the average for each class first, and then taking the final average over the averages of each target class. 

To know more about `UAR`, you can look at: (this powerpoint)[https://ibug.doc.ic.ac.uk/media/uploads/documents/ml-lecture3-2014.pdf], and (this article)[https://ogunlao.github.io/blog/2021/04/24/consider_uar_accuracy.html], or (the sklearn metric here)[https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html]. However, if you decide to use the sklearn package, be careful of its input parameters.

In [10]:
def accuracy(outs:list, tars:list) -> float:
    """
    Calculating and returning the accuracy between outputs (`outs`) and targets (`tars`),
    where each one is a list of integers like [1,0,1,2,2,1,0,3], with each integer indicating the target label
    """
    accuracy = np.mean([out==tar for out, tar in zip(outs, tars)])
    return accuracy

def UAR(outs:list, tars:list) -> float:
    """
    Calculating and returning the unweighted average recall between outputs (`outs`) and targets (`tars`),
    where each one is a list of integers like [1,0,1,2,2,1,0,3], with each integer indicating the target label
    """
    tarsSet = list(set(tars))
    corrects = {}
    totals = {}
    for i in tarsSet:
        corrects[i] = 0
        totals[i] = 0
    for i, out in enumerate(outs):
        tar = tars[i]
        totals[tar] += 1
        if out == tar: corrects[tar] += 1 # Counting when target and output match per each class 
    uar = 0
    for i in tarsSet:
        uar += corrects[i] / totals[i] # Calculating the accuracy per each class 
    uar = uar / len(tarsSet) # Calculating the average of accuracies per each class 
    return uar

outputs = [1,1,1,1,0,1,1,1]
targets = [0,0,1,1,1,1,1,1]
print("Accuracy example:", accuracy(outputs, targets))
print("UAR example:", UAR(outputs, targets))

Accuracy example: 0.625
UAR example: 0.4166666666666667


#### Question 3

- Where do you think it is more useful to use UAR to evaluate a model rather than accuracy? and why?

### Machine learning of textual features

Now that we've defined our metrics, let's quickly train a model on some of our data to see what can be done and how. Different ML models can be used to map a set of features to a set of classes or labels. Here we want to use `SVC` from the `sklearn` package to do `SVM` classification. SVMs are one of the best-known ML methods, conceived about 30 years ago, but still in use today. Basic SVMs learn to separate two classes by a hyperplane in the space of features. This hyperplane separator is trained to be the optimal separator of different classes, taking into account all the training features. This hyperplane separator is considered linear in the first implementations of SVMs (see figure below), but our input features are not always linearly separable. Therefore, a mathematical kernel mechanism was soon introduced to transform the linear hyperplane into a higher dimensional space where the input features are better separable.

You can ([see its wikipedia page](https://en.wikipedia.org/wiki/Support_vector_machine) and [this MIT video for more information](https://www.youtube.com/watch?v=_PwhiWxHK8o)) for more information on how `SVM`s work. 

![SVM-wiki](https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/600px-SVM_margin.png)

Run the script below to train an `SVM` classifier for a portion of training features, to have sentiment classifier. Then, predicting different sentimen classes with another portion of data, and evaluating the predictions with the two metrics of `accuracy` and `UAR`, which were discussed above.

In [11]:
clf = SVC(C=10.0, kernel='linear')
clf.fit(feats_train_text[:300], tars_train_text[:300])
preds = clf.predict(feats_dev_text[:300])
print("Accuracy:", accuracy(preds, tars_dev_text[:300]))
print("UAR:", UAR(preds, tars_dev_text[:300]))

Accuracy: 0.81
UAR: 0.8133414932680538


#### Question 4

- Try changing the kernel in the code above from `linear` to `rbf`, 
    - Which results were better? and why do you think that is? 
    - What is the difference between the `rbf` kernel and the `linear` kernel of SVM? (this question requires some research)

### Hyper-parameter optimisation (Exercise 5)

You can go and take a look at the documentation of the `SVC` [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). As you might notice, you have several different option to tweak in order to get better results. This can be thought of as an optimisation problem. In the code below we would like to write a function that can optimise for `C`, also referred to as SVM complexity, using the development set.

In [13]:
def get_best_model_svm(feats_train, tars_train, feats_dev, tars_dev, Cs):
    """
    This function gets features, and targets for training and development sets,
    then finds the best SVM model (`SVC`) for the parameter `C`, given a list of possible `C`s
    Here, the output is the best_c, and also the best trained model 
    By best, we mean the model that has the best accuracy on the development set
    Inputs:
        feats_train: the list of training features
        tars_train: the list of training targets
        feats_dev: the list of development features
        tars_dev: the list of development targets
        Cs: the list of possible C values
    Output:
        best_c: the C value resulting in the best accuracy on the development set
        best_model: the SVM model `SVC` resulting in the best accuracy on the development set
    """
    best_accuracy = 0
    best_c = Cs[0]
    best_model = None
    for C in Cs:
        clf = SVC(C=C, kernel='rbf')
        clf.fit(feats_train, tars_train)
        preds = clf.predict(feats_dev)
        acc = accuracy(preds, tars_dev)
        print(C, acc)
        if acc > best_accuracy:
            best_accuracy = acc
            best_c = C
            best_model = clf
    return best_c, best_model

Cs = [0.01, 0.1, 1, 10, 100]
best_c, best_model_text = get_best_model_svm(feats_train_text, tars_train_text, feats_dev_text, tars_dev_text, Cs)
    
preds = best_model_text.predict(feats_test_text)
print("best C is:", best_c)
print("Accuracy:", accuracy(preds, tars_test_text))
print("UAR:", UAR(preds, tars_test_text))

0.01 0.5703324808184144
0.1 0.5703324808184144
1 0.9156010230179028
10 0.9156010230179028
100 0.9156010230179028
best C is: 1
Accuracy: 0.9002557544757033
UAR: 0.9001831501831502


#### Question 5

- How do you think changing the value `C` for `SVC` effects training the model with calling the `fit` function?

### Testing trained model with custom inputs

In [39]:
custom_inputs = [
                 "c'est nul à chier",
                 "c'est pas intéressant",
                 "c'est très intéressant",
                 "wow c'est cool",
                 "je me suis bien amusé avec ce TP",
                 "c'est pas cool ça!!!",
                ]
feats = vectorizer.transform(custom_inputs).toarray()
preds = best_model_text.predict(feats)

for p, pred in enumerate(preds):
    label = "Positive" if pred == 1 else "Negative"
    print(custom_inputs[p], "->", label)

c'est nul à chier -> Negative
c'est pas intéressant -> Negative
c'est très intéressant -> Positive
wow c'est cool -> Positive
je me suis bien amusé avec ce TP -> Positive
c'est pas cool ça!!! -> Negative


### Machine learning of acoustic features

Above, we trained an ML model for textual features, and optimised it on the development set, the same is done here, but for the acoustic features. 

### Partitioning

The textual dataset used above, had a predefined `train`, `dev`, and `test` partitions. However, it is not always the case. For example, the `CaFE` dataset does not provide such partitioning. Thus, here we will provide a partitioning of our own, where we have about 70 percent of data as the training set, 15 percent as the development set and another 15 percent as the test set. We would also like to have these partitions balenced in terms of gender, and do not have the same speaker appear in different partitions.

Run the script below to do the partitioning.

In [None]:
def get_partitions(cafe_ids, train_ids=[], dev_ids=[], test_ids=[]):
    train_keys, dev_keys, test_keys = [], [], []
    for cafe_id in cafe_ids:
        actor_num = cafe_id[:2]
        if actor_num in train_ids: train_keys.append(cafe_id)
        if actor_num in dev_ids: dev_keys.append(cafe_id)
        if actor_num in test_ids: test_keys.append(cafe_id)
    return train_keys, dev_keys, test_keys

train_ids = ["01", "02", "03", "04", "05", "06", "07", "08"]
dev_ids   = ["09", "10"]
test_ids  = ["11", "12"]
train_keys, dev_keys, test_keys = get_partitions(cafe_ids, train_ids, dev_ids, test_ids)

### Training an MLP classifier

Here, instead of the SVM, we train an `MLP` classifier. A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN), and the term MLP often loosely means any feedforward ANN [taken from its wiki here](https://en.wikipedia.org/wiki/Multilayer_perceptron). ANNs are a type of machine learning technique, which is loosely based on the concept of biological neural networks in the human brain. Each artificial neuron, similar to the synapses and axons of a biological neuron, can be connected to other neurons to send or receive information. Artificial neurons are usually put together as groups, which are called neural layers. The most basic form of artificial neural layers, are fully connected layers, where all the neurons of the first layer is connected to all the neurons of the next layers. To describe how fully connected layers work through mathematical notations, we can consider the input of each layer to be a numerical vector, that is transformed to a different vector, through a matrix multiplication, and usually followed by a non-linear function. This process can be written as followed: 

$y = h(Wx + b)$

Where $x$ is the input vector, $W$ is the weight matrix, $b$  is the ``bias'' vector, which is there to off-set the linear matrix multiplication, and $h(.)$ is usually a non-linear function such as tangent hyperbolic or sigmoid, and $y$ is the output vector. A fully connected layer (or an `MLP`) can be depicted as below:

![MLP-fig](https://www.researchgate.net/profile/Mohamed-Zahran-16/publication/303875065/figure/fig4/AS:371118507610123@1465492955561/A-hypothetical-example-of-Multilayer-Perceptron-Network.png)

Similar to `SVM`, the `sklearn` package also provides an easy to use [MLP classifier (click here for more the documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). In what follows, we will see how it can be used to train and evaluate a model:

In [None]:
feats = feats_fbank
tars  = gender_tars

feats_train = list({key: feats[key] for key in train_keys}.values())
tars_train  = list({key: tars [key] for key in train_keys}.values())
feats_dev = list({key: feats[key] for key in dev_keys}.values())
tars_dev  = list({key: tars [key] for key in dev_keys}.values())
feats_test = list({key: feats[key] for key in test_keys}.values())
tars_test  = list({key: tars [key] for key in test_keys}.values())

#clf = SVC(C=10.0, kernel='rbf')
clf = MLPClassifier(random_state=1, max_iter=500, hidden_layer_sizes=(256))
clf.fit(feats_train, tars_train)
preds = clf.predict(feats_test)
print("Accuracy:", accuracy(preds, tars_test))
print("UAR:", UAR(preds, tars_test))

#### Question 6

- Here, the `accuracy` is the same as `UAR`, why do you think that is?

### Hyper-parameter optimisation (Exercise 6)

Similar to the optimisation of 'C' for 'SVM' above, here we want to optimise for the best number of hidden layers and nodes. Then, after writing the function below, you can choose the set of hidden layers you want to optimise for.

In [None]:
def get_best_model_mlp(feats_train, tars_train, feats_dev, tars_dev, hiddens):
    """
    This function gets features, and targets for training and development sets,
    then finds the best MLP model for the hidden layer
    Here, the output is the best hidden layer values, and also the best trained model 
    By best, we mean the model that has the best accuracy on the development set
    Inputs:
        feats_train: the list of training features
        tars_train: the list of training targets
        feats_dev: the list of development features
        tars_dev: the list of development targets
        hiddens: the list of possible hidden layers values
    Output:
        best_hidden: the hidden layers value resulting in the best accuracy on the development set
        best_model: the MLP model resulting in the best accuracy on the development set
    """
    best_accuracy = 0
    best_model = None
    best_hidden = hiddens[0]
    for h, hidden in enumerate(hiddens):
        clf = MLPClassifier(random_state=0, max_iter=500, hidden_layer_sizes=hidden)
        clf.fit(feats_train, tars_train)
        preds = clf.predict(feats_dev)
        acc = accuracy(preds, tars_dev)
        print(hidden, acc)
        if acc > best_accuracy:
            best_accuracy = acc
            best_hidden = hidden
            best_model = clf
    return best_hidden, best_model

# Change the `hiddens` value to include more options for hidden layers optimisation
hiddens = [(32), (64), (100), (128,64), (256, 128, 64)]
best_hidden, best_model = get_best_model_mlp(feats_train, tars_train, feats_dev, tars_dev, hiddens)
#best_model = get_best_model_svm(feats_train, tars_train, feats_dev, tars_dev, [0.1,1,10,100])
    
preds = best_model.predict(feats_test)
print("best_hidden:", best_hidden)
print("Accuracy:", accuracy(preds, tars_test))
print("UAR:", UAR(preds, tars_test))

#### Question 7

- Try to justify your set of choices for `hiddens`. Why did you choose that set of hidden values for optimisation?

- Try changing the `get_best_model_mlp` with `get_best_model_svm`, what do you observe? (report the results here and compare them)

- Go back to the previous block of code and change the target `tars` to be `emo_tars` instead of `gender_tars`. Then, run the same code above for `get_best_model_mlp` and `get_best_model_svm`, report the results here, and compare them to the results of `tars=gender_tars`

## Summary

In this practical work, you have learnt the basics of machine learning processes for tasks related to acoustic signals and text. You learned how to load different acoustic and textual datasets into memory, process them, and parition them if necessary, and extract effective features from them. You then learnt how to use these features in various state-of-the-art machine learning techniques to train accurate models that can predict the mood, gender and emotions of different people. You also learnt how to optimise the hyper-parameters of a machine learning model and how to evaluate them after training. The practical knowledge gained in this practical work, can be used to effectively model a large number of acoustic tasks (such as speech recognition, speaker recognition, emotion recognition, language identification) and a large number of textual tasks (such as text classification, sentiment analysis, topic modeling, and natural language processing).