In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
from embedding_api import EmbeddingAPI
from classifier_utilty import create_classifier, train_classifier, test_classifier
from data_utils import load_split

from tqdm import tqdm

# Setup Data and API

In [4]:
train_audios, train_labels = load_split("splits/embedding_train_split.txt")

100%|██████████| 599/599 [00:18<00:00, 32.41it/s]


In [5]:
test_audios, test_labels = load_split("splits/classifier_test_split.txt")

100%|██████████| 200/200 [00:05<00:00, 33.62it/s]


In [6]:
api = EmbeddingAPI()

In [7]:
def get_embeddings(audios, api, model, embed_type="scene", aggregation="mean"):
    embeddings = []
    for audio in tqdm(audios):
        if embed_type == "scene":
            embedding = api.get_scene_embeddings(audio, model, aggregation=aggregation).numpy()
        elif embed_type == "timestamp":
            embedding, _ = api.get_timestamp_embeddings(audio, model)
            embedding = embedding.numpy()
            batch_size = embedding.shape[0]
            embedding = embedding.reshape(batch_size, -1)
        embeddings.extend(embedding)
        
    return embeddings

# Training Info
To pretrain the models (where needed), the GTZAN Genre Classification dataset was used. Since this is the same dataset which is supposed to be used for the evaluation of the linear classifier, a hold out split is created that is not used during the pretraining at all.  
From the remaining dataset, a train and test split were created that were used during the training of the DL models. The same training split was used to train the linear classifier using the extracted embeddings as input.

# MFCC Model

## Timestamp Embedding
At each step, MFCC features are calculated and used as the embedding representation for that particular frame

In [8]:
model = api.load_model("saved_models/mfcc_model.ckpt")

In [9]:
train_embeddings = get_embeddings(train_audios, api, model, embed_type="timestamp")

100%|██████████| 599/599 [00:05<00:00, 113.33it/s]


In [10]:
test_embeddings = get_embeddings(test_audios, api, model, embed_type="timestamp")

100%|██████████| 200/200 [00:01<00:00, 107.08it/s]


In [11]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.49


## Scene Embedding
These are just the timestamp embeddings averaged over the time-steps to produce an aggregate representation of the entire sequence

In [12]:
train_embeddings = get_embeddings(train_audios, api, model)

100%|██████████| 599/599 [00:04<00:00, 121.34it/s]


In [13]:
test_embeddings = get_embeddings(test_audios, api, model)

100%|██████████| 200/200 [00:01<00:00, 117.97it/s]


In [14]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.585


The scene embeddings provide better results. This might be due to the fact that the amount of irrelevant information would be more in timestamp embeddings which have a much larger input (211264). 

# LSTM Model

## Timestamp Embeddings
The MFCC features are provided as input to the LSTM model which produces a corresponding hidden state which in turn is used as the embedding representation

In [15]:
model = api.load_model("saved_models/lstm_model.ckpt")

In [16]:
train_embeddings = get_embeddings(train_audios, api, model, embed_type="timestamp")

100%|██████████| 599/599 [00:08<00:00, 72.84it/s]


In [17]:
test_embeddings = get_embeddings(test_audios, api, model, embed_type="timestamp")

100%|██████████| 200/200 [00:02<00:00, 69.00it/s]


In [18]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.5


## Scene Embeddings
Since LSTMs aggregate information as they progress, the final hidden state is considered as the scene embeddings

In [19]:
train_embeddings = get_embeddings(train_audios, api, model, embed_type="scene", aggregation="last")

100%|██████████| 599/599 [00:08<00:00, 73.28it/s]


In [20]:
test_embeddings = get_embeddings(test_audios, api, model, embed_type="scene", aggregation="last")

100%|██████████| 200/200 [00:02<00:00, 71.46it/s]


In [21]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.045


Although the performance of the timestamp embeddings from the LSTM were comparable to MFCC, the scene embeddings produce atrocious results. This might be due to the fact that we are dealing with very large sequences (>3000 time-steps). This could result in the information content being diluted and the last state lacking any useful material.

## Scene Embeddings (Mean Aggregation)
Similar to how averaging over the time-steps provided decent results for the MFCC features, it is worth trying to collect the information from different LSTM hidden states by taking their mean.

In [22]:
train_embeddings = get_embeddings(train_audios, api, model, embed_type="scene", aggregation="mean")

100%|██████████| 599/599 [00:07<00:00, 78.01it/s]


In [23]:
test_embeddings = get_embeddings(test_audios, api, model, embed_type="scene", aggregation="mean")

100%|██████████| 200/200 [00:02<00:00, 76.95it/s]


In [24]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.275


This provided us with somewhat decent results. 

One potential conclusion that we can draw from this is that information does not reach the final state. We can utilize this information to improve upon the pretraining part.

# LSTM Model (Mean Training)

## Timestamp Embeddings
Same as the previous model

In [25]:
model = api.load_model("saved_models/lstm_model_mean.ckpt")

In [26]:
train_embeddings = get_embeddings(train_audios, api, model, embed_type="timestamp")

100%|██████████| 599/599 [00:07<00:00, 76.57it/s]


In [27]:
test_embeddings = get_embeddings(test_audios, api, model, embed_type="timestamp")

100%|██████████| 200/200 [00:02<00:00, 77.34it/s]


In [28]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.565


This puts us much closer to the performance of the MFCC features. Using the mean of the hidden states allowed the DL model to not be constricted by the sequence and learn information that was more relevant.

## Scene Embeddings
Averaging over the time-steps to produce a combined representation

In [29]:
train_embeddings = get_embeddings(train_audios, api, model, embed_type="scene", aggregation="mean")

100%|██████████| 599/599 [00:07<00:00, 79.54it/s]


In [30]:
test_embeddings = get_embeddings(test_audios, api, model, embed_type="scene", aggregation="mean")

100%|██████████| 200/200 [00:02<00:00, 71.47it/s]


In [31]:
classifier = create_classifier()
train_classifier(classifier, train_embeddings, train_labels)
test_classifier(classifier, test_embeddings, test_labels)

Accuracy: 0.28


Nothing significant compared to the scene embeddings of the previous model.

# Final Comments 
It is clear that there is some room for improvement in the process of aggregating information from the timestamp embeddings to create a more compact representation for the scene embeddings.  
One potential avenue of exploration is to utilize attention mechanism which will provide a more sophisticated way to assemble the information into a single vector.  
Additionaly, using different input for the pretraining task could allow the model to learn better. This could mean learning using Spectograms or the raw waveform directly but they have there own set of challenges associated with them.