# 3SA: Semantic Search for Speeches in Audio

#### Abstract:

Semantic search is the ability to search for documents by understanding the overall meaning of the query rather than using simple keyword matches. Recent breakthroughs in NLP like Bert, Albert, Roberta, etcetera, paved the way for the development of such powerful semantic search engines. But most of these search algorithms are mainly focused on textual information, i.e., both the document and the query are in natural language. In this project, we aim to develop a semantic search algorithm for arbitrary objects (objects which are not in natural language), specifically for speeches in audio, by leveraging advanced NLP techniques. We introduce 3SA, Semantic Search for Speeches in Audio, which can enable the search for audio files, semantically. We perform our experiments on the Librispeech dataset and further evaluate our search results using basic information retrieval metrics.

#### Contents:

   1. [Gather Data](#data)
   2. [Transorming Transcripts to Embeddings](#transform)
   3. [Data Generator](#datagen)
   4. [Model Building and Training](#model)
   5. [Evaluation](#evaluate)
   6. [Analysis](#analysis)
   
#### Files created in this notebook: [Download here](https://drive.google.com/open?id=1yguzkmUV0Z62l7Dze_sh10Xnw7yroy90)

#### Project flow:

![alt text](model_arch.jpg "Project overview")

In [30]:
# Import all the required packages

from datautils import *
from label_prep import *
from model_arch1 import Model_1
from model import Model_2

import pandas as pd
import numpy as np

%matplotlib notebook
from IPython.utils import io
import plotly.graph_objects as go
from scipy import signal
import soundfile as sound

from keras import optimizers
from keras.models import load_model

import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile
from torch.utils.tensorboard import SummaryWriter

import nmslib
from IPython.display import Audio
import  random
from googletrans import Translator

<a id='data'></a>
## 1. Gather Data


We perfrom our experiments on LibriSpeech Dataset. You can either manually download the data and extract the files or simply run the below cell. The function calls PyTorch in-built LibriSpeech dataset which downloads the data for you. Make sure you place the extracted LibriSpeech folder in 'data' directory. Implementation of this function is in __datautils.py__ file.

__Warning__: We have commented the function call as running the below cell downloads ~6.5GB data. So, make sure you are connected to internet with high bandwidth before you uncomment and run it.

In [2]:
# Create required folders

!mkdir data
!mkdir tmp

data_path = 'data' # folder to save data

# Folder to save models and output files
# All the files saved here are also available in the download link provided above
tmp_path = 'tmp'

# Download librispeech data. Stored in data_path by default

# train_data_model2, dev_data_model2, test_data_model2 = get_data()

Once we download the data, we need to create map files for the dataset. We basically store the file location and the corresponding utterance (transcript) in a CSV file. The reason why we are doing this is since it is expensive to fit all our data on to the RAM, we read the CSV file instead and process the data batch-wise. 

In [6]:
# Create map files for the dataset

# Train set
input_path = data_path +'/LibriSpeech/train-clean-100' # Path of LibriSpeech train folder
output_path = data_path + '/train.csv' # Save path

create_map(input_path,output_path)

# Dev set
input_path = data_path +'/LibriSpeech/dev-clean' # Path of LibriSpeech dev folder
output_path = data_path + '/dev.csv' # Save path

create_map(input_path,output_path)

# Test set
input_path = data_path +'/LibriSpeech/test-clean' # Path of LibriSpeech test folder
output_path = data_path + '/test.csv' # Save path

create_map(input_path,output_path)

100%|██████████| 251/251 [01:12<00:00,  3.46it/s]


CSV file saved at:  data/train.csv


100%|██████████| 40/40 [00:06<00:00,  6.64it/s]


CSV file saved at:  data/dev.csv


100%|██████████| 40/40 [00:05<00:00,  6.90it/s]

CSV file saved at:  data/test.csv





You can read the CSV files to explore the data. The CSV files have 'location' and 'utterance' columns in it. If you called the get_data function earlier, then you can access those files as well. The function return the datasets which are indexable. Each item is a tuple of the form: waveform, sample_rate, utterance, speaker_id, chapter_id, and utterance_id.

In [45]:
train  = pd.read_csv('data/train.csv')
test  = pd.read_csv('data/test.csv')
dev  = pd.read_csv('data/dev.csv')



print('Transcript - ',train.iloc[0,2]) # Utterance
Audio(filename = train.iloc[0,1]) # File location

# If you run the get_data function

# print('Transcript - ', train_data_model2[0][2])
# Audio(train_data_model2[0][0],rate = train_data_model2[0][1])

Transcript -  the night of the sixteenth to the seventeenth of february eighteen thirty three was a blessed night above its shadows heaven stood open it was the wedding night of marius and cosette the day had been adorable


<a id='transform'></a>
## 2. Transforming Transcripts to Embeddings

The goal is to learn sentence emebeddings of the transcripts from the audio file. In order to do so, we follow supervised approach. We first convert all our transcripts to vectors which will act as labels.

To transform the transcripts, we use Smooth Inverse Frequencey technique as mentioned in our report. SIF method uses pre-trained word vectors and calculates weighted average of words in the sentence to generate sentence vectors. The approach is proved to be efficient and gave some good results on the STS benchmark dataset.

We use fast-text pre-trained word vectors to apply SIF technique on all our transcripts to generate embeddings. Fast-sentence-embeddings package does this efficiently for us. But we have also built a wrapper function class __SIF_embeddings__ for our convenience. We recommend readers to refer the __label_prep.py__ file and [this](https://github.com/oborchers/Fast_Sentence_Embeddings) library for more info. 

The link for pre-trained vectors is commented in the below cell. Once you download the vectors and give the path to SIF model, we need to train it on all the transcripts. For convenience, we already trained the model,so we simply load the model.

__Warning__: Both the fast-text vectors and trained pickle files are approximately 3GB. So we request to download the files beforehand from the link provided above. 

In [6]:
# Train a sentence embedding model on all the transcripts

# Download and unzip Fast-text word embeddings

# !wget -O tmp/wordvectors.zip 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip'
# !unzip tmp/wordvectors.zip -d tmp/'

# sif_model = SIF_embeddings(tmp_path+'/crawl-300d-2M.vec')

# l1 = train['utterance'].tolist() 
# l2 = dev['utterance'].tolist() 
# l3 = test['utterance'].tolist() 

# sif_model.train(l1+l2+l3)

# sif_model.model.save(tmp_path+'/sentence_embedding.pickle') # The model saves a pickle file and a npy file. Both are required when loading the model


# Or load the trained model

sif_model = SIF_embeddings(tmp_path+'/sentence_embedding.pickle')


<a id='datagen'></a>
## 3. Data Generator

Next, we build a generator function for our CSV files. The purpose of this class is, it generates features and labels on the fly rather than fitting all the data on RAM. To invoke, we pass all the map files to the class along with batch_size and trained SIF model. The max_train_len argument limits the number of data points to be trained on for faster training. For our experiment, we limit the size to 5000 files. Dev and test sets are limited to 10% of train data i.e. 500 files. The implementation of this calss is in __datautil.py__ file.

In [46]:

# Define batch size
batch_size = 10

# Invoke data generator class
data = LibriSpeech(train = 'data/train.csv',\
                   dev = 'data/dev.csv',\
                   test = 'data/test.csv',
                   se = sif_model,\
                   batch_size = batch_size,\
                   max_train_len = 5000) # Limit train size to 5000

print('Length of training set - ',len(data.train))
print('Length of dev set - ',len(data.dev))
print('Length of test set - ',len(data.test))


Length of training set -  5000
Length of dev set -  500
Length of test set -  500


In [49]:
features, labels = next(data.gen_train()) # Generate sample fetaures and labels from train_set
print(features.shape)
print(labels.shape)

(10, 1607, 161)
(10, 300)


Since our batch size is 10, our generator function generates 10 features and 10 labels. The labels are 300 dimensional sentence vectors generated using SIF model. The features are spectrograms which is a visual representation of Fast Fourier transform applied on frequencies of the audio signal with time. Since, capturing only frequencies at a time step does not give much information, we use spectrograms. These features capture more information from our audio signals. Additional information is provided in our report. A sample spectrogram plot is visualized below for reference.

In [27]:

samples, sample_rate = sound.read(train.iloc[3,1])


def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

freqs, times, spectrogram = log_specgram(samples, sample_rate)

fig = go.Figure(data=[go.Surface(z=spectrogram.T, x=times, y=freqs)])
fig.update_traces(contours_z=dict(show=True, usecolormap=True,
                                  highlightcolor="limegreen", project_z=True))

fig.update_layout(title='Spectrogram values for the sample audio file', autosize=False,
                  scene = dict(xaxis_title='time',yaxis_title='frequency',zaxis_title='spectrogram'),
                  scene_camera_eye=dict(x=1.87, y=0.88, z=-0.64),
                  width=600, height=600,
                  margin=dict(l=65, r=50, b=65, t=90))

fig.show()

# Pyplot can be missing when notebook reloads, so we put a picture for reference


![alt text](spectrogram.png "Spectrogram")

<a id='model'></a>
## 4. Model Building and Training

We have 2 architectures for this experiment. The first model uses a pre-trained ASR model coupled with few dense layers. This architecture also has short and long version. Since the short version did not give us the desired result, we discarded it. Same goes for architecture 2. Hence we commented the training process of the second architecture.

The first architecture is implemented in keras. It is in __model_arch1.py__ file. It requires a pre-trained ASR model and the link is provided in the cell. We first compile the model and then call the train function by passing the data generator object to it.

The second architecture is Bidirectional GRUs coupled with dense layers. This model is implemented in Pytorch and is in in __model.py__ file. The idea is we grab the hidden output of GRU which acts as a representation of the audio spectrogram. We pass this representation to dense layers to map it to the sentence vectors space. Unfortuantely, this model did not give us much results and due to time and resource constraints we refrained from exploring more on this architecture.  

In [None]:
# Model architecture 1


# Build model from scratch using a pretrained ASR model

# !wget -O tmp/model_end.h5 'https://github.com/soheillll/Automatic-Speech-Recognition/raw/master/results/model_end.h5'

# Invoke model by passing the pre-trained ASR model
# model = Model_1(asr_path = 'tmp/model_end.h5', mtype = 'long')

# Or load trained model

# Invoke model by passing previously trained model
model = Model_1('tmp/model.h5','tmp/model_wts.h5')

# Compile model with Adam optimizer and cosine proximity as loss function
model.compile(optimizer=optimizers.Nadam(lr=0.002),loss='cosine_proximity')

# Train the model
model.train(data = data,\
            epochs = 10,\
            save_path = tmp_path)



In [None]:
# Model architecture 2

# from model import Model_2

# audio_preprocessor = spectrogrm_preprocessor()

# Create new model
# model = Model_2(sentence_embedder = sif_model,\
#                 audio_preporcessor = audio_preprocessor)

# Load the trained model
# model = Model_2(sentence_embedder = sif_model,\
#                 audio_preporcessor = audio_preprocessor,\
#                 load = True, load_path = 'tmp/model.tar')

# model.train(train_data_model2,\
#             epochs = 10,\
#             save_path = tmp_path,\
#             save_freq = 2)

<a id='evaluate'></a>
## 5. Evaluation

Our evaluation is a bit complicated. First we load our model and predict sentence embeddings for all the audio files in our test set. To truly evaluate our model, we need to enable semantic searchability. So, we sample 200 files in the test set and paraphrase the transcripts. Now for these parapahrased sentences, we project them into the vector space using SIF model. We now have both the audio files and paraphrased transcripts in the same embedding space. For each of these paraphrased transcripts, if thier nearby neighbor is the corresponding audio file, then the model has achieved its purpose. (Formulation is provided in our report for reference.)

In [19]:

# Load the model
model = load_model('tmp/model.h5')
model.load_weights('tmp/model_wts.h5')


test_set = data.test # test-set

# predict sentence embeddings for all the audio files
predictions = np.zeros((len(test_set),300))

for i in range(len(test_set)):
    spectrogram = data.featurize(test_set.iloc[i,1])
    out = model.predict(np.expand_dims(spectrogram, axis=0)).squeeze(0)
    predictions[i,:] = out
    
np.save('tmp/predictions.npy',predictions)


In [None]:

# Paraphrase the transcripts using google translator

df = data.test.sample(n=200) # Sample 200 files

# Might need internet for running this
from googletrans import Translator

translator = Translator()
df['paraphrased'] = None
for index, row in df.iterrows():
    
    de_translations = translator.translate(row['utterance'].lower(), dest='de')
    paraphrase_translations = translator.translate(de_translations.text, dest='en')
    df['paraphrased'].loc[index] = paraphrase_translations.text

df.to_csv('tmp/evaluation_set.csv')


# Generate sentence embeddings using SIF model for the paraphrased transcripts
# we also store the index of the correspinding audio file in the test row

eval_set = pd.read_csv('tmp/evaluation_set.csv.csv')
parasemb = np.zeros([len(eval_set),301])

for i in range(len(eset)):
    emb = sif_model(eset.iloc[i,3]).squeeze(0)
    parasemb[i,0] = float(eset.iloc[i,0])
    parasemb[i,1:] = emb
    
np.save('tmp/paraphrase_embs.npy',parasemb)

We define a small class to search files in the embedding space. We use NMSlib package for this purpose. The library provides quick searchability of datapoints in n-dimensional space. We basically create an index on the predictions. Then for each paraphrased transcript we search for the audio files. We search for 1,3,5, and 10 neighbors for each paraphrase and we expect to find the corresponding audio file in these neignbors.

In [114]:
class search_files:
    
    def __init__(self, load_path = None, sif_model = None,
                 audio_files = None, predictions = None):
        
        if load_path:
            # Load previously created index on predictions
            self.index = nmslib.init(method = 'hnsw', space = 'cosinesimil')
            self.index.loadIndex(load_path)
            
        else:
            # Create new index on the predictions
            self.index = nmslib.init(method = 'hnsw', space = 'cosinesimil')
            self.index.addDataPointBatch(predictions)
            self.index.createIndex({'post': 2}, print_progress=True)
            self.index.saveIndex('tmp/search_index.nmslib')
            
        self.data = audio_files
        self.featurize = sif_model

    def search(self, query, k = 3):
        # Search audio files for paraphrase transcripts
        inp = self.featurize(query).squeeze(0)
        res,dist = self.index.knnQuery(inp, k=k)
    
        for i in res:
            print('+++++++++++++++++')
            print('File '+str(i)+':\n')
            Audio(filename = self.data.iloc[i,1])
            print('Utterance: ',self.data.iloc[i,2])
            
    def evaluate(self,paraphrase_embs,k=3):
        # Evaluate the paraphrase transcripts
        tp = 0
        
        # for each paraphrased transcript
        for i in range(len(paraphrase_embs)):
            r,d = self.index.knnQuery(paraphrase_embs[i][1:],k=k) # fetch k neighbors for it
            if int(paraphrase_embs[i][0]) in r:
                tp+=1 # if the corresponding audio file is in the neighbors increase the count
        return tp/len(paraphrase_embs) # fetched / total_length
        

In [118]:
sf = search_files(load_path = 'tmp/search_index.nmslib', sif_model=sif_model, audio_files = test_set)

**Before we evaluate our model, lets see what our model can help achieve modern search tools.**

In [116]:
sf.search('what a fine lad')

+++++++++++++++++
File 89:

Utterance:  and yet what a fine gallant lad
+++++++++++++++++
File 412:

Utterance:  oh well sir what about him
+++++++++++++++++
File 135:

Utterance:  he impressed me as being a perfectly honest man


In [117]:
sf.search('test time')

+++++++++++++++++
File 369:

Utterance:  i doubt whether branwell was maintaining himself at this time
+++++++++++++++++
File 127:

Utterance:  to morrow is the examination
+++++++++++++++++
File 269:

Utterance:  every chance she could steal after practice hours were over and after the clamorous demands of the boys upon her time were fully satisfied was seized to fly on the wings of the wind to the flowers


The above cells illustrate how how the model can facilitates such search functionality. When we pose a query, the search class returns the relevant audio files. Since our model mapped the audio files in the natural language space, we can fetch them using normal English text. As we see in one of the above example, for search phrase *'time test'*, we where able to fetch semantically similar audio files.

__Note__: The above examples were cherry picked to demonstrate the capability and usability of a 3SA model. The audio widget may not be displayed sometimes due to an issue with the nbformat library.

Now, we evaluate our model as mentioned earlier.

In [121]:
for k in [1,3,5,10]: # number of neighbors
    print('k = '+str(k)+' acc. - '+ str(sf.evaluate(parasemb,k)))

k = 1 acc. - 0.365
k = 3 acc. - 0.37
k = 5 acc. - 0.38
k = 10 acc. - 0.39


As we mentioned earlier, we search the space for 1,3,5, and 10 neighbors and we print the scores.

__What does it mean?__

Consider k = 5. The accuracy score is 38%. This means for 200 paraphrased transcripts only 38% ~ 76 files where successfully mapped by our model in the sentence embedding space. Same goes for all the remaining k values. As we can see the accuracy score is below average.

<a id='analysis'></a>
## 6. Analysis

We plot all the mapped vectors predicted by our model alongside the paraphrased transcript vectors (given by SIF model). Since the vectors are in 300d, we use tensorboard's embedding projector to project these points in a 3d-space. The application directly calculates the PCA components in order to project them accordingly.

In [39]:

preds = np.load('tmp/predictions.npy')
paraembs = np.load('tmp/paraphrase_embs.npy')

metadata = []
vecs = np.zeros((400,300))
c = 0
n = 0
for i in range(200):
    v = paraembs[i]
    
    metadata.append(['audio_'+str(n), str(n)])
    vecs[c,:] = preds[int(v[0])]
    
    metadata.append(['sentence_'+str(n), str(n)])
    vecs[c+1,:] = v[1:]
    
    c+=2
    n+=1


writer = SummaryWriter('tmp/runs2')
writer.add_embedding(vecs, metadata, metadata_header= ['file_type','idx'])
writer.close()


In [42]:
%reload_ext tensorboard
%tensorboard --logdir=tmp/runs2

# Tensordboard might be lost when reloading the notebook, so we attached a screenshot

![alt text](tensorb.png "Embeddings")

**audio_{i}** represents embedding predicted by the model. **sentence_{i}** is the sentence embedding generated by the SIF model for the paraphrased transcripts and is same as the color of the corresponding audio_{i}. As we can see from the plot, we have some pairs. For instance audio_148 on the top right (in orange) is close to sentence_148. This means that our model was able to map this file successfully. In the above image, we filtered audio_14 file and the projector displays all the nearby neighbors. We can notice here that the color is black, indicating there are other files overalaping here. Upon inspecting the files we think we found some insights on why and where the model might be failing.

- The way the words are pronounced differs from person to person. Some of them speak fast while others take time. Change of pronounciation might completely change phonemes and the resulting spectrogram. This affects the way the model learns. For instance, words like two and too might have similar spectrograms.

- On top of this, most transcripts are part of a sentence and not the whole sentence. Due to lack of subject tags and grammar in such sentences, it is hard to capture the semantic meaning.

- Considering only the last hidden layer output as sentence representation might not be working as expected. Maybe if we could capture the represntation for all timesteps, it would have allowed us to see some better scores.

__Takeaways__

Although our model gives below average scores, it is working to some extent. We can improve our model by using a better pre-trained ASR model. The ASR model we used, is very basic and straightforward. If we use SOTA ASR model which are more robust towards these fluctuations in spectrograms (like 'too' and 'two'), we believe this approach could work. Alternatively, we can use a Text-to-Speech model and represent transcripts in audio embedding space rather than the way we proposed. This would eliminate the problem of spectrograms variations as we will be in control of generating the audio.