## Speech Recognition


In [None]:
import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import IPython
import matplotlib.pyplot as plt
from scipy.io import wavfile
%matplotlib inline

# Used to standardize volume of audio clip
def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

# 1 - Creating a dataset for audio samples

Let's start by building a dataset for training our model for trigger word detection. A speech dataset should ideally be as close as possible to the application we will want to run it on, in our case we'd like to detect the word "activate" in working environments (library, home, offices, open-spaces ...) so we must create recordings with a mix of positive words ("activate") and negative words (random words other than activate) on different background sounds, so let's create a representative dataset for oue task.

## 1.1 - Listening to the data   

In the files provided for this lab, we have a number of recording of background sounds in different places, like libraries, cafes, restaurants, homes and offices, as well as snippets of audio of people saying positive/negative words with different accents. 

These recordings can be found in the `raw_data` directory, where we have a number of raw audio files of the positive words, negative words, and background noise. We will use these audio files to synthesize a dataset to train the model.

- The "activate" directory contains positive examples of people saying the word "activate" (one word per audio recording). 
- The "negatives" directory contains negative examples of people saying random words other than "activate" (one word per audio recording). 
- The "backgrounds" directory contains 10 second clips of background noise in different environments.



Run the cells below to listen to some examples using `IPython.display.Audio(PATH)`.

In [None]:
# Listen to some activates

In [None]:
# Listen to some negatives

In [None]:
# Listen to some background noise

You will use these three type of recordings (positives/negatives/backgrounds) to create a labelled dataset.

## 1.2 - From audio recordings to spectrograms

The provided recordings are sampled at 44100 Hz. This means the microphone gives us 44100 numbers per second. Thus, a 10 second audio clip is represented by 441000 numbers (= $10 \times 44100$). 

It is quite difficult to figure out from this "raw" representation of audio whether the word "activate" was said. In  order to help our sequence model more easily learn to detect triggerwords, we will compute a *spectrograms* of the audio. Spectrograms are computed by sliding a window over the raw audio signal, and calculates the most active frequencies in each window using a Fourier transform.

Let's create a function to convert raw audio to Spectrograms, we will use these hyperparameters:
- Length of each window segment `nfft = 200`
- Sampling frequencies `fs = 8000`
- Overlap between windows `noverlap = 120`

In [None]:
# Use plt.specgram, with:
# If the audio data have two channels only use one given that they are equal
def graph_spectrogram(wav_file):
    _, data = wavfile.read(wav_file)
    ## .....
    pxx, freqs, bins, im = plt.specgram()
    return pxx

In [None]:
IPython.display.Audio("audio_examples/example_train.wav")

In [None]:
x = graph_spectrogram("audio_examples/example_train.wav")

The graph above represents how active each frequency is (y axis) over a number of time-steps (x axis). 

The dimension of the output spectrogram depends upon the hyperparameters of the spectrogram and the length of the input. In this notebook, we will be working with 10 second audio clips as the "standard length" for our training examples. The number of timesteps of the spectrogram will be 5511.


In [None]:
audio_file = "audio_examples/example_train.wav"
_, data = wavfile.read(audio_file)
x = graph_spectrogram(audio_file)
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)

Given the results above, we can define:

In [None]:
Tx = # The number of time steps input to the model from the spectrogram
n_freq =  # Size of each input at each time step

So with 10 seconds of discretized samples of the audio, sampled at 441000 frequency (raw audio), we transformed it from 1D signal of size 441000 into a 2D signal as a spectrogram of size `[Tx, n_freq]`.

So the key vales are:

- $441000$ (raw audio frequency)
- $5511 = T_x$ (spectrogram output, and dimension of input to RNN). 
- $10000$ (used by the `pydub` module to synthesize audio) 
- $1375 = T_y$ (the number of steps in the output of the RNN). 

Each of these representations correspond to exactly 10 seconds of time.

In our case, we will output $T_y = 1375$ predictions for each 10s input, so for each $10/1375 \approx 0.0072$ we will predict if someone recently finished saying "activate." 

Consider also the 10000 number above. This corresponds to discretizing the 10sec clip into 10/10000 = 0.001 second itervals. 0.001 seconds is also called 1 millisecond, or 1ms. So when we say we are discretizing according to 1ms intervals, it means we are using 10,000 steps. 


In [None]:
Ty =  # The number of time steps in the output of our model

## 1.3 - Generating a single training example

Because speech data is hard to acquire and label, we will synthesize the training set using the audio clips of activates, negatives, and backgrounds, it is quite slow to record lots of 10 second audio clips with random "activates" in it, so instead we have audio samples of positives and negative words, and samples of background noise separately, and then to create a single training example, we will:

- Pick a random 10 second background audio clip
- Randomly insert 0-4 audio clips of "activate" into this 10sec clip
- Randomly insert 0-2 audio clips of negative words into this 10sec clip

And by knowing exactly where we added the activate clip, we can create the labels at the same time, for this we will use the pydub package to manipulate audio. Pydub converts raw audio files into lists of Pydub data structures. Pydub uses 1ms as the discretization interval which is why a 10sec clip is always represented using 10,000 steps. 

Now let's first low the samples we have at `raw_data`, please complete the following function:

In [None]:
# Load raw audio files for speech synthesis
def load_raw_audio():
    activates = []
    backgrounds = []
    negatives = []
    # Use AudioSegment.from_wav(wav_file_path)
    return activates, negatives, backgrounds

In [None]:
# Load audio segments using pydub 
activates, negatives, backgrounds = load_raw_audio()

print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 

**Overlaying positive/negative words on the background**:

So the objective is:
- Add some activate clips (0-4) clips with no overlap between them,
- Add some negatives (0-2) clips with no overlap between them,
- The total length will always equal to 10s,
- The labels $y^{\langle t \rangle}$ of size $T_y = 1375$, will be equal 0 in the start, and each time we add a new activate clip we update the labels in the correct position for the correct number of steps (do the correct conversion between the input step and the corresponding output step).

Here's a figure illustrating the labels $y^{\langle t \rangle}$, for a clip which we have inserted "activate", "innocent", activate", "baby." Note that the positive labels "1" are associated only with the positive words. 

<img src="images/label_diagram.png" style="width:500px;height:200px;">

To create the training set, we will need to implement the following functions:
    
1. `get_random_time_segment(segment_ms)` gets a random time segment in our background audio
2. `is_overlapping(segment_time, existing_segments)` checks if a time segment overlaps with existing segments
3. `insert_audio_clip(background, audio_clip, existing_times)` inserts an audio segment at a random time in our background audio using `get_random_time_segment` and `is_overlapping`
4. `insert_ones(y, segment_end_ms)` inserts 1's into our label vector y after the word "activate"

In [None]:
def get_random_time_segment(segment_ms):
    """
    The function  returns a random time segment of size `segment_ms`
    onto which we can insert an audio clip of duration 
    """
    
    return (segment_start, segment_end)

Implement `is_overlapping(segment_time, existing_segments)` to check if a new time segment overlaps with any of the previous segments.

In [None]:
def is_overlapping(segment_time, previous_segments):
    """
    Checks if the time of a segment overlaps with the times of existing segments.
    Returns True if the time segment overlaps with any of the existing segments, False otherwise
    """

    return overlap

In [None]:
overlap1 = is_overlapping((950, 1430), [(2000, 2550), (260, 949)])
overlap2 = is_overlapping((2305, 2950), [(824, 1532), (1900, 2305), (3424, 3656)])
print("Overlap 1 = ", overlap1) # Must be False
print("Overlap 2 = ", overlap2) # Must be True


Implement `insert_audio_clip()` to overlay an audio clip onto the background 10sec clip.

In [None]:
def insert_audio_clip(background, audio_clip, previous_segments):
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments.
    """
    
    new_background = background.overlay(--, position = --)
    
    return new_background, segment_time

In [None]:
np.random.seed(5)
test_activate = './raw_data/activates/3_act2.wav'
test_bg = './raw_data/backgrounds/2.wav'
audio_clip, segment_time = insert_audio_clip(test_bg, test_activate, [(3790, 4400)])
audio_clip.export("insert_test.wav", format="wav")
print("Segment Time: ", segment_time) # Must be (2254, 3169) if you use np.random
IPython.display.Audio("insert_test.wav")

Implement `insert_ones()`, where we get the labels (a vector of size 1375), and add 50 ones in the correct starting position (note that the input in of size 10,000, so do the correct conversion)

In [None]:
def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly (emphisis on strictly) after the end of the segment 
    should be set to 1.
    """
    
    return y

In [None]:
arr1 = insert_ones(np.zeros((1, Ty)), 9700)
plt.plot(insert_ones(arr1, 4251)[0,:])
print("sanity checks:", arr1[0][1333], arr1[0][634], arr1[0][635]) # Must be 0.0 1.0 0.0

Finally, you can use `insert_audio_clip` and `insert_ones` to create a new training example.

1. Initialize the label vector $y$ as a numpy array of zeros and shape $(1, T_y)$.
2. Initialize the set of existing segments to an empty list.
3. Randomly select 0 to 4 "activate" audio clips, and insert them onto the 10sec clip. Also insert labels at the correct position in the label vector $y$.
4. Randomly select 0 to 2 negative audio clips, and insert them into the 10sec clip. 


In [None]:
def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.
    """
    np.random.seed(18) # Setting the random seed
    background = background - 20 # Making background quieter

    #### Add your code here
    
    background = match_target_amplitude(background, -20.0) # Standardize the volume of the audio clip 
    file_handle = background.export("train" + ".wav", format="wav") # Export new training example 
    print("File (train.wav) was saved in your directory.")
    x = graph_spectrogram("train.wav") # Convert to spectogram
    
    return x, y

In [None]:
x, y = create_training_example(backgrounds[0], activates, negatives)

Now you can listen to the training example you created and compare it to the spectrogram generated above.

In [None]:
IPython.display.Audio("train.wav")

In [None]:
print(f'Beginning of the first activate at: {np.where(y > 0)[1][0]}') # Must be 337
print(f'Beginning of the second activate at: {np.where(y > 0)[1][50]}') # Must be 522

Finally, you can plot the associated labels for the generated training example.

In [None]:
plt.plot(y[0]) # Must have two picks, at ~ 3

## 1.4 - Loading the train and val sets

We've now implemented the code needed to generate a single training example, the same approch was used to create the set in `XY_train/` and `XY_dev/`, let's load them

In [None]:
# Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")
# Load preprocessed val set examples
X_val = np.load("./XY_val/X_val.npy")
Y_val = np.load("./XY_val/Y_val.npy")

# 2 - Model


In [None]:
from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

## 2.1 - Build the model

Here is the architecture we will use. Take some time to look over the model and see if it makes sense. 
<img src="images/model.png" style="width:1000px;height:1000px;">

The model takes as inputs 5511 step spectrogram, so first we must use a 1D conv to go from Tx = 5511 to Ty = 1375, and then use two layers of a recurrent net to output the predictions.

Note that we use a uni-directional RNN rather than a bi-directional RNN. This is really important for trigger word detection, since we want to be able to detect the trigger word almost immediately after it is said. If we used a bi-directional RNN, we would have to wait for the whole 10sec of audio to be recorded before we could tell if "activate" was said in the first second of the audio clip.  

Implementing the model can be done in four steps:


- For the CONV layer. Use `Conv1D()` to implement this, with 196 filters, with a filter size of 15 (`kernel_size=15`), **find the correct stride to for an output of size 1375 with an input of size 5511**. [[See documentation.](https://keras.io/layers/convolutional/#conv1d)]

- For the two GRU layers, use: [[See documentation.](https://keras.io/layers/recurrent/#GRU)].

- Create a time-distributed dense layer as follows: `X = TimeDistributed(Dense(1, activation = "sigmoid"))(X)`. This creates a dense layer followed by a sigmoid, so that the parameters used for the dense layer are the same for every time step and the output between 0 and 1. [[See documentation](https://keras.io/layers/wrappers/).]

Implement `model()`, the architecture is presented in Figure 3.

In [None]:
def model(input_shape):
    """
    Function creating the model's graph in Keras.
    Returns a Keras model instance
    """
    
    X_input = Input(...)

    # CONV layer
    # First GRU Layer
    # Second GRU Layer

    
    # Time-distributed dense layer
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    model = Model(inputs = X_input, outputs = X)
    
    return model  

In [None]:
model = model(input_shape = (Tx, n_freq))

Let's print the model summary to keep track of the shapes.

In [None]:
model.summary()

'''
We must have as ouput:
Total params: 522,561
Trainable params: 521,657
Non-trainable params: 904
'''

The output of the network is of shape (None, 1375, 1) while the input is (None, 5511, 101). The Conv1D has reduced the number of steps from 5511 at spectrogram to 1375. 

## 2.2 - Fit the model

Trigger word detection takes a long time to train. To save time, we've already trained a model for about 3 hours on a GPU using the architecture you built above, and a large training set of about 4000 examples. Let's load the model. 

In [None]:
# Load the models/
model = ...

You can train the model further, using the Adam optimizer and binary cross entropy loss, as follows. This will run quickly because we are training just for one epoch and with a small training set of 26 examples. 

In [None]:
# Create an optimizer (use Adam), and pass the parameters (learning rate, momentum values and decay rate)
opt = ..
# Compile the model, use loss binary CE and the metric as accuracy
model.compile(...)

In [None]:
# Fit the model using a batch of 5 and one epoch
model.fit(...) # Accuracy at the end must be ~ 98 / 97.5 %

## 2.3 - Test the model

Finally, let's see how your model performs on the dev set.

In [None]:
loss, acc = model.evaluate(X_val, Y_val)
print("Dev set accuracy = ", acc) # Must be in the range of 92 - 95 %

This looks pretty good! However, accuracy isn't a great metric for this task, since the labels are heavily skewed to 0's, so a neural network that just outputs 0's would get slightly over 90% accuracy.

In [None]:
Y_pred = model.predict(...)

def f1_score(Y_pred, Y_val, threshold):

    return f1_score
    
f1_score(Y_pred, Y_val, threshold=0.5)

# 3 - Making Predictions

Now that you have built a working model for trigger word detection, let's use it to make predictions. For this implements a function that computes the spectogram of an input audio clip, swap the axes ((freqs, Tx) to (Tx, freqs) for the model), and pass it through the network to get the predictions, and plot them.

In [None]:
def detect_triggerword(filename):
    plt.subplot(2, 1, 1)

    # preprocessing
    predictions = # predict
    
    plt.subplot(2, 1, 2)
    plt.plot(predictions[0,:,0])
    plt.ylabel('probability')
    plt.show()
    return predictions

Once we've estimated the probability of having detected the word "activate" at each output step, we can trigger a "bell" sound to play when the probability is above a certain threshold. Further, $y^{\langle t \rangle}$ might be near 1 for many values in a row after "activate" is said, yet we want to add ringing sound only once. So we will insert a ring sound at most once every 75 output steps. This will help prevent us from inserting two ring sounds for a single instance of "activate". 

In [None]:
ring_file = "audio_examples/ring.wav"
def ring_on_activate(filename, predictions, threshold):
    # open both wav files
    audio_clip = 
    ring = 
    # if output is 1 for 75 consecutive output steps, add a ring sound
    # superpose audio and background using pydub (audio_clip.overlay)
    # ...
    audio_clip.export("ring_output.wav", format='wav')

## 3.3 - Test examples

Let's explore how our model performs on two unseen audio clips from the development set. Lets first listen to the two dev set clips. 

In [None]:
IPython.display.Audio("./raw_data/test/1.wav")

In [None]:
IPython.display.Audio("./raw_data/test/2.wav")

Now run the model on these audio clips and see if it adds a ring sounds after "activate"!

In [None]:
filename = "./raw_data/test/1.wav"
# Predict & add bell counds
# Head the results
IPython.display.Audio("./ring_output.wav")

In [None]:
filename  = "./raw_data/test/2.wav"
# Predict & add bell counds
# Head the results
IPython.display.Audio("./ring_output.wav")

# 4 - Try your own example


Record a given audio clip of you saying the word "activate" and other random words. Be sure to use the audio as a wav file. If your audio is recorded in a different format (such as mp3) there is free software that you can find online for converting it to wav.

Now, your recording can be larger or smaller that 10 seconds, complete the code below to trim or pad it as needed to make it 10 seconds. 

Use the correct functions that can be found here [Pydub Docs](https://www.pydoc.io/pypi/pydub-0.9.5/autoapi/audio_segment/index.html#)

In [None]:
# Preprocess the audio to the correct format
def preprocess_audio(filename):
    # Trim -> pad -> set frame rate to 44100
    segment = 
    segment.export(filename, format='wav')

Now load your file

In [None]:
your_filename = ""

In [None]:
preprocess_audio(your_filename)
IPython.display.Audio(your_filename) # listen to the audio 

Finally, use the model to predict when you say activate in the 10 second audio clip, and trigger a ring bell. If beeps are not being added appropriately, try to adjust the threshold.

In [None]:
threshold = 0.5
prediction = detect_triggerword(your_filename)
ring_on_activate(your_filename, prediction, threshold)
IPython.display.Audio("./ring_output.wav")

# 5- Use Conv nets as your model



In this lab, we first transformed our raw audio into a spectrogram, created a dataset and then used a recurrent model for detecting the trigger words, one possible alternative is to convert the spectrograms into MFCCs (Mel-frequency cepstrum), create a dataset of images, with the correct labels, and construct a CNN for detecting the trigger words, either a binary detection (in the audio clip or not), or even output a vector indicating the location of the trigger words, it is up to you.

* **Step1:** Transform the training / val samples (from spectrograms) into images (using MFCC)
* **Step2:** Adjust the predictions / labels for a computer vision task
* **Step3:** Save images & labels
* **Step4:** Create a convnet
* **Step5:** Train & Test

Tutorials:
https://www.kaggle.com/davids1992/speech-representation-and-data-exploration
https://www.kaggle.com/alexozerin/end-to-end-baseline-tf-estimator-lb-0-72