# Problem Set 6: Deep learning ASR

In this problem set, you'll be building a deep neural network for recognizing a set of 10 commands of the sort that could be used to control a robot or character in a videogame. You won't be writing a lot of code for this problem set, but you will answer some questions and experiment a bit with deep learning architectures.

The deliverable for this assignment, due on **Wednesday, November 11, 11:59pm Boston time**, is this Jupyter notebook. For this assignment, we will need to see the output of all your code, but we won't have time to run your code since it could take several hours. I recommend that you do `Kernel-> Restart and run all` a few hours before the assignment is due so that all the code can run and complete before the deadline.**

**Important Notes**

* Unless I say otherwise, you don't need to change any of the code blocks before you run them. A lot of what you will do for this problem set is answer questions and run the existing code.

* Even though there is deep learning going on here, you should be able to complete this on your own computer. If you can't (e.g., it doesn't work on Windows, if you can't get Python 3.8 installed), then use [Google Colab](https://colab.research.google.com). There's a version of this notebook in the repo (`ps6-colab.ipynb`) that includes code for uploading and storing files so that they can be used with Colab.

* The code can take a long time to run! If you start this the day it's due, it's likely that it won't be done by 11:59pm.

## Step 1: Importing libraries and checking compatibility

You can install librosa with Python 3.7, but I have not been able to make librosa work on anything under Python 3.8, so you should use Python 3.8 or higher for this problem set. Not aure which Python you are running? Go up to `Help -> About` to find out. If you are running < 3.8 and you are not sure what to do, download the latest Anaconda, install librosa, and then launch Jupyter from Anaconda.

In [None]:
import os
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile
import warnings
warnings.filterwarnings("ignore")

## Step 2: Obtaining and Examining the data

We are going to build an ASR model that will recognize 10 speech commands. The data we'll be using is a subset of the [Google Speech Commands dataset](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html). 

1. [Download the data from here](http://cs.bc.edu/~prudhome/speech_commands.tgz)

2. Move the file you downloaded to this exact directory that you are in right now.

3. Untar and unzip the file (e.g., with ``tar xzf speech_commands.tgz`` or probably by double clicking it).

In the code block below, I'm just visualizing one of the waves for fun.

In [None]:
train_audio_path = 'speech_commands/'

# Use Librosa to read in the wav file.
samples, sample_rate = librosa.load(train_audio_path+ 'yes/0a7c2a8d_nohash_0.wav', sr = 16000)

# Plot the wav file.
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('Raw wave of ' + train_audio_path + '/yes/0a7c2a8d_nohash_0.wav')
ax1.set_xlabel('time')
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)

Great! It looks like a normal wave file. Let's look at a spectrogram. The code below will make a mel spectrogram, which is a different visualization than we're used to seeing in Praat, but you should be able to see a clear vowel (the "e" in "yes") followed by a clear fricative (the "s" in "yes").

In [None]:
spectro = librosa.feature.melspectrogram(y=samples, sr=sample_rate, n_mels=128, fmax=8000)

plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(spectro, ref=np.max), y_axis='mel', fmax=10000, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram')


### Q1: Using the code above, write code in the code block below to open a  `.wav` file for different command, then produce its mel spectrogram. Does the spectrogram look more or less as you would expect?

In [None]:
## write your code for Q1 here

Finally, let's see how many training examples there are for each command! I filtered out some commands in the original data set that were too short or that were sampled at something other than 16kHz.

In [None]:
labels=["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go"]

#find count of each label and plot bar graph
no_of_recordings=[]
for label in labels:
    waves = [f for f in os.listdir(train_audio_path + label) if f.endswith('.wav')]
    no_of_recordings.append(len(waves))
    
#plot
plt.figure(figsize=(30,5))
index = np.arange(len(labels))
plt.bar(index, no_of_recordings)
plt.xlabel('Commands', fontsize=12)
plt.ylabel('No of recordings', fontsize=12)
plt.xticks(index, labels, fontsize=15, rotation=60)
plt.title('No. of recordings for each command')
plt.show()

### Q2: Which command has the most examples? Which has the least? (Note that you might not be able to tell which has the least just by looking at the graph. You might have to print out some of the values from the code block above.)

*Enter your answer to Q2 here*

## Step 3: Data preparation

The code block below reads in each `.wav` file, downsamples it to 8000Hz, and then saves the data to the `all_wavs` list and its label (i.e., the name of the command) to the `all_label` list. 

**This code takes some time to run! Run the block below, and then check back every 5 minutes or so.**

In [None]:
# <-- Wait to execute the next code block until after you see a number in between these square brackets.
# This will take a little while.
all_wave = []
all_label = []
for label in labels:
    print(label)
    waves = [f for f in os.listdir(train_audio_path + '/'+ label) if f.endswith('.wav')]
    for wav in waves:
        samples, sample_rate = librosa.load(train_audio_path + '/' + label + '/' + wav, sr = 16000)
        samples = librosa.resample(samples, sample_rate, 8000)
        all_wave.append(samples)
        all_label.append(label)
        

Remember, computers don't like dealing with strings. For this reason, we are going to associate each label ("yes", "no", "up", "down") with a unique integer. The code below does this for you!

### **Q3: Add a few lines of code at the bottom of the following code block to help you figure out the integer value associated with each label.**

In [None]:
# User the sklearn preprocessing function that will do this for you.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# This makes a list y that contains, for each example in the all_label list,
# the integer associated with that label
y=le.fit_transform(all_label)  # just like all_label but with integers rather than strings

# This just gives you the labels (i.e., the classes).
classes= list(le.classes_)

### Add code here to print out each label and its integer representation. ###
# PRINT OUT EACH LABEL ONLY ONCE! I expect to see something like this:
# 9 yes
# 5 on
# 0 down
# etc.


Remember that neural nets like to deal with vectors. For that reason, we are doing to convert each label from an integer to a **one-hot vector**. In other words, the output for the neural net will be a vector whose length is the number of labels (10) with zeros in all places except for the one associated with the predicted label. For instance, if the integer associated with "down" is 0, then the one-hot vector would be `[1 0 0 0 0 0 0 0 0 0]`. 

In [None]:
# print the original integer label and the string label for one training example
print(y[0], all_label[0])

# convert integer labels to one-hot vectors
from keras.utils import np_utils
y=np_utils.to_categorical(y, num_classes=len(labels))

# print an example one-hot vector and the string label for that training example
print(y[0], all_label[0])


Right now we have our audio data in a 2D "list of lists". We need to make it into the shape the neural net will be expecting, which is 3D matrix (21K example commands with 8000 samples per command with each sample as its own little vector). The code below will do this for you, and you can see how the shape changes from the original 2D list of lists to the 3D matrix.

In [None]:
print(all_wave[0])
all_wave = np.array(all_wave).reshape(-1,8000,1)
print(all_wave[0])
np.shape(all_wave)

Finally, we will split the data into a training and validation set using the sklearn `train_test_split` function.

In [None]:
from sklearn.model_selection import train_test_split
x_tr, x_val, y_tr, y_val = train_test_split(np.array(all_wave),np.array(y),stratify=y,test_size = 0.2,random_state=777,shuffle=True)

## Step 4: Initializing the model

We're using Keras, which is a Python API for TensorFlow, a widely-used deep learning library. Below you will see a lot of different import statements for the various components of our deep learning model. 

In [None]:
from keras.layers import Dense, Dropout, Flatten, Conv1D, Input, MaxPooling1D
from keras.models import Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K

We are making a convolutional neural net (CNN). There are many layers in our architecture, as you can see in the code block below: 3 convolutional layers (`Conv1D`), a "flattening" layer (`Flatten`), one dense (`Dense`) layer, and the dense output layer. What does all this mean? 


### Convolutional layers
We'll be talking about CNNs a little bit in class on Thursday. It's often easier to conceptualize CNNs as they are applied to images rather than speech. You'll remember from class that animation where a small "filter" matrix of weights (also called a "kernel") sweeps across the input matrix, and the output from each step in the sweep is run through an activation function and that output is pooled. The gif below gives you an idea of what this is like (minus the activation function), where the weights in the filter are are applied to elements in the matrix underneath, and the sum is entered into the output element ("convolved feature").

![convolution](convolution.gif)


The arguments to the `Conv1D` layers give us the parameters of this process. The first argument is the number of filters/kernels. The second argument is the size or the dimensions of that filter. The `strides` argument is how far forward the filter jumps as it sweeps.

In the above gif, the pooling was a sum. In the code below, we use max pooling, which means that you just take the maximum value of the output of the filter.

Finally, `Dropout` tells the model to randomly set some percentage of the current matrix of features to 0. Dropout is a kind of regularization, or a method for avoiding overfitting. By randomly throwing out some features, dropout forces the model to look for relationships in all the features and to avoid becoming too dependent on a small of number of specific features.

### Other layers
The `Flatten` layer is straightfoward: it just turns multidimensional input into a 1D vector.

The `Dense` layers are the kind of layers you think of when you think of a basic neural network. The input is a vector, and every input element goes through every node ("fully connected"). 

The final (output) layer is a dense layer where the length of the output is equal to the number of classes. The `softmax` activation function converts the output into (more or less) a probability distribution over the classes, and the predicted class is the class with the highest probability.

In [None]:
K.clear_session()

inputs = Input(shape=(8000,1))

#First Conv1D layer
conv = Conv1D(8,13, padding='valid', activation='relu', strides=1)(inputs)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Second Conv1D layer
conv = Conv1D(16, 11, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Third Conv1D layer
conv = Conv1D(32, 9, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
conv = Dropout(0.3)(conv)

#Flatten layer
conv = Flatten()(conv)

#Dense Layer 1
conv = Dense(128, activation='relu')(conv)
conv = Dropout(0.3)(conv)

# Output layer
outputs = Dense(len(labels), activation='softmax')(conv)

model = Model(inputs, outputs)
model.summary()

### Other model parameters and components

The `compile()` method is how we set the loss function we want to use (e.g., cross entropy, mean squared error), the optimized (adam, sgd), and what metrics we want to report (accuracy).

`EarlyStopping` tells the model it can stop training if, in this case, the loss on the validation set doesn't improve by more than 0.0001 for 10 epoches.

`ModelCheckpoint` allows us to save out the best performing model so that we can reload it later and test new sata.

In [None]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10, min_delta=0.0001) 
mc = ModelCheckpoint('best_model.hdf5', monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

## Step 5: Training the model

The line of code below trains the model for 10 epochs. This means that it will go through the whole dataset 10 times. The batch size is 32, meaning that it will update the weights after every 32 input examples. The `callbacks` are just the arguments for early stopping and saving out a model checkpoint , described above. 

After you run the code below, make sure it's going, and then go have a coffee or take a walk. It will be a while -- maybe 20 or 30 minutes depending on your machine. You'll know its done when you see the `MODEL TRAINING COMPLETE` message printed out. 

In [None]:
history=model.fit(x_tr, y_tr ,epochs=10, callbacks=[es,mc], batch_size=32, validation_data=(x_val,y_val))

print("MODEL TRAINING COMPLETE!")

## Step 6: Reviewing the model

Okay, you have trained a CNN to recognize ten spoken commands! In the output above, you'll see that the accuracy started off low but steadily increased, while the loss started out high, then steadily decreased.

Let's plot the loss on the training and on the validation set to see whether the model is learning over time.

In [None]:
from matplotlib import pyplot as plt
plt.plot(history.history['loss'], label='train') 
plt.plot(history.history['val_loss'], label='test') 
plt.legend() 
plt.show()

### Q4: In the code block below, write code that will print out a graph like the one above but for accuracy on the training and test set.

In [None]:
# EMILY REMOVE THIS CODE
plt.plot(history.history['accuracy'], label='train') 
plt.plot(history.history['val_accuracy'], label='test') 
plt.legend() 
plt.show()

## Part 7: Testing the model with your own voice

Using Praat, record yourself saying each of the 10 commands. Save each one out as a `WAV` file into the same directory as this Jupyter notebook, naming each one with its label, e.g., `yes.wav` or `start.wav`. 

Then you can run the code blocks below to test the model on your own recordings.

In [None]:
# This allows us to load the model we just trained, above.

from keras.models import load_model
model=load_model('best_model.hdf5')

In [None]:
# This is a little function that takes a recording
# and predicts which command it is.

def predict(audio):
    prob=model.predict(audio.reshape(1,8000,1))
    index=np.argmax(prob[0])
    return classes[index]

In [None]:
# This code will read in one of your recorded commands 
# at the necessary 8000Hz sampling rate.

# EMILY FIX THIS
#for l in labels:
for l in ["go", "left", "stop", "up"]:
    wavefile = l + ".wav"
    samples, sample_rate = librosa.load(wavefile, sr = 44100)
    samples = librosa.resample(samples, sample_rate, 8000)

    # This is a dumb way to trim or pad your recording to 1 second,
    # which is necessary for the way I've set this up.
    # Basically: just chop off the difference on either end, or
    # if it's too short, add a bunch of zeros at the end.

    if len(samples) > 8000:
        toremove = int( (len(samples) - 8000)/2)
        if len(samples) % 2 == 0:
            samples = samples[toremove:-toremove]
        else:
            samples = samples[toremove:-toremove-1]
    else:
        samples = np.pad(samples, (0, 8000-len(samples)))
    print("actual:", l, "\tpredicted:", predict(samples))
    

### Q5: Report the conditions under which you made the recording: location, kind of microphone, whether you were wearing a mask. What accuracy did the recognizer yield on your recorded commands? 

*Enter your answer to Q5 here*

## Part 8: Refining your model

I made many more or less arbitrary choices about how to train the model, above. 

* I decided to do only 10 epochs of training. The model seemed to be still improving when I stopped, so what would happen if I trained for more epochs? 
* There are three convolution layers and 2 dense layers -- why not more or fewer? 
* Dropout was 0.3, but what if it had been more or less? 
* What if we used a different activation function or a different measure of loss or a different pooling method?
* What if we change the various parameters for the convolutional layers?

### Q6: In the code block(s) below, I'd like you to build, fit, and test a new model by changing one or more of these parameters, above. Mostly you will be copying and pasting the code above starting with Step 4, and then making slight modifications.

In [None]:
### Write your code here!
### Feel free to include the plots, as well.
### Don't forget to test on your voice data, too,
### and feel free to create more challenging recordings 
### (e.g., by wearing a mask, recording in a noisy environment).

### Q7: How did your new model perform on the validation set and on your own voice test set compared to the model we originally created above?

*Enter your answer to Q7 here*

## Part 9: Submission
The deliverable for this assignment, due on **Wednesday, November 11, 11:59pm Boston time**, is this Jupyter notebook. 

**IMPORTANT:** For this assignment, we will need to see the output of all your code, but we will not have time to actually run your code since it could take several hours. I recommend that you do `Kernel-> Restart and run all` a few hours before the assignment is due so that all the code can run and complete before the deadline.**

