In this notebook we will build a speech recognition model.  

Below we'll import the libraries we'll be using.

In [None]:
import os
import librosa   #for audio processing
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")

Next, we'll download the dataset of speech commands from tensorflow.

In [None]:
if not os.path.exists('speech_commands_v0.01.tar.gz'):
    import urllib.request
    data = urllib.request.urlopen('http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz').read()
    with open('speech_commands_v0.01.tar.gz', 'wb') as f:
        f.write(data)

Here, we unzip the file we downloaded from tensorflow.

In [None]:
if not os.path.exists('speech_commands'):
    import tarfile
    os.mkdir('speech_commands')
    tarfile.open('speech_commands_v0.01.tar.gz').extractall('speech_commands')

The data set contains sound clips of the following 30 English words:
* zero, one, two, three, four, five, six, seven, eight, nine
* left, right, up, down
* yes, no, on, off, stop, go
* bed, bird, cat, dog, house, tree
* marvin, sheila, happy, wow

Each word has about 2000 clips, stored as .wav files.
There are six additional clips in `_background_noise_` directory.
Most clips are 1 second long, though some clips have slightly different durations.
All clips have sampling rates of 16,000 samples per second.

Calculating some statistics of the data set.
The output of the next code cell is:
```
bed    1713 samples (1484 one-second samples)
bird   1731 samples (1521 one-second samples)
cat    1733 samples (1515 one-second samples)
dog    1746 samples (1547 one-second samples)
down   2359 samples (2152 one-second samples)
eight  2352 samples (2111 one-second samples)
five   2357 samples (2161 one-second samples)
four   2372 samples (2158 one-second samples)
go     2372 samples (2101 one-second samples)
happy  1742 samples (1549 one-second samples)
house  1750 samples (1560 one-second samples)
left   2353 samples (2165 one-second samples)
marvin 1746 samples (1578 one-second samples)
nine   2364 samples (2174 one-second samples)
no     2375 samples (2098 one-second samples)
off    2357 samples (2143 one-second samples)
on     2367 samples (2105 one-second samples)
one    2370 samples (2103 one-second samples)
right  2367 samples (2155 one-second samples)
seven  2377 samples (2170 one-second samples)
sheila 1734 samples (1578 one-second samples)
six    2369 samples (2199 one-second samples)
stop   2380 samples (2174 one-second samples)
three  2356 samples (2143 one-second samples)
tree   1733 samples (1521 one-second samples)
two    2373 samples (2140 one-second samples)
up     2375 samples (2062 one-second samples)
wow    1745 samples (1525 one-second samples)
yes    2377 samples (2157 one-second samples)
zero   2376 samples (2203 one-second samples)
_background_noise_    6 samples (   0 one-second samples)
```

In [None]:
top = 'speech_commands'
for word in os.listdir(top):
  word_path = os.path.join(top, word)
  if not os.path.isdir(word_path):
    continue
  total = one_second = 0
  for file in os.listdir(word_path):
    if not file.endswith('.wav'):
      continue
    file_path = os.path.join(word_path, file)
    samples, sample_rate = librosa.load(file_path, sr=None)
    total += 1
    if sample_rate != 16000:
      print(f'{file_path} has wrong sample rate {sample_rate}')
    elif samples.shape == (16000,):
      one_second += 1
  print('%-6s %4d samples (%4d one-second samples)' % (word, total, one_second))

Let's plot the waveform of an example spoken command, `samples`.

In [None]:
train_path = 'speech_commands/'
filename = train_path+'no/afe0b87d_nohash_0.wav'
# By specifying sr=None, librosa.load keeps the original sampling rate of the clip,
# which is 16,000 samples per second for all clips in this data set.
# If an int is given (e.g., sr=20000), the clip will be resampled to that rate.
# If the sr parameter is not given, it defaults to sr=22050.
# samples will be a one-dimensional np.ndarray of type float32,
# with its size equal to the number of samples.
# (This particular clips is one second, so samples.shape is (16000,).)
# Each element of the array is between -1.0 and 0.9999695 (i.e., 32767/32768).
# (The original data are 2-byte integers between -32768 and 32767,
# and they were scaled by dividing with 32768.
samples, sample_rate = librosa.load(filename, sr = 16000)
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('Raw signal of ' + filename)
ax1.set_xlabel('time')
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, len(samples)/sample_rate, len(samples)), samples)

Below we will play the the `samples` audio command.

In [None]:
ipd.Audio(samples,rate=sample_rate,autoplay=True)

Below we load the data into `all_wavs` and their respective labels into `all_labs`.  The labels are either `yes` or `no`.

We'll also print the number of examples in `all_wavs`.

In [None]:
import os

directory = 'speech_commands/'

all_wavs = []
all_labs = []
for label in ['yes', 'no']:
    print(label)
    wavs = [f for f in os.listdir(directory + label) if f.endswith('.wav')]
    for wav in wavs:
        samples, sample_rate = librosa.load(directory + label + '/' + wav, sr=None)
        if len(samples) == 16000: 
            all_wavs.append(samples)
            all_labs.append(label)
print(
    len(all_wavs), 'examples,',
    all_labs.count('yes'), 'yes,', all_labs.count('no'), 'no.')

Below we split our training and test data.  `X_train` is our processed audio files for training and `y_train` are their labels.  `X_test` and `y_test` are our test audio files and their labels, respectively.

In [None]:
from sklearn.model_selection import train_test_split
 
all_wavs = np.array(all_wavs).reshape(-1,16000,1)
all_labs = np.array([lab == 'yes' for lab in all_labs])
X_train, X_test, y_train, y_test = train_test_split(all_wavs,all_labs,test_size = 0.2)

In the following lines, we will build together the layers of our model for speech recognition.

In [None]:
from keras.layers import Conv1D, Input, MaxPooling1D, Flatten, Dense
from keras.models import Model
 
inputs = Input(shape=(16000,1))
 
#First Conv1D layer
conv = Conv1D(8,13, padding='valid', activation='relu', strides=1)(inputs)
conv = MaxPooling1D(3)(conv)
 
#Second Conv1D layer
conv = Conv1D(16, 11, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
 
#Third Conv1D layer
conv = Conv1D(32, 9, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
 
#Fourth Conv1D layer
conv = Conv1D(64, 7, padding='valid', activation='relu', strides=1)(conv)
conv = MaxPooling1D(3)(conv)
 
#Flatten layer
conv = Flatten()(conv)
 
#Dense Layer 1
conv = Dense(256, activation='relu')(conv)
 
#Dense Layer 2
conv = Dense(128, activation='relu')(conv)
 
outputs = Dense(1, activation='sigmoid')(conv)
 
model = Model(inputs, outputs)

We then `fit` the model.  We use a `mean_squared_error` `loss` and optimize the weigths using use `adam` as our `optimizer`. We iterate of the data 15 times.  Each time, or `epoch`, we print out the `accuracy` and `loss` of our model so far.

In [None]:
model.compile(loss='mean_squared_error',optimizer='adam',metrics=['accuracy'])
 
model.fit(X_train, y_train ,epochs=15, batch_size=32)

We then report the final `accuracy` and `loss` on the `X_test` and `y_test` data.

In [None]:
model.evaluate(X_test, y_test)

In [None]:
# We have a total of 4255 samples.
# The training set has 4255 * 80% = 3404 samples.
# The test set has 4255 * 20% = 851 samples.
X_train.shape, y_train.shape, X_test.shape, y_test.shape