# Wav2Letter Example using Google Speech Command Dataset

Google Speech Command Dataset can be found [here](https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data). This dataset was chosen as a quick and convenient way to test Wav2Letter performance

## Load Data

In [1]:
from Wav2Letter.data import GoogleSpeechCommand

# using google's speech command dataset
gs = GoogleSpeechCommand()
_inputs, _targets = gs.load_vectors("./speech_data")

In [2]:
mfcc_features = 13
grapheme_count = gs.intencode.grapheme_count

In [3]:
import torch
inputs = torch.Tensor(_inputs)
targets = torch.IntTensor(_targets)

In [4]:
print(inputs.shape)
print(targets.shape)

torch.Size([64721, 225, 13])
torch.Size([64721, 6])


## Build Model

In [5]:
import torch.nn as nn
import torch.optim as optim
from Wav2Letter.model import Wav2Letter

model = Wav2Letter(mfcc_features, grapheme_count)
print(model.layers)

ctc_loss = nn.CTCLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

Sequential(
  (0): Conv1d(13, 250, kernel_size=(48,), stride=(2,))
  (1): ReLU()
  (2): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (3): ReLU()
  (4): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (5): ReLU()
  (6): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (7): ReLU()
  (8): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (9): ReLU()
  (10): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (11): ReLU()
  (12): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (13): ReLU()
  (14): Conv1d(250, 250, kernel_size=(7,), stride=(1,))
  (15): ReLU()
  (16): Conv1d(250, 2000, kernel_size=(32,), stride=(1,))
  (17): ReLU()
  (18): Conv1d(2000, 2000, kernel_size=(1,), stride=(1,))
  (19): ReLU()
  (20): Conv1d(2000, 25, kernel_size=(1,), stride=(1,))
)


## Train

In [6]:
# Each mfcc feature is a channel
# https://pytorch.org/docs/stable/nn.html#torch.nn.Conv1d
# transpose (sample_size, in_frame_len, mfcc_features)
# to      (sample_size, mfcc_features, in_frame_len)
inputs = inputs.transpose(1, 2)
print(inputs.shape)

torch.Size([64721, 13, 225])


In [7]:
# do short training run
batch_size = 256
model.fit(inputs, targets, optimizer, ctc_loss, batch_size, epoch=10)

epoch 1 : step 1 / 253 , loss  6.584851264953613
epoch 1 : step 51 / 253 , loss  2.7819130420684814
epoch 1 : step 101 / 253 , loss  2.7523272037506104
epoch 1 : step 151 / 253 , loss  2.6992950439453125
epoch 1 : step 201 / 253 , loss  2.7544894218444824
epoch 1 : step 251 / 253 , loss  2.7575273513793945
epoch 1 average epoch loss 2.921410348575577
epoch 2 : step 1 / 253 , loss  2.7081358432769775
epoch 2 : step 51 / 253 , loss  2.744292736053467
epoch 2 : step 101 / 253 , loss  2.740218162536621
epoch 2 : step 151 / 253 , loss  2.6845641136169434
epoch 2 : step 201 / 253 , loss  2.742398738861084
epoch 2 : step 251 / 253 , loss  2.7068052291870117
epoch 2 average epoch loss 2.7363557551689297
epoch 3 : step 1 / 253 , loss  2.6324448585510254
epoch 3 : step 51 / 253 , loss  2.629676103591919
epoch 3 : step 101 / 253 , loss  2.5913217067718506
epoch 3 : step 151 / 253 , loss  2.526848793029785
epoch 3 : step 201 / 253 , loss  2.5802133083343506
epoch 3 : step 251 / 253 , loss  2.56799

## Evaluate

In [9]:
from Wav2Letter.decoder import GreedyDecoder

sample = inputs[0]
sample_target = targets[0]

print(sample.shape)

torch.Size([13, 225])


In [11]:
log_prob model.eval(sample)
output = GreedyDecoder(log_prob)
print(output)

tensor([0, 0, 0, 0, 0, 8, 9, 0, 6, 1, 0, 0, 0, 1, 0, 1])

In [12]:
sample_target

tensor([8, 9, 6, 1, 1, 1], dtype=torch.int32)

**Blank labels are 0, Pads are 1**

**As you can see,  If you remove the 0's and the 1's from the output the model predicted the correct labels!**