### Speech recognition with torchaudio Part 2 using the dataloader

In this notebook we are going to build a nueral network based on the previous notebook that will classify speech commands.

In this notebook we are going to have a look on how we can use the `Dataloader` to prepare our dataset in a way that our network will load it eaisly.

In [None]:
!pip install pydub torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

### Imports

In [1]:
import torch
from torch import nn
from torch.nn import functional as F
import torchaudio, sys

import matplotlib.pyplot as plt
import IPython.display as ipd

from tqdm import tqdm

torch.__version__, torchaudio.__version__

  '"sox" backend is being deprecated. '


('1.7.0+cu101', '0.7.0')

### Device


In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Dataset

We are going to use [SpeechCommands](https://arxiv.org/abs/1804.03209) dataset which contains of 35 different commands spoken by different people. The dataset `SPEECHCOMMANDS` is a `torch.utils.data.Dataset` version of the dataset. In this dataset, all audio files are about 1 second long (and so about 16000 time frames long).

### Spitting the dataset into `train`, `test` and `validation`.

In [3]:
from torchaudio.datasets import SPEECHCOMMANDS
import os
import numpy as np


In [4]:
class SubsetSC(SPEECHCOMMANDS):
  def __init__(self, subset: str = None):
    super().__init__("./", download=True)

    def load_list(filename):
      filepath = os.path.join(self._path, filename)
      with open(filepath) as f:
        return [os.path.join(self._path, line.strip()) for line in f]

    if subset == "validation":
      self._walker = load_list("validation_list.txt")
    elif subset == "testing":
      self._walker = load_list("testing_list.txt")
    elif subset == "training":
      excludes = load_list("validation_list.txt") + load_list("testing_list.txt")
      excludes = set(excludes)
      self._walker = [w for w in self._walker if w not in excludes]


### Creating train and test splits.

In [5]:
train_set = SubsetSC("training")
test_set = SubsetSC("testing")
val_set = SubsetSC("validation")

### Dataset filtering

We are going to filter the this dataset and get what we want from it. We are goig to take `waveform` with it's correcponding `label`.

In [6]:
#  Train 
X_train = [waveform for waveform, _, label, __, ___ in train_set]
y_train = [label for waveform, _, label, __, ___ in train_set]

# Test

X_test = [waveform for waveform, _, label, __, ___ in test_set]
y_test = [label for waveform, _, label, __, ___ in test_set]

# Validation

X_val = [waveform for waveform, _, label, __, ___ in val_set]
y_val = [label for waveform, _, label, __, ___ in val_set]

In [7]:
from torch.utils.data import Dataset, DataLoader

In [8]:
X_train[0], y_train[0]

(tensor([[-0.0658, -0.0709, -0.0753,  ..., -0.0700, -0.0731, -0.0704]]),
 'backward')

In [9]:
len(y_train), len(X_train)

(84843, 84843)

### Testing audio

In [10]:
sample_rate = 16000

In [11]:
ipd.Audio(X_train[0].numpy(), rate=sample_rate)

### Labels

In [12]:
labels = list(set(list(sorted(y_train))))
len(labels), labels[:2]

(35, ['eight', 'right'])

In [13]:
def label_to_index(label):
  return labels.index(label)

In [14]:
label_to_index("down")

5

### Creating the dataset.

In [15]:
class SpeechCommands(Dataset):
  def __init__(self, x, y, transform=None):
    self.transform = transform
    self.x = x
    self.y = [label_to_index(i) for i in y]
    self.len = len(y)
        
  def __len__(self):
    return self.len

  def __getitem__(self, index):
    sample = self.x[index], self.y[index]
    if self.transform:
        sample = self.transform(sample)
    return sample

In [16]:
class ToTensor:
    def __call__(self, samples):
        x, y = samples
        return torch.Tensor(x), torch.Tensor([y])

In [17]:
train = SpeechCommands(X_train, y_train, transform=ToTensor())
test = SpeechCommands(X_test, y_test, transform=ToTensor())
val = SpeechCommands(X_val, y_val, transform=ToTensor())

In [18]:
val[0]

(tensor([[-0.0004, -0.0007, -0.0009,  ...,  0.0062,  0.0058,  0.0057]]),
 tensor([1.]))

### Dataloaders
We are then going to create data lodaers for both our sets:

1. `train`
2. `test`
3. `validation`

We are also going to pad the sequences by applying the collate_function to the dataloader beacause the `DataLoader` class takes the following as args and keyword args.

```py
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)
```

We are also going to transform the `waveform` by reducing the `sample_rate` from `16000` to `8000`




In [19]:
new_sample_rate = 8000
transform = torchaudio.transforms.Resample(orig_freq=sample_rate,
                                           new_freq=new_sample_rate)

In [20]:
def pad_sequence(batch):
  batch = [item.t() for item in batch]
  batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
  return batch.permute(0, 2, 1)


def collate_fn(batch):
  tensors, targets = [], []

  for waveform, label in batch:
    tensors += [transform(waveform)]
    targets += [label.type(torch.LongTensor).squeeze()]

  tensors = pad_sequence(tensors)
  targets = torch.stack(targets)
  return tensors, targets

In [21]:
BATCH_SIZE = 64

train_loader = DataLoader(
    train,
    shuffle=True,
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn
)

test_loader = DataLoader(
    test,
    shuffle=True, # should be false
    batch_size=BATCH_SIZE,
    collate_fn = collate_fn
)

val_loader = DataLoader(
    val,
    shuffle=True, #should be false
    batch_size=BATCH_SIZE,
    collate_fn=collate_fn

)


### Creating a model

Next we are going to create a model.

In [22]:
class M5(nn.Module):
  def __init__(self, n_input=1, n_output=35, stride=16, n_channel=32):
    super(M5, self).__init__()
    self.conv1 = nn.Conv1d(n_input, n_channel, kernel_size=80, stride=stride)
    self.bn1 = nn.BatchNorm1d(n_channel)
    self.pool1 = nn.MaxPool1d(4)

    self.conv2 = nn.Conv1d(n_channel, n_channel, kernel_size=3)
    self.bn2 = nn.BatchNorm1d(n_channel)
    self.pool2 = nn.MaxPool1d(4)

    self.conv3 = nn.Conv1d(n_channel, 2 * n_channel, kernel_size=3)
    self.bn3 = nn.BatchNorm1d(2 * n_channel)
    self.pool3 = nn.MaxPool1d(4)

    self.conv4 = nn.Conv1d(2 * n_channel, 2 * n_channel, kernel_size=3)
    self.bn4 = nn.BatchNorm1d(2 * n_channel)
    self.pool4 = nn.MaxPool1d(4)

    self.fc1 = nn.Linear(2 * n_channel, n_output)


  def forward(self, x):
    x = self.conv1(x)
    x = F.relu(self.bn1(x))
    x = self.pool1(x)
    x = self.conv2(x)
    x = F.relu(self.bn2(x))
    x = self.pool2(x)
    x = self.conv3(x)
    x = F.relu(self.bn3(x))
    x = self.pool3(x)
    x = self.conv4(x)
    x = F.relu(self.bn4(x))
    x = self.pool4(x)
    x = F.avg_pool1d(x, x.shape[-1])
    x = x.permute(0, 2, 1)
    x = self.fc1(x)
    return x

In [23]:
model = M5(n_input=1, n_output=len(labels)).to(device)
model

M5(
  (conv1): Conv1d(1, 32, kernel_size=(80,), stride=(16,))
  (bn1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool1): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv1d(32, 32, kernel_size=(3,), stride=(1,))
  (bn2): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool2): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (conv3): Conv1d(32, 64, kernel_size=(3,), stride=(1,))
  (bn3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool3): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (conv4): Conv1d(64, 64, kernel_size=(3,), stride=(1,))
  (bn4): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (pool4): MaxPool1d(kernel_size=4, stride=4, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=64, out_features=35, bias=True)
)

In [24]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 26,915
Total tainable parameters: 26,915


### Criterion and Optimizer

In [25]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss().to(device)


### Categorical accuracy function

In [26]:
def categorical_accuracy(preds, y):
  top_pred = preds.argmax(1, keepdim = True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  acc = correct.float() / y.shape[0]
  return acc

### Train and evaluation function

In [27]:
def train(model, iterator, optimizer, criterion):
  epoch_loss = 0
  epoch_acc = 0
  model.train()
  for X, y in iterator:
    X, y = X.to(device), y.type(torch.LongTensor).to(device)
    optimizer.zero_grad()
    predictions = model(X).squeeze(1)
    loss = criterion(predictions, y)
    acc = categorical_accuracy(predictions, y)
    loss.backward()
    optimizer.step()
    epoch_loss += loss.item()
    epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
  epoch_loss = 0
  epoch_acc = 0
  model.eval()
  with torch.no_grad():
    for X, y in iterator:
      X, y = X.to(device), y.type(torch.LongTensor).to(device)
      predictions = model(X).squeeze(1)
      loss = criterion(predictions, y)
      acc = categorical_accuracy(predictions, y)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Train Loop

We will create helper functions that will help us to visualize training each and every epoch.

1. Time to string 

In [28]:
import time
from prettytable import PrettyTable

In [29]:

def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

2. tabulate training

In [30]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)
  

In [32]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(model, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, val_loader, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 1.833 |    0.490 | 0:02:47.17 |
| Validation | 1.131 |    0.672 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.997 |    0.714 | 0:02:45.91 |
| Validation | 0.945 |    0.714 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model.

In [33]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_loader, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.631 | Test Acc: 81.83%


### Model inference

In [34]:
def index_to_label(i):
  return labels[i]

In [59]:
for batch in train_loader:
  break

In [65]:
def predict(index):
  label = torch.argmax(model(batch[0][index].unsqueeze(1).to(device)), dim=-1).squeeze().item()

  preds ={
      "actual_label": batch[1][index].item(),
      "predicted_label": label,
      "actual_class": index_to_label(batch[1][index].item()),
      "predicted_class": index_to_label(label)
  }
  return preds


### Train example

In [76]:
index=0
sample_rate=8000
waveform = batch[0][index]
predict(index)

{'actual_class': 'up',
 'actual_label': 27,
 'predicted_class': 'up',
 'predicted_label': 27}

In [77]:
ipd.Audio(waveform.numpy(), rate=sample_rate)

In [78]:
index=-1
sample_rate=8000
waveform = batch[0][index]
predict(index)

{'actual_class': 'four',
 'actual_label': 25,
 'predicted_class': 'four',
 'predicted_label': 25}

In [79]:
ipd.Audio(waveform.numpy(), rate=sample_rate)

### Valid example

In [82]:
for batch in val_loader:
  break

In [83]:
index=0
sample_rate=8000
waveform = batch[0][index]
predict(index)

{'actual_class': 'nine',
 'actual_label': 16,
 'predicted_class': 'no',
 'predicted_label': 33}

In [84]:
ipd.Audio(waveform.numpy(), rate=sample_rate)

In [85]:
index=-1
sample_rate=8000
waveform = batch[0][index]
predict(index)

{'actual_class': 'sheila',
 'actual_label': 9,
 'predicted_class': 'sheila',
 'predicted_label': 9}

In [86]:
ipd.Audio(waveform.numpy(), rate=sample_rate)

### Test example

In [87]:
for batch in test_loader:
  break

In [88]:
index=0
sample_rate=8000
waveform = batch[0][index]
predict(index)

{'actual_class': 'follow',
 'actual_label': 31,
 'predicted_class': 'follow',
 'predicted_label': 31}

In [89]:
ipd.Audio(waveform.numpy(), rate=sample_rate)

In [90]:
index=-1
sample_rate=8000
waveform = batch[0][index]
predict(index)

{'actual_class': 'off',
 'actual_label': 10,
 'predicted_class': 'off',
 'predicted_label': 10}

In [91]:
ipd.Audio(waveform.numpy(), rate=sample_rate)