# Keyword spotting

[Interesting Article about audio](https://www.seeedstudio.com/blog/2018/11/23/6-important-speech-recognition-technology-you-need-to-know/)

Keyword spotting consits of detecting a limited set of keywords, this is typically what is used to wake up IoT devices (Alexa etc.). One of the datasets available is called [Google Speech Commands](https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html) here his the [related paper](https://arxiv.org/abs/1804.03209). This is the one we are going to focus on. It contains a list of 35 words for a total  of 105'829 utterances. They were recored by different users all using their phone or laptop mic (the data was collected using a web application). The dataset also contains backgrouind noise audio (see "_background_noise_" folder), because it is important to be bale to distinguish audio that contains speech from audio that contains none.

Here is the list of words and the number of occurences:

![List of keywords](images/Capture.PNG)

For the project we could use several of those keywords for the robot to understand. We use the V2 version of the dataset.


As a sidenode a framework called [fairseq](https://github.com/facebookresearch/fairseq) could be used for more complex speech recognition task. It is very popular (+20k stars on github)

We will use the model implemented in this [paper](https://arxiv.org/ftp/arxiv/papers/2101/2101.04792.pdf) as it has the best SOTA results on [papers with code](https://paperswithcode.com/sota/keyword-spotting-on-google-speech-commands ) on our dataset. 

In [6]:
# ! pip install nemo-toolkit
# ! pip install 
# ! conda install -c pytorch faiss-gpu
# ! pip install TextGrid
# ! pip install hydra-core
# ! pip install pyannote.audio
# ! pip install webdataset
# ! pip install inflect
# ! pip install youtokentome
# ! pip install sentencepiece
# ! pip install transformers
# ! pip install pytorch-lightning
# ! pip install braceexpand



ERROR: You must give at least one requirement to install (see "pip help install")


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.



PackagesNotFoundError: The following packages are not available from current channels:

  - faiss-gpu

Current channels:

  - https://conda.anaconda.org/pytorch/win-64
  - https://conda.anaconda.org/pytorch/noarch
  - https://repo.anaconda.com/pkgs/main/win-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/win-64
  - https://repo.anaconda.com/pkgs/r/noarch
  - https://repo.anaconda.com/pkgs/msys2/win-64
  - https://repo.anaconda.com/pkgs/msys2/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




Collecting hydra-core
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
     -------------------------------------- 154.5/154.5 kB 9.0 MB/s eta 0:00:00
Collecting antlr4-python3-runtime==4.9.*
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
     ---------------------------------------- 117.0/117.0 kB ? eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting omegaconf<2.4,>=2.2
  Downloading omegaconf-2.3.0-py3-none-any.whl (79 kB)
     ---------------------------------------- 79.5/79.5 kB ? eta 0:00:00
Building wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py): started
  Building wheel for antlr4-python3-runtime (setup.py): finished with status 'done'
  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144578 sha256=d53ff80f813bd15d6e6b2a592f1f5514f90be3f05803fec267e93689efbb23c8
  Stored in

In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [33]:
class Res15(nn.Module):
    def __init__(self, n_maps):
        super(Res15, self).__init__()
        n_maps = n_maps
        self.conv0 = nn.Conv2d(1, n_maps, (3, 3), padding=(1, 1), bias=False)
        self.n_layers = n_layers = 13
        dilation = True
        if dilation:
            self.convs = [nn.Conv2d(n_maps, n_maps, (3, 3), padding=int(2 ** (i // 3)), dilation=int(2 ** (i // 3)),
                                    bias=False) for i in range(n_layers)]
        else:
            self.convs = [nn.Conv2d(n_maps, n_maps, (3, 3), padding=1, dilation=1,
                                    bias=False) for _ in range(n_layers)]
        for i, conv in enumerate(self.convs):
            self.add_module("bn{}".format(i + 1), nn.BatchNorm2d(n_maps, affine=False))
            self.add_module("conv{}".format(i + 1), conv)

    def forward(self, audio_signal, length=None):
        x = audio_signal.unsqueeze(1)
        for i in range(self.n_layers + 1):
            y = F.relu(getattr(self, "conv{}".format(i))(x))
            if i == 0:
                if hasattr(self, "pool"):
                    y = self.pool(y)
                old_x = y
            if i > 0 and i % 2 == 0:
                x = y + old_x
                old_x = x
            else:
                x = y
            if i > 0:
                x = getattr(self, "bn{}".format(i))(x)
        x = x.view(x.size(0), x.size(1), -1)  # shape: (batch, feats, o3)
        x = torch.mean(x, 2)
        return x.unsqueeze(-2), length