## ASR based keyword spotting system

We present in what follow a keyword spotting system solely based on an ASR engine. The pretrained model used can be found [here](http://zamia-speech.org/asr/). It has been trained on ~ **1500 hours** of speech (tedlium3, librispeech, voxforge and other open source datasets) and acheives **8.84% WER**. For more details check *Zamia*'s [release post.](https://goofy.zamia.org/lm/2019/06/20/1500-Hours-160k-Words-English-Zamia-Speech-Models-Released.html) The KWS is straightforward; the user provides a wav file sampled at **16kHz**, the ASR engine performs the speech to text decoding, and finally the keywords are searched in the result.

### Steps (first time)

Perform the steps in the following order to be able to run the keywords spotting system defined below.

Download the pretrained model and uncompress it (**500MB** uncompressed)
- `wget https://goofy.zamia.org/zamia-speech/asr-models/kaldi-generic-en-tdnn_f-r20190609.tar.xz`
- `tar xf kaldi-generic-en-tdnn_f-r20190609.tar.xz`

Git clone repo and move pretrained model in cloned repo
- `git clone https://gitlab.com/SpeechMasterStudents/kws` and create folders `model` and `transcriptions` in `kws/asr_kws/kaldi-generic-en-tdnn_f-r20190609/v1/`
- copy content of `kaldi-generic-en-tdnn_f-r20190609/model` (untared file) in `kws/asr_kws/kaldi-generic-en-tdnn_f-r20190609/v1/model/` (repo)

Upload your audio *.wav* files in `input/audio`, create files `transcriptions/spk2utt` and `transcriptions/wav.scp` by running function below

In [1]:
from os import listdir
from os.path import isfile, join

def prepare_files(path2audio='input/audio/', path2files='transcriptions/', path_in_docker='/opt/kaldi/egs/alibel_model/v1/input/audio/'):
    '''
    Creates files wav.scp and  spk2utt
    - wav.scp : a 2 columns file, left utterance_id , right path to audio file inside docker image
    - spk2utt : a 2 columns file, left speaker_id, right utterance_id
    We assume that each audio file represents one utterance and that speaker_id is different across audio files.
    
    Args:
    path2audio: path to .wav audio files in local.
    path2files: path to the transcriptions folder 
    path_in_docker: path to .wav audio files inside docker image (replace `alibel_model` with the name of the folder you created)

    '''
    # list all audio files 
    filenames = [f for f in listdir(path2audio) if isfile(join(path2audio, f)) and not f.startswith('.')]

    # prepare wav.scp file
    with open(join(path2files, 'wav.scp'), mode='w+') as fp:
        for i, filename in enumerate(filenames):
            utt_id = filename.split('.wav')[0] + '_utt' + str(i)
            fp.write(utt_id + ' ' + join(path_in_docker, filename) + '\n')
    
    print('File wav.scp created in ' + join(path2files, 'wav.scp'))        

    # prepare spk2utt file
    with open(join(path2files, 'spk2utt'), mode='w+') as fp:
        for i, filename in enumerate(filenames):
            utt_id = filename.split('.wav')[0] + '_utt' + str(i)
            spk_id = 'speaker_' + str(i)
            fp.write(spk_id + ' ' + utt_id + '\n')
            
    print('File spk2utt created in ' + join(path2files, 'spk2utt'))   

In [2]:
prepare_files()

File wav.scp created in transcriptions/wav.scp
File spk2utt created in transcriptions/spk2utt


Download kaldi's docker image (cpu based)
- `docker pull kaldiasr/kaldi`
- run image and create new folder (eg. `alibel_model`) in `opt/kaldi/egs/` 

Save changes and stop container 
- `docker commit <container_id> kaldiasr/kaldi:latest`
- `docker stop <container_id>`

Run newly saved image and attach repo to perform decoding
- `docker run -it -v ~/kws/asr_kws/kaldi-generic-en-tdnn_f-r20190609:/opt/kaldi/egs/alibel_model kaldiasr/kaldi:latest`
- `cd` to `egs/alibel_model/v1` and run `./decode.sh`

Transcriptions can be found under `transcriptions/transcribed_speech.txt` 

Run function `kws()` defined below to search for desired keywords

### Keyword spotting

In [3]:
def kws(keywords, path2transciptions='transcriptions/transcribed_speech.txt'):
    '''
    Args:
    keywords: customizable list of keywords.
    path2transciption: path to the transcription file (ie. path to file `transcribed_speech.txt`)

    Returns:
    Dict, {keyword: number_of_times_spotted}
    '''
    
    kw_dict = {}
    # Initialize keyword dictionary
    for kw in keywords:
        kw_dict[kw] = 0
    
    utterances_dict = {} # Dict with elements (utterance_id, transcription)
    with open(path2transciptions) as fp:
        for line in fp:
            utt_id, transcription = line.split(' ', 1)
            for kw in keywords:
                if kw in transcription:
                    kw_dict[kw] += 1
            utterances_dict[utt_id] = transcription
    
    print('ASR based keyword spotting results for the keywords provided :\n')
    print(kw_dict)
    

In [4]:
kws(keywords=['learn', 'protest', 'chief'])

ASR based keyword spotting results for the keywords provided :

{'learn': 1, 'protest': 2, 'chief': 1}


### Steps (any other time)

- Upload desired audio files in `input/audio`. Note that audio files must be sampled at **16kHz** and must be **less than 20 seconds long**.
- Create files `wav.scp` and `spk2utt`: `python3 prepare_files.py --path2audio='input/audio/' --path2files='transcriptions/' --path_in_docker='/opt/kaldi/egs/alibel_model/v1/input/audio/'`
- Launch docker image and attach repo to it: `docker run -it -v ~/kws/asr_kws/kaldi-generic-en-tdnn_f-r20190609:/opt/kaldi/egs/alibel_model kaldiasr/kaldi:latest`
- Run `decoding.sh` script: `cd egs/alibel_model/v1` and `./decode.sh`
- Run keyword spotting system: `python3 asr_kws.py --keywords 'keyword_1' 'keyword_2' --path2transcription='transcriptions/transcribed_speech.txt'`

#### What to do next

- Create an evaluation dataset to set a baseline score for this method
- Depending on the score, investigate whether or not an end2end approach (**ASR-free**) would be better suited