# Deep Speech 2

## 1. Download and preprocess data

Downloads and preprocesses data into the directory `./CommonVoice_dataset/`.

In [8]:
!python deepspeech.pytorch/data/common_voice.py --min-duration 1 --max-duration 15

Could not find downloaded Common Voice archive, Downloading corpus...
100% [..............................................] 12852160484 / 12852160484Unpacking corpus to CommonVoice_dataset/CV_unpacked ...
Converting mp3 to wav for CommonVoice_dataset/CV_unpacked/cv_corpus_v1/cv-valid-dev.csv.
sox WARN rate: rate clipped 1 samples; decrease volume?
sox WARN dither: dither clipped 1 samples; decrease volume?
sox WARN rate: rate clipped 18 samples; decrease volume?
sox WARN dither: dither clipped 14 samples; decrease volume?
sox WARN rate: rate clipped 14 samples; decrease volume?
sox WARN dither: dither clipped 12 samples; decrease volume?
sox WARN rate: rate clipped 1 samples; decrease volume?
sox WARN dither: dither clipped 1 samples; decrease volume?
sox WARN rate: rate clipped 171 samples; decrease volume?
sox WARN dither: dither clipped 148 samples; decrease volume?
sox WARN rate: rate clipped 2 samples; decrease volume?
sox WARN dither: dither clipped 2 samples; decrease volume?
so

## 2. Model Training

### Train Acoustic Model
Training the acoustic model merely requires passing the manifest files to the training function. The data loader converts the audio into spectrograms before passing them to the neural network. 

In [43]:
# We can view the default list of parameters for training by providing the --help flag
!cd deepspeech.pytorch/ && python train.py --help

  import imp
usage: train.py [-h] [--train-manifest DIR] [--val-manifest DIR]
                [--sample-rate SAMPLE_RATE] [--batch-size BATCH_SIZE]
                [--num-workers NUM_WORKERS] [--labels-path LABELS_PATH]
                [--window-size WINDOW_SIZE] [--window-stride WINDOW_STRIDE]
                [--window WINDOW] [--hidden-size HIDDEN_SIZE]
                [--hidden-layers HIDDEN_LAYERS] [--rnn-type RNN_TYPE]
                [--epochs EPOCHS] [--cuda] [--lr LR] [--momentum MOMENTUM]
                [--max-norm MAX_NORM] [--learning-anneal LEARNING_ANNEAL]
                [--silent] [--checkpoint]
                [--checkpoint-per-batch CHECKPOINT_PER_BATCH] [--visdom]
                [--tensorboard] [--log-dir LOG_DIR] [--log-params] [--id ID]
                [--save-folder SAVE_FOLDER] [--model-path MODEL_PATH]
                [--continue-from CONTINUE_FROM] [--finetune] [--augment]
                [--noise-dir NOISE_DIR] [--noise-prob NOISE_PROB]
                [--noi

In [None]:
!cd deepspeech.pytorch/ && python train.py --train-manifest /workspace/cv-valid-train_manifest.csv \
        --val-manifest /workspace/cv-valid-dev_manifest.csv --cuda

  import imp
DeepSpeech(
  (conv): MaskConv(
    (seq_module): Sequential(
      (0): Conv2d(1, 32, kernel_size=(41, 11), stride=(2, 2), padding=(20, 5))
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): Hardtanh(min_val=0, max_val=20, inplace)
      (3): Conv2d(32, 32, kernel_size=(21, 11), stride=(2, 1), padding=(10, 5))
      (4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Hardtanh(min_val=0, max_val=20, inplace)
    )
  )
  (rnns): Sequential(
    (0): BatchRNN(
      (rnn): GRU(1312, 800, bidirectional=True)
    )
    (1): BatchRNN(
      (batch_norm): SequenceWise (
      BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
      (rnn): GRU(800, 800, bidirectional=True)
    )
    (2): BatchRNN(
      (batch_norm): SequenceWise (
      BatchNorm1d(800, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True))
      (rnn): GRU(800, 800, bidirection

#### Test the acoustic model

In [None]:
!cd deepspeech.pytorch/ && python test.py --model-path models/deepspeech_final.pth \
        --test-manifest /workspace/cv-valid-test_manifest.csv --decoder greedy 

### Train Language Model
We use KenLM to train an n-gram language model on the training transcripts. These models can then be combined with a beam search decoder to improve the quality of the predictions. 

In [None]:
# Create training transcript file

In [35]:
import os
training_transcripts_file = 'training_transcripts.txt'
txt_dir = os.path.join('./CommonVoice_dataset/cv-valid-train/txt/')
with open(training_transcripts_file, 'w') as transcript_file:
    for filename in os.listdir(txt_dir):
        with open(os.path.join(txt_dir, filename), 'r') as f:
            line = f.readlines()[0]
        transcript_file.write(line +'\n')

In [36]:
!head training_transcripts.txt

A MONK DRESSED IN BLACK CAME TO THE GATES
IT'S THE OASIS SAID THE CAMEL DRIVER
IT WAS THE FIRST TIME SHE HAD DONE THAT
SHE'LL BE BACK IN A SECOND
THIS IS FOR THE BOY
IT WAS HIS HEART THAT WOULD TELL HIM WHERE HIS TREASURE WAS HIDDEN
IT WAS VERY HARD FOR HER TO FOCUS
AND IN THAT MOOD HE WAS GRATEFUL TO BE IN LOVE
HENDERSON WAS TAKING IT IN
HE WAS MORE CONFIDENT IN HIMSELF THOUGH AND FELT AS THOUGH HE COULD CONQUER THE WORLD


In [39]:
!kenlm/build/bin/lmplz -o 2 < training_transcripts.txt > cv_2gram_lm.arpa
!kenlm/build/bin/build_binary cv_2gram_lm.arpa cv_2gram_lm.trie

=== 1/5 Counting and sorting n-grams ===
Reading /workspace/training_transcripts.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1851187 types 8006
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:96072 2:11763831808
Statistics:
1 8006 D1=0.672643 D2=1.09902 D3+=1.33225
2 34424 D1=0.388199 D2=0.759291 D3+=1.71911
Memory estimate for binary LM:
type     kB
probing 808 assuming -p 1.5
probing 839 assuming -r models -p 1.5
trie    372 without quantization
trie    276 assuming -q 8 -b 8 quantization 
trie    372 assuming -a 22 array pointer compression
trie    276 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:96072 2:550784
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85-

In [40]:
!head cv_2gram_lm.arpa

\data\
ngram 1=8006
ngram 2=34424

\1-grams:
-4.5863776	<unk>	0
0	<s>	-2.0972793
-1.0751288	</s>	0
-1.997083	A	-1.486713
-4.128622	MONK	-1.6754742


In [42]:
!python deepspeech.pytorch/test.py --model-path deepspeech.pytorch/models/deepspeech_final.pth --test-manifest \
    test_manifest.csv --decoder beam --beam-width 10 \
    --lm-path cv_2gram_lm.trie

  import imp
Traceback (most recent call last):
  File "deepspeech.pytorch/test.py", line 27, in <module>
    model = DeepSpeech.load_model(args.model_path)
  File "/workspace/deepspeech.pytorch/model.py", line 239, in load_model
    package = torch.load(path, map_location=lambda storage, loc: storage)
  File "/miniconda/envs/py36/lib/python3.6/site-packages/torch/serialization.py", line 356, in load
    f = open(f, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'deepspeech.pytorch/models/deepspeech_final.pth'
