## How to Access Open Voice Training Data: The Mozilla’s Common Voice Platform

**Assessment Two**

## Introduction
In this notebook, you will use a pre-trained DeepSpeech model to transcribe English speech to text.

### Notebook Configuration
In this secion, we shall install all the necessary libraries and files required to build a simple transcriber.
1. <a name="deepspeech-install"></a>Install the `deepspeech` library.
2. Download the English (en-US) pre-trained `deepspeech` model.
3. Download the sample audio files to test the model.
4. Test the model on the downloaded audio samples.

In [1]:
# 1. Install the deepspeech library
!pip install deepspeech

# 2. Download the English (en-US) pre-trained DeepSpeech model.
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.0/deepspeech-0.8.0-models.pbmm
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.0/deepspeech-0.8.0-models.tflite
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.0/deepspeech-0.8.0-models.scorer

# 3. Import the deepspeech model
import deepspeech

# 4. Initialize the model with the pretrained weights and the scorer.
model_file_path = 'deepspeech-0.8.0-models.pbmm'
scorer_file_path = 'deepspeech-0.8.0-models.scorer'
model = deepspeech.Model(model_file_path)

# Scorer initialization
model.enableExternalScorer(scorer_file_path)

Collecting deepspeech
[?25l  Downloading https://files.pythonhosted.org/packages/43/ff/f17ff70af03d27afb749f866cab2e6f5def29e02d5aa2762afc68ea92eab/deepspeech-0.8.2-cp36-cp36m-manylinux1_x86_64.whl (8.3MB)
[K     |████████████████████████████████| 8.3MB 6.4MB/s 
Installing collected packages: deepspeech
Successfully installed deepspeech-0.8.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   652  100   652    0     0   2910      0 --:--:-- --:--:-- --:--:--  2910
100  180M  100  180M    0     0  27.7M      0  0:00:06  0:00:06 --:--:-- 34.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   654  100   654    0     0   3070      0 --:--:-- --:--:-- --:--:--  3070
100 45.1M  100 45.1M    0     0  17.6M      0  0:00:02  0:00:02 --:--:-- 32.3M
  % Total    % Receiv

Alternatively, if you need to use the `deepspeech` version that supports a [Graphics Processing Unit](https://en.wikipedia.org/wiki/Graphics_processing_unit), you could instead use the command below. _(Uncomment the second line before running the cell.)_

Ensure to change the runtime type, from none to GPU. To do this follow the following steps:
1. Click on the **Runtime** tab in the notebook top menu.
2. Go to and click **Change runtime type**.
3. Change from **None** to **GPU**

In [None]:
# 1. (Alternatively) Install the DeepSpeech Library with GPU support.
# !pip install deepspeech-gpu

Next, we shall download a pretrained model that we shall use to transcribe the sentence we shall speak from voice (speech) to text.


---
Notes:
- The model with the `.pbmm` extension is memory mapped and thus memory efficient and fast to load. 
- The model with the `.tflite` extension is converted to use TFLite, has post-training quantization enabled, and is more suitable for resource constrained environments.


Next, we shall download audio samples that we shall use to test the pre-trained model. The downloaded files will be unzipped before use.

In [2]:
# 3. Download the sample audio files to test the model.
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.0/audio-0.8.0.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   642  100   642    0     0   2853      0 --:--:-- --:--:-- --:--:--  2853
100  193k  100  193k    0     0   211k      0 --:--:-- --:--:-- --:--:--  383k


In [3]:
# Unzip the files.
!tar -xvzf audio-0.8.0.tar.gz

._audio
audio/
audio/._2830-3980-0043.wav
audio/2830-3980-0043.wav
audio/._Attribution.txt
audio/Attribution.txt
audio/._4507-16021-0012.wav
audio/4507-16021-0012.wav
audio/._8455-210777-0068.wav
audio/8455-210777-0068.wav
audio/._License.txt
audio/License.txt


Next, we test the pretrained model on the sample audio files.

To do this, we shall run the `deepspeech` command, which was made available after the [first step](#deepspeech-install) of the notebook configuration. 

---
Notes:
- We run the `deepspeech` command with the `-h` flag to better understand how the command is used.



In [4]:
!deepspeech -h

usage: deepspeech [-h] --model MODEL [--scorer SCORER] --audio AUDIO
                  [--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA]
                  [--lm_beta LM_BETA] [--version] [--extended] [--json]
                  [--candidate_transcripts CANDIDATE_TRANSCRIPTS]

Running DeepSpeech inference.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model (protocol buffer binary file)
  --scorer SCORER       Path to the external scorer file
  --audio AUDIO         Path to the audio file to run (WAV format)
  --beam_width BEAM_WIDTH
                        Beam width for the CTC decoder
  --lm_alpha LM_ALPHA   Language model weight (lm_alpha). If not specified,
                        use default from the scorer package.
  --lm_beta LM_BETA     Word insertion bonus (lm_beta). If not specified, use
                        default from the scorer package.
  --version             Print version and exits
  --extended          

<a name="test-step"></a>From the output above, `deepspeech` command requires two arguments i.e. `--model` which is the path to the model, and `--audio` which is the path to the audio file to transcribe.

In [5]:
# 4. Test the pretrained deepspeech model
!deepspeech --model deepspeech-0.8.0-models.pbmm --scorer deepspeech-0.8.0-models.scorer \
            --audio audio/2830-3980-0043.wav

# Expirement with the other sample .wav files to ascertain if the transcription is correct.

Loading model from file deepspeech-0.8.0-models.pbmm
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
2020-08-31 18:07:34.694321: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
Loaded model in 0.0106s.
Loading scorer from files deepspeech-0.8.0-models.scorer
Loaded scorer in 0.000259s.
Running inference.
experience proves this
Inference took 1.459s for 1.975s audio file.


Notes:
- You could download one of the `.wav` files from the audio folder to listen to the speech, then compare that with the second last statement from the `deepspeech` command output.
- It should be the same, and if it is, then everything is working as expected.

### DeepSpeech Model
In this section, we shall set up a model that we shall use to transcribe the sentence that we shall speak.


This is similar to what was done in the [test step](#test-step). At this stage the model is ready to receive and transcribe speech English sentences.

### A Simple Transcriber
A transcriber consists of two main modules, the module that captures voice from a microphone, and the module that converts the voice to text. These modules work simulatenously, the voice capturing module keeps producing chunks of the speech stream while the other module listens to this stream chunks, and converts them to text upon arrival while updating the transcribed text.

---
**[TO-DO]**<br>
To capture your custom sentence;
1. You will need to download and install the [Audacity Tool](https://www.fosshub.com/Audacity.html?dwl=audacity-win-2.4.2.exe) _(This URL links directly to the `.exe` file.)_
2. When the program has successfully installed, click on the record button &#9210;.
3. Proceed to record your sentence(s), when done, click the stop &#9209; button.
4. To save the recording, go to the file tab, then export the file in the preferred format `.wav`,`.mp3`. 
5. Finally, upload the file to this notebook and mark its path.



### Transcription Process
With the voice file uploaded, we shall run it through the model for the transcription.

In [None]:
# 1. Set the filepath, change 'audio/4507-16021-0012.wav' to the full file path of your custom sentence.
filepath = 'audio/4507-16021-0012.wav'

In [None]:
#@title simple_transcriber function (Run Cell) { display-mode: "form" }
import wave
import numpy as np
# transcription function
def simple_transcriber(filepath):
  """
    This function transcribes the input sentence from voice to text.

    filepath <string> path/to/custom_sentence
  """
  w = wave.open(filepath, 'r')
  rate = w.getframerate()
  frames = w.getnframes()
  buffer = w.readframes(frames)

  # DeepSpeech expects a 16-bit integer array, but the wave library returns a byte
  # array, we use numpy to convert it to the 16-bit format.

  data = np.frombuffer(buffer, dtype=np.int16)

  # Transcribe
  text = model.stt(data)
  return text

In [None]:
# Transcription
text_from_speech = simple_transcriber(filepath)
print(text_from_speech)

### Alternatively (Optional)
Instead of using the Audacity Tool, it is possible to record your custom sentence and get it transcribed in real time.

---
**NOTE:**<br>
This cannot be done on Google Colaboratory, as this environment runs on a virtual machine that does not support a microphone, as such, using this method requires downloading this notebook and running it locally.

Steps
1. Install the `pyaudio` module, this is a python binding for [PortAudio](http://www.portaudio.com/), an open source audio input and output library. We install `pyaudio` in a two step process.
- Install the dependency tools in the environment.
- Then using `pip`, install `pyaudio` library.

---

- `pyaudio` library captures the sentence(s) that you speak.

In [None]:
# 5. Install pyaudio module to capture audio.
!apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
!pip install pyaudio

Run the cell below only on your local computer, it returns an `OSError` in this environment, given it has no access to a microphone.

In [None]:
#@title Run this cell only on your local computer! { display-mode: "form" }

# Create a Streaming session
context = model.createStream()

# Encapsulate DeepSpeech audio feeding into a callback for PyAudio
text_so_far = ''
def process_audio(in_data, frame_count, time_info, status):
    global text_so_far
    data16 = np.frombuffer(in_data, dtype=np.int16)
    model.feedAudioContent(context, data16)
    text = model.intermediateDecode(context)
    if text != text_so_far:
        print('Interim text = {}'.format(text))
        text_so_far = text
    return (in_data, pyaudio.paContinue)

# PyAudio parameters
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK_SIZE = 1024

# Feed audio to deepspeech in a callback to PyAudio
audio = pyaudio.PyAudio()
stream = audio.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK_SIZE,
    stream_callback=process_audio
)

def real_time_transcriber():
  print('Please start speaking, when done press Ctrl-C ...')
  stream.start_stream()

  try: 
      while stream.is_active():
          time.sleep(0.1)
  except KeyboardInterrupt:
      # PyAudio
      stream.stop_stream()
      stream.close()
      audio.terminate()
      print('Finished recording.')
      # DeepSpeech
      text = model.finishStream(context)
      print('Final text = {}'.format(text))

In [None]:
# Run this cell and follow the prompts.
real_time_transcriber()