<a id="top"></a>
# Speech Recognition

## OpenVINO version check:
You are currently using the latest development version of Intel® Distribution of OpenVINO™ Toolkit. Alternatively, you can open a version of this notebook for the Intel® Distribution of OpenVINO™ Toolkit LTS version by [clicking this link](../../../../openvino-lts/developer-samples/cpp/speech_recognition/speech_recognition.ipynb).

<a id="top"></a>
# Speech Recognition

This is a sample reference implementation to showcase an offline speech recognition application using an audio file, and a Kaldi acoustic model, a kaldi language model. In addition to Highlighting the fundamental Libraries:  
* OpenVINO™ Inference Engine
* Intel® Speech Feature Extraction (Speech Library)
* Intel® Speech Decoder libraries (Speech Library)

<img src="img/speech_recognition_pipeline.png">



## Overview of how it works
At start-up the sample application reads command line arguments and loads the audio file 
(.wav) in memory and parses the configuration file that indicates the inference device, the path to model IR(.xml+.bin), and other specific configurations pertaining to the Intel® Speech Feature Extraction and Intel® Speech Decoder libraries described in the configuration file section where they'll be initialized. After Initialization, the audio file will go through feature extraction, feature vectors will then serve as input to the acoustic model where inference will be ran to transcribe them to context-dependent phonemes, and lastly using a language model to decode the phonemes to words.

A job is submitted to a hardware accelerator Intel® Gaussian & Neural Accelerator(GNA), Intel® Core U-series (Whiskey-Lake) CPU, Intel® Core CPU,Intel® Xeon® CPU,  Intel® HD Graphics GPU, and Intel® Xeon).
After the processing of the entire audio is completed, the output results are appropriately stored in the <JOB_NAME>.o<JOB_ID> file in the current working directory, which can then be viewed within the Jupyter Notebook instance.

## Demonstration objectives
* Audio input with an acoustic model for Inference
* Inference performed on edge hardware (rather than on the development node hosting this Jupyter notebook)
* Demonstrate the Speech Library API in action


## Step 0: Set Up

### 0.1: Import dependencies

Run the below cell to import Python dependencies needed for displaying the results and listening to the audio in this notebook
(tip: select the cell and use **Ctrl+enter** to run the cell)

In [None]:
!/opt/intel/openvino/bin/setupvars.sh

In [None]:
from IPython.display import HTML, Audio
import os
import time
import sys                                     
from qarpo.demoutils import *

#install pyyaml for parsing config
!pip3 install --user -U PyYAML

### 0.2: Build the speech library, the kaldi slm tool, and the offline speech recognition app.

In [None]:
!./build_speech_lib.sh

We will start with the processing of on an audio file to see how the Speech Library and OpenVINO's Inference Engine work to extract features, run inference on features, and decode the phonemes to text.

We will go over the Speech Recognition Pipeline with OpenVINO in several steps:

1. Create a configuration file.
2. Understand the Speech library API that includes the Intel® Speech Feature Extraction and Intel® Speech Decoder.
3. Execute the Offline Speech Recognition application on Development Node(CPU).
6. Create a job file to target different hardware types.
7. Submit jobs to the queue.
8. View the results and hardware performance comparison.

## Download Model Files

We will use the pre-trained LibreSpeech DNN model created from the Kaldi S5 NNet1 framework. 
**We do not have to use model optimizer that is responsible for converting a model to IR (.xml+bin) since that step has already been done and we are downloading the IR files.**

The script will produce the following files:
* **speech_recognition_config.template** - This is a template for the configuration of the parameters/options to set for the feature extraction step, the inference step, and the decode step. 
* **lspeech_s5_ext.feature_transform** - The feature transform file holds a fixed function that serves as a front end and expands dimensionality so that low dimension inputs can be used thus saving disk-space and read throughput. It's used during feature extraction stage.
* **lspeech_s5_ext.xml** - IR file of the acoustic model to understand the layers, each layer's parameters, and how the model is connected. (human-readable) 
* **lspeech_s5_ext.bin** -The IR file holds the weights of the acoustic model. 
* **hclg.fst** - This is the language model/decoding graph (Finite State Transducer (.fst)) that is based on the transducer (h), phonetic context (c), Lexicon (l), and grammar (g). [Visit Kaldi Documentation to read more about this](https://kaldi-asr.org/doc/graph_recipe_test.html).
* **labels.bin**  - This holds the symbol table responsible for describing the alphabet of the input and output labels for arcs in the Finite State Transducer (hclg.fst).

In [None]:
!python3 speech_recognition_model.py -c /opt/intel/openvino/data_processing/audio/speech_recognition/models/intel/lspeech_s5_ext/model.yml

## Configuration File

The configuration file is critical into setting the parameters for each specific stage: Feature Extraction, Inference Engine, and the Decoder. We will highlight the **critical parameters** changed for each stage in the pipeline for this current sample. To understand more about all the options offered in each stage that is configurable visit the [OpenVINO documentation](https://docs.openvinotoolkit.org/latest/_inference_engine_samples_speech_libs_and_demos_Offline_speech_recognition_demo.html).

There are three configuration files that are pre-configured for you to use:
* speech_lib_CPU.cfg
* speech_lib_GPU.cfg
* speech_lib_GNA.cfg

We will show the CPU specific configuration file for now.


In [None]:
!cat speech_lib_CPU.cfg

### Feature Extraction Parameters

**-fe:rt:featureTransform** -> path to kaldi feature transform file 

Example:

**-fe:rt:featureTransform** model/FP32/lspeech_s5_ext.feature_transform 

### Inference Engine Parameters

**-inference:device** -> The device used to run inference (CPU|GPU|GNA_AUTO)

Example:

**-inference:device** CPU


### Decoding Parameters

**-dec:wfst:acousticModelFName** - path to the acoustic model .xml file without .xml extension

Example: 

**-dec:wfst:acousticModelFName** model/FP32/lspeech_s5_ext

**-dec:wfst:fsmFName** - path to language model 

Example:

**-dec:wfst:fsmFName** model/FP32/hclg.fst 

**-dec:wfst:outSymsFName** - Path to Symbols file.

Example:

**-dec:wfst:outSymsFName** model/FP32/labels.bin 

## Input Audio File

In [None]:
Audio("/data/reference-sample-data/speech-recognition/how_are_you_doing.wav",autoplay=False)

## Speech Library

The Speech Library serves as a wrapper around the Intel® Speech Feature Extraction and Intel® Speech Decoder libraries that takes care of the initialization of core components and data passing while exposing a simple C++ API.

5 Main Function Calls:

* **SpeechLibraryCreate** - Creates an instance of Speech Library with a callback Handle. 
* **SpeechLibraryInitialize** - Takes the Speech Library Handle and Configuration file, Parses the configuration file to load and  initialize the proper settings(wav file, acousting model, language file, etc.) in the configuration for each stage to build the pipeline (Feature Extraction, Inference, and Decode).
* **SpeechLibraryPushData** - Takes the audio data either from wav file or mic pushing it through the entire pipeline starting with the feature extractor -> Inference Engine-> Decoder.
* **SpeechLibraryGetResult** - Returns the transcribed audio. 
* **SpeechLibraryRelease** - Releases all resources tied to the handle for current speech recognition pipeline.

### Speech API Call Flow Diagram

<img src="img/speech_library_api.png" />

### Speech API Used In Sample Code

In [None]:
!sed -n 227,265p /opt/intel/openvino/data_processing/audio/speech_recognition/demos/offline_speech_recognition_demo/src/speech_library_app.cpp

## Run Inference on Dev Node CPU

In [None]:
!./intel64/Release/offline_speech_recognition_app -h

In [None]:
!./intel64/Release/offline_speech_recognition_app -wave=/data/reference-sample-data/speech-recognition/how_are_you_doing.wav -c=speech_lib_CPU.cfg

## Create a Job File

All the code up to this point has been run within the Jupyter Notebook instance running on a development node based on an Intel® Xeon® Scalable Processor, where the Notebook is allocated a single core. We will run the workload on several DevCloud's edge compute nodes. We will send work to the edge compute nodes by submitting jobs into a queue. For each job, we will specify the type of the edge compute server that must be allocated for the job.

To pass the specific variables to the C++ code, we will use following arguments:

* `-c`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;location of the configuration file
* `-i`&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;location of the wav file


The job file will be executed directly on the edge compute node.

In [None]:
%%writefile speech_recognition_job.sh

ME=`basename $0`


# The default path for the job is your home directory, so we change directory to where the files are.
cd $PBS_O_WORKDIR

while getopts 'c:i:?' OPTION; do
    case "$OPTION" in

    c)
        CONFIG_FILE=$OPTARG
        echo "$ME is using config file $OPTARG"
      ;;

    i)
        WAVE_FILE=$OPTARG
        echo "$ME is using wave file $OPTARG"
      ;;
    esac  
done

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/intelpython3/lib/

./intel64/Release/offline_speech_recognition_app -wave=$WAVE_FILE -c=$CONFIG_FILE

Here, the properties describe the node, and number on the left is the number of available nodes of that architecture.

In [None]:
!pbsnodes | grep compnode | awk '{print $3}' | sort | uniq -c

###  Job queue submission

Each cell below will submit a job to different edge compute nodes.
The output of the cell is the `JobID` of your job, which you can use to track progress of a job.

**Note** You can submit all 5 jobs at once or follow one at a time. 

After submission, they will go into a queue and run as soon as the requested compute resources become available. 
(tip: **shift+enter** will run the cell and automatically move you to the next cell. So you can hit **shift+enter** multiple times to quickly run multiple cells)

**Note** If you want to use your own video, Change the environment variable 'VIDEO' in the following cell from "/data/reference-sample-data/safety-gear-detection/Safety_Full_Hat_and_Vest.mp4" to the full path of your uploaded video.




In [None]:
os.environ["AUDIO"] = "/data/reference-sample-data/speech-recognition/how_are_you_doing.wav"

### 10th Generation  Intel® Core CPU with GNA
In the cell below, we submit a job to an edge node with a <a href="https://www.intel.com/content/www/us/en/products/processors/core/i7-processors/i7-1065g7.html">Intel 10th Generation Intel Core CPU</a>. The inference workload will run on the GNA.

In [None]:
#Submit job to the queue
job_id_gna = !qsub speech_recognition_job.sh -l nodes=qsub -l nodes=1:i7-1065g7 -F "-c speech_lib_GNA.cfg -i $AUDIO " -N speech_gna
print(job_id_gna[0]) 
#For viewing results
output_file_gna = "speech_gna.o"+job_id_gna[0].split('.')[0]

### 8th Generation  Intel® Core CPU 
In the cell below, we submit a job to an edge node with an <a href="https://www.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/whiskey-lake/overview.html">Intel 8th Generation Intel Core Whiskey Lake CPU</a>. The inference workload will run on the CPU.

In [None]:
#Submit job to the queue
job_id_whiskeylake = !qsub speech_recognition_job.sh -l nodes=1:idc016ai7 -F "-c speech_lib_CPU.cfg -i $AUDIO " -N speech_whiskeylake_cpu
print(job_id_whiskeylake[0]) 
#For viewing results
output_file_whiskeylake_cpu = "speech_whiskeylake_cpu.o"+job_id_whiskeylake[0].split('.')[0]

### Intel® CPU 
In the cell below, we submit a job to an <a 
    href="https://software.intel.com/en-us/iot/hardware/iei-tank-dev-kit-core">IEI 
    Tank 870-Q170</a> edge node with an <a 
    href="https://ark.intel.com/products/88186/Intel-Core-i5-6500TE-Processor-6M-Cache-up-to-3-30-GHz-">Intel 
    Core i5-6500TE</a>. The inference workload will run on the CPU.



In [None]:
#Submit job to the queue
job_id_core = !qsub speech_recognition_job.sh -l nodes=1:idc001skl -F "-c speech_lib_CPU.cfg -i $AUDIO " -N speech_core
print(job_id_core[0]) 
#For viewing results
output_file_core = "speech_core.o"+job_id_core[0].split('.')[0]


### Intel® Xeon® CPU 
In the cell below, we submit a job to an <a 
    href="https://software.intel.com/en-us/iot/hardware/iei-tank-dev-kit-core">IEI 
    Tank 870-Q170</a> edge node with an <a 
    href="https://ark.intel.com/products/88178/Intel-Xeon-Processor-E3-1268L-v5-8M-Cache-2-40-GHz-">Intel 
    Xeon Processor E3-1268L v5</a>. The inference workload will run on the CPU.
    

In [None]:
#Submit job to the queue
job_id_xeon = !qsub speech_recognition_job.sh -l nodes=1:idc007xv5 -F "-c speech_lib_CPU.cfg -i $AUDIO " -N speech_xeon
print(job_id_xeon[0]) 
#For viewing results
output_file_xeon = "speech_xeon.o"+job_id_xeon[0].split('.')[0]

### Intel® Core CPU with Intel® GPU
In the cell below, we submit a job to an <a 
    href="https://software.intel.com/en-us/iot/hardware/iei-tank-dev-kit-core">IEI 
    Tank 870-Q170</a> edge node with an <a href="https://ark.intel.com/products/88186/Intel-Core-i5-6500TE-Processor-6M-Cache-up-to-3-30-GHz-">Intel Core i5-6500TE</a>. The inference workload will run on the Intel® HD Graphics 530 card integrated with the CPU.

In [None]:
#Submit job to the queue
job_id_gpu = !qsub speech_recognition_job.sh -l nodes=1:idc001skl -F "-c speech_lib_GPU.cfg -i $AUDIO " -N speech_gpu
print(job_id_gpu[0]) 
#For viewing results
output_file_gpu = "speech_gpu.o"+job_id_gpu[0].split('.')[0]


### UP Squared Grove IoT Development Kit
In the cell below, we submit a job to an <a 
    href="https://software.intel.com/en-us/iot/hardware/up-squared-grove-dev-kit">UP Squared Grove IoT Development Kit</a> edge node with an <a 
    href="https://ark.intel.com/products/96488/Intel-Atom-x7-E3950-Processor-2M-Cache-up-to-2-00-GHz-">Intel Atom® x7-E3950 Processor</a>. The inference  workload will run on the integrated Intel® HD Graphics 505 card.

In [None]:
#Submit job to the queue
job_id_up2 = !qsub speech_recognition_job.sh -l nodes=1:idc008u2g -F "-c speech_lib_GPU.cfg -i $AUDIO " -N speech_up2_gpu
print(job_id_up2[0]) 
#For viewing results
output_file_up2_gpu = "speech_up2_gpu.o"+job_id_up2[0].split('.')[0]

### Check the Progress

Check the progress of the jobs. `Q` status stands for `queued`, `R` for `running`. How long a job is being queued is dependent on number of the users. It should take up to 5 minutes for a job to run. If the job is no longer listed, it's done. 

In [None]:
liveQstat()

You should see the jobs you have submitted (referenced by `Job ID` that gets displayed right after you submit the job in step 2.3).
There should also be an extra job in the queue "jupyterhub": this job runs your current Jupyter Notebook session.

The 'S' column shows the current status. 
- If it is in Q state, it is in the queue waiting for available resources. 
- If it is in R state, it is running. 
- If the job is no longer listed, it means it is completed.

**Note**: Time spent in the queue depends on the number of users accessing the edge nodes. Once these jobs begin to run, they should take from 1 to 5 minutes to complete. 

***Wait!***

Please wait for the inference jobs and video rendering complete before proceeding to the next step.

## Step 3: View Results

Once the jobs are completed, the queue system outputs the `stdout` and `stderr` streams of each job into files with names
`speech_{type}.o{JobID}` and `obj_det_{type}.e{JobID}`. Here, speech_{type} corresponds to the `-N` option of qsub. For example, `core` for Core CPU target.


`speech_{type}.e{JobID}`

(here, speech_{type} corresponds to the `-N` option of qsub).


### 10th Generation  Intel® Core CPU with GNA

In [None]:
filepath = os.getcwd()+"/"+output_file_gna
fd = open( filepath, 'r')
print(fd.read())

### 8th Generation  Intel® Core CPU 

In [None]:
filepath = os.getcwd()+"/"+output_file_whiskeylake_cpu
fd = open( filepath, 'r')
print(fd.read())

### Intel® CPU 

In [None]:
filepath = os.getcwd()+"/"+output_file_core
fd = open( filepath, 'r')
print(fd.read())

### Intel® Xeon® CPU 

In [None]:
filepath = os.getcwd()+"/"+output_file_xeon
fd = open( filepath, 'r')
print(fd.read())

### Intel® Core CPU with Intel® GPU

In [None]:
filepath = os.getcwd()+"/"+output_file_gpu
fd = open( filepath, 'r')
print(fd.read())

### UP Squared Grove IoT Development Kit

In [None]:
filepath = os.getcwd()+"/"+output_file_up2_gpu
fd = open( filepath, 'r')
print(fd.read())

## Next steps
- [More Jupyter* Notebook Samples](https://devcloud.intel.com/edge/advanced/sample_applications/) - additional sample applications 
- [Jupyter* Notebook Tutorials](https://devcloud.intel.com/edge/get_started/tutorials) - sample application Jupyter* Notebook tutorials
- [Intel® Distribution of OpenVINO™ toolkit Main Page](https://software.intel.com/openvino-toolkit) - learn more about the tools and use of the Intel® Distribution of OpenVINO™ toolkit for implementing inference on the edge


## About this notebook

For technical support, please see the [Intel® DevCloud Forums](https://software.intel.com/en-us/forums/intel-devcloud-for-edge)

<p style=background-color:#0071C5;color:white;padding:0.5em;display:table-cell;width:100pc;vertical-align:middle>
<img style=float:right src="https://devcloud.intel.com/edge/static/images/svg/IDZ_logo.svg" alt="Intel DevCloud logo" width="150px"/>
<a style=color:white>Intel® DevCloud for the Edge</a><br>   
<a style=color:white href="#top">Top of Page</a> |
<a style=color:white href="https://devcloud.intel.com/edge/static/docs/terms/Intel-DevCloud-for-the-Edge-Usage-Agreement.pdf">Usage Agreement (Intel)</a> | 
<a style=color:white href="https://devcloud.intel.com/edge/static/docs/terms/Colfax_Cloud_Service_Terms_v1.3.pdf">Service Terms (Colfax)</a>
</p>