# Transcribe Audio

## Overview

(shortly called as ASR for Automatic Speech Recognition)

Task 2 - Data Preparation & EDA

Sub Task 3 - Transcribe Audio

Description: Create a written representation of the audio

## Libraries & Models

| Library | Key Features | Status | Remarks |
| :- | :- | --- | :- |
|NeMo||In Progress|Code to transcribe audio along with word time stamps adopted from https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb|
||||Exploring options to obtain punctuations|
|Wave2Vec2||In Progress|Exploring option to obtain word time stamp as well|
|Vosk||In Progress|Referring article on Vosk in link: https://towardsdatascience.com/speech-recognition-with-timestamps-934ede4234b2|

## Audio Files

|Audio File|Length|Description|Source|
|:-|:-|:-|:-|
|an4_diarize_test.wav|5 seconds|Dates by 2 speakers|https://nemo-public.s3.us-east-2.amazonaws.com/an4_diarize_test.wav|
|OSR_us_000_0010_8k.wav|33 seconds|Harvard sentences in British accent by 1 person|https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav|
|OSR_us_000_0060_8k.wav|58 seconds|Harvard sentences in neutral accent by 1 person|https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0060_8k.wav|

# Install Dependencies

## NeMo Specific Dependencies

In [None]:
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install unidecode

# ## Install NeMo
BRANCH = 'r1.10.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

## Install TorchAudio
!pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=1bfd63dc5bd527ad57e702869b99728ec0335c50a771f726db0ce2231327413b
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libsndfile1 is already the newest version (1.0.28-4ubuntu0.18.04.2).
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be ins

# Import Libraries

## Common Libraries

In [None]:
import os
import librosa
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Audio, display

## NeMo Specific Libraries

In [None]:
from omegaconf import OmegaConf
import wget
import json
from nemo.collections.asr.parts.utils.decoder_timestamps_utils import ASR_TIMESTAMPS

[NeMo W 2022-08-18 15:32:04 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.


# Setup Data Directory

In [None]:
ROOT_DIR = os.getcwd()
text_data = 'data'
DATA_DIR = os.path.join(ROOT_DIR, text_data)
os.makedirs(DATA_DIR, exist_ok=True)

# List the directory details
print(f"Root or Current Working Directory: {ROOT_DIR}")
root_dir_contents = !ls
print(f"Contents of Current Directory: {root_dir_contents[0]}")
print(f"Data Directory: {DATA_DIR}")

Root or Current Working Directory: /content
Contents of Current Directory: data  sample_data
Data Directory: /content/data


# Download Audio Clips

## List of Audio Clips

In [None]:
audio_files = []
audio_files.append({"name": "an4_diarize_test", "extn": "wav", "url":"https://nemo-public.s3.us-east-2.amazonaws.com/an4_diarize_test.wav"})
audio_files.append({"name": "OSR_us_000_0010_8k", "extn": "wav", "url":"https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav"})
audio_files.append({"name": "OSR_us_000_0060_8k", "extn": "wav", "url":"https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0060_8k.wav"})

## Download to Data Directory

In [None]:
audio_file_paths = []
audio_file_path = ''

for audio_file in audio_files:
    audio_file_path = os.path.join(DATA_DIR, audio_file['name'] + '.' + audio_file['extn'])
    if not os.path.exists(audio_file_path):
        !wget --directory-prefix={text_data} {audio_file['url']}
    if os.path.exists(audio_file_path):
        audio_file_paths.append(audio_file_path)

print(f"Full path of downloaded audio files: {audio_file_paths}")

--2022-08-18 15:32:26--  https://nemo-public.s3.us-east-2.amazonaws.com/an4_diarize_test.wav
Resolving nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)... 52.219.101.146
Connecting to nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)|52.219.101.146|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 166444 (163K) [audio/wav]
Saving to: ‘data/an4_diarize_test.wav’


2022-08-18 15:32:27 (4.53 MB/s) - ‘data/an4_diarize_test.wav’ saved [166444/166444]

--2022-08-18 15:32:27--  https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
Resolving www.voiptroubleshooter.com (www.voiptroubleshooter.com)... 162.241.218.124
Connecting to www.voiptroubleshooter.com (www.voiptroubleshooter.com)|162.241.218.124|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 538014 (525K) [audio/x-wav]
Saving to: ‘data/OSR_us_000_0010_8k.wav’


2022-08-18 15:32:27 (2.26 MB/s) - ‘data/OSR_

## Display Waveform Function

In [None]:
def display_waveform(signal, sampling_rate, text='Audio', overlay_color=[]):
    fig, ax = plt.subplots(1, 1)
    fig.set_figwidth(20)
    fig.set_figheight(2)
    plt.scatter(np.arange(len(signal)), 
                signal,
                s=1, marker='o', c='k')
    if len(overlay_color):
        plt.scatter(np.arange(len(signal)), 
                    signal, 
                    s=1, marker='o', c=overlay_color)    
    fig.suptitle(text, fontsize=16)
    plt.xlabel('time (secs)', fontsize=18)
    plt.ylabel('signal strength', fontsize=14)
    plt.axis([0, len(signal), -0.5, +0.5])
    time_axis, _ = plt.xticks()
    plt.xticks(time_axis[:-1], time_axis[:-1]/sampling_rate)
    plt.show()

## Inspect Audio Clips

In [None]:
for audio_file in audio_file_paths:
    signal, sampling_rate = librosa.load(audio_file, sr=None)
    print(f"File Name: {audio_file}, Sampling Rate: {sampling_rate}")
    display(Audio(signal, rate=sampling_rate))
    display_waveform(signal, sampling_rate)

Output hidden; open in https://colab.research.google.com to view.

# NeMo based ASR

## Download Configutation File

In [None]:
NEMO_CONFIG_URL = "https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/offline_diarization_with_asr.yaml"
if not os.path.exists( os.path.join(DATA_DIR, 'offline_diarization_with_asr.yaml')):
    NEMO_CONFIG = wget.download(NEMO_CONFIG_URL, DATA_DIR)
else:
    NEMO_CONFIG = os.path.join(DATA_DIR, 'offline_diarization_with_asr.yaml')

nemo_cfg = OmegaConf.load(NEMO_CONFIG)

## Choose ASR Model & Input Audio

In [None]:
# Choose the model based on which ASR needs to be performed by NeMo
nemo_asr_model_path = 'QuartzNet15x5Base-En'

# Choose the input audio file from among the audio clips downloaded earlier
audio_file_index = 0

## Create Manifest

In [None]:
meta = {
'audio_filepath': audio_file_paths[audio_file_index], 
'offset': 0,
'duration': None, 
'label': 'infer', 
'text': '-', 
'num_speakers': None,
'rttm_filepath': None,
'uem_filepath': None
}

with open(os.path.join(DATA_DIR, 'input_manifest.json'), 'w') as fp:
    json.dump(meta, fp)
    fp.write('\n')

nemo_cfg.diarizer.manifest_filepath = os.path.join(DATA_DIR, 'input_manifest.json')
nemo_cfg.diarizer.out_dir = DATA_DIR
nemo_cfg.diarizer.asr.model_path = nemo_asr_model_path

# Display the Configuration File if needed
# print(OmegaConf.to_yaml(nemo_cfg))

## Execute Transcription (ASR)

In [None]:
nemo_asr_ts_decoder = ASR_TIMESTAMPS(**nemo_cfg.diarizer)
nemo_asr_model = nemo_asr_ts_decoder.set_asr_model()
nemo_words, nemo_words_ts = nemo_asr_ts_decoder.run_ASR(nemo_asr_model)

print("NeMo based ASR:")
print("Decoded word output dictionary: \n", nemo_words[audio_files[audio_file_index]['name']])
print("Word-level timestamps distionary: \n", nemo_words_ts[audio_files[audio_file_index]['name']])

[NeMo I 2022-08-18 15:45:32 speaker_utils:82] Number of files to diarize: 1
[NeMo I 2022-08-18 15:45:32 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.10.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.
[NeMo I 2022-08-18 15:45:32 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.10.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo
[NeMo I 2022-08-18 15:45:32 common:789] Instantiating model from pre-trained checkpoint
[NeMo I 2022-08-18 15:45:33 features:200] PADDING: 16
[NeMo I 2022-08-18 15:45:34 save_restore_connector:243] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.10.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.


Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

NeMo based ASR:
Decoded word output dictionary: 
 ['eleven', 'twenty', 'seven', 'fifty', 'seven', 'october', 'twenty', 'fourth', 'nineteen', 'seventy']
Word-level timestamps distionary: 
 [[0.56, 1.0], [1.14, 1.5], [1.54, 2.06], [2.14, 2.48], [2.52, 3.24], [3.34, 3.74], [3.78, 4.04], [4.08, 4.32], [4.46, 4.8], [4.82, 5.18]]
