# Audacity WaveformToLabels Example

In this notebook we will load in a speech to text model from Facebook using Huggingface's Transformers module/package. We will look at the necessary dependencies to serialize  a model, how to create a wrapper class for a pretrained WaveformToLabels model, and show how to save this wrapped model so that it can easily be used in Audacity. 

## Dependencies

In [1]:
!pip install "torch==1.8.1"
!pip install "torchaudio==0.8.0"
!pip install transformers

Collecting torch==1.8.1
  Using cached torch-1.8.1-cp39-cp39-win_amd64.whl (190.5 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.8.0
    Uninstalling torch-1.8.0:
      Successfully uninstalled torch-1.8.0
Successfully installed torch-1.8.1


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 0.8.0 requires torch==1.8.0, but you have torch 1.8.1 which is incompatible.


Collecting torch==1.8.0
  Using cached torch-1.8.0-cp39-cp39-win_amd64.whl (190.5 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.8.1
    Uninstalling torch-1.8.1:
      Successfully uninstalled torch-1.8.1
Successfully installed torch-1.8.0


In [2]:
%%capture
import torch
from transformers import Speech2TextForConditionalGeneration, Speech2TextProcessor
import torchaudio
import json

# use no grad!
torch.set_grad_enabled(False)

These packages will be needed if you want to upload your model to Huggingface using a CLI. 

In [3]:
%%capture
# required for huggingface
!sudo apt-get install git-lfs
!git lfs install

## Storing Labels
If your model has a large number of labels this block of code will read in each line as a text file as a label and store it in an array. This will minimize issues when creating your model's metadata.

In [4]:
def readFile(fileName):
    fileObj = open(fileName, "r")
    words = fileObj.read().splitlines() 
    fileObj.close()
    return words

In [5]:
labels = readFile('assets/vocab.txt')


## Wraping the model
We need to create a `.pt` containing the model itself, and a json string with the model's metadata. This meta data will tell end users about the model's domain, sample rate, labels, etc...

`torchaudacity` provides a `WaveformToLabels` class. We will use this as a base class for our pretrained models wrapper. The `WaveformToLabels` class provides us with tests to ensure that our model is receiving properly sized input, and outputting the expected tensor shapes for Audacity's Deep Learning Analyzer.

In [6]:
import sys
sys.path.append("..")

In [62]:
from torchaudacity import WaveformToLabels

class model_wrapper(WaveformToLabels):
    def __init__(self, model, processor, vocab):
        super().__init__(model_wrapper)
        self._model = model
        self._processor = processor
    def do_forward_pass(self, input):
        input_features = self._processor(
        input[0],
        sampling_rate=16_000,
        return_tensors="pt"
        ).input_features

        # get predictions, and decode them
        generated_ids = self._model.generate(input_ids=input_features)
        transcription = processor.tokenizer.batch_decode(generated_ids)[0].split(' ')
        num_preds = len(transcription)

        # model predictions must be logits or one-hot encoded 
        preds_onehot = torch.FloatTensor(num_preds, 10000)
        preds_onehot.zero_()
        for i, token in enumerate(transcription):
            if token in processor.tokenizer.get_vocab():
                token_idx = processor.tokenizer.get_vocab()[token]
                preds_onehot[i][token_idx] = 1
            elif '_' + token in processor.tokenizer.get_vocab():
                token_idx = processor.tokenizer.get_vocab()['_' + token]
                preds_onehot[i][token_idx] = 1
            else:
                preds_onehot[i][3] = 1
        
        # this model does not use timestamps, therefore we will use 
        # equally sized time ranges for each prediction
        total_time = input.shape[1] / 16000
        equal_size_timestamp = total_time / num_preds
        timestamps = torch.FloatTensor(num_preds, 2)
        timestamps.zero_()
        for i in range(num_preds):
            if i == 0:
                timestamps[0][1] = equal_size_timestamp
            else:
                timestamps[i][0] = timestamps[i-1][1]
                timestamps[i][1] = timestamps[i][0] + equal_size_timestamp

        # return the predictions and timestamps as a tensor
        return (preds_onehot, timestamps)

In [63]:
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-librispeech-asr", torchscript=True)
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-librispeech-asr")
model.eval()

torchscript_model = model_wrapper(model, processor, labels)

In [65]:
dummy_input = torch.randn((1, 32000))
torchscript_model(dummy_input)

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([[0., 1.],
         [1., 2.]]))

## Model Metadata

We need to create a `metadata.json` file for our model. This file will be added to the Huggingface repo and will provide Audacity with important information about our model. This allows for users to quickly get important information about this model directly from Audacity. See the [contributing documentation](https://github.com/hugofloresgarcia/torchaudacity) for the full metadata schema.

In [66]:
# create a dictionary with model metadata
metadata = {
    'sample_rate': 16000, 
    'domain_tags': ['speech'],
    'short_description': 'I will label your speech into text :]',
    'long_description': 
              'This is an Audacity wrapper for the model, '
              'forked from the repository '
              'facebook/s2t-medium-librispeech-asr'
              'This model was trained by Changhan Wang'
              'and Yun Tang and Xutai Ma and Anne Wu' 
              'and Dmytro Okhonko and Juan Pino.',
    'tags': ['speech-to-text'],
    'effect_type': 'waveform-to-labels',
    'multichannel': False,
    'labels': list(processor.tokenizer.get_vocab().keys()),
}

## Saving Our Model & Metadata

We will now save the wrapped model locally by tracing it with torchscript, and generating a `ScriptModule` or `ScriptFunction` using `torch.jit.script`. We can then use `torchaudacity's` utility function `save_model` to save the model and meta data easily. 

In [67]:
from torchaudacity.utils import save_model
from pathlib import Path

In [68]:
# compiling and saving model
dummy_input = torch.randn((1, 2048)) # dummy input for model tracing
traced_model = torch.jit.trace(torchscript_model, dummy_input)
serialized_model = torch.jit.script(traced_model)

save_model(serialized_model, metadata, Path('audacity-s2t-medium'))