In [1]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
"""

BRANCH = 'main'

In [2]:
if 'google.colab' in str(get_ipython()):
  !pip install -q condacolab
  import condacolab
  condacolab.install()

✨🍰✨ Everything looks OK!


In [3]:

# If you're using Google Colab and not running locally, run this cell.
# install NeMo
if 'google.colab' in str(get_ipython()):
  !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

Collecting nemo_toolkit[all]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision debug_itn) to /tmp/pip-install-sl4dfir1/nemo-toolkit_2b0f4f353bbb4b10a38be9e09869259a
  Running command git clone -q https://github.com/NVIDIA/NeMo.git /tmp/pip-install-sl4dfir1/nemo-toolkit_2b0f4f353bbb4b10a38be9e09869259a
  Running command git checkout -b debug_itn --track origin/debug_itn
  Switched to a new branch 'debug_itn'
  Branch 'debug_itn' set up to track remote branch 'debug_itn' from 'origin'.


In [4]:
if 'google.colab' in str(get_ipython()):
  !conda install -c conda-forge pynini=2.1.3

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ failed

InvalidVersionSpec: Invalid version '4.19.112+': empty version component



In [5]:
import os
import wget
import pynini

# Task Description

Inverse text normalization (ITN), also called denormalization, is a part of the Automatic Speech Recognition (ASR) post-processing pipeline. 

ITN is the task of converting the raw spoken output of the ASR model into its written form to improve the text readability. For example, `in nineteen seventy` should be changed to `in 1975` and `one hundred and twenty three dollars` to `$123`.

# NeMo Inverse Text Normalization

The NeMo ITN tool is a Python package that is based on weighted finite-state
transducer (WFST) grammars. The tool uses [`Pynini`](https://github.com/kylebgorman/pynini) to construct WFSTs, and the created grammars can be exported and integrated into [`Sparrowhawk`](https://github.com/google/sparrowhawk) (an open-source version of [The Kestrel TTS text normalization system](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)) for production. The NeMo ITN tool can be seen as a Python extension of `Sparrowhawk`. 

Currently, NeMo tool provides support for English and the following semiotic classes from the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish):
DATE, CARDINAL, MEASURE, DECIMAL, ORDINAL, MONEY, TIME, PLAIN. 

The toolkit is modular, easily extendable, and can be adapted to other languages and tasks like text normalization. The Python environment enables an easy combination of text covering grammars with NNs. 

The overall NeMo ITN pipeline from development in `Pynini` to deployment in `Sparrowhawk` is shown below:
![alt text](deployment.png "Inverse Text Normalization Pipeline")

# Quick Start

## Add ITN to your Python ASR post-processing workflow

ITN is a part of the `nemo_tools` package and can be easily integrated into an existing pipeline. Installation instructions could be found [here](https://github.com/NVIDIA/NeMo/tree/main/nemo_tools).

In [None]:
from nemo_tools.text_denormalization.denormalize import denormalize

raw_text = "we paid one hundred and twenty three dollars for this desk, and this."
denormalize(raw_text, verbose=False)

In the above cell, `one hundred and twenty three dollars` would be converted to `$123`, and the rest of the words remain the same.

## Run Inverse Text Normalization on an input from a file

Use `run_predict.py` to convert a spoken text from a file `INPUT_FILE` to a written format and save the output to `OUTPUT_FILE`. Under the hood, `run_predict.py` is calling `denormalize()` (see the above section).

In [None]:
# If you're running the notebook locally, update the NEMO_TOOLS_PATH below
# In Colab, a few required scripts will be downloaded from NeMo github

NEMO_TOOLS_PATH = '<UPDATE_PATH_TO_NeMo_root>/nemo_tools/text_denormalization'
DATA_DIR = 'data_dir'
os.makedirs(DATA_DIR, exist_ok=True)

if 'google.colab' in str(get_ipython()):
    NEMO_TOOLS_PATH = '.'

    required_files = ['run_predict.py',
                      'run_evaluate.py']
    for file in required_files:
        if not os.path.exists(file):
            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/nemo_tools/text_denormalization/' + file
            print(file_path)
            wget.download(file_path)
elif not os.path.exists(NEMO_TOOLS_PATH):
      raise ValueError(f'update path to NeMo root directory')

INPUT_FILE = f'{DATA_DIR}/test.txt'
OUTPUT_FILE = f'{DATA_DIR}/test_itn.txt'

! echo "on march second twenty twenty" > $DATA_DIR/test.txt
! python $NEMO_TOOLS_PATH/run_predict.py --input=$INPUT_FILE --output=$OUTPUT_FILE

In [None]:
# check that the raw text was indeed converted to the written form
! cat $OUTPUT_FILE

## Run evaluation

[Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish) consists of 1.1 billion words of English text from Wikipedia, divided across 100 files. The normalized text is obtained with [The Kestrel TTS text normalization system](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)).

Although a large fraction of this dataset can be reused for ITN by swapping input with output, the dataset is not bijective. 

For example: `1,000 -> one thousand`, `1000 -> one thousand`, `3:00pm -> three p m`, `3 pm -> three p m` are valid data samples for normalization but the inverse does not hold for ITN. 

We used regex rules to disambiguate samples where possible, see `nemo_tools/text_denormalization/clean_eval_data.py`.

To run evaluation, the `input` file should follow the Google Text normalization dataset format.

Example evaluation run: 

`python run_evaluate.py \
        --input=./en_with_types/output-00001-of-00100 \
        [--denormalizer nemo] \
        [--cat CATEGORY] \
        [--filter]`
        
        
Use `--cat` to specify a `CATEGORY` to run evaluation on (all other categories are going to be exluded from evaluation). With the option `--filter`, the provided data will be cleaned to avaid disambiguieties (use `clean_eval_data.py` to clean up the data upfront).

In [None]:
eval_text = """PLAIN	ON	<self>
DATE	22 July 2012	the twenty second of july twenty twelve
PUNCT	.	sil
<eos>	<eos>
"""

INPUT_FILE_EVAL = f'{DATA_DIR}/test_eval.txt'

with open(INPUT_FILE_EVAL, 'w') as f:
    f.write(eval_text)
    
! python $NEMO_TOOLS_PATH/run_evaluate.py --input=$INPUT_FILE_EVAL

`run_evaluate.py` call will output both **sentence level** and **token level** accuracies. 
For our example, the expected output is the following:

```
Loading training data: test_eval.txt
Sentence level evaluation...
- Data: 1 sentences
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 79.83it/s]
- Deormalized. Evaluating...
- Accuracy: 1.0
Token level evaluation...
- Token type: PLAIN
  - Data: 1 tokens
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 519.68it/s]
  - Denormalized. Evaluating...
  - Accuracy: 1.0
- Token type: DATE
  - Data: 1 tokens
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 165.33it/s]
  - Denormalized. Evaluating...
  - Accuracy: 1.0
- Accuracy: 1.0
 - Total: 2 

         Class Num Tokens nemo
0   sent level          1  1.0
1        PLAIN          1  1.0
2         DATE          1  1.0
3     CARDINAL          0    0
4      LETTERS          0    0
5     VERBATIM          0    0
6      MEASURE          0    0
7      DECIMAL          0    0
8      ORDINAL          0    0
9        DIGIT          0    0
10       MONEY          0    0
11   TELEPHONE          0    0
12  ELECTRONIC          0    0
13    FRACTION          0    0
14        TIME          0    0
15     ADDRESS          0    0
```

# C++ deployment

The instructions on how to export `Pynini` grammars and to run them with `Sparrowhawk`, could be found at [NeMo/text_denormalization/tools/text_denormalization](https://github.com/NVIDIA/NeMo/tree/text_denormalization/tools/text_denormalization).

# WFST and Common Pynini Operations

Finite-state acceptor (or FSA) is a finite state automaton that has a finite number of states and no output. FSA either accepts (when the matching patter is found) or rejects a string (no match is found). 

In [None]:
print([byte for byte in bytes('fst', 'utf-8')])

# create an acceptor from a string
pynini.accep('fst')

Here `0` - is a start note, `1` and `2` are the accept nodes, while `3` is a finite state.
By default (token_type="byte", `Pynini` interprets the string as a sequence of bytes, assigning one byte per arc. 

A finite state transducer (FST) not only matches the pattern but also produces output according to the defined transitions.

In [None]:
# create an FST
pynini.cross('fst', 'FST')

Pynini supports the following operations:

- `closure` - Computes concatenative closure.
- `compose` - Constructively composes two FSTs.
- `concat` - Computes the concatenation (product) of two FSTs.
- `difference` - Constructively computes the difference of two FSTs.
- `invert`  - Inverts the FST's transduction.
- `optimize` - Performs a generic optimization of the FST.
- `project` - Converts the FST to an acceptor using input or output labels.
- `shortestpath` - Construct an FST containing the shortest path(s) in the input FST.
- `union`- Computes the union (sum) of two or more FSTs.


The list of most commonly used `Pynini` operations could be found [https://github.com/kylebgorman/pynini/blob/master/CHEATSHEET](https://github.com/kylebgorman/pynini/blob/master/CHEATSHEET). 

Pynini examples could be found at [https://github.com/kylebgorman/pynini/tree/master/pynini/examples](https://github.com/kylebgorman/pynini/tree/master/pynini/examples).
Use `help()` to explore the functionality. For example:

In [None]:
help(pynini.union)

# NeMo ITN API

NeMo ITN defines the following APIs that are called in sequence:

- `classify()` - creates a linear automaton from the input string and composes it with the final classification WFST, which transduces numbers and inserts semantic tags.  
- `parse()` - parses the tagged string into a list of key-value items representing the different semiotic tokens.
- `generate_reorderings()` - takes the parsed tokens and generates string serializations with different reorderings of the key-value items. This is important since WFSTs can only process input linearly, but the word order can change from spoken to written form (e.g., `three dollars -> $3`). 
- `verbalize()` - takes the intermediate string representation and composes it with the final verbalization WFST, which removes the tags and returns the written form.  

![alt text](pipeline.png "Inverse Text Normalization Pipeline")

# References and Further Reading:

- [Ebden, Peter, and Richard Sproat. "The Kestrel TTS text normalization system." Natural Language Engineering 21.3 (2015): 333.](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)
- [Gorman, Kyle. "Pynini: A Python library for weighted finite-state grammar compilation." Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 2016.](https://www.aclweb.org/anthology/W16-2409.pdf)
- [Mohri, Mehryar, Fernando Pereira, and Michael Riley. "Weighted finite-state transducers in speech recognition." Computer Speech & Language 16.1 (2002): 69-88.](https://cs.nyu.edu/~mohri/postscript/csl01.pdf)