In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect
"""

# Problem Definition

</br>
</br>
<img src="https://raw.githubusercontent.com/NVIDIA/NeMo/tn_tutorial/tutorials/text_processing/images/task_overview.png" width="600"/>


**Text normalization (TN)** is the task of converting text in canonical written form to it's verbalized form. For example, *10:00* -> *ten o'clock* or *10kg* -> *ten kilograms*.

**Inverse text normalization (ITN)** does the reverse and converts normalized text back written form. For example, *in nineteen seventy five* -> *in 1975* and *one hundred and twenty three dollars* -> *$123*.

A sentence can be split up into semiotic tokens stemming from a varity of classes, where the spoken form differs from the written form. Examples are *dates*, *decimals*, *cardinals*, *measures* etc. The good TN or ITN system will be able to handle a variety of **semiotic classes**.

TN is used to in the pre-processing of Text-To-Speech (TTS) systems, whereas ITN is used to post-process Automatic Speech Recognition (ASR) outputs. Audio-based TN can be used to normalize ASR training data for better ASR accuracy.


# NeMo package overview

`nemo_text_processing` is a Python package automatically installed with [`NeMo`](https://github.com/NVIDIA/NeMo). `nemo_text_processing` supports 
- TN
- audio-based TN
- ITN

The toolkit is based on weighted finite-state
transducer (WFST) grammars. The tools uses [`Pynini`](https://www.openfst.org/twiki/bin/view/GRM/PyniniDocs) to construct WFSTs. 

`nemo_text_processing` supports a wide range of grammars across of number of languages. TODO: reference to docs

The toolkit is modular and easily extendable. A tutorial for system details and customization can be found in the [WFST tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/text_processing/WFST_Tutorial.ipynb). The Python environment allows integration into an existing Python application. 

A sentence can be split up into semiotic tokens stemming from a varity of classes, where the spoken form differs from the written form. Examples are *dates*, *decimals*, *cardinals*, *measures* etc. 

The Python system can be seamlessly deployment into C++. The pipeline is shown below.
The WFST-based grammars can be exported into an [OpenFST](https://www.openfst.org/) Archive File (FAR) and dropped into [`Sparrowhawk`](https://github.com/google/sparrowhawk)  -- an open-source version of [Kestrel TTS text normalization system](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA). As an example, `nemo_text_processing` is used in [NVIDIA RIVA](https://www.nvidia.com/en-us/ai-data-science/products/riva-enterprise/).

<img src="https://raw.githubusercontent.com/NVIDIA/NeMo/tn_tutorial/tutorials/text_processing/images/deployment_pipeline.png" width="600"/>

# How to use
### 1. Installation

In [None]:
## Install NeMo, which installs both nemo and nemo_text_processing package
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nemo_text_processing]

In [1]:
# try to import of nemo_text_processing an other dependencies
import nemo_text_processing
import os

### 2. Text Normalization

In [2]:
# create text normalization instance that works on cased input
from nemo_text_processing.text_normalization.normalize import Normalizer
normalizer = Normalizer(input_case='cased', lang='en')

[NeMo W 2022-04-28 20:41:26 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.


[NeMo I 2022-04-28 20:41:26 tokenize_and_classify:92] Creating ClassifyFst grammars.


#### 2.1 Run TN on input string

In [3]:
# run normalization on example string input
written = "We paid $123 for this desk."
normalized = normalizer.normalize(written, verbose=True, punct_post_process=True)
print(normalized)

tokens { name: "We" } tokens { name: "paid" } tokens { money { currency_maj: "dollars" integer_part: "one hundred and twenty three" } } tokens { name: "for" } tokens { name: "this" } tokens { name: "desk" }  tokens { name: "." }
We paid one hundred and twenty three dollars for this desk.


intermediate semtiotic class information is shown if verbose=True
#### 2.1 Run TN on input file

In [4]:
# create temporary data folder and example input file
DATA_DIR = 'tmp_data_dir'
os.makedirs(DATA_DIR, exist_ok=True)
INPUT_FILE = f'{DATA_DIR}/inference.txt'
! echo -e 'The alarm went off at 10:00a.m. \nI received $123' > $INPUT_FILE

In [5]:
# check input file was properly created
! cat $INPUT_FILE

The alarm went off at 10:00a.m. 
I received $123


In [6]:
# load input file into 'data' - a list of strings
data = []
with open(INPUT_FILE, 'r') as fp:
    for line in fp:
        data.append(line.strip())
data

['The alarm went off at 10:00a.m.', 'I received $123']

In [7]:
# run normalization on 'data'
normalizer.normalize_list(data, punct_post_process=True)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.13it/s]


['The alarm went off at ten AM',
 'I received one hundred and twenty three dollars']

#### 2.2 Evaluate TN on written-normalized text pairs 

The evaluation data needs to have the following format:

'on 22 july 2022 they worked until 12:00' and the normalization is represented as 

In [8]:
# example evaluation sentence
eval_text =  """PLAIN\ton\t<self>
DATE\t22 july 2012\tthe twenty second of july twenty twelve
PLAIN\tthey\t<self>
PLAIN\tworked\t<self>
PLAIN\tuntil\t<self>
TIME\t12:00\ttwelve o'clock
<eos>\t<eos>
"""
EVAL_FILE = f'{DATA_DIR}/eval.txt'
with open(EVAL_FILE, 'w') as fp:
    fp.write(eval_text)
! cat $EVAL_FILE

PLAIN	on	<self>
DATE	22 july 2012	the twenty second of july twenty twelve
PLAIN	they	<self>
PLAIN	worked	<self>
PLAIN	until	<self>
TIME	12:00	twelve o'clock
<eos>	<eos>


That is, every sentence is broken into semiotic tokens line by line and concluded by end of sentence token `<eos>`. In case of a plain token it's `[SEMIOTIC CLASS] [TAB] [WRITTEN] [TAB] <self>`, otherwise `[SEMIOTIC CLASS] [TAB] [WRITTEN] [TAB] [NORMALIZED]`.
This format was introduced in [Google Text normalization dataset](https://arxiv.org/abs/1611.00068). 

In [9]:
# Parse evaluation file into written and normalized sentence pairs
from nemo_text_processing.text_normalization.data_loader_utils import load_files, training_data_to_sentences
eval_data = load_files([EVAL_FILE])
sentences_un_normalized, sentences_normalized, sentences_class_types = training_data_to_sentences(eval_data)
print(list(zip(sentences_un_normalized, sentences_normalized)))

[('on 22 july 2012 they worked until 12:00', "on the twenty second of july twenty twelve they worked until twelve o'clock")]


In [10]:
# run prediction
sentences_prediction = normalizer.normalize_list(sentences_un_normalized)
print(sentences_prediction)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.93it/s]

["on the twenty second of july twenty twelve they worked until twelve o'clock"]





In [11]:
# measure sentence accuracy
from nemo_text_processing.text_normalization.data_loader_utils import evaluate
sentences_accuracy = evaluate(
            preds=sentences_prediction, labels=sentences_normalized, input=sentences_un_normalized
        )
print("- Accuracy: " + str(sentences_accuracy))

- Accuracy: 1.0


You can also break down evaluation accuracy by semiotic class, for that use the script [`NeMo/nemo_text_processing/text_normalization/run_evaluate.py`](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/run_evaluate.py)

### 3. Inverse Text Normalization
ITN supports equivalent API as TN. Here we are only going to show inverse normalization on input string

In [12]:
# create inverse text normalization instance
from nemo_text_processing.inverse_text_normalization.inverse_normalize import InverseNormalizer
inverse_normalizer = InverseNormalizer(lang='en')

[NeMo I 2022-04-28 20:41:45 tokenize_and_classify:70] Creating ClassifyFst grammars.


In [13]:
# run ITN on example string input
spoken = "we paid one hundred twenty three dollars for this desk"
un_normalized = inverse_normalizer.inverse_normalize(spoken, verbose=True)
print(un_normalized)

tokens { name: "we" } tokens { name: "paid" } tokens { money { integer_part: "123" currency: "$" } } tokens { name: "for" } tokens { name: "this" } tokens { name: "desk" }
we paid $123 for this desk


### 4. Audio-based Text Normalization
Audio-based text normalization uses extended WFST grammars to provide a range of possible normalization options.
The following example shows the workflow: (Disclaimer: exact values in graphic do not need to be real system's behavior)
1. text "627" is sent to extended TN WFST grammar
2. grammar output 5 different options of verbalization based on text input alone
3. in case an audio file is presented we compare the audio transcript with the verbalization options to find out which normalization is correct based on character error rate. The transcript is generated using a pretrained NeMo ASR model. 

More information can be found at https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html#audio-based-text-normalization.

<img src="https://raw.githubusercontent.com/NVIDIA/NeMo/tn_tutorial/tutorials/text_processing/images/audio_based_tn.png" width="600"/>

The following shows an example of how to generate multiple normalization options:

In [None]:
# import non-deterministic WFST-based TN module
from nemo_text_processing.text_normalization.normalize_with_audio import NormalizerWithAudio

# initialize normalizer
normalizer = NormalizerWithAudio(
        lang="en",
        input_case="cased",
        overwrite_cache=False,
        cache_dir="cache_dir",
    )
# create up to 10 normalization options
print(normalizer.normalize("123", n_tagged=10, punct_post_process=True))

# C++ deployment

Sparrowhawk is based on C++ which operates similar to NeMo's Python TN and ITN.
To deploy the grammars you need:
- [Docker](https://www.docker.com/) 
- download [NeMo source code](https://github.com/NVIDIA/NeMo) which includes grammars

Run [NeMo/tools/text_processing_deployment/export_grammars.sh](https://github.com/NVIDIA/NeMo/blob/main/tools/text_processing_deployment/export_grammars.sh).

More details can be found under https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_processing_deployment.html

# Tutorial on how to customize grammars

https://github.com/NVIDIA/NeMo/blob/main/tutorials/text_processing/WFST_Tutorial.ipynb

https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/intro.html 

# References and Further Reading:


- [Zhang, Yang, Bakhturina, Evelina, Gorman, Kyle and Ginsburg, Boris. "NeMo Inverse Text Normalization: From Development To Production." (2021)](https://arxiv.org/abs/2104.05055)
- [Ebden, Peter, and Richard Sproat. "The Kestrel TTS text normalization system." Natural Language Engineering 21.3 (2015): 333.](https://www.cambridge.org/core/journals/natural-language-engineering/article/abs/kestrel-tts-text-normalization-system/F0C18A3F596B75D83B75C479E23795DA)
- [Gorman, Kyle. "Pynini: A Python library for weighted finite-state grammar compilation." Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. 2016.](https://www.aclweb.org/anthology/W16-2409.pdf)
- [Mohri, Mehryar, Fernando Pereira, and Michael Riley. "Weighted finite-state transducers in speech recognition." Computer Speech & Language 16.1 (2002): 69-88.](https://cs.nyu.edu/~mohri/postscript/csl01.pdf)