ThutmoseTaggerModel, a new model for inverse text normalization #4011

bene-ges · 2022-04-15T21:02:20Z

Signed-off-by: Alexandra Antonova aleksandraa@nvidia.com

What does this PR do ?

A new tagger-based model for inverse text normalization

Collection: [NLP]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ X] New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

lgtm-com · 2022-04-15T21:13:16Z

This pull request introduces 8 alerts when merging 42730ba into 9005f23 - view on LGTM.com

new alerts:

7 for Module-level cyclic import
1 for Module is imported with 'import' and 'import from'

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

…icense headers Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

examples/nlp/duplex_text_normalization/data/data_split.py

examples/nlp/text_normalization_as_tagging/conf/thutmose_tagger_itn_config.yaml

examples/nlp/text_normalization_as_tagging/utils/eval.py

examples/nlp/text_normalization_as_tagging/utils/eval_per_class.py

examples/nlp/text_normalization_as_tagging/utils/prepare_corpora_after_alignment.py

examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py

examples/nlp/text_normalization_as_tagging/utils/extract_giza_alignments.py

ekmb · 2022-04-19T17:19:13Z

examples/nlp/text_normalization_as_tagging/utils/extract_giza_alignments.py

+    '--giza_suffix', type=str, required=True, help='suffix of alignment files, e.g. \"Ahmm.5\", \"A3.final\"'
+)
+parser.add_argument('--out_filename', type=str, required=True, help='Output file')
+parser.add_argument('--lang', type=str, required=True, help="Language")


en and ru only?

examples/nlp/text_normalization_as_tagging/utils/get_label_vocab.py

nemo/collections/nlp/models/text_normalization_as_tagging/thutmose_tagger.py

nemo/collections/nlp/data/text_normalization_as_tagging/thutmose_tagger_dataset.py

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

lgtm-com · 2022-04-27T21:58:04Z

This pull request introduces 1 alert when merging 1d48227 into 0d052c8 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

lgtm-com · 2022-04-28T08:17:17Z

This pull request introduces 1 alert when merging b03d10e into 70d9687 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

examples/nlp/text_normalization_as_tagging/helpers.py

examples/nlp/text_normalization_as_tagging/prepare_dataset_en.sh

examples/nlp/text_normalization_as_tagging/helpers.py

examples/nlp/text_normalization_as_tagging/prepare_dataset_en.sh

examples/nlp/text_normalization_as_tagging/train_ru.sh

examples/nlp/text_normalization_as_tagging/utils/corpus_errors.ru

examples/nlp/text_normalization_as_tagging/utils/eval.py

examples/nlp/text_normalization_as_tagging/utils/extract_giza_alignments.py

examples/nlp/text_normalization_as_tagging/utils/filter_sentences_with_errors.py

nemo/collections/nlp/data/text_normalization_as_tagging/utils.py

ekmb · 2022-04-28T17:10:01Z

nemo/collections/nlp/models/text_normalization_as_tagging/thutmose_tagger.py

+
+        src_hiddens = self.bert_model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask)
+        log_softmax = self.logits(hidden_states=src_hiddens)
+        log_softmax_semiotic = self.semiotic_logits(hidden_states=src_hiddens)


why is this called log_softmax if log_softmax=False in the self.logits init?

ok, I renamed it

ekmb · 2022-04-28T17:11:06Z

nemo/collections/nlp/models/text_normalization_as_tagging/thutmose_tagger.py

+                    span_predictions.append(cid)
+                else:
+                    span_predictions.append(self.tag_classification_report.num_classes - 1)  # this stands for WRONG
+            assert len(span_labels) == len(span_predictions)


raise errors instead of assert

ekmb · 2022-04-28T17:11:11Z

nemo/collections/nlp/models/text_normalization_as_tagging/thutmose_tagger.py

+                else:
+                    # this stands for WRONG
+                    multiword_span_predictions.append(self.tag_classification_report.num_classes - 1)
+            assert len(multiword_span_labels) == len(multiword_span_predictions)


raise errors instead of assert

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

lgtm-com · 2022-04-29T17:18:48Z

This pull request introduces 1 alert when merging fc8c6ce into 58ff608 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

lgtm-com · 2022-04-29T17:32:11Z

This pull request introduces 1 alert when merging 883d9e8 into 58ff608 - view on LGTM.com

new alerts:

1 for Unused import

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

lgtm-com · 2022-04-29T17:44:04Z

This pull request introduces 1 alert when merging 7073117 into 58ff608 - view on LGTM.com

new alerts:

1 for Unused import

examples/nlp/text_normalization_as_tagging/dataset_preparation/prepare_corpora_for_alignment.py

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

examples/nlp/text_normalization_as_tagging/dataset_preparation/prepare_corpora_for_alignment.py

nemo/collections/nlp/models/text_normalization_as_tagging/thutmose_tagger.py

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ThutmoseTaggerModel, a new model for inverse text normalization

42730ba

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb self-requested a review April 15, 2022 21:18

Alexandra Antonova added 4 commits April 17, 2022 18:14

fix circular imports

8fc60a1

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

small fixes for logging

805dda8

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

black codestyle

badbf96

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

1. Remove unused regexps from extract_giza_alignments.py. 2. Change l…

25a3b33

…icense headers Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb requested changes Apr 19, 2022

View reviewed changes

Alexandra Antonova added 3 commits April 22, 2022 20:17

Fixes for pr review

e36ec94

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

1. Add file corpus_errors.ru. 2. Style fixes

d610544

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

Add test to Jenkins

4940610

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

bene-ges marked this pull request as ready for review April 26, 2022 19:39

Alexandra Antonova added 3 commits April 28, 2022 00:35

Add head for semiotic labels

dba0d14

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

style fix

c495f5b

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

Merge latest main

1d48227

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

test commit to debug jenkins

b03d10e

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

look at semiotic predictions during swapping

b08c7ce

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb requested changes Apr 28, 2022

View reviewed changes

ekmb reviewed Apr 28, 2022

View reviewed changes

fixes for code review

fc8c6ce

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

fix setup

883d9e8

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

bene-ges requested a review from ekmb April 29, 2022 17:20

remove commented code

7073117

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb reviewed Apr 29, 2022

View reviewed changes

examples/nlp/text_normalization_as_tagging/dataset_preparation/prepare_corpora_for_alignment.py Outdated Show resolved Hide resolved

More fixes for code review

e96a242

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb reviewed Apr 29, 2022

View reviewed changes

examples/nlp/text_normalization_as_tagging/dataset_preparation/prepare_corpora_for_alignment.py Show resolved Hide resolved

nemo/collections/nlp/models/text_normalization_as_tagging/thutmose_tagger.py Outdated Show resolved Hide resolved

change assert to raise

a46f497

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb previously approved these changes Apr 29, 2022

View reviewed changes

Alexandra Antonova added 2 commits April 30, 2022 00:40

Merge branch 'main' into thutmose_tagger

2039b8c

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

style

155171e

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

bene-ges dismissed ekmb’s stale review via 155171e April 29, 2022 21:48

ekmb approved these changes Apr 29, 2022

View reviewed changes

Merge branch 'main' into thutmose_tagger

43c283a

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

ekmb merged commit c232ef3 into NVIDIA:main Apr 29, 2022

bene-ges deleted the thutmose_tagger branch May 3, 2022 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThutmoseTaggerModel, a new model for inverse text normalization #4011

ThutmoseTaggerModel, a new model for inverse text normalization #4011

bene-ges commented Apr 15, 2022 •

edited by ekmb

Loading

lgtm-com bot commented Apr 15, 2022

ekmb Apr 19, 2022

lgtm-com bot commented Apr 27, 2022

lgtm-com bot commented Apr 28, 2022

ekmb Apr 28, 2022

bene-ges Apr 28, 2022

ekmb Apr 28, 2022

bene-ges Apr 29, 2022

ekmb Apr 28, 2022

bene-ges Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

ThutmoseTaggerModel, a new model for inverse text normalization #4011

ThutmoseTaggerModel, a new model for inverse text normalization #4011

Conversation

bene-ges commented Apr 15, 2022 • edited by ekmb Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

lgtm-com bot commented Apr 15, 2022

ekmb Apr 19, 2022

Choose a reason for hiding this comment

lgtm-com bot commented Apr 27, 2022

lgtm-com bot commented Apr 28, 2022

ekmb Apr 28, 2022

Choose a reason for hiding this comment

bene-ges Apr 28, 2022

Choose a reason for hiding this comment

ekmb Apr 28, 2022

Choose a reason for hiding this comment

bene-ges Apr 29, 2022

Choose a reason for hiding this comment

ekmb Apr 28, 2022

Choose a reason for hiding this comment

bene-ges Apr 29, 2022

Choose a reason for hiding this comment

lgtm-com bot commented Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

lgtm-com bot commented Apr 29, 2022

bene-ges commented Apr 15, 2022 •

edited by ekmb

Loading