Remove generic tokenizer and support multiple languages for the word cloud. #388

gabegma · 2023-01-24T01:06:14Z

Description:

I investigated if the generic tokenizer was still useful. It turns out it wasn't really, so I made the following changes:

There are no more dummy saliencies (equal to 0) that are computed when saliency is not available. It was counter-productive to compute them since they were not used in the UI, and for the top words, the words were converted to sub-tokens, and then back to words.
Because of that, I added saliency as an optional startup task.
Furthermore, I needed to refactor the top words code, so it no longer calls the dummy saliency method when saliency is not available. This turned out to be an opportunity: I now use the Spacy model (which, as we know, can be defined per language), which already has its own list of punctuation chars (is_punct) and stop words (is_stop). The chars are a bit different from the ones we had, but as you can see in the tests, the differences seem ok to me. Plus, now it will work for french :).
- However, I needed to add a char replacement for this to work. The English Spacy model did not recognize "`" as a punctuation sign. I believe this also caused problems that we weren't seeing when counting tokens or performing POS tagging. When replacing it by "'", this slightly changed the values in the dataset warnings (token count).
  - Example from sst2 with the "`":
  designed to provide a mix of smiles and tears , `` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns .
- I needed to change the scope of TopWordsModule from ModelContractConfig to AzimuthConfig, since it now relies on both the model or the syntax config, depending if saliency is available. AzimuthConfig is a bit too broad, but since this module computes fast, I think that it is ok.

Checklist:

You should check all boxes before the PR is ready. If a box does not apply, check it to acknowledge it.

ISSUE NUMBER. You linked the issue number (Ex: Resolve #XXX).
PRE-COMMIT. You ran pre-commit on all commits, or else, you
ran pre-commit run --all-files at the end.
USER CHANGES. The changes are added to CHANGELOG.md and the documentation, if they impact
our users.
DEV CHANGES.
- Update the documentation if this PR changes how to develop/launch on the app.
- Update the README files and our wiki for any big design decisions, if relevant.
- Add unit tests, docstrings, typing and comments for complex sections.

azimuth/utils/utterance.py

azimuth/modules/word_analysis/top_words.py

tests/test_modules/test_word_analysis/test_top_words.py

azimuth/modules/word_analysis/top_words.py

JosephMarinier

Cool cool cool! 😚👌 I only have some small comments.

azimuth/utils/utterance.py

azimuth/modules/base_classes/artifact_manager.py

azimuth/modules/word_analysis/top_words.py

gabegma · 2023-01-25T20:22:55Z

@JosephMarinier @lindsaydbrin actually, I realized that I could create a specific config scope for TopWords. See my last commit. Does that make sense to you?

azimuth/modules/word_analysis/top_words.py

lindsaydbrin · 2023-01-25T20:35:43Z

@JosephMarinier @lindsaydbrin actually, I realized that I could create a specific config scope for TopWords. See my last commit. Does that make sense to you?

If it works, seems like a more direct approach! It seems like if we do this a lot, we could have a tangled mess of custom config pairings, but we can deal with that if it happens. But anyway, generally, I'd defer to @JosephMarinier on this.

docs/docs/user-guide/exploration-space/prediction-overview.md

azimuth/modules/model_contracts/hf_text_classification.py

lindsaydbrin

Looks great - thanks for taking care of this, and for such helpful Description context! Some small edits on comments, one small question mostly for my understanding, and I'll let Joseph comment on the config scope change. It was already approved, but here's another anyhow. 😆

Co-authored-by: Lindsay Brin <lindsay.brin@servicenow.com>

azimuth/config.py

gabegma added 2 commits January 23, 2023 19:26

Refactor top words

3ef02ef

Update documentation

7b5b7ae

gabegma self-assigned this Jan 24, 2023

gabegma changed the title ~~Remove generic tokenizer~~ Remove generic tokenizer and support multiple languages for the word cloud. Jan 24, 2023

Remove tokenizer and dummy saliency

c0db735

gabegma force-pushed the ggm/remove-tokenizer branch from adde1d1 to c0db735 Compare January 24, 2023 01:45

gabegma requested review from JosephMarinier and lindsaydbrin January 24, 2023 02:05

JosephMarinier reviewed Jan 24, 2023

View reviewed changes

azimuth/utils/utterance.py Outdated Show resolved Hide resolved

JosephMarinier reviewed Jan 24, 2023

View reviewed changes

azimuth/modules/word_analysis/top_words.py Outdated Show resolved Hide resolved

JosephMarinier reviewed Jan 24, 2023

View reviewed changes

tests/test_modules/test_word_analysis/test_top_words.py Outdated Show resolved Hide resolved

JosephMarinier reviewed Jan 24, 2023

View reviewed changes

azimuth/modules/word_analysis/top_words.py Outdated Show resolved Hide resolved

JosephMarinier approved these changes Jan 24, 2023

View reviewed changes

Adapt based on comments

c1ff0a5

JosephMarinier reviewed Jan 25, 2023

View reviewed changes

azimuth/utils/utterance.py Outdated Show resolved Hide resolved

Move reg ex

d60066a