Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azimuth for French with language selection in config #239

Merged
merged 37 commits into from Nov 30, 2022
Merged

Conversation

lindsaydbrin
Copy link
Contributor

@lindsaydbrin lindsaydbrin commented Sep 9, 2022

Resolve #265

New news

No longer a draft PR; should run acceptably on French. Language-switching from config, updated behavioral tests, and multilingual FAISS encoder now included. Also more consistent reference to language-specific defaults in docs.

Old news (for reference):

Draft PR because:

  • Goal was to get it working on a branch / not meant to be merged.
  • More work required or preferred:
    • Language-switching from config (not hard-coded French, as it is now) = required
    • Changes for behavioral tests
    • Changes for FAISS encoder

Description:

New news

This code makes Azimuth work on French data/pipelines, and the language can be selected in the config.

Main changes:

  • Part of Speech tagging: French defaults for spaCy model and dependency tags (to detect subjects and objects)
  • Behavioral tests: French neutral tokens; fixed removal of space before punctuation (also submitted nlpaug issue)
  • Config: Dynamic selection of language-specific defaults in the config

Out of scope things that could improve French Azimuth:

  • Multilingual FAISS encoder (English-trained option works acceptably but not optimally)
  • Multilingual tokenizer (Only used for comparing average token count per utterance between train and eval, so should have minimal impact on any conclusions)

Old news (for reference)

Addressed one aspect of Azimuth running for French models: spaCy part-of-speech tagging. This branch loads a French model (instead of English) and the missing_xyz smart tag POS/dependency tag lists have been updated to include French tags.

Several issues came up:

  1. Single quote issue: Dataset included text with a single quote instead of an apostrophe, which broke POS-tagging (in particular, j' was not tagged as a subject) and search from the control panel. I added a new function to replace single quotes with apostrophes and used it both for searching and filtering. It worked reasonably well; out of 370 instances of j' in the eval set, 326 were tagged missing_verb before the fix, and 22 afterward (most of which started with "j'aimerais").
  2. So then there was a problem with "j'" in "j'aimerais" being tagged as a subject (sometimes). This was solved with the md spaCy model (instead of sm).
  3. SpaCy sometimes failed to identify verbs when they were the first word (common, especially with the imperative tense, e.g. "aidez-moi...") and/or capitalized (i.e. even in the middle of the sentence, which is not typical). Changing to the md model was a substantial improvement and fixed the majority of the issue. The exceptions seem to all be cases when the verb is capitalized (but not all capitalized verbs have problems; most don't). The capitalization issue seems to be fixed with the lg model...but it's like 500 MB...so I'm not sure if that's worth it (and haven't done it yet).

Checklist:

You should check all boxes before the PR is ready. If a box does not apply, check it to acknowledge it.

  • ISSUE NUMBER. You linked the issue number (Ex: Resolve #XXX).
  • PRE-COMMIT. You ran pre-commit on all commits, or else, you
    ran pre-commit run --all-files at the end.
  • FRONTEND TYPES. Regenerate the front-ent types if you played with types and routes.
    Run cd webapp && yarn types while the back-end is running.
  • USER CHANGES. The changes are added to CHANGELOG.md and the documentation, if they impact
    our users.
  • DEV CHANGES.
    • Update the documentation if this PR changes how to develop/launch on the app.
    • Update the README files and our wiki for any big design decisions, if relevant.
    • Add unit tests, docstrings, typing and comments for complex sections.

Copy link
Contributor

@gabegma gabegma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this gives us a good idea of the changes required! Some comments.

azimuth/utils/filtering.py Outdated Show resolved Hide resolved
poetry.lock Outdated Show resolved Hide resolved
azimuth/config.py Outdated Show resolved Hide resolved
azimuth/config.py Outdated Show resolved Hide resolved
azimuth/config.py Outdated Show resolved Hide resolved
@lindsaydbrin lindsaydbrin marked this pull request as ready for review October 7, 2022 19:15
pyproject.toml Outdated Show resolved Hide resolved
Copy link
Contributor

@JosephMarinier JosephMarinier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooo yeah! 👍

@lindsaydbrin
Copy link
Contributor Author

@gabegma I'll wait for you to read docs (at your leisure - hope you enjoy nonfiction 😄 📖 ) before merging!

@lindsaydbrin lindsaydbrin changed the title Lindsay/french Run Azimuth on French data/models with language selection in config Oct 19, 2022
@lindsaydbrin lindsaydbrin changed the title Run Azimuth on French data/models with language selection in config Azimuth for French with language selection in config Oct 19, 2022
Copy link
Contributor

@gabegma gabegma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR! And great work with the documentation. I have a few comments. I've left the regex review to @JosephMarinier ;)

azimuth/config.py Outdated Show resolved Hide resolved
azimuth/utils/ml/perturbation_functions.py Outdated Show resolved Hide resolved
azimuth/utils/ml/perturbation_functions.py Outdated Show resolved Hide resolved
docs/docs/key-concepts/syntax-analysis.md Outdated Show resolved Hide resolved
docs/docs/reference/configuration/analyses/syntax.md Outdated Show resolved Hide resolved
docs/docs/reference/configuration/language.md Show resolved Hide resolved
docs/docs/reference/configuration/project.md Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@gabegma gabegma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!! Love the documentation :)

poetry.lock Show resolved Hide resolved
# This class should remain empty!
pass
# Before adding attributes: Remember that dependence on an attribute in AzimuthConfig will
# cause a module to include all other configs in its scope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change only one word to be clear that it won't automatically "cause" that, but it will "force" you to set the whole AzimuthConfig as the module's scope.

Suggested change
# cause a module to include all other configs in its scope.
# force the module to include all other configs in its scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# cause a module to include all other configs in its scope.
# force a module to include all other configs in its scope.

Sounds good! Just going to use "a" instead of "the."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I used "the" was to indicate that I am referring to the same module as the one with a "dependence on an attribute in AzimuthConfig". I didn't want to sentence to be understood as "dependence on an attribute [...] will force [another random] module to include all other configs in its scope".

Maybe I should have modified the first line

Remember that if a module depends on an attribute in AzimuthConfig, it will need to include all other configs in its scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand the intention exactly! When I read what I committed, I read it as exactly that, whereas with "the," the pronoun feels surprising. I'm having trouble figuring out how to explain the grammatical reasoning...

Currently:

Before adding attributes: Remember that dependence on an attribute in AzimuthConfig will cause a module to include all other configs in its scope.

It is clear to me that the first phrase ("dependence...AzimuthConfig") belongs to the "a module" mentioned later - that it's a specific module that is the topic (subject?) of the sentence.

With the:

Before adding attributes: Remember that dependence on an attribute in AzimuthConfig will cause the module to include all other configs in its scope.

...I'm like, wait, what? Which module were we talking about? Were we talking about a module already?

But, all that said, even if I had a sound grammatical argument, it doesn't matter if it is confusing to you (or any other user). So to be super clear, we could change it to what you suggested, with a slight rephrasing in line with your earlier suggestion, and one of mine to be concise:

Reminder: If a module depends on an attribute in AzimuthConfig, the module will be forced to include all other configs in its scope.

What do you think? I'm not opposed to doing a new PR for one comment, all in the name of clear communication. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfect! PR: #310

azimuth/config.py Outdated Show resolved Hide resolved
Copy link
Contributor

@JosephMarinier JosephMarinier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! Thank you!

Co-authored-by: Joseph Marinier <joseph.marinier@servicenow.com>
@lindsaydbrin lindsaydbrin enabled auto-merge (squash) November 30, 2022 03:32
azimuth/config.py Outdated Show resolved Hide resolved
@lindsaydbrin lindsaydbrin merged commit c9b1963 into main Nov 30, 2022
@lindsaydbrin lindsaydbrin deleted the lindsay/french branch November 30, 2022 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run Azimuth on French data with language selection in config
5 participants