-
Notifications
You must be signed in to change notification settings - Fork 140
Fixes issue 228 #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issue 228 #234
Conversation
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
for more information, see https://pre-commit.ci
tbartley94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this change a bit more powerful if you don't see any issue with suggestions.
Also, move this to whitelist class, not electronic since your current use cases are more whitelist styled.
|
|
||
| # Vebalizes common hyphenated nominal compounds (e.g. 3D-Drucker) | ||
| verbalized_abbreviations = pynini.project(abbreviations, "output") | ||
| DE_CHARS = pynini.union(*"äöüß") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to DE graph and import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
| graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv")) | ||
| hyphen_accep = pynini.accep("-") | ||
| graph_compound_a = ( | ||
| pynutil.insert("fragment_id:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this valid for sparrowhawk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a valid field in SP's Electronic semiotic class implementation, but since it's better to move everything to Whitelist, then I suppose it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, what are our options with whitelist sparrowhawk wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing how the SP proto doesn't even list Whitelist as a semiotic class -- rightfully so IMO -- the options are endless? The current implementation of Whitelist doesn't do much besides case-folding and deterministic routing. Moving the logic there should work, however it will most definitely give precedence to cases that we may not want to prioritize (e.g. DGX-1 not being ELECTRONIC is probably more acceptable than 2-Liter not being MEASURE). On the other hand, leaving the current setup with the limitations imposed by SP's limitations lets you keep a tight lid on what "electronic" things go in it by simply editing the .tsv file.
Perhaps, the best of two worlds would be to leave the above in place, and expand Whitelist to match everything ABC-### and ###-ABC? That way all electronics stay Electronicand we have aWhitelistthat capturesG-9, COVID-19`, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to reiterate: we add the functionality to both whitelist and electronic, but whitelist is more general while elctronic captures specific whitelist?
I think that can work. But we can also just add negative weights within whitelist to take care of precedence cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As extra as that sounds -- yes. It may not matter for the final output, but I think it would be useful to separate evidently ELECTRONIC things from WHITELIST during tagging.
If you don't think it matters all that much (and it may not), I'll add the changes to WHITELIST and we push like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, redundancy becomes a pain on debugging. Just move everything to whitelist, but make the key electronic terms in the actual whitelist while the more liberal expansion a few lines in the tagger graph. Make the weight for the whitelist file have highest priority. If we get annoyed by the liberal expansion, we comment out and work on later. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds very sensible. Let me implement that and see how well it meshes with the rest of the the grammar.
|
|
||
| graph = url | domain_graph | email | tag | ||
| # Implements a graph for commonly-used hyphenated compound nouns (e.g. 3D-Drucker, 2D-Mammogram) | ||
| graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, thoughts on doing a more comprehensive pattern? Technically we can extend to any use of [0-9]+-[A-Za-z] Can you see potential conflict with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reverse, would allow capturing things like G-9 summit, COVID-19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this will make it so that Electronic captures everything that's ###-ABC or ABC-###. This of course won't matter if all this is moved to Whitelist. Besides that, I don't think I'm seeing any immediate problems, which doesn't mean that they won't emerge once this is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, my vote is trying out but make sure it's a change that can be commented out. That way the fix can be quick if we break anything.
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
for more information, see https://pre-commit.ci
…se) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
for more information, see https://pre-commit.ci
|
I added the explicitly targeted strings to |
|
Oh go figure. Can you do a quick check to see if that's consistent across the rest of the repo? Will merge once import checks are fixed. |
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
for more information, see https://pre-commit.ci
The merge-base changed after approval.
tbartley94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tbartley94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The merge-base changed after approval.
|
@zoobereq since i tried to push last time I'm not allowed for final review. Ping Marianna for a quick stamp. |
The merge-base changed after approval.
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
The merge-base changed after approval.
The merge-base changed after approval.
Signed-off-by: Simon Zuberek <szuberek@nvidia.com>
* Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> --------- Signed-off-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Ankit Narwade <anarwade@nvidia.com>
* Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> --------- Signed-off-by: Simon Zuberek <szuberek@nvidia.com> Signed-off-by: Ankit Narwade <anarwade@nvidia.com>
* Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> --------- Signed-off-by: Simon Zuberek <szuberek@nvidia.com> Signed-off-by: Ankit Narwade <anarwade@nvidia.com>
* Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> --------- Signed-off-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
* Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <szuberek@nvidia.com> * Updates the cache Signed-off-by: Simon Zuberek <szuberek@nvidia.com> --------- Signed-off-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: Simon Zuberek <szuberek@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
The fix addresses one of the issues reported in #228, in particular:
The updated system correctly transduces these common hyphenated nominal compounds and can be easily expanded to include others.
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.