Update datasets #309

gabegma · 2022-11-30T03:01:54Z

Resolve #267

Description:

Checklist:

You should check all boxes before the PR is ready. If a box does not apply, check it to acknowledge it.

ISSUE NUMBER. You linked the issue number (Ex: Resolve #XXX).
PRE-COMMIT. You ran pre-commit on all commits, or else, you
ran pre-commit run --all-files at the end.
USER CHANGES. The changes are added to CHANGELOG.md and the documentation, if they impact
our users.
DEV CHANGES.
- Update the documentation if this PR changes how to develop/launch on the app.
- Update the README files and our wiki for any big design decisions, if relevant.
- Add unit tests, docstrings, typing and comments for complex sections.

gabegma · 2022-12-01T22:00:22Z

@Dref360 do you think my fixes for updating datasets make sense?

Dref360 · 2022-12-01T22:02:37Z

looks good. We never run anything on the DatasetDict themselves so there is nothing to worry about if they don't have the same features.

azimuth/routers/v1/utterances.py

tests/test_loading_resources.py

JosephMarinier · 2022-12-15T14:38:07Z

tests/test_modules/test_dataset_analysis/test_dataset_warnings.py

+    eval_dm._base_dataset_split.features["label"] = ClassLabel(
+        num_classes=3, names=existing_classes + ["NO_INTENT"]
+    )
+    train_dm._base_dataset_split.features["label"] = ClassLabel(
+        num_classes=3, names=existing_classes + ["NO_INTENT"]
+    )


Is it possible to instantiate only one ClassLabel object and pass it to both dataset managers?

Suggested change

eval_dm._base_dataset_split.features["label"] = ClassLabel(

num_classes=3, names=existing_classes + ["NO_INTENT"]

)

train_dm._base_dataset_split.features["label"] = ClassLabel(

num_classes=3, names=existing_classes + ["NO_INTENT"]

)

eval_dm._base_dataset_split.features["label"] = train_dm._base_dataset_split.features[

"label"

] = ClassLabel(num_classes=3, names=existing_classes + ["NO_INTENT"])

Or with a temporary variable:

Suggested change

eval_dm._base_dataset_split.features["label"] = ClassLabel(

num_classes=3, names=existing_classes + ["NO_INTENT"]

)

train_dm._base_dataset_split.features["label"] = ClassLabel(

num_classes=3, names=existing_classes + ["NO_INTENT"]

)

class_label = ClassLabel(num_classes=3, names=existing_classes + ["NO_INTENT"])

eval_dm._base_dataset_split.features["label"] = class_label

train_dm._base_dataset_split.features["label"] = class_label

Or if we move the creation of dms before that, we can loop on it:

Suggested change

eval_dm._base_dataset_split.features["label"] = ClassLabel(

num_classes=3, names=existing_classes + ["NO_INTENT"]

)

train_dm._base_dataset_split.features["label"] = ClassLabel(

num_classes=3, names=existing_classes + ["NO_INTENT"]

)

class_label = ClassLabel(num_classes=3, names=existing_classes + ["NO_INTENT"])

for dm in dms.values():

dm._base_dataset_split.features["label"] = class_label

Here is a complete diff of the last idea, in case it was not clear:

# Adding a rejection class eval_dm: DatasetSplitManager = mod.get_dataset_split_manager(DatasetSplitName.eval) train_dm: DatasetSplitManager = mod.get_dataset_split_manager(DatasetSplitName.train) - existing_classes = eval_dm.get_class_names(labels_only=True) - eval_dm._base_dataset_split.features["label"] = ClassLabel( - num_classes=3, names=existing_classes + ["NO_INTENT"] - ) - train_dm._base_dataset_split.features["label"] = ClassLabel( - num_classes=3, names=existing_classes + ["NO_INTENT"] - ) - eval_dm._base_dataset_split = eval_dm._base_dataset_split.map( - lambda u, i: {"label": 2 if i % 10 == 0 else u["label"]}, with_indices=True - ) dms = { DatasetSplitName.eval: eval_dm, DatasetSplitName.train: train_dm, } + existing_classes = eval_dm.get_class_names(labels_only=True) + class_label = ClassLabel(num_classes=3, names=existing_classes + ["NO_INTENT"]) + for dm in dms.values(): + dm._base_dataset_split.features["label"] = class_label + eval_dm._base_dataset_split = eval_dm._base_dataset_split.map( + lambda u, i: {"label": 2 if i % 10 == 0 else u["label"]}, with_indices=True + )

Thank you for the recommendation! I further cleaned it up, because I think it was hard to read, given that sometimes we would edit the values in the Dict, and sometimes, directly eval_dm. LMK what you think.

Haha! I had done the exact same change locally while reviewing, but I thought I was asking too much. That's perfect! 👍

JosephMarinier

That's a relief! Thank you! I have some minor comments, but that's good to go!

gabegma added 2 commits December 1, 2022 16:59

Update datasets

c4aad91

Fix file-based tests

15aac79

gabegma force-pushed the ggm/update-datasets branch from 0886ada to 15aac79 Compare December 1, 2022 21:59

gabegma marked this pull request as ready for review December 1, 2022 22:00

gabegma requested review from JosephMarinier and lindsaydbrin December 2, 2022 01:38

Dref360 reviewed Dec 5, 2022

View reviewed changes

azimuth/routers/v1/utterances.py Show resolved Hide resolved

Dref360 reviewed Dec 5, 2022

View reviewed changes

tests/test_loading_resources.py Show resolved Hide resolved

JosephMarinier reviewed Dec 15, 2022

View reviewed changes

tests/test_loading_resources.py Outdated Show resolved Hide resolved

JosephMarinier reviewed Dec 15, 2022

View reviewed changes

JosephMarinier approved these changes Dec 15, 2022

View reviewed changes

gabegma added 3 commits December 19, 2022 15:33

Change based on review

8465d2d

Merge remote-tracking branch 'origin/main' into ggm/update-datasets

33304cc

Fix bug on main

cbadf49

gabegma merged commit 7afbd5c into main Dec 20, 2022

gabegma deleted the ggm/update-datasets branch December 20, 2022 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update datasets #309

Update datasets #309

gabegma commented Nov 30, 2022 •

edited

Loading

gabegma commented Dec 1, 2022

Dref360 commented Dec 1, 2022

JosephMarinier Dec 15, 2022

gabegma Dec 19, 2022

JosephMarinier Dec 19, 2022

JosephMarinier left a comment

Update datasets #309

Update datasets #309

Conversation

gabegma commented Nov 30, 2022 • edited Loading

Description:

Checklist:

gabegma commented Dec 1, 2022

Dref360 commented Dec 1, 2022

JosephMarinier Dec 15, 2022

Choose a reason for hiding this comment

gabegma Dec 19, 2022

Choose a reason for hiding this comment

JosephMarinier Dec 19, 2022

Choose a reason for hiding this comment

JosephMarinier left a comment

Choose a reason for hiding this comment

gabegma commented Nov 30, 2022 •

edited

Loading