Add support for HuggingFace Datasets by ernestum · Pull Request #677 · HumanCompatibleAI/imitation

ernestum · 2023-02-09T16:52:33Z

Description

As per #651 we want to start storing trajectories in HuggingFace Datasets

This PR adds support for HuggingFace Datasets and moves imitation.data.types.load/save to imitation.data.seralize.load/save. Datasets are stored in the HuggingFace format by default.

The conversion script is adapted accordingly and I re-structured the tests.

I also moved the loading/saving functionality to imitation.data.serialize instead of imitation.data.types to mimic the layout of imitation.policies.

Design Choices

I introduced a new class (imitation.data.huggingface_datasets_conversion.TrajectoryDatasetSequence) that presents a HuggingFace dataset as a sequence of imitation.data.types.Trajectory. In its __getitem__ method the dataset is queried and the result is converted to a trajectory (or slice of trajectories) ad-hoc.

I found that the "info" dicts were very heterogeneous in nature and could therefore not be mapped to some dataset layout. I decided to serialize the info dicts using jsonpickle and then just store a list of strings. Since decoding this is far more expensive than reading the other (potentially memory-mapped) features (observations, actions, terminals) and often we don't care about the "infos", I decided for a lazy decoding wrapper. It will only decode the info dicts when accessed and keep decoded dicts in an LRU cache to avoid decoding them multiple times.

Testing

Old tests have been updated and split up into more atomic unit tests.

ernestum · 2023-02-20T14:51:22Z

@AdamGleave Right now the tests fail due to lack of git lfs support in our Docker image. Is there any reason we don't use git lfs? If not, I would add it according to this guide: https://naiyer.dev/post/2020/09/05/using-git-lfs-in-ci/

edit: I started this in #683

codecov · 2023-02-21T14:44:45Z

Codecov Report

Merging #677 (29bd1c9) into master (4ea1ee2) will decrease coverage by 0.06%.
The diff coverage is 96.10%.

@@            Coverage Diff             @@
##           master     #677      +/-   ##
==========================================
- Coverage   96.31%   96.25%   -0.06%     
==========================================
  Files          89       91       +2     
  Lines        8620     8685      +65     
==========================================
+ Hits         8302     8360      +58     
- Misses        318      325       +7

Impacted Files	Coverage Δ
src/imitation/data/types.py	`98.19% <ø> (-0.01%)`	⬇️
src/imitation/data/huggingface_utils.py	`86.27% <86.27%> (ø)`
src/imitation/data/serialize.py	`95.45% <95.45%> (ø)`
src/imitation/algorithms/adversarial/common.py	`96.83% <100.00%> (ø)`
src/imitation/algorithms/bc.py	`98.33% <100.00%> (ø)`
src/imitation/algorithms/dagger.py	`100.00% <100.00%> (ø)`
src/imitation/policies/serialize.py	`100.00% <100.00%> (ø)`
src/imitation/scripts/analyze.py	`91.40% <100.00%> (ø)`
src/imitation/scripts/convert_trajs.py	`94.11% <100.00%> (+5.22%)`	⬆️
src/imitation/scripts/eval_policy.py	`100.00% <100.00%> (ø)`
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ernestum · 2023-02-27T12:13:30Z

Not sure what is going on with the coverage reports again 🤷

AdamGleave · 2023-02-28T19:13:29Z

@Rocamonde can you review this please?

Rocamonde · 2023-03-05T01:39:29Z

+import datasets
+import numpy as np
+
+from imitation.data import huggingface_datasets_conversion as hfds


Hmmm... Maybe have a shorter module name in the first place?

We could go with hf_datasets_conversion? I am happy about other suggestions but I would like the name to stay descriptive. I rather type some more letters now than wonder what a module was supposed to do later.

I might call it huggingface_data_converter or something like that? I think having a very descriptive name but then renaming the import to a non-obvious abbreviation is probably just as bad for redability

Hmm. I am not really happy with that name either. I would expect a class named HuggingFaceDataConverter in such a module.

When it is so hard to come up with a good name for a module that is often an indicator for bad architecture of the module. So I looked at its content again and the following came up:

When saving, we create a HF dataset just to write it to disk immediately. This involves a (shallow) copy of the data.

When loading, we have a wrapper that makes a HF dataset visible as a sequence of trajectories. Trajectory objects generated on-the-fly.

To make this symmetrical, I should have used the Dataset.from_generator method instead of the Dataset.from_dict method with a generator that constructs dicts from trajectories on-the-fly (documentation here).
This is also the most memory-efficient way to create datasets and it would make it possible to stream sampled trajectories straight to disk.
I will push a version of this soon.

Concerning the naming of the module: I would propose to just call it huggingface_utils or even hf_utils. In symmetry to that we could refactor imitation.policies.serialize and pull out most of the HF-specific code in a imitation.policies.huggingface_utils module.

That sounds like a good decision, thanks for taking a look at this! Huggingface_utils sounds good. Let me know once you've made those changes.

Added it just now. Unfortunately this change made the save a lot slower (test_types.test_save_trajectories executes orders of magnitude slower). Maybe here @simoninithomas can give us some insight?

So I asked the dataset team:

from_dict writes in RAM

from_generator writes on disk

So should I better switch between from_dict and from_generator based on the size of the dataset (assuming that I know the number of elements coming out of the generator) or is that something, that you maybe implement on the datasets library end?

If your dataset is a python dict, it takes up RAM and is therefore relatively small so you can load it with from_dict. But if you both load files one by one, then from_generator is better

…ingFace Datasets.

…sets#5517

…n.data.serialize and encode infos using jsonpickle to support arbitrary infos structure.

… fix the documentation of imitation.types.serialize.save.

ernestum · 2023-03-21T11:30:29Z

I finally decided to go without the from_generator function and only use it when we actually need it. Right now our datasets are small enough to fit in memory.

@Rocamonde could you give this another pass and then we can merge it?

Rocamonde · 2023-03-23T05:13:39Z

Sure! Will take a look tomorrow.

…

On Tue, Mar 21, 2023 at 4:30 AM, M. Ernestus < ***@***.*** > wrote: @ ernestum ( https://github.com/ernestum ) requested your review on: #677 ( #677 ) Add support for HuggingFace Datasets. — Reply to this email directly, view it on GitHub ( #677 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/ABVWH35U2RU35EVAK67VJGLW5GGOPANCNFSM6AAAAAAUWZJMO4 ). You are receiving this because your review was requested. Message ID: <HumanCompatibleAI/imitation/pull/677/issue_event/8804347408 @ github. com>

Rocamonde · 2023-03-26T20:00:20Z

+import pathlib
 import warnings

+import imitation.data.serialize


In other places we're importing from imitation.data import serialize

Good point. Even found some more places where this is the case. I made sure it is always imported as from imitation.data import serialize or from imitation import data whenever we also need the policies.serialize module in the same file.

…e_path to the utils and load_rollouts_from_huggingface to data.serialize.

Rocamonde

LGTM

AdamGleave · 2023-04-07T22:12:02Z

Anything blocking merging now we have an LGTM? Is it just the code coverage?

ernestum · 2023-04-14T16:14:45Z

What is missing in coverage right now (as I interpret the codecov output)

loading old versions of the Trajectory dataclass which did not have the terminal flag yet.
specifying the log-level with something that can not be cast to an int

If you think we can live without that coverage then you can merge this please.

AdamGleave · 2023-04-27T02:37:13Z

Sorry for the slow turnaround -- yes, I think that coverage isn't essential, I have merged now.

ernestum force-pushed the huggingface_datasets branch 3 times, most recently from 18d6c23 to f8309cc Compare February 20, 2023 14:37

ernestum force-pushed the huggingface_datasets branch 3 times, most recently from 4b1c834 to d3940b3 Compare February 21, 2023 14:20

ernestum mentioned this pull request Feb 21, 2023

Enable git lfs #683

Merged

ernestum force-pushed the huggingface_datasets branch 2 times, most recently from 7eb32f8 to 798071a Compare February 24, 2023 14:35

ernestum marked this pull request as ready for review February 24, 2023 14:36

ernestum requested a review from AdamGleave February 27, 2023 11:48

AdamGleave requested a review from Rocamonde February 28, 2023 19:12

AdamGleave changed the title ~~Add support fur HuggingFace Datasets~~ Add support for HuggingFace Datasets Feb 28, 2023

Rocamonde reviewed Mar 5, 2023

View reviewed changes

ernestum added 11 commits March 21, 2023 11:42

Make the imitation.types.safe/load functions store/retrieve from Hugg…

6953887

…ingFace Datasets.

Add datasets dependency.

a907c9f

Add our own transform to numpy arrays to work around huggingface/data…

6812db1

…sets#5517

Move serialization related code from imitation.data.types to imitatio…

2353c63

…n.data.serialize and encode infos using jsonpickle to support arbitrary infos structure.

Move numpy conversion logic into huggingface_datasets_conversion.py

4298b9d

Improve warnings and comments.

5af3a71

Fix convert_trajs.py script and its tests.

32ebd2b

Add no cover pragma to convert_trajs main.

d02a2a7

Add no cover pragma to the case of loading an unknown trajectory format.

3ce206e

Fix inconsistent imports.

b5a6f30

Rename huggingface_datasets_conversion.py to huggingface_utils.py and…

6d4b8ae

… fix the documentation of imitation.types.serialize.save.

ernestum force-pushed the huggingface_datasets branch from f4ab25e to 6d4b8ae Compare March 21, 2023 11:12

ernestum requested a review from Rocamonde March 21, 2023 11:30

Rocamonde reviewed Mar 26, 2023

View reviewed changes

Normalize imitation.data.serialize imports across the repo, move pars…

cd8ac9a

…e_path to the utils and load_rollouts_from_huggingface to data.serialize.

ernestum requested a review from Rocamonde March 27, 2023 20:00

Rocamonde approved these changes Mar 27, 2023

View reviewed changes

Comment thread tests/data/test_types.py

Comment thread src/imitation/scripts/train_rl.py Outdated

Comment thread src/imitation/scripts/convert_trajs.py

Comment thread src/imitation/scripts/convert_trajs.py

ernestum added 2 commits March 28, 2023 10:59

Remove now unneeded pytype error suppression.

61dfcc6

Fix formatting in convert_trajs

29bd1c9

AdamGleave merged commit bbdcb29 into master Apr 27, 2023

AdamGleave deleted the huggingface_datasets branch April 27, 2023 02:37

ernestum mentioned this pull request May 30, 2023

Transition from Models Hub to Datasets Hub for expert trajectories #723

Merged

Conversation

ernestum commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Design Choices

Testing

Uh oh!

ernestum commented Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ernestum commented Feb 27, 2023

Uh oh!

AdamGleave commented Feb 28, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rocamonde Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simoninithomas Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ernestum commented Mar 21, 2023

Uh oh!

Rocamonde commented Mar 23, 2023 via email

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rocamonde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AdamGleave commented Apr 7, 2023

Uh oh!

ernestum commented Apr 14, 2023

Uh oh!

AdamGleave commented Apr 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ernestum commented Feb 9, 2023 •

edited

Loading

ernestum commented Feb 20, 2023 •

edited

Loading

codecov Bot commented Feb 21, 2023 •

edited

Loading

Rocamonde Mar 6, 2023 •

edited

Loading

simoninithomas Mar 7, 2023 •

edited

Loading