Skip to content

Add support for HuggingFace Datasets#677

Merged
AdamGleave merged 14 commits into
masterfrom
huggingface_datasets
Apr 27, 2023
Merged

Add support for HuggingFace Datasets#677
AdamGleave merged 14 commits into
masterfrom
huggingface_datasets

Conversation

@ernestum
Copy link
Copy Markdown
Collaborator

@ernestum ernestum commented Feb 9, 2023

Description

As per #651 we want to start storing trajectories in HuggingFace Datasets

This PR adds support for HuggingFace Datasets and moves imitation.data.types.load/save to imitation.data.seralize.load/save. Datasets are stored in the HuggingFace format by default.

The conversion script is adapted accordingly and I re-structured the tests.

I also moved the loading/saving functionality to imitation.data.serialize instead of imitation.data.types to mimic the layout of imitation.policies.

Design Choices

I introduced a new class (imitation.data.huggingface_datasets_conversion.TrajectoryDatasetSequence) that presents a HuggingFace dataset as a sequence of imitation.data.types.Trajectory. In its __getitem__ method the dataset is queried and the result is converted to a trajectory (or slice of trajectories) ad-hoc.

I found that the "info" dicts were very heterogeneous in nature and could therefore not be mapped to some dataset layout. I decided to serialize the info dicts using jsonpickle and then just store a list of strings. Since decoding this is far more expensive than reading the other (potentially memory-mapped) features (observations, actions, terminals) and often we don't care about the "infos", I decided for a lazy decoding wrapper. It will only decode the info dicts when accessed and keep decoded dicts in an LRU cache to avoid decoding them multiple times.

Testing

Old tests have been updated and split up into more atomic unit tests.

@ernestum ernestum force-pushed the huggingface_datasets branch 3 times, most recently from 18d6c23 to f8309cc Compare February 20, 2023 14:37
@ernestum
Copy link
Copy Markdown
Collaborator Author

ernestum commented Feb 20, 2023

@AdamGleave Right now the tests fail due to lack of git lfs support in our Docker image. Is there any reason we don't use git lfs? If not, I would add it according to this guide: https://naiyer.dev/post/2020/09/05/using-git-lfs-in-ci/

edit: I started this in #683

@ernestum ernestum force-pushed the huggingface_datasets branch 3 times, most recently from 4b1c834 to d3940b3 Compare February 21, 2023 14:20
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 21, 2023

Codecov Report

Merging #677 (29bd1c9) into master (4ea1ee2) will decrease coverage by 0.06%.
The diff coverage is 96.10%.

@@            Coverage Diff             @@
##           master     #677      +/-   ##
==========================================
- Coverage   96.31%   96.25%   -0.06%     
==========================================
  Files          89       91       +2     
  Lines        8620     8685      +65     
==========================================
+ Hits         8302     8360      +58     
- Misses        318      325       +7     
Impacted Files Coverage Δ
src/imitation/data/types.py 98.19% <ø> (-0.01%) ⬇️
src/imitation/data/huggingface_utils.py 86.27% <86.27%> (ø)
src/imitation/data/serialize.py 95.45% <95.45%> (ø)
src/imitation/algorithms/adversarial/common.py 96.83% <100.00%> (ø)
src/imitation/algorithms/bc.py 98.33% <100.00%> (ø)
src/imitation/algorithms/dagger.py 100.00% <100.00%> (ø)
src/imitation/policies/serialize.py 100.00% <100.00%> (ø)
src/imitation/scripts/analyze.py 91.40% <100.00%> (ø)
src/imitation/scripts/convert_trajs.py 94.11% <100.00%> (+5.22%) ⬆️
src/imitation/scripts/eval_policy.py 100.00% <100.00%> (ø)
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ernestum ernestum mentioned this pull request Feb 21, 2023
@ernestum ernestum force-pushed the huggingface_datasets branch 2 times, most recently from 7eb32f8 to 798071a Compare February 24, 2023 14:35
@ernestum ernestum marked this pull request as ready for review February 24, 2023 14:36
@ernestum ernestum requested a review from AdamGleave February 27, 2023 11:48
@ernestum
Copy link
Copy Markdown
Collaborator Author

Not sure what is going on with the coverage reports again 🤷

@AdamGleave AdamGleave requested a review from Rocamonde February 28, 2023 19:12
@AdamGleave AdamGleave changed the title Add support fur HuggingFace Datasets Add support for HuggingFace Datasets Feb 28, 2023
@AdamGleave
Copy link
Copy Markdown
Member

@Rocamonde can you review this please?

Comment thread src/imitation/algorithms/dagger.py
Comment thread tests/scripts/test_scripts.py
Comment thread src/imitation/data/huggingface_datasets_conversion.py Outdated
Comment thread src/imitation/data/serialize.py Outdated
import datasets
import numpy as np

from imitation.data import huggingface_datasets_conversion as hfds
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... Maybe have a shorter module name in the first place?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could go with hf_datasets_conversion? I am happy about other suggestions but I would like the name to stay descriptive. I rather type some more letters now than wonder what a module was supposed to do later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might call it huggingface_data_converter or something like that? I think having a very descriptive name but then renaming the import to a non-obvious abbreviation is probably just as bad for redability

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I am not really happy with that name either. I would expect a class named HuggingFaceDataConverter in such a module.

When it is so hard to come up with a good name for a module that is often an indicator for bad architecture of the module. So I looked at its content again and the following came up:

When saving, we create a HF dataset just to write it to disk immediately. This involves a (shallow) copy of the data.

When loading, we have a wrapper that makes a HF dataset visible as a sequence of trajectories. Trajectory objects generated on-the-fly.

To make this symmetrical, I should have used the Dataset.from_generator method instead of the Dataset.from_dict method with a generator that constructs dicts from trajectories on-the-fly (documentation here).
This is also the most memory-efficient way to create datasets and it would make it possible to stream sampled trajectories straight to disk.
I will push a version of this soon.

Concerning the naming of the module: I would propose to just call it huggingface_utils or even hf_utils. In symmetry to that we could refactor imitation.policies.serialize and pull out most of the HF-specific code in a imitation.policies.huggingface_utils module.

Copy link
Copy Markdown
Member

@Rocamonde Rocamonde Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good decision, thanks for taking a look at this! Huggingface_utils sounds good. Let me know once you've made those changes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it just now. Unfortunately this change made the save a lot slower (test_types.test_save_trajectories executes orders of magnitude slower). Maybe here @simoninithomas can give us some insight?

Copy link
Copy Markdown

@simoninithomas simoninithomas Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I asked the dataset team:

  • from_dict writes in RAM
  • from_generator writes on disk

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I better switch between from_dict and from_generator based on the size of the dataset (assuming that I know the number of elements coming out of the generator) or is that something, that you maybe implement on the datasets library end?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your dataset is a python dict, it takes up RAM and is therefore relatively small so you can load it with from_dict. But if you both load files one by one, then from_generator is better

@ernestum ernestum force-pushed the huggingface_datasets branch from f4ab25e to 6d4b8ae Compare March 21, 2023 11:12
@ernestum
Copy link
Copy Markdown
Collaborator Author

I finally decided to go without the from_generator function and only use it when we actually need it. Right now our datasets are small enough to fit in memory.

@Rocamonde could you give this another pass and then we can merge it?

@ernestum ernestum requested a review from Rocamonde March 21, 2023 11:30
@Rocamonde
Copy link
Copy Markdown
Member

Rocamonde commented Mar 23, 2023 via email

Comment thread tests/data/test_types.py
Comment thread src/imitation/scripts/convert_trajs.py Outdated
import pathlib
import warnings

import imitation.data.serialize
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we're importing from imitation.data import serialize

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Even found some more places where this is the case. I made sure it is always imported as from imitation.data import serialize or from imitation import data whenever we also need the policies.serialize module in the same file.

Comment thread src/imitation/data/serialize.py
…e_path to the utils and load_rollouts_from_huggingface to data.serialize.
@ernestum ernestum requested a review from Rocamonde March 27, 2023 20:00
Copy link
Copy Markdown
Member

@Rocamonde Rocamonde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread tests/data/test_types.py
Comment thread src/imitation/scripts/train_rl.py Outdated
Comment thread src/imitation/scripts/convert_trajs.py
Comment thread src/imitation/scripts/convert_trajs.py
@AdamGleave
Copy link
Copy Markdown
Member

Anything blocking merging now we have an LGTM? Is it just the code coverage?

@ernestum
Copy link
Copy Markdown
Collaborator Author

What is missing in coverage right now (as I interpret the codecov output)

  • loading old versions of the Trajectory dataclass which did not have the terminal flag yet.
  • specifying the log-level with something that can not be cast to an int

If you think we can live without that coverage then you can merge this please.

@AdamGleave AdamGleave merged commit bbdcb29 into master Apr 27, 2023
@AdamGleave AdamGleave deleted the huggingface_datasets branch April 27, 2023 02:37
@AdamGleave
Copy link
Copy Markdown
Member

Sorry for the slow turnaround -- yes, I think that coverage isn't essential, I have merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants