Add support for HuggingFace Datasets#677
Conversation
18d6c23 to
f8309cc
Compare
|
@AdamGleave Right now the tests fail due to lack of git lfs support in our Docker image. Is there any reason we don't use git lfs? If not, I would add it according to this guide: https://naiyer.dev/post/2020/09/05/using-git-lfs-in-ci/ edit: I started this in #683 |
4b1c834 to
d3940b3
Compare
Codecov Report
@@ Coverage Diff @@
## master #677 +/- ##
==========================================
- Coverage 96.31% 96.25% -0.06%
==========================================
Files 89 91 +2
Lines 8620 8685 +65
==========================================
+ Hits 8302 8360 +58
- Misses 318 325 +7
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
7eb32f8 to
798071a
Compare
|
Not sure what is going on with the coverage reports again 🤷 |
|
@Rocamonde can you review this please? |
| import datasets | ||
| import numpy as np | ||
|
|
||
| from imitation.data import huggingface_datasets_conversion as hfds |
There was a problem hiding this comment.
Hmmm... Maybe have a shorter module name in the first place?
There was a problem hiding this comment.
We could go with hf_datasets_conversion? I am happy about other suggestions but I would like the name to stay descriptive. I rather type some more letters now than wonder what a module was supposed to do later.
There was a problem hiding this comment.
I might call it huggingface_data_converter or something like that? I think having a very descriptive name but then renaming the import to a non-obvious abbreviation is probably just as bad for redability
There was a problem hiding this comment.
Hmm. I am not really happy with that name either. I would expect a class named HuggingFaceDataConverter in such a module.
When it is so hard to come up with a good name for a module that is often an indicator for bad architecture of the module. So I looked at its content again and the following came up:
When saving, we create a HF dataset just to write it to disk immediately. This involves a (shallow) copy of the data.
When loading, we have a wrapper that makes a HF dataset visible as a sequence of trajectories. Trajectory objects generated on-the-fly.
To make this symmetrical, I should have used the Dataset.from_generator method instead of the Dataset.from_dict method with a generator that constructs dicts from trajectories on-the-fly (documentation here).
This is also the most memory-efficient way to create datasets and it would make it possible to stream sampled trajectories straight to disk.
I will push a version of this soon.
Concerning the naming of the module: I would propose to just call it huggingface_utils or even hf_utils. In symmetry to that we could refactor imitation.policies.serialize and pull out most of the HF-specific code in a imitation.policies.huggingface_utils module.
There was a problem hiding this comment.
That sounds like a good decision, thanks for taking a look at this! Huggingface_utils sounds good. Let me know once you've made those changes.
There was a problem hiding this comment.
Added it just now. Unfortunately this change made the save a lot slower (test_types.test_save_trajectories executes orders of magnitude slower). Maybe here @simoninithomas can give us some insight?
There was a problem hiding this comment.
So I asked the dataset team:
from_dictwrites in RAMfrom_generatorwrites on disk
There was a problem hiding this comment.
So should I better switch between from_dict and from_generator based on the size of the dataset (assuming that I know the number of elements coming out of the generator) or is that something, that you maybe implement on the datasets library end?
There was a problem hiding this comment.
If your dataset is a python dict, it takes up RAM and is therefore relatively small so you can load it with from_dict. But if you both load files one by one, then from_generator is better
…ingFace Datasets.
…n.data.serialize and encode infos using jsonpickle to support arbitrary infos structure.
… fix the documentation of imitation.types.serialize.save.
f4ab25e to
6d4b8ae
Compare
|
I finally decided to go without the @Rocamonde could you give this another pass and then we can merge it? |
|
Sure! Will take a look tomorrow.
…On Tue, Mar 21, 2023 at 4:30 AM, M. Ernestus < ***@***.*** > wrote:
@ ernestum ( https://github.com/ernestum ) requested your review on: #677 (
#677 ) Add support for
HuggingFace Datasets.
—
Reply to this email directly, view it on GitHub (
#677 (comment) )
, or unsubscribe (
https://github.com/notifications/unsubscribe-auth/ABVWH35U2RU35EVAK67VJGLW5GGOPANCNFSM6AAAAAAUWZJMO4
).
You are receiving this because your review was requested. Message ID: <HumanCompatibleAI/imitation/pull/677/issue_event/8804347408
@ github. com>
|
| import pathlib | ||
| import warnings | ||
|
|
||
| import imitation.data.serialize |
There was a problem hiding this comment.
In other places we're importing from imitation.data import serialize
There was a problem hiding this comment.
Good point. Even found some more places where this is the case. I made sure it is always imported as from imitation.data import serialize or from imitation import data whenever we also need the policies.serialize module in the same file.
…e_path to the utils and load_rollouts_from_huggingface to data.serialize.
|
Anything blocking merging now we have an LGTM? Is it just the code coverage? |
|
What is missing in coverage right now (as I interpret the codecov output)
If you think we can live without that coverage then you can merge this please. |
|
Sorry for the slow turnaround -- yes, I think that coverage isn't essential, I have merged now. |
Description
As per #651 we want to start storing trajectories in HuggingFace Datasets
This PR adds support for HuggingFace Datasets and moves
imitation.data.types.load/savetoimitation.data.seralize.load/save. Datasets are stored in the HuggingFace format by default.The conversion script is adapted accordingly and I re-structured the tests.
I also moved the loading/saving functionality to
imitation.data.serializeinstead ofimitation.data.typesto mimic the layout ofimitation.policies.Design Choices
I introduced a new class (
imitation.data.huggingface_datasets_conversion.TrajectoryDatasetSequence) that presents a HuggingFace dataset as a sequence ofimitation.data.types.Trajectory. In its__getitem__method the dataset is queried and the result is converted to a trajectory (or slice of trajectories) ad-hoc.I found that the "info" dicts were very heterogeneous in nature and could therefore not be mapped to some dataset layout. I decided to serialize the info dicts using
jsonpickleand then just store a list of strings. Since decoding this is far more expensive than reading the other (potentially memory-mapped) features (observations, actions, terminals) and often we don't care about the "infos", I decided for a lazy decoding wrapper. It will only decode the info dicts when accessed and keep decoded dicts in an LRU cache to avoid decoding them multiple times.Testing
Old tests have been updated and split up into more atomic unit tests.