Introduce Cache 1/n #18642

tchaton · 2023-09-26T17:50:03Z

What does this PR do?

This PR introduces the Cache and CacheDataLoader object to optimise data fetching in the cloud.

Here are the different ways of using the Cache.

For beginners:

import os
import torch
from lightning.data.cache import CacheDataLoader
from lightning.pytorch.demos.boring_classes import RandomDataset

dataset = RandomDataset(64, 64)
dataloader = CacheDataLoader(dataset, cache_dir="./cache", chunk_bytes=2 << 12)
for batch in dataloader:
    # do your thing

print(sorted(os.listdir("./cache"))) # ['0.index.json', 'chunk-0-0.bin', 'chunk-0-1.bin', 'chunk-0-2.bin']. 
# Your dataset is optimised for the cloud

For advanced users:

import os
import numpy as np
import torch
from PIL import Image
from torchvision import transforms as T
from torch.utils.data import Dataset
from lightning import seed_everything
from lightning.data.cache import Cache, CacheDataLoader
from lightning.fabric import Fabric

class BoringImageDataset(Dataset):

    def __init__(self, cache, size=100, num_classes=10):
        self.cache = Cache("./cache", chunk_size=2 << 12)
        if not self.cache.filled:
            self.data = create_random_images(size, num_classes)
        self.transform = T.Compose([T.ToTensor()])

    def __len__(self):
        return len(self.cache) if self.cache.filled else len(self.data)

    def __getitem__(self, index):
        data = self.cache[index] if self.cache.filled else self.data[index]
        if not self.cache.filled: self.cache[index] = data
        data["image"] = self.transform(data["image"]).unsqueeze(0)
        data["class"] = torch.tensor([data["class"]])
        return data

def main(fabric):
    seed_everything(42)

    dataset = BoringImageDataset()
    dataloader = CacheDataLoader(dataset, num_workers=34, batch_size=4)

    model = fabric.setup(model)

    dataloader = fabric.setup_dataloader(dataloader) # to be added soon, but not needed as the CacheDataLoader handles distribution
    
    ...

    for epoch in range(10):
        for batch in dataloader:
            loss = model(batch)
            ...

if __name__ == "__main__":
    Fabric.launch(main, accelerator="auto", devices=2, strategy="ddp_spawn")

In the follow-ups PRs, I will investigate the followings:

Add more tests for the samplers, CacheDataLoader, etc...
Start benchmarking and optimising the Cache.
Add support for the IterableDataset
Add support for the LightningIterable
Optimise the index for very large dataset by dropping the offsets and mapping.
Add support for NLP (tensors) datasets
Add resumability
Prepare the MLPerf Dataset
Explore FFCV integration
...

Fixes #<issue_number>

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--18642.org.readthedocs.build/en/18642/

cc @Borda

for more information, see https://pre-commit.ci

…htning into introduce_cache

for more information, see https://pre-commit.ci

…htning into introduce_cache

for more information, see https://pre-commit.ci

…htning into introduce_cache

for more information, see https://pre-commit.ci

…htning into introduce_cache

for more information, see https://pre-commit.ci

codecov · 2023-10-06T16:57:15Z

Codecov Report

Merging #18642 (0e5342c) into master (fdae213) will decrease coverage by 20%.
The diff coverage is 81%.

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #18642     +/-   ##
=========================================
- Coverage      83%      64%    -20%     
=========================================
  Files         428      435      +7     
  Lines       33484    34530   +1046     
=========================================
- Hits        27904    21974   -5930     
- Misses       5580    12556   +6976

justusschock

lgtm, just some minor comments

src/lightning/data/cache/config.py

src/lightning/data/cache/downloader.py

src/lightning/data/cache/serializers.py

.github/workflows/ci-tests-data.yml

tchaton and others added 29 commits September 26, 2023 16:11

update

2e3e1c2

update

90bcd89

[pre-commit.ci] auto fixes from pre-commit.com hooks

8d76988

for more information, see https://pre-commit.ci

update

fa8a5f3

[pre-commit.ci] auto fixes from pre-commit.com hooks

70332a9

for more information, see https://pre-commit.ci

update

a894cc4

update

e1ebe37

[pre-commit.ci] auto fixes from pre-commit.com hooks

bf47412

for more information, see https://pre-commit.ci

update

35cae78

[pre-commit.ci] auto fixes from pre-commit.com hooks

2376c3e

for more information, see https://pre-commit.ci

update

28fab53

update

7f54886

[pre-commit.ci] auto fixes from pre-commit.com hooks

c1b197f

for more information, see https://pre-commit.ci

update

81f5a79

Merge branch 'introduce_cache' of https://github.com/Lightning-AI/lig…

019b1bd

…htning into introduce_cache

[pre-commit.ci] auto fixes from pre-commit.com hooks

c2ee47c

for more information, see https://pre-commit.ci

update

d0708e0

Merge branch 'introduce_cache' of https://github.com/Lightning-AI/lig…

7ddba5c

…htning into introduce_cache

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f6ce5f

for more information, see https://pre-commit.ci

update

23bc5c4

Merge branch 'introduce_cache' of https://github.com/Lightning-AI/lig…

f3430d7

…htning into introduce_cache

update

f79a292

[pre-commit.ci] auto fixes from pre-commit.com hooks

4770038

for more information, see https://pre-commit.ci

update

dd0991c

Merge branch 'introduce_cache' of https://github.com/Lightning-AI/lig…

a3a9ff7

…htning into introduce_cache

update

32fe811

update

1e3d1ab

Merge branch 'master' into introduce_cache

7859601

[pre-commit.ci] auto fixes from pre-commit.com hooks

d05e34f

for more information, see https://pre-commit.ci

tchaton changed the title ~~Introduce Cache~~ Introduce Cache 1/n Sep 28, 2023

thomas added 2 commits October 6, 2023 15:33

update

e356838

update

c4ddeb8

tchaton marked this pull request as ready for review October 6, 2023 16:36

tchaton requested review from ethanwharris, justusschock and lantiga October 6, 2023 16:37

update

da985f2

update

0e5342c

ethanwharris approved these changes Oct 6, 2023

View reviewed changes

thomas added 3 commits October 7, 2023 15:48

update

3aed4c5

update

1f41d60

update

bab1f1c

justusschock approved these changes Oct 9, 2023

View reviewed changes

src/lightning/data/cache/config.py Outdated Show resolved Hide resolved

src/lightning/data/cache/downloader.py Show resolved Hide resolved

src/lightning/data/cache/serializers.py Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Oct 9, 2023

thomas added 2 commits October 9, 2023 09:56

update

2c0ee2f

update

3d57af8

Lightning-AI deleted a comment from justusschock Oct 9, 2023

update

9ff7dc4

tchaton requested a review from carmocca as a code owner October 9, 2023 11:37

github-actions bot added the ci Continuous Integration label Oct 9, 2023

update

bfa57c0

github-actions bot added the dependencies Pull requests that update a dependency file label Oct 9, 2023

thomas added 2 commits October 9, 2023 13:52

update

e733e59

update

1d0f5e4

Borda reviewed Oct 9, 2023

View reviewed changes

.github/workflows/ci-tests-data.yml Show resolved Hide resolved

update

ff7b629

github-actions bot removed the package label Oct 9, 2023

tchaton merged commit 1d5851f into master Oct 9, 2023

tchaton deleted the introduce_cache branch October 9, 2023 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce Cache 1/n #18642

Introduce Cache 1/n #18642

Uh oh!

tchaton commented Sep 26, 2023 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Oct 6, 2023 •

edited

Loading

Uh oh!

justusschock left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Introduce Cache 1/n #18642

Introduce Cache 1/n #18642

Uh oh!

Conversation

tchaton commented Sep 26, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

codecov bot commented Oct 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tchaton commented Sep 26, 2023 •

edited by github-actions bot

Loading

codecov bot commented Oct 6, 2023 •

edited

Loading