DataPipeline PoC #141

tchaton · 2021-02-22T18:58:48Z

What does this PR do?

This PR introduces the new API for DataPipeline.

Objective:
Provide a flexible API which organise user processing code toward higher readability, debugging and performance.

DataPipeline are composed from 2 parts: Preprocess and Postprocess.

Preprocess implements the following hooks:

load_data
load_sample
per_sample_pre_tensor_transform
per_sample_to_tensor_transform
per_sample_post_tensor_transform
per_batch_transform
collate
per_sample_transform_on_device
per_batch_transform_on_device

Postprocess implements the following hooks:

per_batch_transform
per_sample_transform
uncollate
export_data
export_sample

The DataPipeline are aware of the Trainer RunningStage, meaning they know if they are running training, validation, testing, predicting,

The users can customise each hooks for a specific RunningStage by adding train, validation, test, predict as prefix before every hooks: Example. train_load_data function would be used for Training stage only or use boolean self.training, self.validating, self.testing and self.predicting

@mock.patch("torch.save")  # need to mock torch.save or we get pickle error
def test_dummy_example(tmpdir):

    class ImageClassificationPeprocess(Preprocess):

        def __init__(self, to_tensor_transform, train_per_sample_transform_on_device):
            super().__init__()
            self._to_tensor = to_tensor_transform# T.ToTensor()
            self._train_per_sample_transform_on_device = train_per_sample_transform_on_device# T.RandomHorizontalFlip()

        def load_data(self, folder: str):
            # from folder -> return files paths
            return ["a.jpg", "b.jpg"]

        def load_sample(self, path: str) -> Image.Image:
            # from a file path, load the associated image
            img8Bit = np.uint8(np.random.uniform(0, 1, (64, 64, 3)) * 255.0)
            return Image.fromarray(img8Bit)

        def per_sample_to_tensor_transform(self, pil_image: Image.Image) -> torch.Tensor:
            # convert pil image into a tensor
            return self._to_tensor(pil_image)

        def train_per_sample_transform_on_device(self, sample: Any) -> Any:
            # apply an augmentation per sample on gpu for train only
            return self._train_per_sample_transform_on_device(sample)

    class CustomModel(Task):

        def __init__(self):
			# This would be a CNN and the loss cross entropy :)
            super().__init__(model=torch.nn.Linear(1, 1), loss_fn=torch.nn.MSELoss())

        def training_step(self, batch, batch_idx):
            assert batch.shape == torch.Size([2, 3, 64, 64])

        def validation_step(self, batch, batch_idx):
            assert batch.shape == torch.Size([2, 3, 64, 64])

        def test_step(self, batch, batch_idx):
            assert batch.shape == torch.Size([2, 3, 64, 64])

    class CustomDataModule(DataModule):

        preprocess_cls = ImageClassificationPeprocess

        @property
        def preprocess(self):
            return self.preprocess_cls(
                self.to_tensor_transform,
                self.train_per_sample_transform_on_device)

        @classmethod
        def from_folders(
            cls, 
            train_folder: Optional[str], 
            val_folder: Optional[str], 
            test_folder: Optional[str], 
            predict_folder: Optional[str], 
            to_tensor_transform: torch.nn.Module, 
            train_per_sample_transform_on_device: torch.nn.Module, 
            batch_size: int):

            # attach the arguments for the preprocess onto the cls
            cls.to_tensor_transform = to_tensor_transform
            cls.train_per_sample_transform_on_device = train_per_sample_transform_on_device
            
            # call ``from_load_data_inputs``
            return cls.from_load_data_inputs(
                train_load_data_input=train_folder, 
                valid_load_data_input=val_folder, 
                test_load_data_input=test_folder, 
                predict_load_data_input=predict_folder, 
                batch_size=batch_size)
            
    datamodule = CustomDataModule.from_folders(
        "train_folder", "val_folder", "test_folder", None, T.ToTensor(), T.RandomHorizontalFlip(), batch_size=2)

    assert isinstance(datamodule.train_dataloader().dataset[0], Image.Image)
    batch = next(iter(datamodule.train_dataloader()))
    assert batch[0].shape == torch.Size([3, 64, 64])

    model = CustomModel()
    trainer = Trainer(
        max_epochs=1,
        limit_train_batches=2,
        limit_val_batches=1,
        limit_test_batches=2,
        limit_predict_batches=2,
        num_sanity_val_steps=1
    )
    trainer.fit(model, datamodule=datamodule)
    trainer.test(model

TODOs:

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests? [not needed for typos/docs]
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2021-02-22T18:58:55Z

Hello @tchaton! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-29 18:33:32 UTC

flash/core/data/datamodule.py

…tning-flash into datapipeline_poc_1

codecov · 2021-02-23T18:29:49Z

Codecov Report

Merging #141 (e2f24dc) into master (3b4c6b6) will increase coverage by 3.37%.
The diff coverage is 78.15%.

❗ Current head e2f24dc differs from pull request most recent head de3327b. Consider uploading reports for the commit de3327b to get more accurate results

@@            Coverage Diff             @@
##           master     #141      +/-   ##
==========================================
+ Coverage   76.52%   79.89%   +3.37%     
==========================================
  Files          56       55       -1     
  Lines        2334     2447     +113     
==========================================
+ Hits         1786     1955     +169     
+ Misses        548      492      -56

Flag	Coverage Δ
unittests	`79.89% <78.15%> (+3.37%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
flash/core/data/utils.py	`29.26% <ø> (-58.83%)`	⬇️
flash/vision/detection/model.py	`72.46% <ø> (-0.14%)`	⬇️
flash/text/seq2seq/core/model.py	`61.53% <25.00%> (-2.53%)`	⬇️
flash/text/seq2seq/core/data.py	`42.62% <32.55%> (-42.31%)`	⬇️
flash/text/classification/data.py	`40.90% <36.55%> (-44.81%)`	⬇️
flash/text/classification/model.py	`63.63% <40.00%> (-33.34%)`	⬇️
flash/text/seq2seq/summarization/data.py	`57.14% <47.61%> (-27.48%)`	⬇️
flash/text/seq2seq/summarization/model.py	`82.35% <66.66%> (+5.88%)`	⬆️
flash/vision/utils.py	`69.23% <69.23%> (ø)`
flash/tabular/classification/data/dataset.py	`75.75% <75.00%> (ø)`
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b4c6b6...de3327b. Read the comment docs.

…ing/lightning-flash into datapipeline_poc_1

Borda · 2021-02-24T13:36:38Z

@tchaton @justusschock can we get this done asap as it blocks transition to proper PL version

justusschock · 2021-02-24T17:55:41Z

@Borda why does it block? This should be independent.

Borda · 2021-02-24T18:07:50Z

@Borda why does it block? This should be independent.

pls, check #133 (comment)

…ing/lightning-flash into datapipeline_poc_1

README.md

docs/source/general/data.rst

docs/source/reference/tabular_classification.rst

flash/core/imports.py

flash/core/model.py

flash/vision/classification/data.py

flash/vision/detection/data.py

flash_notebooks/generic_task.ipynb

requirements.txt

flash_examples/generic_task.py

requirements.txt

docs/source/general/data.rst

docs/source/custom_task.rst

flash/data/data_module.py

flash/data/data_pipeline.py

…ing/lightning-flash into datapipeline_poc_1

justusschock and others added 8 commits February 18, 2021 17:53

add prototype of DataPipeline

45691cd

Add Prototype of PostProcessingPipeline

135eb17

isort + pep8

535353c

update post_processing_pipeline

f66f223

update data pipline

67de76f

add new prediction part

3be12a3

change loader name

17cecb8

update

be4f505

kaushikb11 reviewed Feb 22, 2021

View reviewed changes

flash/core/data/datamodule.py Outdated Show resolved Hide resolved

justusschock and others added 7 commits February 23, 2021 19:23

uypdate new datapipeline

2e2fa54

update model with new pipeline

fc34775

update

b417683

update gitignore

307b210

add autodataset

9dc842a

add batch processing

77f935c

Merge branch 'datapipeline_poc_1' of github.com:PyTorchLightning/ligh…

7606f93

…tning-flash into datapipeline_poc_1

tchaton added 3 commits February 24, 2021 08:12

update

dd68bf3

Merge branch 'datapipeline_poc_1' of https://github.com/PyTorchLightn…

69c5bc1

…ing/lightning-flash into datapipeline_poc_1

update

b5b3ad0

kaushikb11 mentioned this pull request Feb 24, 2021

Update lightning version to v1.2 #133

Merged

8 tasks

Borda added the Priority label Feb 24, 2021

tchaton and others added 4 commits February 25, 2021 09:28

update

040c3be

add process file

4edee9c

make datapipeline attaching and detaching more robust

327f19c

resolve flake8

95e809c

tchaton and others added 12 commits March 26, 2021 11:33

Merge branch 'datapipeline_poc_1' of https://github.com/PyTorchLightn…

a434a68

…ing/lightning-flash into datapipeline_poc_1

update

bda0ca3

update

b81a204

update

8043153

update

9297c5b

try

4f1e87d

update

d00382e

udpate

f11002c

update

6e05051

update

e72c1c3

update

9289632

formatting

77e3e0e

Borda reviewed Mar 28, 2021

View reviewed changes

edgarriba reviewed Mar 29, 2021

View reviewed changes

flash_examples/generic_task.py Show resolved Hide resolved

requirements.txt Outdated Show resolved Hide resolved

Borda reviewed Mar 29, 2021

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

tchaton and others added 5 commits March 29, 2021 15:15

update on comments

d72d1b7

update on comments

503b62a

Merge branch 'master' into datapipeline_poc_1

25bd225

General changes

2c58006

General changes

816c010

carmocca approved these changes Mar 29, 2021

View reviewed changes

docs/source/general/data.rst Show resolved Hide resolved

docs/source/custom_task.rst Show resolved Hide resolved

flash/data/data_module.py Show resolved Hide resolved

flash/data/data_pipeline.py Show resolved Hide resolved

tchaton added 5 commits March 29, 2021 19:10

update

fbfb71f

Merge branch 'datapipeline_poc_1' of https://github.com/PyTorchLightn…

d3d4c78

…ing/lightning-flash into datapipeline_poc_1

update

0245d17

add _data_pipeline back

e2f24dc

update

de3327b

tchaton merged commit ba34bf4 into master Mar 29, 2021

tchaton deleted the datapipeline_poc_1 branch March 29, 2021 19:10

This was referenced May 3, 2021

Add split support in from folders #152

Closed

from_datamodules and dataset flexibility #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataPipeline PoC #141

DataPipeline PoC #141

tchaton commented Feb 22, 2021 •

edited by akihironitta

Loading

pep8speaks commented Feb 22, 2021 •

edited

Loading

codecov bot commented Feb 23, 2021 •

edited

Loading

Borda commented Feb 24, 2021

justusschock commented Feb 24, 2021

Borda commented Feb 24, 2021

DataPipeline PoC #141

DataPipeline PoC #141

Conversation

tchaton commented Feb 22, 2021 • edited by akihironitta Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Feb 22, 2021 • edited Loading

Comment last updated at 2021-03-29 18:33:32 UTC

codecov bot commented Feb 23, 2021 • edited Loading

Codecov Report

Borda commented Feb 24, 2021

justusschock commented Feb 24, 2021

Borda commented Feb 24, 2021

tchaton commented Feb 22, 2021 •

edited by akihironitta

Loading

pep8speaks commented Feb 22, 2021 •

edited

Loading

codecov bot commented Feb 23, 2021 •

edited

Loading