Skip to content

Commit

Permalink
Organize benchmarks section
Browse files Browse the repository at this point in the history
  • Loading branch information
williamFalcon committed Jul 5, 2024
1 parent 052ad7b commit 18602f5
Showing 1 changed file with 77 additions and 37 deletions.
114 changes: 77 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ Transform Optimize
<a href="#transform-datasets">Transform datasets</a> •
<a href="#key-features">Key features</a> •
<a href="#benchmarks">Benchmarks</a> •
<a href="#runnable-templates">Templates</a>
<a href="#start-from-a-template">Templates</a> •
<a href="#community">Community</a>
</p>

&nbsp;
Expand Down Expand Up @@ -576,61 +577,74 @@ Explore an example setup of litdata with MinIO in the [LitData with MinIO](https
----

# Benchmarks
In this section we show benchmarks for speed to optimize a dataset and the resulting streaming speed ([Reproduce the benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries)).

In order to measure the effectiveness of LitData, we used a commonly used dataset for benchmarks: [Imagenet-1.2M](https://www.image-net.org/) where the training set contains `1,281,167 images`.
## Streaming speed

To align with other benchmarks, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.
Data optimized and streamed with LitData achieves a 20x speed up over non optimized data and 2x speed up over other streaming solutions.

Speed to stream Imagenet 1.2M from AWS S3:

Reproduce our benchmark **by running** this [Studio](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).
| Framework | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |
|---|---|---|---|---|
| PL Data | **5800.34** | **6589.98** | **6282.17** | **7221.88** |
| Web Dataset | 3134.42 | 3924.95 | 3343.40 | 4424.62 |
| Mosaic ML | 2898.61 | 5099.93 | 2809.69 | 5158.98 |

<details>
<summary> Benchmark details</summary>
&nbsp;

### Imagenet-1.2M Streaming from AWS S3
The [Imagenet-1.2M dataset](https://www.image-net.org/) contains `1,281,167 images`.
To align with other benchmarks, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.

We can observe LitData is up to 85 % faster than the second best. Higher is better in the table below.
**Streaming Imagenet-1.2M from AWS S3** (Higher is better)

| Framework | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |
|---|---|---|---|---|
| PL Data | **5800.34** | **6589.98** | **6282.17** | **7221.88** |
| Web Dataset | 3134.42 | 3924.95 | 3343.40 | 4424.62 |
| Mosaic ML | 2898.61 | 5099.93 | 2809.69 | 5158.98 |

### Imagenet-1.2M Conversion
</details>

We measured how fast the 1.2 million images can converted into a streamable format. Faster is better in the table below.
&nbsp;

## Time to optimize data
LitData optimizes the Imagenet dataset for fast training 3-5x faster than other frameworks:

Time to optimize 1.2 million ImageNet images (Faster is better):
| Framework |Train Conversion Time | Val Conversion Time | Dataset Size | # Files |
|---|---|---|---|---|
| PL Data | **10:05 min** | **00:30 min** | **143.1 GB** | 2.339 |
| Web Dataset | 32:36 min | 01:22 min | 147.8 GB | 1.144 |
| Mosaic ML | 49:49 min | 01:04 min | **143.1 GB** | 2.298 |

&nbsp;

# Runnable Templates

Fastest way to learn is with [Studios](https://lightning.ai/studios).

[Studios](https://lightning.ai/studios) are reproducible cloud IDE with data, code, dependencies, e.g. so redo everything yourself with ease!

We've published [public templates](https://lightning.ai/studios) that demonstrates how best to use the LitData framework at scale and with several data types.

Sign up [here](https://lightning.ai/) and run your first Studio for free.
----

| Studio | Data type | Dataset |
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
| [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) | Image & Text |[LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) |
| [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) | Image & Mask | [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) |
| [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) | Image & Label | [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) |
| [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) | Text | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) |
| [Tokenize 2M Swedish Wikipedia Articles](https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles) | Text | [Swedish Wikipedia](https://huggingface.co/datasets/wikipedia) |
| [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) | Text | [English Wikipedia](https://huggingface.co/datasets/wikipedia) |
| [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) | Parquet Files | Randomly Generated data |
# Paralellize transforms and data optimization on cloud machines
<div align="center">
<img alt="Lightning" src="https://pl-flash-data.s3.amazonaws.com/data-prep.jpg" width="700px">
</div>

# Infinite cloud data processing
## Parallelize data transforms

If you want to scale data processing, you typically need more machines and if you do this yourself, this becomes very tedious and can take a long time to get there.
Transformations with LitServe are linearly parallelizable across machines.

For example, let's say that it takes 56 hours to embed a dataset on a single A10G machine. With LitServe,
this can be speed up by adding more machines in parallel

Instead, create a free account on the [Lightning.ai](https://lightning.ai/) platform and use as many machines as you need from code.
| Number of machines | Hours |
|-----------------|--------------|
| 1 | 56 |
| 2 | 28 |
| 4 | 14 |
| ... | ... |
| 64 | 0.875 |

On the platform, simply specify the number of nodes and the machine type you need as follows:
To scale the number of machines, run the processing script on [Lightning Studios](https://lightning.ai/):

```python
from litdata import map, Machine
Expand All @@ -642,7 +656,8 @@ map(
)
```

Also, the `optimize` operator can do the same to make immense datasets streamable as follows:
## Parallelize data optimization
To scale the number of machines for data optimization, use [Lightning Studios](https://lightning.ai/):

```python
from litdata import optimize, Machine
Expand All @@ -654,15 +669,40 @@ optimize(
)
```

&nbsp;

Within the [LAION 400 MILLION Studio](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset), we utilized 32 machines, each equipped with 32 CPUs, to execute the `optimize` operator, enabling the download of 400 million images in just 2 hours. Below is a screenshot of that job within the [Lightning.ai](https://lightning.ai/) platform. You can execute it yourself [here](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset).
Example: [Process the LAION 400 million image dataset in 2 hours on 32 machines, each with 32 CPUs](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset).

<div align="center">
&nbsp;

<img alt="Lightning" src="https://pl-flash-data.s3.amazonaws.com/data-prep.jpg" width="800px" style="max-width: 100%;">
----

</div>
# Start from a template
Below are templates for real-world applications of LitData at scale.

## Templates: Transform datasets

| Studio | Data type | Time (minutes) | Machines | Dataset |
| ------ | ----------------- | ----------------- | -------------- | -------------- |
| [Download LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) | Image & Text | 120 | 32 |[LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) |
| [Tokenize 2M Swedish Wikipedia Articles](https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles) | Text | 7 | 4 | [Swedish Wikipedia](https://huggingface.co/datasets/wikipedia) |
| [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) | Text | 15 | 3 | [English Wikipedia](https://huggingface.co/datasets/wikipedia) |

## Templates: Optimize + stream data

| Studio | Data type | Time (minutes) | Machines | Dataset |
| ------ | ----------------- | ----------------- | -------------- | -------------- |
| [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) | Image & Mask | 120 | 32 | [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) |
| [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) | Image & Label | 10 | 1 | [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) |
| [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) | Text | 240 | 32 | [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) |
| [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) | Parquet Files | 12 | 16 | Randomly Generated data |

&nbsp;

----

# ⚡ Contributors
# Community
LitData is a community project accepting contributions - Let's make the world's most advanced AI data processing framework.

We welcome any contributions, pull requests, or issues. If you use the Streaming Dataset for your own project, please reach out to us on [Discord](https://discord.com/invite/XncpTy7DSt).
- [Get help from 5,0000+ developers on our Discord](https://discord.com/invite/XncpTy7DSt)
- [Licensed under the Apache 2.0 License](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)

0 comments on commit 18602f5

Please sign in to comment.