Skip to content

Commit f739826

Browse files
authored
docs: update README with stories (#677)
1 parent 21786d1 commit f739826

File tree

1 file changed

+47
-18
lines changed

1 file changed

+47
-18
lines changed

README.md

Lines changed: 47 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,39 @@ pip install 'litdata[extras]'
7979
----
8080

8181
# Speed up model training
82+
Stream datasets directly from cloud storage without local downloads. Choose the approach that fits your workflow:
83+
84+
## Option 1: Start immediately with existing data ⚡⚡
85+
Stream raw files directly from cloud storage - no pre-optimization needed.
86+
87+
```python
88+
from litdata import StreamingRawDataset
89+
from torch.utils.data import DataLoader
90+
91+
# Point to your existing cloud data
92+
dataset = StreamingRawDataset("s3://my-bucket/raw-data/")
93+
dataloader = DataLoader(dataset, batch_size=32)
94+
95+
for batch in dataloader:
96+
# Process raw bytes on-the-fly
97+
pass
98+
```
99+
100+
**Key benefits:**
101+
102+
**Instant access:** Start streaming immediately without preprocessing.
103+
**Zero setup time:** No data conversion or optimization required.
104+
**Native format:** Work with original file formats (images, text, etc.).
105+
**Flexible processing:** Apply transformations on-the-fly during streaming.
106+
**Cloud-native:** Stream directly from S3, GCS, or Azure storage.
107+
108+
## Option 2: Optimize for maximum performance ⚡⚡⚡
82109
Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.
83110

84-
**Step 1: Optimize the data**
85-
This step will format the dataset for fast loading. The data will be written in a chunked binary format.
111+
**Step 1: Optimize your data (one-time setup)**
112+
113+
Transform raw data into optimized chunks for maximum streaming speed.
114+
This step formats the dataset for fast loading by writing data in an efficient chunked binary format.
86115

87116
```python
88117
import numpy as np
@@ -91,24 +120,24 @@ import litdata as ld
91120

92121
def random_images(index):
93122
# Replace with your actual image loading here (e.g., .jpg, .png, etc.)
94-
# (recommended to pass as compressed formats like JPEG for better storage and optimized streaming speed)
95-
# You can also apply resizing or reduce image quality to further increase streaming speed and save space.
123+
# Recommended: use compressed formats like JPEG for better storage and optimized streaming speed
124+
# You can also apply resizing or reduce image quality to further increase streaming speed and save space
96125
fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))
97126
fake_labels = np.random.randint(10)
98127

99128
# You can use any key:value pairs. Note that their types must not change between samples, and Python lists must
100-
# always contain the same number of elements with the same types.
129+
# always contain the same number of elements with the same types
101130
data = {"index": index, "image": fake_images, "class": fake_labels}
102131

103132
return data
104133

105134
if __name__ == "__main__":
106-
# The optimize function writes data in an optimized format.
135+
# The optimize function writes data in an optimized format
107136
ld.optimize(
108137
fn=random_images, # the function applied to each input
109138
inputs=list(range(1000)), # the inputs to the function (here it's a list of numbers)
110139
output_dir="fast_data", # optimized data is stored here
111-
num_workers=4, # The number of workers on the same machine
140+
num_workers=4, # the number of workers on the same machine
112141
chunk_bytes="64MB" # size of each chunk
113142
)
114143
```
@@ -122,14 +151,14 @@ aws s3 cp --recursive fast_data s3://my-bucket/fast_data
122151

123152
**Step 3: Stream the data during training**
124153

125-
Load the data by replacing the PyTorch DataSet and DataLoader with the StreamingDataset and StreamingDataloader
154+
Load the data by replacing the PyTorch Dataset and DataLoader with the StreamingDataset and StreamingDataLoader.
126155

127156
```python
128157
import litdata as ld
129158

130159
dataset = ld.StreamingDataset('s3://my-bucket/fast_data', shuffle=True, drop_last=True)
131160

132-
# Custom collate function to handle the batch (Optional)
161+
# Custom collate function to handle the batch (optional)
133162
def collate_fn(batch):
134163
return {
135164
"image": [sample["image"] for sample in batch],
@@ -144,15 +173,15 @@ for sample in dataloader:
144173

145174
**Key benefits:**
146175

147-
✅ Accelerate training: Optimized datasets load 20x faster.
148-
✅ Stream cloud datasets: Work with cloud data without downloading it.
149-
Pytorch-first: Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.
150-
✅ Easy collaboration: Share and access datasets in the cloud, streamlining team projects.
151-
✅ Scale across GPUs: Streamed data automatically scales to all GPUs.
152-
✅ Flexible storage: Use S3, GCS, Azure, or your own cloud account for data storage.
153-
✅ Compression: Reduce your data footprint by using advanced compression algorithms.
154-
✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.
155-
✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios.
176+
**Accelerate training:** Optimized datasets load 20x faster.
177+
**Stream cloud datasets:** Work with cloud data without downloading it.
178+
**PyTorch-first:** Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.
179+
**Easy collaboration:** Share and access datasets in the cloud, streamlining team projects.
180+
**Scale across GPUs:** Streamed data automatically scales to all GPUs.
181+
**Flexible storage:** Use S3, GCS, Azure, or your own cloud account for data storage.
182+
**Compression:** Reduce your data footprint by using advanced compression algorithms.
183+
**Run local or cloud:** Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.
184+
**Enterprise security:** Self host or process data on your cloud account with Lightning Studios.
156185

157186
 
158187

0 commit comments

Comments
 (0)