Skip to content

Magic FileSerializer is causing issues #178

@tchaton

Description

@tchaton

Some users have faced issues where they provide a filepath inside their sample and litdata automatically detects it is a valid file and read its content.

This made the code nice but unfortunately, some user un-aware of the behaviours were seeing too many chunks being created.

Example:

fallocate -l 50MB gentoo_root.img
from litdata import optimize

def fn(filepath):
    return filepath


optimize(
    fn=fn,
    inputs=["gentoo_root.img" for _ in range(10)],
    output_dir="./data",
    chunk_bytes="64MB",
    num_workers=1,
)

Each sample will store the entire 50MB file, so you endup with 10 chunks with 10 times the entire file instead of 1 chunk with 10 samples containing the filepath .

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions