Skip to content

Dataset not created when using map() on data structure without file paths inside #99

@awaelchli

Description

@awaelchli

🐛 Bug

Dataset does not get created when map is given a list of items without paths.

To Reproduce

  1. pip install litdata==0.2.3 lightning_sdk==0.1.3
  2. Run this script
  3. Wait for job to finish
  4. Restart studio
  5. ls /teamspace/datasets

Code sample

import requests, os
from litdata import map
from lightning_sdk import Machine


def create_files(idx, output_dir):
    with open(os.path.join(output_dir, f"{idx}.txt") , "w") as f:
        f.write(str(idx))


def main():
    # root_dir = "./data-processed"
    root_dir = "/teamspace/datasets/data-processed"
    # os.makedirs(root_dir, exist_ok=True)

    inputs = list(range(100))

    map(
        fn=create_files,
        inputs=inputs,
        output_dir=root_dir,
        num_workers=os.cpu_count(),
        num_nodes=2,
        machine=Machine.CPU,
    )


if __name__ == "__main__":
    main()

In #73 a condition and self.input_dir.path was added:
https://github.com/Lightning-AI/litdata/blame/58f7aeb5836383ff839b4d966462c568aa6e7435/src/litdata/processing/data_processor.py#L1022
The assumption there was that map gets a datastructure of file paths, but that's not always true. For example, map could be called on a list of URLs to download and process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions