Adding support for raw python generator in addition to Dataset

The main goal is to ease the create of streaming data to the pipe. `Dataset` is more involved and pytorch specific. This PR, provides a way to use a python iterator too. This enabled huggingface#14250 but can be proposed as a standalone PR. ```python from transformers import pipeline def read_data(filename): with open(filename, 'r') as f: for line in f: yield f pipe = pipeline("text-classification") for classified in pipe(read_data("large_file.txt")): print("Success ! ", classified) ``` The main caveat of this, is the interaction with `DataLoader` with `num_workers>1`. When you have multiple workers, each receive a copy of the generator (like `IterableDataset`). That means the naive Iterator will fail since all workers iterate on all items of the generator. There are ways to do clever "skipping", but it could be bad still because all workers still do have to pass through all items of the generator (they just ignore items they don't handle), depending on the case it might be bad. Using `num_workers=1` is the simplest fix and if the cost of loading your data is small enough should be good enough. In the above example trying to do smart tricks to skip some lines is unlikely to be a net positive for instance. If there are better ways to do "jumps" on some data, then using `Dataset` is more advised (since then differents workers can just jump themselves).
Narsil · Dec 8, 2021 · e487dbf · e487dbf
1 parent 2e12d90
commit e487dbf
Showing 1 changed file with 6 additions and 1 deletion.
diff --git a/src/transformers/pipelines/base.py b/src/transformers/pipelines/base.py
@@ -1087,7 +1087,12 @@ def __call__(self, inputs, *args, num_workers=0, batch_size=1, **kwargs):
                 return outputs
             else:
                 return self.run_multi(inputs, preprocess_params, forward_params, postprocess_params)
-        elif Dataset is not None and isinstance(inputs, Dataset):
+        elif (
+            Dataset is not None
+            and isinstance(inputs, Dataset)
+            or isinstance(inputs, types.GeneratorType)
+            and self.framework == "pt"
+        ):
             return self.get_iterator(
                 inputs, num_workers, batch_size, preprocess_params, forward_params, postprocess_params
             )