New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we want to support iterators for data loading? #19
Comments
Comment by elijahbenizzy Yeah I'd need to see more of a workflow/motivating example for this TBH. E.G. a graphadapter that chunks up data and iterates in a streaming sense could be high-value as well... |
OK, inspired by @jdonaldson's hamilton-like framework, I'm curious what could happen if we use chunking. Going to walk through a few different UIs I've been mulling over that would fit into the way we currently do things. some use-cases:
requirements/nice-to-have
IdeasChunking with type annotationsThe idea here is that we use def training_files() -> List[str]:
return [file_1, file_2, file_3]
def training_batch(training_files: List[str]) -> Chunked[pd.DataFrame]:
"""Releases the dataframe in batches for training."""
for mini_training_file in training_files:
yield pd.read_csv(mini_training_file)
def bad_data_filtered(training_batch: pd.DataFrame) -> pd.DataFrame:
"""Map operation that takes in a dataframe and outputs a dataframe. This is run each time the training_batch function is completed. We know that to be the case, as it is a pure map function."""
return training_batch.filter(_filter_fn(training_batch))
def X(training_batch: pd.DataFrame) -> pd.DataFrame:
return training_batch[FEATURE_COLUMNS]
def Y(training_batch: pd.DataFrame) -> pd.Series:
return training_batch[TARGET_COLUMNS]
def model(X: Chunked[pd.DataFrame], Y: Chunked[pd.Series]) -> Model:
"""Aggregate operation. This is run one time per each yield from before, but the final model is returned"""
model = Model()
for x, y in zip(X,Y):
model.minibatch_train(x,y) Say we have three batches -- what's the order of operations?
So, the rules are:
There's some interesting implications/extensions. Hydrating from cacheIn this case, def model(X: Chunked[pd.DataFrame], Y: Chunked[pd.Series], model: Model=None) -> Model:
"""Aggregate operation. This is run using the generator from before, but the final model is returned"""
model = Model() if model is None else model
for x, y in zip(X,Y):
model.minibatch_train(x,y) In this case, it accepts a previous model from the DAG run, or you can seed it from a certain result. Note, however, that ParallelizationAny Logging functionsWe could potentially break the aggregators into two and have one map it. There's likely a much cleaner way to write this, but... def model_training(X: Chunked[pd.DataFrame], Y: Chunked[pd.Series]) -> Tuple[Model, TrainingMetrics]:
"""Aggregate operation broken up. Note that the model is just returned on the last yield."""
model = Model() if model is None else model
for x, y in zip(X,Y):
metrics = model.minibatch_train(x,y)
yield [model, metrics]
# Just returns the ID of where it was logged
# This is a side-effecty function and should likely be a materialized result
def logger(model_training: Chunked[Tuple[Model, TrainingMetrics]]) -> str:
run_instance = generate_run_intance_from_logger()
for _, metrics in model_training:
run_instance.log(metrics)
return run_instance.id
def model(model_training: Chunked[Tuple[Model, TrainingMetrics]]) -> Model:
*_, (final_model, __) = model_training
return final_model Note that this implementation likely involves intelligent caching, as two outputs depend on it. Nested ChunkingYeah, this is too messy to allow for now, so I'm going to go ahead and say its banned. Could be implemented if we allow the |
OK, latest API proposal -- feel free to react/comment. Planning to scope/implement shortly. Note these are just adaptations of the above. Two primary use-cases we want to handle:
APIWe are using type-hints to handle this -- they will be loose wrappers over generators. Three generic types, inspired (in part) by classic map-reduce:
def files_to_process(dir: str) -> Sequential[str]:
"""Anything downstream of this that depends on `files_to_process` as an `str`
will run in sequence, until a Collect function."""
for f in list_dir(dir):
if f.endswith('.csv'):
yield f
def files_to_process(dir: str) -> Parallel[str]:
"""Anything downstream of this that depends on `files_to_process` as an `str`
can run in parallel, until a Collect function. Note that this means that this entire generator might be run greedily, and the
results are dished out"""
for f in list_dir(dir):
if f.endswith('.csv'):
yield f
# setting up basic map reduce example
def file_loaded(files_to_process: str) -> str:
with open(files_to_process, 'r') as f:
return f.read()
def file_counted(file_loaded: str) -> dict:
return Counter(tokenize(file_loaded))
def word_counts(file_counted: Collect[Counter]) -> dict:
"""Joins all the word counts"""
full_counter = Counter()
for counter in file_counted:
for word, counter in counter.items():
full_counter[word] += count
return full_counter The rules are as follows:
Edge cases
ImplementationTBD for now -- will scope out and report. First implementation will likely break some of the edge cases above. Some requirements though:
|
More edge cases:
|
Issue by skrawcz
Monday Feb 07, 2022 at 17:25 GMT
Originally opened as stitchfix/hamilton#68
What?
If we want to chunk over data, a natural way to do that is via an iterator.
Example: Enable "input" functions to be iterators
This is fraught with some edge cases. But could be a more natural way to chunk over large data sets? This perhaps requires a new driver -- as we'd want some
next()
type semantic logic on the output of execute...Originally posted by @skrawcz in stitchfix/hamilton#43 (comment)
Things to think through whether this is something useful to have:
execute()
only do one iteration?The text was updated successfully, but these errors were encountered: