Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify meaning of ingest! versus update! #10

Merged
merged 5 commits into from
Jan 25, 2023
Merged

Clarify meaning of ingest! versus update! #10

merged 5 commits into from
Jan 25, 2023

Conversation

ablaom
Copy link
Member

@ablaom ablaom commented Jan 22, 2023

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2023

Codecov Report

Merging #10 (e170e2e) into master (bd75a42) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #10   +/-   ##
=======================================
  Coverage   27.50%   27.50%           
=======================================
  Files           5        5           
  Lines          80       80           
=======================================
  Hits           22       22           
  Misses         58       58           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jeremiedb
Copy link

I think I remain a little confused to the extent to which these term can translate unambiguously to the variety of algos and their implementations.

For a GBT / EvoTree:

  • fit: preprocess X / Y, creates a cache, then apply a grow_evotree! for some iterations
  • update!: essentially grow_evotree!, that is, add a tree to the model, assuming no change to data (uses cache), though the hyper-params may have changed (learning rate, regularization...) but not all (nbins couldn't as it would require an expensive re-creation of cache)
  • ingest!: continue training using new data. This is not a functionnality that isn't supported in current implementation. Is there an actual use case for which it would be expeted to be supported for GBT?

Is the intent of ingest! to act as a support for online learning? As I unnderstand it, the intent of ingest! is that the intended effect of fit(x1, y1, m); update!(x2, y2, m) is to be equivalent to fit(x3, y3) where x3 is the concatenation of x1 & x2. If such is the case, then I guess that some extra information needs to be captures during fit, for as linear model for example to exhibit such behavior where the fit + ingest = fit on concatenated data.

Is it assumed that for ingest!, the new data keeps the same features, or it could be a subset / overset? The later could be relevant in situations where uses initial model as offset models, over which training could be performed potentially on additional features, though I don't think such mechanism would be the appropriate approach to achieve this rather than an explicit model stacking.

For neural nets, where a model is fed data through a DataLoader, I'm not too clear which of the update! and ingest! best applies. Is each of of the batch of an epoch be considered like new data? Or would ingest! only be used is a new DataLoader is built or new data?

I think a reason why I find the update / ingest distinction not so clear is that it may be that the underlying reason for a difference in implications from the 2 verbs have more to do about algorithm implementations and whether they involve preprocessing / caching, than actual distinct verbs generally applicable.

For example, if using a GBT with exact method (one which does not require data preprocessing), then such tree construction algo could be implemented using a stream / online approach. Each iteration could be fed with either entirely new data (having the same features) or just another subsampling of the original data. This is a similar situation for neural nets where I don't see fundamental distinction between a batch from a fixed dataset or a batch coming from an entirely new one. And in all cases, I think there are some parms that can changed through both update / ingest like learning rates and regularization, and others that can't like number of features, or size of hidden layers.

Perhaps this has already been done, but I'm wondering if a clarification of the scope of what algos / use cases are supported by the framework. By that I mean to explicit what are the implications (is there any overhead, in what circumstances) for a variety of algo families, notably:

  • Linear models
  • Neural Nets
  • Gradient boosted trees
  • Algos requiring cache / initialization vs those that doesn't

Given the broadly different cowds that may feel concerned by the framework, it also comes with very different perspectives of what are the "natural" way of doing things and what appear like reasonablw compromise (for instance performance overhead is a big deal in my prod oriented usage, but isn't for many research / educational ones).

@ablaom ablaom merged commit aa393c0 into master Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants