Clarify meaning of `ingest!` versus `update!` #10

ablaom · 2023-01-22T23:15:44Z

[skip ci]

Typos

codecov-commenter · 2023-01-22T23:17:31Z

Codecov Report

Merging #10 (e170e2e) into master (bd75a42) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #10   +/-   ##
=======================================
  Coverage   27.50%   27.50%           
=======================================
  Files           5        5           
  Lines          80       80           
=======================================
  Hits           22       22           
  Misses         58       58

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

jeremiedb · 2023-01-23T21:24:34Z

I think I remain a little confused to the extent to which these term can translate unambiguously to the variety of algos and their implementations.

For a GBT / EvoTree:

fit: preprocess X / Y, creates a cache, then apply a grow_evotree! for some iterations
update!: essentially grow_evotree!, that is, add a tree to the model, assuming no change to data (uses cache), though the hyper-params may have changed (learning rate, regularization...) but not all (nbins couldn't as it would require an expensive re-creation of cache)
ingest!: continue training using new data. This is not a functionnality that isn't supported in current implementation. Is there an actual use case for which it would be expeted to be supported for GBT?

Is the intent of ingest! to act as a support for online learning? As I unnderstand it, the intent of ingest! is that the intended effect of fit(x1, y1, m); update!(x2, y2, m) is to be equivalent to fit(x3, y3) where x3 is the concatenation of x1 & x2. If such is the case, then I guess that some extra information needs to be captures during fit, for as linear model for example to exhibit such behavior where the fit + ingest = fit on concatenated data.

Is it assumed that for ingest!, the new data keeps the same features, or it could be a subset / overset? The later could be relevant in situations where uses initial model as offset models, over which training could be performed potentially on additional features, though I don't think such mechanism would be the appropriate approach to achieve this rather than an explicit model stacking.

For neural nets, where a model is fed data through a DataLoader, I'm not too clear which of the update! and ingest! best applies. Is each of of the batch of an epoch be considered like new data? Or would ingest! only be used is a new DataLoader is built or new data?

I think a reason why I find the update / ingest distinction not so clear is that it may be that the underlying reason for a difference in implications from the 2 verbs have more to do about algorithm implementations and whether they involve preprocessing / caching, than actual distinct verbs generally applicable.

For example, if using a GBT with exact method (one which does not require data preprocessing), then such tree construction algo could be implemented using a stream / online approach. Each iteration could be fed with either entirely new data (having the same features) or just another subsampling of the original data. This is a similar situation for neural nets where I don't see fundamental distinction between a batch from a fixed dataset or a batch coming from an entirely new one. And in all cases, I think there are some parms that can changed through both update / ingest like learning rates and regularization, and others that can't like number of features, or size of hidden layers.

Perhaps this has already been done, but I'm wondering if a clarification of the scope of what algos / use cases are supported by the framework. By that I mean to explicit what are the implications (is there any overhead, in what circumstances) for a variety of algo families, notably:

Linear models
Neural Nets
Gradient boosted trees
Algos requiring cache / initialization vs those that doesn't

Given the broadly different cowds that may feel concerned by the framework, it also comes with very different perspectives of what are the "natural" way of doing things and what appear like reasonablw compromise (for instance performance overhead is a big deal in my prod oriented usage, but isn't for many research / educational ones).

PallHaraldsson and others added 4 commits January 17, 2023 11:10

Typos

736b9b3

[skip ci]

Merge pull request #9 from PallHaraldsson/patch-1

38f3764

Typos

clarify meaning of ingest!

4884bfe

Merge branch 'dev' of https://github.com/JuliaAI/MLInterface.jl into dev

f086be7

add "Goals" section to landing page of docs

e170e2e

ablaom mentioned this pull request Jan 25, 2023

Discussion: ingest! and update! #13

Open

ablaom merged commit aa393c0 into master Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify meaning of `ingest!` versus `update!` #10

Clarify meaning of `ingest!` versus `update!` #10

ablaom commented Jan 22, 2023

codecov-commenter commented Jan 22, 2023 •

edited

Loading

jeremiedb commented Jan 23, 2023

Clarify meaning of ingest! versus update! #10

Clarify meaning of ingest! versus update! #10

Conversation

ablaom commented Jan 22, 2023

codecov-commenter commented Jan 22, 2023 • edited Loading

Codecov Report

jeremiedb commented Jan 23, 2023

Clarify meaning of `ingest!` versus `update!` #10

Clarify meaning of `ingest!` versus `update!` #10

codecov-commenter commented Jan 22, 2023 •

edited

Loading