-
Notifications
You must be signed in to change notification settings - Fork 0
Composite serialization #17
Composite serialization #17
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #17 +/- ##
===========================================
- Coverage 100.00% 97.89% -2.11%
===========================================
Files 2 2
Lines 32 95 +63
===========================================
+ Hits 32 93 +61
- Misses 0 2 +2
Continue to review full report at Codecov.
|
src/MLJSerialization.jl
Outdated
| import IterationControl | ||
|
|
||
| export serializable, restore!, save, machine | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently save is not exported. Are you wanting to export it for convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was actually wondering why it wasn't?
| end | ||
|
|
||
| setreport!(copymach, mach.report) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail: Maybe we want to set the value of copymach.state=-1 to tag the machine as "serialisable" so we can throw an error in the new situation that user tries predict(mach, ....) when mach has been stripped of data (logic that will need adding at MLJBase/src/operations.jl).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that depends on how faithful you want the state to be and what is the current logic behind it. Let's imagine a situation where data comes in every day and we update our machine with it as it comes and serialize it back to disk. I suppose we would like to record how many times we have updated our machine in the state? Would that be captured at the moment? Setting it ot -1, will necessarily erase this information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, currently MLJ does not support incremental learning, that is "update machine with new data" (eg, gradient descent models) - only update with new model parameters or new views of the same data. If you get new data, you need to rebind your model with all the data you have and retrain from scratch. There was some plan to support incremental learning but in that case we would probably introduce a new field to keep track of the number of "data injections".
Presently the state variable is used to determine: (i) has the machine been trained (state > 0) and (ii) whether a machine upstream of a machine mach in a learning network has been trained since the last time mach was trained (via the old_upstream_state field, which is a snapshot of the state of every upstream machine).
On the other hand, if you have a reliable way to detect if a machine has been serialised, we can go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with setting the state to -1 is that when restored, the machine cannot be used again to predict unless we set the state back again to 1 in restore which seems a bit counterproductive. Any thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. You are right. My first thought is to change the predict logic to allow -1.
|
Let's also clear the |
During our off-line discussion, I think I missed the essential point here: If we leave the |
|
@ablaom I think I am not too far from a working solution and have a few more questions (I am also available for a chat if you want to discuss more):
|
Thanks for this progress. I think another meeting before proceeding too much further is good idea, but I will make some quick responses to help you prepare:
This previously existed to enable user to pass options to a model-specific serialiser, and to the default JLSO serialiser. The latter use-case is not relevant under the new design. I don't know of a current use-case for model-specific serializer options, so if they get in the way, I guess we can get rid of them. (I didn't really understand the conflict, but we can discuss.)
Sounds like your throwing the baby out with the bath water. Iteration control still needs a
DecisionTree is used in current tests, but you can use something different if you prefer. Note that
The original reason for shifting this out was that JLSO is (or was) quite heavy, slowing down loading and pre-compilation. Under the new design, we are dumping it, so moving this back might make sense. |
The goal is to address: #15
This is a proposal for MLJSerialization. There are 4 high level functions:
serializable: makes a copy of a machine that is serializable, ie removes data and make fitresults serializablerestore!: in place modification of a machine by restoration of the fitresultssave: combinesserializablewith the built-inserializemethodmachine: combines the build-indeserializewithrestore!So a user can either use the end-to-end provided
saveandmachineor use the Serialization module of their choice. Happy to discuss the change which unfortunately is almost a full rewrite and will be most probably breaking. However I have just overloaded theMLJModelInterface.savemethod so there is no change of logic.@edit
Progress tracking:
Set of generic questions on machines. I find it difficult to understand the organization/separation of concernes of the datastructures in machines (cache, report, fitresult). A few subquestions:
TunedModel--> should be dealt with.Currently, there is a problem with the filename used by XGBoostRegressor that will be used multiple times in a stack for instance and then erased. Is it really necessary to pass the filename to
savefor models as it doesn't seem to be of any use in this example? As a user I would not expect to have say 20 files saved if I save aStackfor instance. I think theIterationControlwas used for that purpose but I haven't looked into it yet neither do I know if it can be used easily with my proposal. @ablaom has posted Dropfilenamearg from save(filename, model, ...) MLJBase.jl#724machineshould accept new arguments to be fitted againManage
state--> probably set to-1.migrate to MLJBase, MLJTuning, MLJEnsembles, MLJIteration
Test
report_nodesis correctly saved, maybe I could finish this first to have a built in example.Further testing
Drop support for 1.0? (as I think this is the case for MLJ in general : https://github.com/JuliaAI/MLJBase.jl/releases/tag/v0.19.4 @ablaom suggest 1.6
I have changed the dependency management to having 2
Project.tomlfiles as I find it easier to stack the environments for development. I hope this is fine.Probably remove kwargs...?
Modify
Savecontrol to pass customsavemethod to override defaultSerialization.saveDocument somewhere that serializable parameters have to be part of the model def
Mark this package as deprecated ?
Hope that helps and happy to take remarks