Skip to content
This repository was archived by the owner on Apr 7, 2022. It is now read-only.

Conversation

@olivierlabayle
Copy link
Collaborator

@olivierlabayle olivierlabayle commented Jan 11, 2022

The goal is to address: #15

This is a proposal for MLJSerialization. There are 4 high level functions:

  • serializable: makes a copy of a machine that is serializable, ie removes data and make fitresults serializable
  • restore!: in place modification of a machine by restoration of the fitresults
  • save: combines serializable with the built-in serialize method
  • machine: combines the build-in deserialize with restore!

So a user can either use the end-to-end provided save and machine or use the Serialization module of their choice. Happy to discuss the change which unfortunately is almost a full rewrite and will be most probably breaking. However I have just overloaded the MLJModelInterface.save method so there is no change of logic.

@edit

Progress tracking:

  • Set of generic questions on machines. I find it difficult to understand the organization/separation of concernes of the datastructures in machines (cache, report, fitresult). A few subquestions:

    • Fields initialization: many fields start undefined
    • Fields Types: various types can be found for each field
    • Fields concern. What is the concern of what field? For instance I wasn't expecting to see a machine in the cache of a TunedModel --> should be dealt with.
  • Currently, there is a problem with the filename used by XGBoostRegressor that will be used multiple times in a stack for instance and then erased. Is it really necessary to pass the filename to save for models as it doesn't seem to be of any use in this example? As a user I would not expect to have say 20 files saved if I save a Stack for instance. I think the IterationControl was used for that purpose but I haven't looked into it yet neither do I know if it can be used easily with my proposal. @ablaom has posted Drop filename arg from save(filename, model, ...) MLJBase.jl#724

  • machine should accept new arguments to be fitted again

  • Manage state --> probably set to -1.

  • migrate to MLJBase, MLJTuning, MLJEnsembles, MLJIteration

  • Test report_nodes is correctly saved, maybe I could finish this first to have a built in example.

  • Further testing

    • Unsupervised Model
    • maybe check file size
  • Drop support for 1.0? (as I think this is the case for MLJ in general : https://github.com/JuliaAI/MLJBase.jl/releases/tag/v0.19.4 @ablaom suggest 1.6

  • I have changed the dependency management to having 2 Project.toml files as I find it easier to stack the environments for development. I hope this is fine.

  • Probably remove kwargs...?

  • Modify Save control to pass custom save method to override default Serialization.save

  • Document somewhere that serializable parameters have to be part of the model def

  • Mark this package as deprecated ?

Hope that helps and happy to take remarks

@codecov-commenter
Copy link

Codecov Report

Merging #17 (7766135) into dev (161e27d) will decrease coverage by 2.10%.
The diff coverage is 98.68%.

Impacted file tree graph

@@             Coverage Diff             @@
##               dev      #17      +/-   ##
===========================================
- Coverage   100.00%   97.89%   -2.11%     
===========================================
  Files            2        2              
  Lines           32       95      +63     
===========================================
+ Hits            32       93      +61     
- Misses           0        2       +2     
Impacted Files Coverage Δ
src/controls.jl 100.00% <ø> (ø)
src/machines.jl 97.56% <98.68%> (-2.44%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 161e27d...7766135. Read the comment docs.

import IterationControl

export serializable, restore!, save, machine

Copy link
Member

@ablaom ablaom Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently save is not exported. Are you wanting to export it for convenience?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was actually wondering why it wasn't?

end

setreport!(copymach, mach.report)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail: Maybe we want to set the value of copymach.state=-1 to tag the machine as "serialisable" so we can throw an error in the new situation that user tries predict(mach, ....) when mach has been stripped of data (logic that will need adding at MLJBase/src/operations.jl).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that depends on how faithful you want the state to be and what is the current logic behind it. Let's imagine a situation where data comes in every day and we update our machine with it as it comes and serialize it back to disk. I suppose we would like to record how many times we have updated our machine in the state? Would that be captured at the moment? Setting it ot -1, will necessarily erase this information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, currently MLJ does not support incremental learning, that is "update machine with new data" (eg, gradient descent models) - only update with new model parameters or new views of the same data. If you get new data, you need to rebind your model with all the data you have and retrain from scratch. There was some plan to support incremental learning but in that case we would probably introduce a new field to keep track of the number of "data injections".

Presently the state variable is used to determine: (i) has the machine been trained (state > 0) and (ii) whether a machine upstream of a machine mach in a learning network has been trained since the last time mach was trained (via the old_upstream_state field, which is a snapshot of the state of every upstream machine).

On the other hand, if you have a reliable way to detect if a machine has been serialised, we can go with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with setting the state to -1 is that when restored, the machine cannot be used again to predict unless we set the state back again to 1 in restore which seems a bit counterproductive. Any thought?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. You are right. My first thought is to change the predict logic to allow -1.

@ablaom ablaom marked this pull request as draft January 18, 2022 02:03
@ablaom
Copy link
Member

ablaom commented Jan 18, 2022

Let's also clear the old_rows field. See https://discourse.julialang.org/t/what-does-mlj-save-really-save/74606/8?u=ablaom

@ablaom
Copy link
Member

ablaom commented Jan 20, 2022

  • Currently, there is a problem with the filename used by XGBoostRegressor that will be used multiple times in a stack for instance and then erased. Is it really necessary to pass the filename to save for models as it doesn't seem to be of any use in this example? As a user I would not expect to have say 20 files saved if I save a Stack for instance. I think the IterationControl was used for that purpose but I haven't looked into it yet neither do I know if it can be used easily with my proposal.

During our off-line discussion, I think I missed the essential point here: If we leave the filename argment in save(filename, ::Model, ...) then serializable needs filename as an argument, which is silly. So I agree that we make the breaking change (in MLJBase) to drop the filename argument, and to find a workaround for XGBoost.

@olivierlabayle
Copy link
Collaborator Author

@ablaom I think I am not too far from a working solution and have a few more questions (I am also available for a chat if you want to discuss more):

  • kwargs...: I also asked the question here. Is this really necessary?
  • IterationControl: I have removed the IterationControl that now seems unecesssary here unless I missed something?
  • Testing: I see there is a dependency on the DecisionTree package but you also seem to have overriden it? I am also not using the _dummymodel at the moment. I am planning on adding a testing deps on MLJModels though to also test for a few transformers. MLJModels seems like a fair dependency to have since it is maintained by the MLJ team if I am not mistaken.
  • Code Organisation: I am pondering the requirement for the MLJSerialization package. If we want to keep it, I think we will need to have an "unnatural" dependency on MLJTuning.
    • I can't move code to either MLJTuning or MLJBase since their save function depends on serializable that now lives here.
    • Moving serializable, restore!, save and machine to MLJBase would allow to move the rest of the functions to their respective packages. Plus it would also make it easier to reuse MLJBase.replace.
    • I know you are trying to reduce the size of MLJBase so there may be more things to consider.

@ablaom
Copy link
Member

ablaom commented Jan 27, 2022

@ablaom I think I am not too far from a working solution and have a few more questions (I am also available for a chat if you want to discuss more):

Thanks for this progress. I think another meeting before proceeding too much further is good idea, but I will make some quick responses to help you prepare:

  • kwargs...: I also asked the question here. Is this really necessary?

This previously existed to enable user to pass options to a model-specific serialiser, and to the default JLSO serialiser. The latter use-case is not relevant under the new design. I don't know of a current use-case for model-specific serializer options, so if they get in the way, I guess we can get rid of them. (I didn't really understand the conflict, but we can discuss.)

  • IterationControl: I have removed the IterationControl that now seems unecesssary here unless I missed something?

Sounds like your throwing the baby out with the bath water. Iteration control still needs a save control, so iterative models can create periodic snapshots of the model. The control may need a breaking redesign, however. In particular, we may need to add parameters saying what serialiser is going to be used, in case the user is not happy with the default.

  • Testing: I see there is a dependency on the DecisionTree package but you also seem to have overriden it? I am also not using the _dummymodel at the moment. I am planning on adding a testing deps on MLJModels though to also test for a few transformers. MLJModels seems like a fair dependency to have since it is maintained by the MLJ team if I am not mistaken.

DecisionTree is used in current tests, but you can use something different if you prefer. Note that test/_models contains a local version of the MLJ <-> DecisionTreeInterface (essentially what is now in MLJDecisionTreeInterface) but this may well be an anachronism. (MLJModels used to contain this interface but also depended on MLJBase, so was not good as a test dependency. MLJModels no longer depends on MLJBase.) I just checked MLJModels - it has a lot of dependencies, but probably most are already dependencies of MLJBase, so it's probably okay to use. In the future it is likely that the transformer live in a separate package which might be a little more lightweight.

_dummymodel is needed to test the control for MLJIteration; I think we'll still need it.

  • Code Organisation: I am pondering the requirement for the MLJSerialization package. If we want to keep it, I think we will need to have an "unnatural" dependency on MLJTuning.

    • I can't move code to either MLJTuning or MLJBase since their save function depends on serializable that now lives here.
    • Moving serializable, restore!, save and machine to MLJBase would allow to move the rest of the functions to their respective packages. Plus it would also make it easier to reuse MLJBase.replace.
    • I know you are trying to reduce the size of MLJBase so there may be more things to consider.

The original reason for shifting this out was that JLSO is (or was) quite heavy, slowing down loading and pre-compilation. Under the new design, we are dumping it, so moving this back might make sense.

@olivierlabayle olivierlabayle closed this by deleting the head repository Mar 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants