Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clone functionality for module and containers #1111

Closed
wants to merge 27 commits into from

Conversation

benborder
Copy link
Contributor

@benborder benborder commented May 23, 2023

Original Issue: #1110

closes #1110

Summary

Adds cloning functionality to modules and containers by adding a pure virtual clone() method to Module.

  • clone() performs a deep copy of parameters and modules, taking advantage of the underlying tensors copy on write semantics if implemented by the backend in order to minimise memory usage.
  • Every module and container now must implement clone(). This has been done for the core library and some simpler modules in pkg, the remaining will throw a runtime error indicating clone is unimplmented.
  • Core modules have been updated where necessary to add appropriate copy, assignment and move constructors to perform a deep copy of Variable.
  • Core containers have been updated to use a new macro FL_BASIC_CONTAINER_CLONING where possible, which implements clone() as well as appropriate copy, assignment and move constructors.
  • Users must be aware of and manage the lifetimes and cloning behaviour of their modules/containers. This means if users have any custom or shared lifetime requirements they should not use the FL_BASIC_CONTAINER_CLONING macro, but implement their own clone() override method as well as copy, assignment and move constructors to achieve their desired behaviour.

Test Plan (required)

Added some tests and successfully ran locally. Also use CI.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label May 23, 2023
@jacobkahn jacobkahn self-requested a review May 23, 2023 23:41
@jacobkahn jacobkahn added the enhancement New feature or request label May 23, 2023
Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benborder — this CRTP idea is a nice one. If I'm being honest, I think breaking the existing API is fine, and there's a way we could do this which would be pretty minimally invasive. tl;dr -- I think we can add a pure virtual clone() one the base Module/Container and implement things that way. Module already being pure means this is pretty simple as is.

Downstream users that use module but want its copy constructor can and should implement copy if they want this functionality. If they don't, they can raise an exception when updating. FL is still in a stage where we're not as concerned about keeping the API perfect -- we're still iterating and have the luxury of making decisions (like this one!). Keeping this as a method on the class avoids some additional inheritance complexity which in my opinion significantly complexifies the interface and type hierarchy.

Enables users to easily add cloning for standard Modules and Containers

A few simple ways to deal with this being optional:

  • users can make their types' copy constructors of they don't want to play by copy constructor rules
  • users can throw exceptions in copy methods

Let me know what you think. I'll leave more specific code comments after we figure out direction.

@benborder
Copy link
Contributor Author

@jacobkahn The intention with this approach was to avoid any breaking changes while minimising the amount of code added/changed. Given breaking changes are an option, the approach you suggest is definitely a better one.

It would be good to get some clarification on the desired behaviour of Module and Container copy constructors. I'm not sure what use case there is for shallow copy semantics for Module and Container parameters and modules? So I'd be inclined to require their copy constructors perform a deep copy of parameters/modules, which I think is what you are suggesting?

This would mean simple modules like activation functions can continue to use the compiler generated copy constructor. Furthermore, if the copy constructor is assumed to perform a deep copy there would be no need for a copy() member function, only a clone(), which would minimise changes to the Module interface.

@jacobkahn
Copy link
Member

desired behaviour of Module and Container copy constructors

Agree -- let's think about this. My thoughts:

  • agree shallow copy via module copy constructors is a bit confusing. Copy-on-write makes sense for tensors and Variables; not modules, since they aren't mutated in the same way. Copy ctor for modules doing a deep copy makes a lot of sense.
  • re compiler generated copy ctors -- this can work, but one must be careful, as copying underlying params_ calls Variable copy ctor which shallow copies underlying tensors. That COW is by design and matters because of how autograd works, so I think the implementation of copy for modules should be explicit and possibly user defined.

A pure virtual copy on Module that's implemented by each subclass and used by the copy ctor could make this easy.

@benborder benborder force-pushed the module_container_clone branch 3 times, most recently from adf4351 to 4c9cf15 Compare May 25, 2023 14:19
@benborder
Copy link
Contributor Author

I generally agree. I think requiring users to implement their own clone() method is the best we can do to force them think about how to correctly implement their copy constructors. It also means that if users want to implement some custom copy constructor that does not perform a deep copy, they can still implement a deep copy via the clone() method.

I've made changes to reflect this. ModuleTest builds and passes, but I'll wait until we agree on a direction before addressing the remaining build failures from clone() being pure virtual.

Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CI is failing because some classes are still missing clone() implementations -- flashlight/fl/contrib modules need this as well. Would be great if you could update flashlight/pkg too while you're at it. CI baselines for those coming soon.

flashlight/fl/nn/modules/BatchNorm.cpp Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Outdated Show resolved Hide resolved
@benborder benborder changed the title Add Cloneable template classes to add clone functionality Add clone functionality for module and containers May 28, 2023
@benborder
Copy link
Contributor Author

@jacobkahn I went though and fixed up all the compile issues. I added copy constructors and copy assignment operators where the implementation was simple enough or the module/container provided core functionality. For the remaining I added a runtime exception indicating cloning is not implemented.

Something to note. Due to adding copy constructors to perform a deep copy of a modules variables etc, existing code may not function as expected. For example the ResidualFwd test in flashlight/fl/test/contrib/modules/ContribModuleTest.cpp shares a BatchNorm module between resModule1 and resModule2 containers. Previously even though a copy was taken when adding the BatchNorm module to the containers, the running mean/var variables were updated in forward passes for both containers (as the underlying tensors of the BatchNorm module's variables were shared between the 2 networks). With these changes, now if BatchNorm is copied, a deep copy is performed for its variables. However, if the previous behaviour is desired, a shared_ptr can be used to wrap the module to avoid copies and share the module between multiple modules/containers.

There is one potential issue I foresee with the current implementation of cloning, which is the handling of shared modules between multiple modules/containers. For example, in the ResidualFwd test, even though the BatchNorm layer is shared between resModule1 and resModule2, when performing a deep copy of either module their layers will no longer be shared. I think most occurrences are likely to be unintended bugs, but there may be some rare edge case scenarios where it is intended behaviour and as such users will need to handle it accordingly.

@jacobkahn
Copy link
Member

@benborder

resnet, prev behavior

Prev behavior would be desired. shared_ptr should be used in that case where the copy constructors of modules are used ambiguously.

Any existing user code which uses the weird behavior should be changed anyways, as there's inherent ambiguity in the old system. If you're sharing module state very explicitly, you should use pointers or references, else use the copy ctor.

Tw other things as discussed offline:

  • we can and should just return unique_ptr<Module> from clone(). If the user wants to share the resulting cloned module, they can add std::shared_ptr<Module>(myModule.clone().release()). This maintains all of the desired flexibility at negligible overhead cost.
  • I'm now coming down on the sides of it not being desirable for tensors to be deep copied in modules. Flashlight has always been firmly supporting copy-on-write with Tensors. The semantics for copying modules, in my opinion, should be as follows:
    • Copying a module deep copies its Variable parameters
    • Deep copying a Variable simple calls the copy constructor of its underlying Tensor rather than pointing to identical tensor data per sharedData/sharedGrad and friends.
    • Underlying tensor library implementations can track whether or not tensors need to be deep copied based on the operations that will be performed on them. For example: of a module is being copied but its parameters only being read from, the tensors underlying its variables don't need to be deep copied, and can be read from the same data.

@benborder
Copy link
Contributor Author

After offline discussion with @jacobkahn about if Modules should fully deep copy their parameters circumventing COW semantics, or deep copy using COW semantics if implemented in the backend, identifying some potential edge case issues and resolving some misunderstandings on my part, we agreed:

  • Whilst cloned modules/containers that are using tensor backends with COW semantics are not thread safe, this can either be addressed by the user by adding synchronisation, or resolved in the future by providing functionality to allow a full deep copy or even by some other means.
  • It will be up to users to correctly manage the lifetimes and cloning behaviour of their modules/containers and be aware of any consequences of cloning a module/container. This means if users have any custom or shared lifetime requirements they should not use the FL_BASIC_CONTAINER_CLONING macro, but implement their own clone() override method as well as copy, assignment and move constructors to achieve their desired behaviour.

The following changes were made:

  • Return unique_ptr for clone() instead of shared_ptr. unique_ptr can be promoted to a shared_ptr and a convenience function was added to Container to facilitate this.
  • Added a copy method to Variable with optional forced deep copy to circumvent COW of the underlying tensor if applicable.
  • Added missing move constructors.
  • Ensure train is correctly set in Module.

@facebook-github-bot
Copy link

@jacobkahn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving some preliminary comments first — will in parallel start reviewing everything else.

flashlight/fl/nn/modules/Module.h Outdated Show resolved Hide resolved
flashlight/fl/autograd/Variable.h Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Show resolved Hide resolved
flashlight/fl/nn/modules/Container.h Show resolved Hide resolved
Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to add a small test fixture to test copying basic modules (make sure behavior is the same; this can also serve as a chance to test that parameters aren't shared) as well as a test for copying a basic compositional Container. Would you be able to add those in this PR as well?

flashlight/fl/examples/RnnClassification.cpp Outdated Show resolved Hide resolved
flashlight/fl/examples/RnnLm.cpp Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/WeightNorm.cpp Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/WeightNorm.cpp Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@jacobkahn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final stretch! Just a few more comments. Thanks for bearing with me on this. I'll rebase this branch shortly.

flashlight/fl/nn/modules/Container.h Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Outdated Show resolved Hide resolved
flashlight/pkg/vision/nn/Transformer.cpp Show resolved Hide resolved
flashlight/fl/nn/modules/Container.h Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Container.h Outdated Show resolved Hide resolved
flashlight/fl/nn/modules/Container.h Show resolved Hide resolved
flashlight/fl/test/nn/ModuleTest.cpp Show resolved Hide resolved
flashlight/fl/test/nn/ModuleTest.cpp Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Show resolved Hide resolved
flashlight/fl/nn/modules/Module.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@jacobkahn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 I think we're ready to merge this after

  • rebasing
  • CI still being green enough
  • double checking any new non-class functions are explicitly exported via FL_API (I don't think there are any)

Cheers to sticking with this PR -- this makes so many semantics clearer and behaviorS more explicit. A huge step towards a less bug prone set of interfaces on modules, and sensible copy semantics throughout FL!

flashlight/fl/nn/modules/Module.cpp Show resolved Hide resolved
flashlight/fl/nn/modules/Module.cpp Show resolved Hide resolved
@benborder
Copy link
Contributor Author

Awesome!

Agreed, it's a solid set of improvements. Glad we can get it over the line!

@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@jacobkahn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Member

@jacobkahn jacobkahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few final lints

flashlight/fl/nn/modules/Container.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@jacobkahn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@benborder has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link

@jacobkahn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@jacobkahn merged this pull request in f354e7f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. enhancement New feature or request Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Module and Container deep copy/cloning
3 participants