Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing post installation artifacts in offline mirror #50

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

UnrememberMe
Copy link

Initial version of RFC for storing post installation artifacts in offline mirror.

Derived from part of the discussion from yarnpkg/yarn#393 with @bestander

@UnrememberMe UnrememberMe changed the title Initial version of RFC Storing post installation artifacts in offline mirror Feb 21, 2017

##Modification to `yarn` offline mirror structure and `yarn.lock`

- Store post installation artifacts under post-installation subdirectory when using yarn offline mirror. _resolved_ field should reflect this change as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the resolved field need to change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now need to store a path (post-install/foo.tar.gz#xxxx) instead of just a file name (foo.tar.gz#xxxx). Very minor difference, but I thought I should call it out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the path configurable, i.e. is it ever going to change to something other than 'post-install' ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be configurable via a yarn option or even an environment variable. The key feature is that we need some way in the stored file itself to tell us that it is a post-install artifact.

Imaging a project adding a new dependency with only the offline mirror, Yarn cli must know what installation steps should be skipped. The only place that we can store this information is in the file name / directory path in my POV.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought, there is an alternative to store the post installation information in file name / directory name. We could potentially add a file, say .post-install, in the stored artifact. In this case, we do not need to change resolved field or current structure of offline mirror, which is a flat list.

Copy link
Member

@bestander bestander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complex problem but I don't think it will work out via offline mirror

*How should this feature be introduced and taught to existing Yarn users?*
Explain the intended use case with illustrated work flow.

# Drawbacks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think it might work for a subset of Node.js npm packages that don't write or read to folders outside of package.
    This won't be a generic solution for packages heavy on native code, we are working on https://github.com/jordwalke/esy to address that.

  2. Offline mirror is designed to be cross platform because it caches things at the fetching phase.
    This feature will be platform specific and in some cases machine specific (sometimes binaries store local paths) and is a linking phase cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Are there examples of package that either read/write to folders outside of package or store absolute paths of local machine? The explicit assumption in the RFC is that we have very few, if any, packages have this kind of behavior.

  2. I specifically avoided the platform dependency issue in an effort to limit the reach of this RFC. I guess this is a can of worms that I cannot avoid :-(.

    There are two main ways to deal with platform specific codes. Storing precompiled binaries or compile during installation. Some prior arts include Python wheels (https://www.python.org/dev/peps/pep-0427/), Ruby Gems (http://guides.rubygems.org/specification-reference/#platform=), and Go (https://golang.org/src/go/build/doc.go). Python and Go stores platform dependent binaries in their package, while Ruby recompiles during Gem installation.

    Storing platform dependent post installation packages via a scheme similar to the one outlined in this RFC is my preferred choice.
    Pros:

    • Guaranteed consist installation across machines with same os / arch / node version combination
    • Compatible with NREs.
    • Adds no cost for package owners. The choice of what combination of os / arch /node version to support is done by post installation package maintainer, presumably someone has those specific needs.
    • Possible to tar the entire installed packages up and copy it to other machines with same platform. This means it will be possible to track a single version as a tar file across its life cycle and can be a great feature for enterprise.

    Cons:

    • Matrix of os / arch / node version to support can explode. This is somewhat mitigated by the fact that maintainers can choose how big a matrix they want to support.
    • Will not work if there are machine specific codes, like linking to an absolute path
    • Adds further complexity on Yarn (or a plug in).
  3. I choose to reuse offline mirror because I don't want to introduce another cache. If conceptually it is cleaner to have a separate cache for post installation artifacts, that's a change we should modify for this RFC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Here is an example with node-gyp Parallel workers running install scripts can interfere yarn#1874
  2. Yep, I know the pain but we have to deal with it as many projects are sharing same yarn.lock files and offline mirror .tgz files across all OS
  3. I am pretty sure a postinstall cache should be independent from offline mirror


This observation leads to the assumption that most installation scripts only modify files and directories within its own folders. Consequently, we can store the post installation content as artifacts without worrying about inter-module dependencies.

##Modification to `yarn` offline mirror structure and `yarn.lock`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline mirror kicks in at fetch phase.
After the .tgz file is extracted into global cache folder link phase starts.
During link phase files are copied from cache into node_modules, considering hoisting, and then lifecycle scripts are executed that modify some files on those node_modules.

You would have to generate a new .tgz file for each package folder that got modified after lifecycle scripts phase, disabling their lifecycle scripts, and then modify yarn.lock file to point to the new .tgz file.

That could be quite complex to implement without bringing too much complexity into Yarn.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like your suggestion is to use a separate cache not related to offline mirror to store those artifacts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this will make the offline mirror cache too confusing.
The idea of offline mirror cache is that it stores the file as it was downloaded from a remote repository, this RFC adds a lot of new conditions

- Installation scripts may serve legitimate purposes in certain circumstances
- Requires significant efforts to educate node module writers
- Working on a per package basis and updating all dependent packages might take a long time for the necessary changes to propagate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Npm community is large and free to do anything, so it will be impossible to enforce any kind of behavior.

The right thing to do would be for the community members to work with the packages individually to provide ability to be installed while using a mirror (sinopia based mirrors have the same problem) and without Internet access: raise issues, send PRs, fork.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a dumb question but why do some npm packages need internet access to be install? Why can't they hold all needed information within the package itself? (aside from defined dependencies)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phantomjs, for example, actually downloads its platform-specific binary upon npm module installation. The npm module is just a wrapper.

I suppose it could package up each target platform/architecture binary and only configure the intended one for that runtime.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree that the right thing to do is to work with the package owner to remove network dependencies, the process has been proven as slow and sometimes unresponsive. We not only need to work with the owner of the package in question, in some case, we need to work with dependents package and dependents of dependents as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the default assumption - a package is released "as is" and I think it is an exception when a package author has time to support more use cases.

- Update to installation time downloads are ignored / require explicit action
Installation scripts tends to download the latest version of dependencies. A stored post installation artifacts will always have the same version of dependencies and thus potentially will not have the latest dependencies. To update such installation time downloaded dependencies, explicit actions from offline mirror maintainers will be required.

# Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, this RFC goes beyond the concerns of Offline Mirror feature.

I think the problem may be solved by caching and sharing a built package in some way.
This may not work across platforms and machines, depends on every package and how a project is built.
I would try:

  • disable lifecycle scripts for a package that needs Internet (maybe have this setting in package.json)
  • before package is installed from Yarn cache, replace the cache with the prebuilt content. Packing, sharing and replacing in cache could be automated in some way by Yarn or a plugin or a third party script

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those that operate in a NRE would likely be less concerned about cross platform compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those that operate in a NRE would likely be less concerned about cross platform compatibility

I'm at Red Hat, working in NREs on multiple architectures.

@bestander
Copy link
Member

I would look at this from another angle, some FB employees are working on a generic solution to bring appropriate binaries compilation to Yarn yarnpkg/yarn#480 (comment).
The idea is to stop running post-install scripts and have a layer on top of Yarn that is responsible of dealing with native builds and caches.
Once Yarn/esy integration gets into shape (hopefully within the month) we could come up with something lightweight for Node npm packages like this.

@BYK BYK self-assigned this Sep 29, 2017
@BYK BYK removed their assignment Oct 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants