Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Initial proof-of-concept to have a static cache #162

Closed
wants to merge 0 commits into from

Conversation

AndrewFarley
Copy link
Contributor

@AndrewFarley AndrewFarley commented Mar 14, 2018

WARNING: Please do not merge as-is
Fixes #157 (filed by me)

This is a proof of concept that will probably need a bit of love. What this does, is it creates a cache of the statically compiled pip outputs, and based on the MD5 hash of the generated requirements.txt file it will re-use the last-statically compiled outputs automatically, and never enter docker. Our build times are down from 2-3 minutes to a few seconds.

Implementation Details

Items from .serverless/requirements are placed into the .serverless-python-requirements-static-cache folder at the project root, and a file in that folder is created to store the md5 sum of the requirements.txt that was used to create that static cache. Since the .serverless folder is completely destroyed before passing context to the serverless-python-requirements plugin, I have to store the requirements somewhere else.

Problems / Suggestions / Ideas / Input Needed

  • Note: I had this implemented on a 3.x version of this plugin but I figured I should modify my changes for the latest code. Let me know if this was wrong, and I can go realign
  • Need to test later: This code "should" work with single function deploys also, but I can't test because of Fatal error when trying to deploy single-functions #161. So I'm just assuming that works for now.
  • Feature: This feature will probably want to be a flag which is enable/disable-able and to be disabled by default as to possibly not interrupt the workflow and existing usage of this plugin. Possibly a new flag staticCache: true
  • Idea: Another idea, might be to see if there's a state we can hook into in serverless that is "before" they remove the .serverless folder, so I can "save" the .serverless/requirements folder from in there and re-use it for the next run.
  • BUG: This is irrelevant if we re-use the .serverless/requirements folder (above) or use a system temp path for the cache. But, as this PR works now, I can't seem to get serverless to ignore the new cache folder when packaging the zip file. If I add .serverless-python-requirements-static-cache to the exclude path in serverless.yml it still includes it. This is out of my knowledge of Serverless or this plugin. Hoping someone can help with this, as this is the only major bug I can tell in this plugin. Consider using another location for storing the static cache would fix this problem (like os-dependent temp storage)?

Sample Output

In some brief tests, I've noticed the larger the project and/or the more compiling having to be done, the more the savings (makes sense). The time it takes now is mostly just zipping everything up.

$ time serverless package
Serverless: Generated requirements of requirements.txt in .serverless...
Serverless: Installing requirements from .serverless/requirements.txt ...
Serverless: Docker Image: lambci/lambda:build-python3.6
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Injecting required Python packages to package...

real	1m53.788s

$ time serverless package
Serverless: Generated requirements of requirements.txt in .serverless...
Serverless: Using static cache...
Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Injecting required Python packages to package...

real	0m20.776s

Questions / Thoughts / Input / Feedback welcome. Feel free to take this idea and run with it, re-implementing it as you see fit, or give me adequate input and I can try to re-align. I'm (clearly) not really a javascript programmer though. I need this for a few of my development teams to support faster development cycles. :)

@dschep
Copy link
Contributor

dschep commented Mar 14, 2018

I'll take a look at this next week.

@kichik
Copy link
Contributor

kichik commented Mar 14, 2018

You can probably speed it up even more by having injectAllRequirements() inject your new folder instead of .serverless/requirements. This way there is no need to copy and delete all the requirements (which can be quite a lot of files) all the time.

@AndrewFarley
Copy link
Contributor Author

I'll take a stab later this week at putting this in the OS temp directory (I had some other feedback recommend me to do this, that way the caching can be shared between stacks that use the same requirements) unless there's some reason why I shouldn't do that and then also try to modify injectAllRequirements to pull from this folder to speed it up even more.

@Valian
Copy link

Valian commented Mar 27, 2018

I'm also very interested in that feature, because reinstalling packages isn't really necessary.

Keep in mind that temp directory is cleared after a system restart, at least on Linux. Programs usually use user home directory, e.g. ~/.cache/serverless.

If I could help you somehow, just say, I have experience in both JS and Python.

@AndrewFarley
Copy link
Contributor Author

@Valian Glad to see some shared interest... that user home directory works on Mac/Linux but not Windows (or does it?). Also, I personally dislike storing things on a user's home folder without their knowledge, I think system temp is more appropriate since it is removed upon reboot. Anyways, I emailed you, lets have a chat and maybe you can help me nail this down. Cheers!

@AndrewFarley
Copy link
Contributor Author

AndrewFarley commented Mar 27, 2018

@dschep Could I get your thoughts on the points I've written up below? I'm going to take up making this merge request actually possible and happen because I'm seeing a lot of interest from the people I'm sharing this with, but there are going to be a few bigger changes, and I just want your thoughts on them so I don't waste my time on something that will never be merged incase something seems fundamentally flawed to you

  • I'll be calling this feature a "static cache". I don't know what the heck else to call it... "compiled-pip-cache?" I have no opinion on the matter, if you have a better name, please help. :P So, assuming this is the name...
  • Completely deprecating the .serverless/requirements folder, and instead using a folder in the system temp directory eg: os.tmpdir() / serverless-python-requirements because the <project_dir>.serverless folder is removed on every deploy and can't be used for caching purposes
    • I'm recommending doing this, rather than trying to keep the existing code when not using a static-cache as to not make the code simpler. Thoughts?
  • Within' this tempdir, will be md5-named folders, which is the md5 of the requirements.txt file which compiled those requirements
  • (if necessary) Creation of that tempdir and the md5 folder within' it will happen before pip requirements generation, and modification of the volume mounted in Docker to this md5 tempdir, and will skip requirements generation if this folder already exists, is populated, and the user wants to use cached pip output.
  • Modification of the zip injection to inject from this tempdir (be it the first time or whether it was cached from when this dir was populated).
  • Modification of the main README.md to include information on this "static cache" and how to use it, what it does, and a suggestion on using staticCacheLocation to put the data in your home folder (eg: ~/.cache/serverless).
  • Addition of a few configuration variables
Variable Name Value Description
useStaticCache false/true Default: False. This will enable or disable the caching completely. Would like to make this default: true after a few releases and more people using/testing this feature
staticCacheLocation <path> Default: os.tmpdir()/serverless-python-requirements This will allow the user to specify where the static cache should be stored. This will be useful for people who want to persist their python requirements cache beyond a reboot, and/or for CI/CD build servers that use shared-storage to allow multiple build machines to leverage a cache to speed builds up
staticCacheMaxVersions <integer> Default: 0. This will allow a maximum number of "versions" / caches in the cache folder. Setting to 0 makes no maximum number of versions, which makes sense if this is on a os.tmpdir(), but if this is on non-temp, they may want to limit the number of caches present.
  • Note: Edited to remove staticCacheFlush, it was an afterthought

@Valian
Copy link

Valian commented Mar 27, 2018

For me, all these points are valid and welcomed, maybe except staticCacheFlush. When both useStaticCache and staticCacheFlush are on, it's almost equal to useStaticCache: false. Flushing is a one time action, and as such shouldn't be driven by the configuration file. Users can clear cache on their own.

@dschep
Copy link
Contributor

dschep commented Mar 28, 2018

Your writeup sounds good for the most part @AndrewFarley. I agree with @Valian's point re staticCacheFlush. Also, if there's a node equivalent to the appdirs python module, I'd prefer that that be used instead of tmpdir because then you can use dirs explicitly made for caches.

@AndrewFarley
Copy link
Contributor Author

Thanks @dschep, keep you guys posted... :)

@AndrewFarley
Copy link
Contributor Author

FYI: RE-opening a new PR to re-review the proposed code

dschep pushed a commit that referenced this pull request Sep 8, 2018
Fixes #157 (filed by me)

## What this does
* Makes the download caching of pip a "first-class citizen" as an option directly in this plugin's options.  This will "fix" a few (attempts) at using the pip cache, specifically in Docker, and will simplify this feature (as the user simply has to enable it, not specify a folder).  In a future MR, I'd highly suggest enabling this by default.
* Second, it adds a new type of caching called "static caching" which allows you to cache the outputs of this plugin.  This greatly speeds up every single build as long as you have the feature enabled and do not change your requirements.txt file.  In a future MR, I'd highly suggest enabling this by default also.
* The pip download and static cache are shared between any projects of the same user through an [appdir](https://www.npmjs.com/package/appdirectory) cache folder when packaging your service.  This _especially_ helps on projects that heavily use Docker (Win/Mac) for deployments or development, or for pip modules that need to compile every time, and _especially_ for projects with long requirements.txt files.  This will also greatly help the longer and more complex your requirements.txt is, and/or if you use the same requirements.txt on multiple projects (common in team environments).

## Implementation details
* When either cache is enabled, this plugin now caches those requirements (download or static) to an "appdir" cache folder (per the [appdirectory](https://www.npmjs.com/package/appdirectory) node module).
* When this feature is NOT enabled, nothing changes
* Injection happens directly from the new cached requirements directory via a symlink created in the right place in `.serverless` or `.serverless/functionname` if deploying individually.
* As mentioned above, there is a symlink into the .serverless folder when the static cache is enabled pointing to it, so you still "know" where your cache is (for both individually and non-individually packaged functions).
* The requirements.txt "generator" was improved to remove comments, empty lines, and sort the list of items before trying to use it (or check its md5 sum).  This allows for more actual md5 matches between projects, in-case of comments and such in the requirements file.
* A new command was added to the command-line to flush the download/static cache, called cleanCache invokable with: `serverless requirements cleanCache`.  This clears all items including the download and static cache.
* A handful of new tests were created for various edge conditions I've found while doing this refactoring, some were based on bugs other people found while using this plugin with some combination of options, some are not directly related to this merge's intent, but it's just part of my stream of work/consciousness.  Sorry tests take a lot longer to run now since there are lots more now.
* A UID bug fix related to docker + pip was implemented (seen on a few other bugs) from @cgrimal 
* The following new configurable custom options were added to this plugin...

Variable Name | Value | Description
--- | --- | ---
useStaticCache | `false/true` | Default: false.  This will enable or disable the static cache.  After  some testing I would like to make this default: true, as this will greatly help everyone, and there's no reason to not enable this.  Possibly making this default: true will help weed out issues faster.  I'll gladly step-up to quickly fix any bugs people have with it since I'm now well accustomed with the code.
useDownloadCache | `false/true` | Default: false.  This will enable or disable the pip download cache.  This was previously the "example" code using a pipEnvExtraCmd to specify a local folder to cache downloads to.  This does not require a cache location to be set, if not specified it will use an appdirs.usercache() location.
cacheLocation | `<path>` | Default: [appdirectory](https://www.npmjs.com/package/appdirectory).userCache(appName: serverless-python-requirements) This will allow the user to specify where the caches (both static and download) should be stored.  This will be useful for people who want to do advanced things like storing cache globally shared between users, or for CI/CD build servers on shared-storage to allow multiple build machines to leverage a cache to speed builds up.  An example would be to mount a shared NFS store on all your CI/CD runners to `/mnt/shared` and set this value to `/mnt/shared/sls-py-cache`. 
staticCacheMaxVersions | `<integer>` | Default: 0.  This will restrict the a maximum number of caches in the cache folder.  Setting to 0 makes no maximum number of versions.  This will be useful for build/CI/CD machines that have limited disk space and don't want to (potentially) infinitely cache hundreds/thousands of versions of items in cache. Although, I would be disturbed if a project had hundreds of changes to their requirements.txt file.

## TODO
- [X] Feature Implementation
- [X] BUG: Deploying single-functions fails (Packaging works, but fails because of #161 )
- [X] Code Styling / Linting
- [X] Test to be sure Pipfile / generated requirements.txt still works
- [X] Tested a bunch on Mac / Linux with and without Docker
- [X] Adding Tests for Download Cache
- [X] Make sure zip feature still works
- [X] Ensure all existing tests pass
- [X] Adding Tests for static cache
- [X] Updating README.md to inform users how to use it
- [X] Make sure dockerSsh works
- [X] Implement error when trying to use --cache-dir with dockerizePip (won't work)
- [X] Implement suggestion when trying to use --cache-dir without dockerizePip
- [x] Test on Windows
- [x] Iterate through any feedback
- [x] Rebase with master constantly, awaiting merge...  :)

Replaces #162
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Caching not working (as well as it could?) when using docker...
4 participants