New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache optimization for `brew linkage` command. #3720

Merged
merged 34 commits into from May 24, 2018

Conversation

Projects
None yet
8 participants
@armcburney
Copy link
Contributor

armcburney commented Jan 22, 2018

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew style with your changes locally?
  • Have you successfully run brew tests with your changes locally?

Hi Homebrew maintainers!

This is my first time contributing to Homebrew. I saw issue #3008, and thought it would be a good first issue to work on.

This pull request is still a work-in-progress. I was hoping to get some advice from the maintainers before proceeding any further.

I implemented a simple caching mechanism using SQLite3 to store brew linkage-related data on-disk in a database located in the HOMEBREW_CACHE directory.

The pull request creates two new files (Library/Homebrew/os/mac/linkage_store.rb and Library/Homebrew/os/mac/linkage_database.rb), and changes a bit of Library/Homebrew/os/mac/linkage_checker.rb.

Any constructive criticism and/or design advice would be greatly appreciated!

All the best,
Andrew

Performance Enhancements

The caching improves performance for the brew linkage command. I've attached two flame graphs showing the breakdown of the command time, as well as two sample runs of the command each with 106 packages (including larger packages such as boost).

The command runs in under 5 seconds (as requested in #3008) with the caching mechanism utilized.

  • The first command run has a new --rebuild flag, which forces the cache to rebuild. The behaviour of the command with the --rebuild flag is equivalent to the current behaviour of the brew linkage command.
  • The second command run utilizes the SQLite3 caching, which is utilized by default.

Flame Graphs

Without Optimization

Without the caching (i.e., using the --rebuild flag) the command (instrumented starting from linkage.rb) took 11.5 seconds to complete. The library boost took 1.01 seconds to complete checking dylibs.

screen shot 2018-01-23 at 10 40 02 am

With Optimization

With the caching (i.e., by default) the command (instrumented starting from linkage.rb) took 182 milliseconds to complete. The library boost took 1.38 milliseconds to complete checking dylibs.

screen shot 2018-01-23 at 10 40 38 am

Input

Run 1 (no caching)

time brew linkage --rebuild apache-spark docker gettext heroku libidn2 mysql py2cairo sbt atk docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala autoconf docker-machine git hub libpng node pyenv-virtualenv scalastyle automake dockutil glfw hugo librsvg nvm pygobject sqlite bash elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo bdw-gc elixir glm imagemagick libtiff openssl python tree boost emacs gmp imagemagick@6 libtool openssl@1.1 python3 wget cairo emacs-plus gnu-sed isl libunistring p11-kit r wxmac cask erlang gnutls ispell libyaml pango rbenv xz cmake fontconfig go jpeg little-cms2 pcre rbenv-gemset yarn coreutils freetype gobject-introspection jruby llvm pixman readline zookeeper crystal-lang gcc graphite2 libcroco lua pkg-config redis dash gdbm gtk+ libevent maven poppler ruby dateutils gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build

Run 2 (SQLite3 caching)

time brew linkage apache-spark docker gettext heroku libidn2 mysql py2cairo sbt atk docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala autoconf docker-machine git hub libpng node pyenv-virtualenv scalastyle automake dockutil glfw hugo librsvg nvm pygobject sqlite bash elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo bdw-gc elixir glm imagemagick libtiff openssl python tree boost emacs gmp imagemagick@6 libtool openssl@1.1 python3 wget cairo emacs-plus gnu-sed isl libunistring p11-kit r wxmac cask erlang gnutls ispell libyaml pango rbenv xz cmake fontconfig go jpeg little-cms2 pcre rbenv-gemset yarn coreutils freetype gobject-introspection jruby llvm pixman readline zookeeper crystal-lang gcc graphite2 libcroco lua pkg-config redis dash gdbm gtk+ libevent maven poppler ruby dateutils gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build

Output

Run 1 (no caching)

...
brew linkage --rebuild apache-spark docker gettext heroku libidn2 mysql  sbt   5.46s user 5.30s system 94% cpu 11.418 total

Run 2 (SQLite3 caching)

...
brew linkage apache-spark docker gettext heroku libidn2 mysql py2cairo sbt at  0.85s user 0.56s system 73% cpu 1.898 total

Files Changed

Library/Homebrew/os/mac/linkage_database.rb

LinkageDatabase is a module to interface with the SQLite3 linkage database. The public interface is:

# Expose linkage database through LinkageDatabase module
def db
  @db ||= SQLite3::Database.new "#{HOMEBREW_CACHE}/linkage.db"
end

# Checks if the database has been initialized
def empty?(keys:)
  ...
end

# Takes in an array of strings, and formats them into a SQL list string
def format_database_list(list)
  ...
end

Caching Schema

The schema for the caching is a single table schema, which includes relevant linkage information.

CREATE TABLE IF NOT EXISTS linkage (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name TEXT NOT NULL,
  path TEXT NOT NULL,
  type TEXT NOT NULL CHECK (type IN (#{all_types})),
  label TEXT CHECK (label IS NULL OR (type IN (#{hash_types}))),
  UNIQUE(name, path, type, label) ON CONFLICT IGNORE
);

Library/Homebrew/os/mac/linkage_store.rb

The LinkageStore class is a class with the responsibility to fetch, insert, update, and delete data from the SQLite3 cache.

The public interface for the class is:

# Updates cached values in SQLite3 tables according to the type of data stored
def update!(
      key:,
      array_linkage_values: { system_dylibs: %w[], variable_dylibs: %w[],
                              broken_dylibs: %w[], undeclared_deps: %w[],
                              unnecessary_deps: %w[] },
      hash_linkage_values: { brewed_dylibs: {}, reverse_links: {} }
    )
  ...
end

# Fetches a subset of paths by looking up an array of keys
def fetch_path_values!(type:, keys:)
  ...
end

# Fetches path and label values from a table name given an array of keys and
# returns them in a hash format with the labels as keys to an array of paths
def fetch_hash_values!(type:, keys:)
  ...
end

# Deletes rows given an array of keys. If the row's name
# attribute contains a value in the array of keys, then that row is deleted
def flush_cache_for_keys!(keys:)
  ...
end

Library/Homebrew/dev-cmd/linkage.rb

The brew linkage command now has the option to pass in the --rebuild flag. This flag will force the cache to rebuild, and the LinkageChecker will call check_dylibs for each of the kegs passed into the command. This option causes the command to work like it currently does (without the SQLite3 caching). By default, the brew linkage command will now use caching.

Questions

Updating the cache for a single keg

  • At what point should I update the cache for a single keg? I was thinking brew install <formula>, but I'm not sure if there are any other commands where I should be updating the cache for a particular keg.

Updating the cache for multiple kegs

  • At what points should I update the cache for multiple kegs at once? I was thinking when the brew update command is invoked the cache should be invalidated (ie. flushed) and updated on the next run of the brew linkage command.
  • Any advice on how to handle updating multiple kegs, and are there any cases I'm not considering where the cache would need to be updated?

How do you recommend writing tests for these changes?

  • Does Homebrew have any specific output matchers or helper functions that would be useful to write tests for my changes?

Outstanding Issues

  • Write benchmark and integration tests for the caching system
  • Write unit tests for Library/Homebrew/os/mac/linkage_database.rb and Library/Homebrew/os/mac/linkage_store.rb.

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from 6a56e7c to d33be1c Jan 23, 2018

@alyssais

This comment has been minimized.

Copy link
Contributor

alyssais commented Jan 23, 2018

Sorry for such a major comment, but: I'd rather not install a gem if we could get away with it.

What do you think about using a persistence solution that comes with Ruby, such as pstore?

@alyssais

This comment has been minimized.

Copy link
Contributor

alyssais commented Jan 23, 2018

Good job wrapping the database in its own class, though. Means we can try out different solutions fairly easily.

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Jan 23, 2018

@alyssais No worries, I can try out pstore instead and see how the performance compares to the original implementation.

The reason I initially opted to go with the SQLite3 gem was because @MikeMcQuaid mentioned experimenting with SQLite in #3379.

@alyssais

This comment has been minimized.

Copy link
Contributor

alyssais commented Jan 23, 2018

@ilovezfs

This comment has been minimized.

Copy link
Contributor

ilovezfs commented Jan 23, 2018

Using sqlite seems nicer to me.

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Jan 24, 2018

Hi @alyssais, and @ilovezfs

I implemented a caching solution using pstore, and created a PR on my personal fork to compare the changes with the SQLite3 implementation. I found pstore transactions to be expensive. It takes a bit of time to persist and fetch the hash structures (this could be due to the way I implemented it, so I'm open to any suggestions for how I could improve the code).

I've documented my findings in the PR, where I contrast the time differences between the two solutions in the description. I would recommend going with the SQLite3 implementation over the pstore implementation. I believe the tradeoff of installing another gem is worth the performance increases for the caching.

I'm open to any suggestions on how I could improve the pstore or SQLite3 implementations. 🙂

All the best,
Andrew

@alyssais

This comment has been minimized.

Copy link
Contributor

alyssais commented Jan 24, 2018

Wow, thanks for such a detailed investigation!! You really went above and beyond and I'm super impressed.

I'm curious to see what @MikeMcQuaid thinks, since he's been fairly anti-dependency in the past for user-facing stuff IIRC, but also was the one to suggest SQLite.

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented Jan 25, 2018

This pull request is still a work-in-progress. I was hoping to get some advice from the maintainers before proceeding any further.

It looks great so far, well done! High level thoughts are that I'd love to see all the SQLite stuff abstracted away sufficiently that we could use it for more than just linkage and think about what other data (e.g. dependency trees) would be good to store and when it would be good to update it (for dependencies: on brew update, for installation data: on brew (re)install/upgrade

With the caching (i.e., by default) the command (instrumented starting from linkage.rb) took 182 milliseconds to complete. The library boost took 1.38 milliseconds to complete checking dylibs.

This is pretty great performance gains, well done. What I'd be interested in (if you can summarise in a little table) the timings before and after this change with empty/invalid cache and with a valid cache. Additionally, does doing something like brew reinstall boost and ending up with files which vary slightly in metadata affect anything at all.

Sorry for such a major comment, but: I'd rather not install a gem if we could get away with it.

We need to not install a gem at runtime but it's fine to use the sqlite gem because it has no dependencies (as far as I can tell). It would need to be vendored into Library/Homebrew/vendor/ but hold off on doing so for now because it'll bloat the diff.

@alyssais

This comment has been minimized.

Copy link
Contributor

alyssais commented Jan 25, 2018

We need to not install a gem at runtime but it's fine to use the sqlite gem because it has no dependencies (as far as I can tell). It would need to be vendored into Library/Homebrew/vendor/ but hold off on doing so for now because it'll bloat the diff.

Ah, sweet!

@AndrewMcBurney you mentioned (I think in the other PR) that you want to clean this up a bit before review. Shout when you’re ready and I’ll have a look. Great job so far!

@woodruffw

This comment has been minimized.

Copy link
Member

woodruffw commented Jan 25, 2018

This is a bit late, but to add onto what @alyssais said about PStore: Ruby also provides DBM as an abstraction over various Berkeley-type DBs. Depending on the BDB-like supplied by OS X, it might offer better performance (Marshal is fast, but a native DB with string, string KV pairs is probably faster).

Edit: On 10.13.1, it looks like we've got BerkeleyDB 4.7.25:

$ db_dump -V
Berkeley DB 4.7.25: (May 15, 2008)
@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented Jan 25, 2018

This is a bit late, but to add onto what @alyssais said about PStore: Ruby also provides DBM as an abstraction over various Berkeley-type DBs. Depending on the BDB-like supplied by OS X, it might offer better performance (Marshal is fast, but a native DB with string, string KV pairs is probably faster).

This seems like an interesting thing to explore. It'd also be nice if we could get comparable performance without the more rigid overhead of a typed DB schema.

@RandomDSdevel

This comment has been minimized.

Copy link
Contributor

RandomDSdevel commented Jan 25, 2018

@woodruffw, possibly also @MikeMcQuaid: For the record, Homebrew Core has berkeley-db@4 at v4.8.30.

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Jan 25, 2018

Hi @alyssais and @MikeMcQuaid,

I refactored the existing SQLite3 implementation to make the code more extensible for adding new caching mechanisms in the future. I think this should be easy to extend for further commands, and could potentially be used to implement a solution to increase performance for brew uses.

Code Changes

Abstractions

Library/Homebrew/os/mac/database/database.rb

Database is an abstract base class representing a database caching schema residing in the Homebrew cache. It asks concretions to override the pure-virtual create_tables method, where a concrete class may specify the schema for the given database.

Public interface:

# Creates a database in the Homebrew cache with the name passed in
#
# @param  [String] name
# @raise  [SQLite3::CantOpenException]
# @return [nil]
def initialize(name) ... end

# Memoized `SQLite3` database object with on-disk database located in the
# `HOMEBREW_CACHE`
#
# @return [SQLite3::Database] db
def db ... end

# Abstract method overridden by concretion classes. Creates database tables for
# the corresponding database schema
#
# @abstract
# @return [nil]
def create_tables ... end

Library/Homebrew/os/mac/store/store.rb

Store is a class which acts as an interface to a persistent storage mechanism. If the cache hasn't changed, don't do extra processing in Homebrew. Instead, just fetch the data stored in the cache.

Concretions extending Store must implement its public interface by overriding the update!, fetch, and flush_condition methods.

# Initializes new `Store` class
#
# @param  [Database] database
# @return [nil]
def initialize(database) ... end

# Updates cached values in persistent storage according to the type of data
# stored
#
# @abstract
# @param  [Any]
# @return [nil]
def update!(*) ... end

# Fetches cached values in persistent storage according to the type of data
# stored
#
# @abstract
# @param  [Any]
# @return [Any]
def fetch(*) ... end

# A condition to specify what rows to delete when flushing the cache. This
# method should be overridden by concrete classes extending abstract
# `Database` class
#
# @abstract
# @return [String]
def flush_condition ... end

# Deletes rows where the `flush_condition` holds
#
# @return [nil]
def flush_cache! ... end

Concretions

Library/Homebrew/os/mac/database/linkage_database.rb

  • Implements the Database interface
  • Includes a helper LinkageDatabaseTypes module for shared type information

Library/Homebrew/os/mac/store/linkage_store.rb

  • Implements the Store interface
  • Includes a helper LinkageDatabaseTypes module for shared type information

Helpers

Library/Homebrew/os/mac/database/helpers/linkage_database_types.rb

  • Both LinkageDatabase and LinkageStore depend on knowing the HASH_LINKAGE_TYPES. I didn't want to break the Dependency Inversion Principle by coupling column-specific type information in the LinkageDatabase class. The LinkageStore would need to depend on constants defined in the concrete LinkageDatabase class, rather than depending on the interface for the abstract Database class. Hence, I created this module to be included in both classes which depend on the shared information.

Performance Table

What I'd be interested in (if you can summarise in a little table) the timings before and after this change with empty/invalid cache and with a valid cache.

I've summarized my findings in the table below, but I've included all my bash input/output so you can reproduce or point out a potential mistake in my investigation.

# Build the cache when the cache is full
time brew linkage --rebuild apache-spark docker-machine git hub libpng node pyenv-virtualenv scalastyle atk dockutil glfw hugo librsvg nvm pygobject sqlite autoconf elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo automake elixir glm imagemagick libtiff openssl python bash emacs gmp imagemagick@6 libtool openssl@1.1 python3 tree bdw-gc emacs-plus gnu-sed isl libunistring p11-kit r wget boost erlang gnutls ispell libyaml pango rbenv wxmac cairo fontconfig go jpeg little-cms2 pcre rbenv-gemset xz cask freetype gobject-introspection jruby llvm pixman readline yarn cmake gcc graphite2 libcroco lua pkg-config redis coreutils gdbm gtk+ libevent maven poppler ruby crystal-lang gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build docker gettext heroku libidn2 mysql py2cairo sbt docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala > build-cache-when-cache-full.txt
# > Output:
brew linkage --rebuild apache-spark docker-machine git hub libpng node   atk   11.65s user 19.74s system 54% cpu 57.822 total

# Use the cache when the cache is full
time brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv scalastyle atk dockutil glfw hugo librsvg nvm pygobject sqlite autoconf elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo automake elixir glm imagemagick libtiff openssl python bash emacs gmp imagemagick@6 libtool openssl@1.1 python3 tree bdw-gc emacs-plus gnu-sed isl libunistring p11-kit r wget boost erlang gnutls ispell libyaml pango rbenv wxmac cairo fontconfig go jpeg little-cms2 pcre rbenv-gemset xz cask freetype gobject-introspection jruby llvm pixman readline yarn cmake gcc graphite2 libcroco lua pkg-config redis coreutils gdbm gtk+ libevent maven poppler ruby crystal-lang gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build docker gettext heroku libidn2 mysql py2cairo sbt docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala > use-cache-when-cache-full.txt
# > Output:
brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv  0.51s user 0.20s system 93% cpu 0.757 total

# Check output is good
vimdiff build-cache-when-cache-full.txt use-cache-when-cache-full.txt

# Cleanup the homebrew cache
brew cleanup
rm -rf ~/Library/Caches/Homebrew/*

# Build the cache when the cache is empty
time brew linkage --rebuild apache-spark docker-machine git hub libpng node pyenv-virtualenv scalastyle atk dockutil glfw hugo librsvg nvm pygobject sqlite autoconf elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo automake elixir glm imagemagick libtiff openssl python bash emacs gmp imagemagick@6 libtool openssl@1.1 python3 tree bdw-gc emacs-plus gnu-sed isl libunistring p11-kit r wget boost erlang gnutls ispell libyaml pango rbenv wxmac cairo fontconfig go jpeg little-cms2 pcre rbenv-gemset xz cask freetype gobject-introspection jruby llvm pixman readline yarn cmake gcc graphite2 libcroco lua pkg-config redis coreutils gdbm gtk+ libevent maven poppler ruby crystal-lang gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build docker gettext heroku libidn2 mysql py2cairo sbt docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala > build-cache-when-cache-empty.txt
# > Output:
brew linkage --rebuild apache-spark docker-machine git hub libpng node   atk   5.14s user 5.35s system 94% cpu 11.111 total

# Use the cache when the cache is 'empty' (i.e., just recently filled)
time brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv scalastyle atk dockutil glfw hugo librsvg nvm pygobject sqlite autoconf elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo automake elixir glm imagemagick libtiff openssl python bash emacs gmp imagemagick@6 libtool openssl@1.1 python3 tree bdw-gc emacs-plus gnu-sed isl libunistring p11-kit r wget boost erlang gnutls ispell libyaml pango rbenv wxmac cairo fontconfig go jpeg little-cms2 pcre rbenv-gemset xz cask freetype gobject-introspection jruby llvm pixman readline yarn cmake gcc graphite2 libcroco lua pkg-config redis coreutils gdbm gtk+ libevent maven poppler ruby crystal-lang gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build docker gettext heroku libidn2 mysql py2cairo sbt docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala > use-cache-when-cache-empty.txt
# > Output:
brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv  0.50s user 0.20s system 94% cpu 0.741 total

# Sanity check
vimdiff build-cache-when-cache-full.txt build-cache-when-cache-empty.txt
vimdiff build-cache-when-cache-full.txt use-cache-when-cache-empty.txt

Performance for cache update scenarios

Run Performance
# Build the cache when the cache is full 11.65s user 19.74s system 54% cpu 57.822 total
# Use the cache when the cache is full 0.51s user 0.20s system 93% cpu 0.757 total
# Build the cache when the cache is empty 5.14s user 5.35s system 94% cpu 11.111 total
# Use the cache when the cache is 'empty' (i.e., just recently filled) 0.50s user 0.20s system 94% cpu 0.741 total

Note: the CPU was running at 94% in the second run building the cache (i.e., Build the cache when the cache is empty), so that explains why that run was considerably faster than building when the cache was full, which used 54% CPU (i.e., # Build the cache when the cache is full)

boost reinstall

Additionally, does doing something like brew reinstall boost and ending up with files which vary slightly in metadata affect anything at all.

I piped the output of four commands into text files, and ran a vimdiff to compare outputs as follows:

# Running linkage without and with caching, and piping output to .txt files
brew linkage --rebuild boost > boost-rebuild.txt
brew linkage boost > boost.txt

# Sanity check to ensure output is the same in cached and uncached versions
vimdiff boost.txt boost-rebuild.txt

# Reinstalling boost
brew reinstall boost

# Rebuilding the cache
brew linkage --rebuild boost > boost-rebuild-re.txt
brew linkage boost > boost-re.txt

# Comparing output between the two versions
vimdiff boost.txt boost-rebuild-re.txt
vimdiff boost.txt boost-re.txt

All the metadata remained the same when reinstalling the formula. However, I think I should probably write code to reset the cache for formulas passed into this command, to ensure anything updated is accounted for. This segues into the next section.

Setting the cache

Currently, the only way to set the cache for a list of packages is by passing in the --rebuild flag to the brew linkage <packages> command. Hence, if you run brew linkage <packages> before building the cache (i.e., running brew linkage --rebuild), the output will not be correct.

I have some pseudocode showing what it would entail building the cache for a single keg versus for all kegs. However, I'm not quite sure where this code should be added to ensure brew linkage is never called without the cache properly being set.

# For updating the cache for one `keg`
LinkageChecker.new(keg).check_dylibs

# For updating the cache for all `kegs`
kegs.each { |keg| LinkageChecker.new(keg).check_dylibs }

Any advice would be greatly appreciated! 🙂

Testing

  • Any advice on how best to stub the SQLite3 database calls / caching for test execution?
  • When I run brew tests locally, I get an SQLite3::CantOpenException, do you have any advice for how to workaround this, or to utilize the HOMEBREW_CACHE (or an alternative directory) within the test suites?

Thank you! 🙌

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented Jan 27, 2018

Performance for cache update scenarios

It would be great to add to your table how these numbers compare to the same behaviour without the cache being used at all (i.e. the existing code). Thanks!

Any advice on how best to stub the SQLite3 database calls / caching for test execution?

I recommend not stubbing these for integration tests and not adding unit tests that would need them stubbed.

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Jan 27, 2018

Hey @MikeMcQuaid,

On current origin/master I ran the command I used before twice

# Run 1) Note: CPU at 45%
time brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv scalastyle atk dockutil glfw hugo librsvg nvm pygobject sqlite autoconf elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo automake elixir glm imagemagick libtiff openssl python bash emacs gmp imagemagick@6 libtool openssl@1.1 python3 tree bdw-gc emacs-plus gnu-sed isl libunistring p11-kit r wget boost erlang gnutls ispell libyaml pango rbenv wxmac cairo fontconfig go jpeg little-cms2 pcre rbenv-gemset xz cask freetype gobject-introspection jruby llvm pixman readline yarn cmake gcc graphite2 libcroco lua pkg-config redis coreutils gdbm gtk+ libevent maven poppler ruby crystal-lang gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build docker gettext heroku libidn2 mysql py2cairo sbt docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala > master-with-homebrew-cache.txt
# > Output:
brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv  7.18s user 11.19s system 45% cpu 40.115 total

# Run 2) Note: CPU at 99%
time brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv scalastyle atk dockutil glfw hugo librsvg nvm pygobject sqlite autoconf elasticsearch glib icu4c libtasn1 openjpeg pygtk texinfo automake elixir glm imagemagick libtiff openssl python bash emacs gmp imagemagick@6 libtool openssl@1.1 python3 tree bdw-gc emacs-plus gnu-sed isl libunistring p11-kit r wget boost erlang gnutls ispell libyaml pango rbenv wxmac cairo fontconfig go jpeg little-cms2 pcre rbenv-gemset xz cask freetype gobject-introspection jruby llvm pixman readline yarn cmake gcc graphite2 libcroco lua pkg-config redis coreutils gdbm gtk+ libevent maven poppler ruby crystal-lang gdk-pixbuf harfbuzz libffi mpfr postgresql ruby-build docker gettext heroku libidn2 mysql py2cairo sbt docker-compose gimme hicolor-icon-theme libmpc nettle pyenv scala > master-with-homebrew-cache.txt
# > Output:
brew linkage apache-spark docker-machine git hub libpng node pyenv-virtualenv  4.70s user 3.97s system 99% cpu 8.735 total

An updated table comparing the two branches would be:

Run Performance
AndrewMcBurney/cache-optimization: Build the cache 5.14s user 5.35s system 94% cpu 11.111 total
AndrewMcBurney/cache-optimization: Use the cache 0.50s user 0.20s system 94% cpu 0.741 total
origin/master 4.70s user 3.97s system 99% cpu 8.735 total

So it takes longer to build the cache than it currently does on origin/master (as expected with the database calls). However, the performance is significantly improved when fetching the data from SQLite3.

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Jan 29, 2018

Hi @alyssais and @MikeMcQuaid,

Could I get a code review for my changes when you have some free time please?

Thank you! 🙂

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from 4523a49 to 9450381 Feb 5, 2018

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Feb 5, 2018

Hi @alyssais and @MikeMcQuaid,

The reason CI is failing is due to an issue installing the sqlite3 gem:

==> Installing or updating 'sqlite3' gem
Fetching: sqlite3-1.3.13.gem (100%)
Building native extensions.  This could take a while...
ERROR:  Error installing sqlite3:
	ERROR: Failed to build gem native extension.

However, @MikeMcQuaid stated:

We need to not install a gem at runtime but it's fine to use the sqlite gem because it has no dependencies (as far as I can tell). It would need to be vendored into Library/Homebrew/vendor/ but hold off on doing so for now because it'll bloat the diff.

So we wouldn't have the gem install issues with the vendored solution.

Would it be possible to get a code-review or advice for next-steps I should take please?

Thank you! 🙂

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented Feb 11, 2018

So we wouldn't have the gem install issues with the vendored solution.

It'd be good to do that to see the CI results I think 👍

PStore: Ruby also provides DBM as an abstraction over various Berkeley-type DBs. Depending on the BDB-like supplied by OS X, it might offer better performance (Marshal is fast, but a native DB with string, string KV pairs is probably faster).

I'd love to see an investigation on this as part of the PR too. Ideally the existing PStore implementation, one using BDB and SQLite would all be worth investigating both for a benchmark perspective and to compare their code implementation (I'd take a slightly slower cache with much simpler code).

I'm gonna do a quick code review now, too. Sorry for the delay in this and thanks for all your work so far @AndrewMcBurney!

@MikeMcQuaid
Copy link
Member

MikeMcQuaid left a comment

A few comments. A general comment that the levels of abstraction here can stay while playing around with various methods of their implementation but it would be good to reduce this stuff down to e.g. a single new class with extensions to existing classes for a final merged PR.

@@ -9,6 +9,9 @@
#:
#: If `--reverse` is passed, print the dylib followed by the binaries
#: which link to it for each library the keg references.
#:
#: If `--rebuild` is passed, flushes sqlite3 db row for 'keg.name' and

This comment has been minimized.

@MikeMcQuaid

MikeMcQuaid Feb 11, 2018

Member

Doesn't matter for now but would be good to just reference the cache rather than an implementation (sqlite)


# Force a flush of the 'cache' and check dylibs if the `--rebuild`
# flag is passed
result.check_dylibs if ARGV.include?("--rebuild")

This comment has been minimized.

@MikeMcQuaid

MikeMcQuaid Feb 11, 2018

Member

Could rename the method flush_cache_and_check_dylibs and kill comment entirely

#
# @abstract
#
class Database

This comment has been minimized.

@MikeMcQuaid

MikeMcQuaid Feb 11, 2018

Member

DatabaseCache or similar?


def initialize(keg, formula = nil)
@keg = keg
@keg = keg

This comment has been minimized.

@MikeMcQuaid

MikeMcQuaid Feb 11, 2018

Member

Just leave this formatting as-is; I'm not opposed to having it like this for new files but it makes it trickier to review when there's unrelated formatting changes.

@armcburney armcburney force-pushed the armcburney:cache-optimization branch 2 times, most recently from 496b2a2 to 337e011 Feb 11, 2018

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented Feb 12, 2018

Hi @MikeMcQuaid,

I implemented a DBM Berkeley db solution like you suggested. I've summarized the performance comparison between the SQLite3 implementation and the DBM implementation below.

I created a pull request against my personal fork of Homebrew to show what the changes would look like against origin/master.

Should you wish to proceed with the DBM implementation (which I recommend over the SQLite3 solution after this investigation), I'll push the code to this pull request, clean up the commits, and write tests for the changes.

Performance Comparison

Summary

Implementation No Optimization Cache Optimization
origin/master 8.04 seconds N/A
SQLite3 10.3 seconds 180 milliseconds
DBM 8.14 seconds 208 milliseconds

Observations

  • It takes longer to build the cache (i.e., the No Optimization column) for both SQLite3 and DBM implementations than it currently takes on origin/master
  • The DBM implementation can build the cache faster than the SQLite3 implementation
  • It's much faster to run the brew linkage command when the values are cached (i.e., the Cache Optimization column for DBM and SQLite3)
  • The SQLite3 implementation is faster at accessing values from the cache than the DBM implementation (but not by much)

Flame Graphs

origin/master

Without Optimization

The baseline without caching (i.e., the time it takes for brew linkage to complete on origin/master) took 8.04 seconds to complete. The library boost took 704 milliseconds to complete checking the dylibs and printing the output.

baseline

SQLite3

Without Optimization

Without the caching (i.e., using the --rebuild flag) the command (instrumented starting from linkage.rb) took 10.3 seconds to complete. The library boost took 777 milliseconds to complete checking the dylibs and printing the output.

no-opt

With Optimization

With the caching (i.e., by default) the command (instrumented starting from linkage.rb) took 180 milliseconds to complete. The library boost took 1.77 milliseconds to complete checking the dylibs and printing the output.

opt

DBM

Without Optimization

Without the caching (i.e., using the --rebuild flag) the command (instrumented starting from linkage.rb) took 8.14 seconds to complete. The library boost took 686 milliseconds to complete checking the dylibs and printing the output.

no-opt

With Optimization

With the caching (i.e., by default) the command (instrumented starting from linkage.rb) took 208 milliseconds to complete. The library boost took 2.15 milliseconds to complete checking the dylibs and printing the output.

optimization

Recommendations

I would recommend proceeding with the DBM implementation because:

  • the solution doesn't require a vendored gem like the SQLite3 implementation
  • the SQLite3 gem requires building native extensions from C with the rake/extensioncompiler
  • setting the cache is slightly fast than the SQLite3 implementation
  • while the SQLite3 solution is faster at fetching data from the cache, it's not faster by much
  • cleaner to read the code, and understand the cache structure/schema

@armcburney armcburney referenced this pull request Feb 12, 2018

Closed

Berkeley db cache optimization for `brew linkage` command. #3

5 of 6 tasks complete

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from 138ea19 to 360a301 May 21, 2018

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 21, 2018

Is the desired cache usage this?

Yes, I think so. I'm open to being convinced otherwise but it seemed strange to me to build/store a cache on disk which would never be used without an option or environment variable being passed.

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from f049e3b to b8bf7ba May 21, 2018

Changed cache usage behavior.
1. Running `brew linkage some_package` does not set the cache.
2. Running `brew linkage --cached some_package` when `DatabaseCache.empty?` returns `true` should build the cache.
3. Running `brew linkage --cached some_package` when `DatabaseCache.empty?` returns `false` should use the cache.

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from b8bf7ba to 010207b May 21, 2018

@MikeMcQuaid MikeMcQuaid force-pushed the armcburney:cache-optimization branch from 3bd83a3 to 44f5d3e May 22, 2018

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 22, 2018

@AndrewMcBurney I've pushed some refactoring here; I couldn't really figure out how to explain what I wanted without writing out some of the code myself. Let me know if you object to any of it. The main thing I wanted was to ave the DatabaseCache (now called CacheStoreDatabase) be the only thing that calls DBM objects, the CacheStore and subclasses (i.e. LinkageStore now called LinkageCacheStore) be the only thing that calls DatabaseCache/CacheStoreDatabase objects and the LinkageChecker be the only thing that calls LinkageStore/LinkageCacheStore objects.

Write benchmark and integration tests for the caching system
Write unit tests for Library/Homebrew/os/mac/linkage_database.rb and Library/Homebrew/os/mac/linkage_store.rb.

@AndrewMcBurney I think the code is now enough that you can proceed with these (if you're still up for it). I think you can skip the benchmark tests and focus mainly on unit tests with a few integration tests to check things are working end-to-end. If you're not going to have time to do this: shout and I'll take a look.

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from 934ef00 to b2e046a May 22, 2018

@armcburney armcburney force-pushed the armcburney:cache-optimization branch from b2e046a to 91f7a5e May 22, 2018

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented May 22, 2018

Hi @MikeMcQuaid,

I added unit tests for the new classes, but I was hoping to get some advice for the integration tests. I've looked at the other dev-cmd test suites, and structured the linkage_spec as such:

describe "brew linkage", :integration_test do
  context "no cache" do
    it "works when one argument is provided" do
      setup_test_formula "testball"

      expect(File).to_not receive(:write)
      expect { brew "linkage", "foo" }
        .to be_a_success
    end
  end
end

I thought this would work, but it's currently failing. Does Homebrew have any test helper methods to setup kegs that I could pass into the brew linkage command?

Thanks!

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 23, 2018

expect(File).to_not receive(:write)

This is the bit that may not work as expected.

What's the failure you're seeing?

expect { brew "linkage", "foo" }

Should be brew "linkage", "testball".

If neither of those help: what's the failure you're seeing? FYI don't go overboard in integration tests; we have too many and I'm going to delete some.

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 23, 2018

FYI don't go overboard in integration tests; we have too many and I'm going to delete some.

See #4202 for this.

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented May 23, 2018

What's the failure you're seeing?

It fails, but the error message isn't too descriptive:

Failures:

  1) brew linkage no cache works when one argument is provided
     Failure/Error:
       expect { brew "linkage", "testball" }
         .to be_a_success

       expected #<Proc:0x007f8a4eda77f0@/usr/local/Homebrew/Library/Homebrew/test/dev-cmd/linkage_spec.rb:13> to be a success
     # ./test/dev-cmd/linkage_spec.rb:13:in `block (3 levels) in <top (required)>'
     # ./test/support/helper/spec/shared_context/integration_test.rb:44:in `block (2 levels) in <top (required)>'
     # ./test/spec_helper.rb:103:in `block (2 levels) in <top (required)>'
     # ./vendor/bundle/ruby/2.3.0/gems/rspec-retry-0.5.7/lib/rspec/retry.rb:115:in `block in run'
     # ./vendor/bundle/ruby/2.3.0/gems/rspec-retry-0.5.7/lib/rspec/retry.rb:104:in `loop'
     # ./vendor/bundle/ruby/2.3.0/gems/rspec-retry-0.5.7/lib/rspec/retry.rb:104:in `run'
     # ./vendor/bundle/ruby/2.3.0/gems/rspec-retry-0.5.7/lib/rspec_ext/rspec_ext.rb:12:in `run_with_retry'
     # ./vendor/bundle/ruby/2.3.0/gems/rspec-retry-0.5.7/lib/rspec/retry.rb:33:in `block (2 levels) in setup'
     # ./vendor/bundle/ruby/2.3.0/gems/rspec-wait-0.0.9/lib/rspec/wait.rb:46:in `block (2 levels) in <top (required)>'

for the test:

describe "brew linkage", :integration_test do
  context "no cache" do
    it "works when one argument is provided" do
      setup_test_formula "testball"

      expect { brew "linkage", "testball" }
        .to be_a_success
    end
  end
end
@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 23, 2018

It fails, but the error message isn't too descriptive:

Yeh, they aren't ideal 😭

I'd suggest adding an assertion for the standard/error output (both as good practise and to help debug).

@armcburney

This comment has been minimized.

Copy link
Contributor

armcburney commented May 23, 2018

Hi @MikeMcQuaid, on stderr, it's printing: Error: No such keg: /tmp/homebrew-tests-20180523-61114-1xow8iw/cellar/testball. Is there a helper function for creating kegs that I could use in these tests? Thank you!

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 23, 2018

Hi @MikeMcQuaid, on stderr, it's printing: Error: No such keg: /tmp/homebrew-tests-20180523-61114-1xow8iw/cellar/testball. Is there a helper function for creating kegs that I could use in these tests? Thank you!

@AndrewMcBurney I'd copy-paste from the upgrade and migrate integration tests which both rely on a keg existing. There's no helper function beyond setup_test_formula

@MikeMcQuaid MikeMcQuaid merged commit 97645d0 into Homebrew:master May 24, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@wafflebot wafflebot bot removed the in progress label May 24, 2018

@MikeMcQuaid

This comment has been minimized.

Copy link
Member

MikeMcQuaid commented May 24, 2018

Great work on this @AndrewMcBurney! Sorry for the delay in merging but so happy to see this in now!

@lock lock bot added the outdated label Jun 23, 2018

@lock lock bot locked as resolved and limited conversation to collaborators Jun 23, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.