Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared storage cache #4989

Closed
wants to merge 17 commits into from
Closed

Shared storage cache #4989

wants to merge 17 commits into from

Conversation

ghoshkaj
Copy link
Member

@ghoshkaj ghoshkaj commented Apr 1, 2018

Issue

Use a shared storage cache to compare against performance of the threadlocal cache in #4876. This PR is to test out which is faster and keep the implementation that is faster. Related to cache considerations detailed here.

Tasklist

  • implement singleton pattern to make the cache accessible from global storage
  • ensure that ^ is thread safe (01e6562#diff-3765a18558eecd8c9e22fbf48fb2eccbR96)
  • implement mutexes for all functions in the cache (to lock for every read and every write)
  • swap mutex for ReadWriteLock (only lock on write so that reads can still happen in parallel)
  • run benchmarks keeping the following in mind:
    • create a profile for the number of threads used: 1, 2 , 8, 16 (the lower count of threads means there will not be as many threads blocking each other, while the higher end may create a deadlock. Getting a profile of these runs gives us a better idea of what the ideal number of threads will be)
    • we want total cache size to be at maximum 500mb extra memory in total so the line-size of this cache can be 26214400.
    • so, when benching the threadlocal cache, make sure that the line size is divided by the number of threads that you are using, so that the combined total is not greater than 500mb.
    • run requests 20000 requests for 25x25 matrix and 100x100 matrix on a large dataset
  • either merge into the base PR or close as not necessary
  • CHANGELOG.md entry (How to write a changelog entry)
  • update relevant Wiki pages
  • add tests (see testing documentation
  • review
  • adjust for comments
  • cherry pick to release branch

Requirements / Relations

#4876

@ghoshkaj ghoshkaj requested a review from oxidase April 2, 2018 04:32
@ghoshkaj ghoshkaj self-assigned this Apr 2, 2018
@ghoshkaj
Copy link
Member Author

ghoshkaj commented Apr 2, 2018

@oxidase I didn't make the cache a singleton since all the accesses to the cache are through the SearchEngine. Instead I got rid of the boost::thread_specific_ptr and used a std::unique_ptr instead to try and simulate a singleton. However, I'm not sure that it is threadsafe for over here when we create a new cache for a new dataset (timestamp).

@ghoshkaj
Copy link
Member Author

ghoshkaj commented Apr 2, 2018

After talking with @danpat, I converted the cache pointer into a cache object: f39e37a. But tests were still failing non-determiistically.

Then I talked to @miccolis and went with putting a lock in front of the call to clear the function from within the SearchEngine: 01e6562. The tests have stopped failing, but this probably makes it slower.

Some conversation came up around whether to make this shared storage cache be inter-thread shared storage versus inter-process shared storage. This implementation is of inter-thread shared storage as of now. Decisions to be made after performance measurements between threadlocal vs inter-thread shared storage. Hopefully we don't have to venture into inter-process shared storage 😬 🤞

@ghoshkaj ghoshkaj changed the base branch from master to compute-annotations-for-table-at-runtime April 3, 2018 00:52
@ghoshkaj ghoshkaj changed the base branch from compute-annotations-for-table-at-runtime to master April 3, 2018 00:59
@TheMarex
Copy link
Member

TheMarex commented Apr 3, 2018

One consideration for using a shared storage is that there can be multiple datasets (usually two) in flight at the same time. This is not a problem for thread-local caching, since a single thread will either only use the new or the old data. For a shared cache we need to keep track of the dataset used to compute every entry in the cache, that means we need to extend the cache line to include the timestamp.

The full timestamp used by OSRM is a 32bit integer but for the cache it is really only important to be able to differentiate between the datasets still in use. Typically we limit the number of datasets to 2 so in theory 1 bit of information would be enough. For the sake of generality I would suggest we use 1 byte (== 255 possible datasets) to start and optimize this later.

The cache design would need to be changed in the following way:

  1. Each cache entry gets an additional value generation of one unsigned char (== 1 byte)
  2. At the beginning of every query we obtain the generation for a given timestamp from the cache.
  3. When inserting a value into the cache the user needs to pass in the generation with the value.
  4. When retrieving a value from the cache the user needs to pass in the generation as well.
  5. The generation would be used to compute the hash for the key in unordered_map.

@ghoshkaj
Copy link
Member Author

ghoshkaj commented Apr 4, 2018

@TheMarex the steps above are really clever and adding unsigned char generation as part of the hash would have been so cool, except that with the vendored version of the LRU cache that I'm using, I don't have access to the hashing function anymore, so I can' t add the generation to be part of the unique key signature. Do you have a suggestion for a way around this problem? For now, I've just added the full timestamp as part of the key in the line, but that take up an additional 8 bytes to the cache.

@TheMarex
Copy link
Member

TheMarex commented Apr 4, 2018

LRU cache that I'm using, I don't have access to the hashing function anymore

You can do it the same way you currently handle the full timestamp value by making it part of the key. The generation logic can live in the interface that you build around the LRU cache. Seems like boost automatically generates a hash function for a std::tuple, so extending the actual has function is obsolete.

@ghoshkaj
Copy link
Member Author

ghoshkaj commented Apr 4, 2018

@TheMarex Ooo, actually I could just hash it myself and pass the hash to the library as the key!! Going to do that! posted at the same time. Going to move forward with your suggestion.

@ghoshkaj ghoshkaj force-pushed the shared-storage-cache branch 2 times, most recently from 7e28a9d to 71892a5 Compare April 5, 2018 12:40
@ghoshkaj ghoshkaj changed the base branch from master to implement-cache April 5, 2018 17:47
@ghoshkaj ghoshkaj force-pushed the implement-cache branch 5 times, most recently from 3bfdaf3 to 729a399 Compare April 8, 2018 23:40
@ghoshkaj ghoshkaj force-pushed the shared-storage-cache branch 2 times, most recently from fac25d1 to 97cd4c7 Compare April 9, 2018 03:30
@ghoshkaj
Copy link
Member Author

ghoshkaj commented Apr 9, 2018

branch-gitsha \ dataset, matrix-size, numReqs usca-25x25-10k reqs (ms) usca-25x25-10k reqs (ms)  usca-25x25-10k reqs (ms)  usca-25x25-10k reqs (ms) usca-100x100-5k reqs (ms) usca-100x100-5k reqs (ms) usca-100x100-5k reqs (ms)  usca-100x100-5k reqs (ms) berlin-latest -5x5 - 20k reqs (ms) berlin-latest -5x5 - 20k reqs (ms) berlin-latest -5x5 - 20k reqs (ms)  
  2 threads 4 threads 8 threads 16 threads 2 threads 4 threads 8 threads 16 threads 2 threads 4 threads 8 threads 16 threads
master 20.556 20.807 20.750 20.755 97.250 97.154 97.017 96.763        
threadlocal-cache (250 mb)                        
shared-storage (250mb)                        
threadlocal-cache (500 mb) 31.592 31.833 32.497 33.065 226.447 226.923 431.501 225.378        
shared-storage (500 mb)                        
threadlocal-cache (1000 mb)                        
shared-storage (1000 mb)                        

Running osrm-runner with this command, adjusting lengh of request and number of requests as appropriate:

./scripts/osrm-runner.js -n 10000 -b -122.65,37.26,-119.25,39.83  -p '/table/v1/driving/{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{};{}' -f '$.durations[0]'

For threads, using the - t parameter on ./osrm-routed

@ghoshkaj ghoshkaj force-pushed the implement-cache branch 2 times, most recently from c7983bd to 0470a4e Compare April 9, 2018 13:15
copy dummy cache over

implement retrievePackedPathFromSearchSpace

calculate packed_path_from_source_to_middle

debugging the retrievePackedPathFromSearchSpace function implementation

adding in packed_path_from_source_to_middle

cache is partway working

unpack path and get duration that way

the computeDurationForEdge method

comment out cache

clean up the code

move vector creation and allocation to outside of loop

hack to not return vectors on facade.GetUncompressedForwardDurations and facade.GetUncompressedReverseDurations

clean up hack

add exclude_index to cache key

clearing cache with timestamp

rebase against vectors->range pr

swapped out unordered_map cache with a boost_lru implementation

calculation for cache size

cleaned up comment about cache size calculations

unit tests

cache uses unsigned char for exclude index

clean up cache and unit tests
copy dummy cache over

implement retrievePackedPathFromSearchSpace

calculate packed_path_from_source_to_middle

debugging the retrievePackedPathFromSearchSpace function implementation

adding in packed_path_from_source_to_middle

cache is partway working

unpack path and get duration that way

the computeDurationForEdge method

comment out cache

clean up the code

move vector creation and allocation to outside of loop

hack to not return vectors on facade.GetUncompressedForwardDurations and facade.GetUncompressedReverseDurations

clean up hack

add exclude_index to cache key

clearing cache with timestamp

rebase against vectors->range pr

swapped out unordered_map cache with a boost_lru implementation

calculation for cache size

cleaned up comment about cache size calculations

unit tests

cache uses unsigned char for exclude index

clean up cache and unit tests
copy dummy cache over

implement retrievePackedPathFromSearchSpace

calculate packed_path_from_source_to_middle

debugging the retrievePackedPathFromSearchSpace function implementation

adding in packed_path_from_source_to_middle

cache is partway working

unpack path and get duration that way

the computeDurationForEdge method

comment out cache

clean up the code

move vector creation and allocation to outside of loop

hack to not return vectors on facade.GetUncompressedForwardDurations and facade.GetUncompressedReverseDurations

clean up hack

add exclude_index to cache key

clearing cache with timestamp

rebase against vectors->range pr

swapped out unordered_map cache with a boost_lru implementation

calculation for cache size

cleaned up comment about cache size calculations

unit tests

cache uses unsigned char for exclude index

clean up cache and unit tests
shared lock for reads and unique lock for writes

declare cache as object and not pointer in serach engine data to simulate singleton declaration

put a lock infront of the clear function to make it threadsafe

remove clear function from cache because cache will never get dropped

unit tests

unit tests and timestamp as part of key

cache generations

hash the key

500 mb

1000 mb

250 mb

rebase against implement-cache
@ghoshkaj ghoshkaj force-pushed the implement-cache branch 3 times, most recently from 26122e3 to 62d7d08 Compare May 7, 2018 15:36
@ghoshkaj ghoshkaj closed this Apr 18, 2020
@DennisOSRM DennisOSRM deleted the shared-storage-cache branch November 6, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants