Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency reads in GCS connector #114

Closed
medb opened this issue Jul 12, 2018 · 8 comments
Closed

High latency reads in GCS connector #114

medb opened this issue Jul 12, 2018 · 8 comments

Comments

@medb
Copy link
Contributor

medb commented Jul 12, 2018

Issue that focuses on high latency reads discussed in #108

From @sidseth:

Any thoughts on how the first read performance can be improved, or at least reuse connections when 100s or 1000s of files are being opened once?

The access pattern is quite simple to reproduce.
Generate a bunch of files. Perform the following on each of them (ORC by default will launch 10 threads to read the stripe details from these files)

byte[] buffer = new byte[16384];
FileSystem fs = FileSystem.get(new URI(pathString), conf);
InputStream is = fs.open(new Path(pathString));
((FSDataInputStream) is).readFully(fileLength-16K, buffer, 0, 16384);

https://github.com/rajeshbalamohan/hadoop-aws-wrapper is useful in tracking various invocations in the readFully call. This may need some modifications. (Can post a PR to that repo tomorrow)

@medb
Copy link
Contributor Author

medb commented Jul 12, 2018

@sidseth I have made some optimizations to address this issue in #110, I plan to mainline them soon.

May you check if they help your use-case?

@sidseth
Copy link
Contributor

sidseth commented Jul 12, 2018

@medb - will try out the patch and get back with details, will likely be early next week though.

@medb
Copy link
Contributor Author

medb commented Jul 18, 2018

We just released GCS connector 1.9.2 which includes all the performance optimizations.

To take advantage of all available optimizations set the properties:

fs.gs.inputstream.fadvise=RANDOM
fs.gs.io.buffersize=524288
fs.gs.inputstream.footer.prefetch.size=65536
fs.gs.performance.cache.enable=true
fs.gs.performance.cache.max.entry.age.ms=1800000

@sidseth
Copy link
Contributor

sidseth commented Jul 18, 2018

Thansk @medb. I've got some information from the previous patch (using FADVISE=RANDOM). Will run some more tests with the new changes. From a quick glance - seemed like the footer reads from multiple files was faster. Let me analyze the results a little more before posting details. Also noticed that listLocatedStatus was taking a very long time (relatively) - which causes a slow down.
Any particular reason to reduce the buffer size from the default 8MB to 0.5MB?

@medb
Copy link
Contributor Author

medb commented Jul 18, 2018

Buffer size limits minimum HTTP range requests size in RANDOM mode.
In my SparkSQL tests with ORC files it lead to redundant data transfer (up to 2x) - my guess is that ORC reads ~1MB pages at a time so 8MB range requests are redundant for it.

@sidseth
Copy link
Contributor

sidseth commented Jul 19, 2018

Got it.
Here's what I've noticed from a bunch of runs.

  1. The patches for fadvise=RANDOM has not made a significant difference to the footer read times (This was with fadvise=RANDOM and the set of patches from 2 days ago - no fs.gs.performance.cache or footer prefetch). This was from running queries - but a few micros benchmarks from my local system had similar results. Will run a few micro-benchmarks from GCS nodes to make sure that's the case.

  2. With the latest set of patches, and the settings mentioned above - the readFully call is way faster. However, open is now quite a bit slower, since that's where the prefetch happens. Combined though - for queries (not microbenchmarks - seeing a 20-40% improvement)

  3. listLocatedStatus sees a significant improvement with the performance cache enabled. Is this recommended (Will look at memory impact before enabling this by default)

@medb
Copy link
Contributor Author

medb commented Jul 19, 2018

Thanks for sharing test results!

  1. Yes, for just footer reads RANDOM mode is not necessarily beneficial, because footer is relatively small and at the end of the file, so there no difference between STREAMING and RANDOM mode in this case.

  2. Regarding open - if test doesn't actually read footer to parse data in the file (as I think the case with readFully call), you can set fs.gs.inputstream.footer.prefetch.size=0, so GCS metadata request will be send in open call instead of pre-fetching footer, it could be better for this use-case.

  3. Another thing to consider is data staleness when using performance cache, by default it set to 5 seconds.

@medb
Copy link
Contributor Author

medb commented Aug 8, 2018

GCS connector 1.9.4 was released with improvements to random reads latency.
To take advantage of them you need to set next properties:

fs.gs.io.buffersize=0
fs.gs.inputstream.min.range.request.size=262144
fs.gs.performance.cache.enable=true
fs.gs.performance.cache.max.entry.age.ms=300000
fs.gs.inputstream.fast.fail.on.not.found.enable=false
fs.gs.inputstream.fadvise=RANDOM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants