Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial review notes #1

Open
davepacheco opened this issue Feb 21, 2019 · 5 comments
Open

initial review notes #1

davepacheco opened this issue Feb 21, 2019 · 5 comments

Comments

@davepacheco
Copy link
Contributor

davepacheco commented Feb 21, 2019

I'm filing this issue as a place to put feedback from my initial review. I reviewed commit d5fb9d6.

First, this is great -- a lot of work for a short time! Most of my major feedback relates to performance and operational considerations.

Here are the major issues I think we definitely want to address:

  • The sqlite update and write to the data files should probably happen once per batch, not per record. I think this will have a pretty major impact on both CPU usage and performance due to not having to fsync the sqlite file for each path written.
  • I'd start with a batch size of 5K or 10K. The data path is limited to 1K, but I think we can push this a little further.
  • I'd start f_delay = 0. It's good that we have this option in there, but we should have just one client per shard, and having one request outstanding all the time seems reasonable.
  • The shard number needs to appear somewhere in the database, either in the table itself or in the name of the database file. Otherwise, concurrent instances of this program will stomp on each other. I'd probably recommend putting this into the database file name for reasons described later. (Another option would be to take the database name as an argument and putting the shard name into the table.)

Here are the operational considerations we'll need to answer:

  • How many shards will we assign to each process? I think we want to use multiple processes in case this winds up being CPU-bound (which seems not unlikely). Splitting it across multiple processes might also mitigate risk if one of them fails to make forward progress for unforeseen reasons. However, the more processes, the more memory used. If we had 100 processes and using 500 MB of memory each, that'd be 50 GB of memory. We also only have a few dozen cores available. I'd probably start with something like 5-10 shards assigned to each process.
  • How will we orchestrate the processes? We can have each one be a separate SMF service instance and have the program take a list of shards on the command line.
  • Does this program leak memory? We'll want to check for this early on.
  • What's the bottleneck? We should profile this on a large-ish dataset (or else plan to profile and iterate in prod -- that's probably fine too). It could be more efficient by parallelizing the RPC and the file writes, but that's quite a lot more work and I don't think we should do this unless we find it's not running quickly enough.
  • What's the impact on the system it runs on?
    • disk space: we should calculate the max disk space
      • Jan: KR: 125 bytes per key x 1M keys per node x 1015 nodes = 118 GiB
        (uncompressed)
    • I/O starvation? It's all async I/O, and a total of only 118 GiB -- maybe not so bad? May depend on the rate.
    • CPU starvation: doesn't seem likely to be CPU-bound if we tune it right, but it's hard to know.
  • Impact on metadata tier: one client per shard seems okay, even with no delay

From the above: I think it would be safest if we can deploy this to a new CN on the Manta network.

I'm not sure how big an issue this will be:

  • It may be quite expensive to use appendSync every time instead of just keeping a file open and appending to it. We can try profiling first and seeing.
jmwski pushed a commit that referenced this issue Feb 21, 2019
@jmwski
Copy link
Contributor

jmwski commented Feb 21, 2019

  • The sqlite update and write to the data files should probably happen once per batch, not per record. I think this will have a pretty major impact on both CPU usage and performance due to not having to fsync the sqlite file for each path written.

I've added a checkpoint prototype function that is called in once per 'end' event received from findObjects.

  • I'd start with a batch size of 5K or 10K. The data path is limited to 1K, but I think we can push this a little further.

Agreed, this will remain the default batch size.

  • I'd start f_delay = 0. It's good that we have this option in there, but we should have just one client per shard, and having one request outstanding all the time seems reasonable.

Done.

  • The shard number needs to appear somewhere in the database, either in the table itself or in the name of the database file. Otherwise, concurrent instances of this program will stomp on each other. I'd probably recommend putting this into the database file name for reasons described later. (Another option would be to take the database name as an argument and putting the shard name into the table.)

Database filenames are now formatted as in 2.moray.orbit.example.com-stream_position.db. Each process will have one client per shard. That client will be the exclusive writer/reader from this database.

  • How many shards will we assign to each process?

I've made this configurable by adding a shards array to the config file accepted by the feeder.

  • How will we orchestrate the processes?

I think having an array of srvDomains for shards we want to assign to the feeder is reasonable. This could probably look similar to GC_ASSIGNED_SHARDS.

  • Does this program leak memory?

Do you have a good litmus test for this? My first thought would be to run the process for a while and look at the output of prstat -s rss to check whether the RSS runs off. I have tried to ensure that all resources are closed in the state_done method and that the feeder always ends up in that state eventually.

I will make sure that the req object returned from node-moray's findobjects is not leaked. My first thought would be that these are gc'd after they emit the 'end' event.

  • What's the bottleneck?

I will profile this while listing a large directory in my test environment. What tools would you use to determine which part of the code is consuming the most time?

  • It may be quite expensive to use appendSync every time instead of just keeping a file open and appending to it. We can try profiling first and seeing.

I have changed the feeder to use a separate writeStream for each instruction object listing (1 per storage node). The feeder will created these streams when it starts up and close them in done state. Do you think that running in a Manta with many storage nodes (~1,000) might lead to memory issues? I'm not sure how large the objects associated with each stream are, but I could look into that. Do you have a suggestion for how I could find this information?

@davepacheco
Copy link
Contributor Author

davepacheco commented Feb 21, 2019

For both the memory leak check and CPU profiling, it may make sense to sanity-check it in a test environment, but the results could be different in prod (particularly for CPU profiling), so I wouldn't necessarily spend a ton of time on it. I think it would be better to prioritize making it possible to iterate in prod.

  • Does this program leak memory?

Do you have a good litmus test for this?

I would just run it for an extended period and see if the RSS usage stabilizes. That's the only way I know to tell.

  • What's the bottleneck?
    I will profile this while listing a large directory in my test environment. What tools would you use to determine which part of the code is consuming the most time?

I would use prstat to verify whether it's on CPU (and in userland) very much. If so, then I'd profile with:

# dtrace -n profile-97/pid == $YOURPID && arg1/{ @[ustack(80, 8192)] = count(); }' -c 'sleep 60' > stacks.out

This will profile at 97Hz for 60 seconds. Then I'd copy this to your laptop and use stackvis:

# stackvis dtrace flamegraph-d3 < stacks.out > stacks.htm

to generate a flame graph. You can open this in your browser.

There are a few notes about this process:

  • If you get any DTrace errors, you may need to bump the second parameter, which is the buffer size for the recorded stack. (It depends on the error, though.)
  • If the flame graph has no text in it, or all the text overflows all the boxes, this is a known stackvis issue. You can try another browser. You can also try to use the old SVG-based flame graph visualization instead with stackvis dtrace flamegraph-svg < stacks.out > stacks.svg.
  • If you find a lot of stacks in the flame graph have the exact same height and don't seem to start at the same bottom frame (which should be _start in libc or the like), then you may need to increase the parameters to ustack (which control how many stack frames to record and how much buffer space to use, respectively).

@jmwski
Copy link
Contributor

jmwski commented Feb 22, 2019

I kicked off a test run of the latest commit in us-east (409a14c) in our ops zone there.

I would just run it for an extended period and see if the RSS usage stabilizes. That's the only way I know to tell.

After writing out around 50,000 instructions, the process held a steady RSS of about 64M:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
494351 root       64M   56M sleep    1    0   0:00:09 0.0% node/11

Here's an example of the output of prstat -p <PID> -mLc 1:

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
494351 root     1.4 0.3 0.0 0.0 0.0 0.0  98 0.1  46  35 359   0 node/1
494351 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0  18   3  48   0 node/9
494351 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0  16   1  48   0 node/7
494351 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0  17   1  48   0 node/8
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0  12   0  30   0 node/10
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/11
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
494351 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.03, 0.08, 0.08

Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph.htm. If I'm reading that correctly I think this process is bottlenecked on crc checksum validation when decoding fast message buffers in the node-moray client.

This data is from a process assigned one shard in us-east. Roughly, the process was able to list write 300,000 instructions to a listing file in 45 minutes of execution. To list 1,000,000 instruction objects at this rate would take 150 minutes (around 2.5 hours).

@jmwski
Copy link
Contributor

jmwski commented Feb 22, 2019

Here are some datapoints from a lister process running with two assigned shards in us-east:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
544199 root       68M   61M sleep   59    0   0:00:33 0.1% node/11

RSS holds steady at 61M (slightly higher than the process running with 1 shard. This process spends slightly more time in user mode:

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
544199 root     3.6 0.7 0.1 0.0 0.0 0.0  95 0.3 130  84 975   0 node/1
544199 root     0.0 0.2 0.0 0.0 0.0 100 0.0 0.1  65   4 192   0 node/9
544199 root     0.0 0.2 0.0 0.0 0.0 100 0.0 0.1  48  13 142   0 node/10
544199 root     0.0 0.2 0.0 0.0 0.0 100 0.0 0.0  59   2 173   0 node/11
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/8
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/7
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
544199 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.13, 0.16, 0.14

Here's the flamegraph for a process with two shards assigned to it: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-2-shards.htm.

After 17 minutes of execution, the process juggling 2 shards wrote out 395,549 instructions (this is a sum across multiple storage nodes).

@jmwski
Copy link
Contributor

jmwski commented Feb 22, 2019

4-shards
User time is substantially higher with 4-shards:

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
561881 root      13 1.8 0.1 0.0 0.0 0.0  84 0.8 898 281  5K   0 node/1
561881 root     0.1 5.9 0.0 0.0 0.0  94 0.0 0.3 274 115 811   0 node/8
561881 root     0.2 5.5 0.0 0.0 0.0  94 0.0 0.4 259 153 737   0 node/9
561881 root     0.1 4.9 0.0 0.0 0.0  95 0.0 0.2 223  49 659   0 node/10
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/11
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/7
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
561881 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.79, 0.47, 0.28

RSS is not significantly affected:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
561881 root       65M   58M sleep   59    0   0:00:08 0.1% node/11

Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-4-shards.htm

8-shards: There's significantly more USR time here:

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
565813 root      17 2.4 0.2 0.0 0.0 0.0  78 1.7  1K 534  7K   0 node/1
565813 root     0.2 1.9 0.0 0.0 0.0  98 0.0 0.3 366  62  1K   0 node/10
565813 root     0.2 1.8 0.0 0.0 0.0  98 0.0 0.2 355  36  1K   0 node/9
565813 root     0.2 1.7 0.0 0.0 0.0  98 0.0 0.3 344  66 987   0 node/11
565813 root     0.1 1.7 0.0 0.0 0.0  98 0.0 0.3 323  58 926   0 node/8
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/7
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
565813 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/2
Total: 1 processes, 11 lwps, load averages: 0.43, 0.28, 0.20

But rss seems to stabilize with not too much growth compared to the 4-shards case:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
565813 root       76M   71M sleep   58    0   0:00:54 0.7% node/11

Here's the flamegraph: http://us-east.manta.joyent.com/jan.wyszynski/public/flamegraph-8-shards.htm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants