Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use balanced ssds instead of spinning disks for remote-dev #3079

Merged
merged 2 commits into from
May 12, 2022

Conversation

joaquincasares
Copy link
Contributor

@joaquincasares joaquincasares commented May 11, 2022

Description

Move to SSDs from HDDs for new remote-dev boxes for better random read/write patterns to contend with:

  • the solana-validator writing about 10 Mb/s of data
  • dockerd writing logs
  • ganache writing blocks to disk
  • postgres maintaining its database
  • elasticsearch maintaining its database
  • vscode-server's remote-ssh filewatchers
  • git integration for certain p10k.sh shells
  • normal terminal usage

Ideally, we would separate data for the solana-validator, ganache, postgres, and elasticsearch into a separate mounted/attached disk to lessen disk pressure on the boot disk allowing for a more responsive terminal and IDE experience.

If this change does not show vast improvement, we can try the pd-ssd disk type before considering a dedicated mounted disk as recommended by Docker:

  • "Avoid storing application data in your container’s writable layer using storage drivers. This increases the size of your container and is less efficient from an I/O perspective than using volumes or bind mounts."
  • "Instead, store data using volumes."

Background

After logging into my remote-dev machine, my terminal was laggy.

top showed about ~75% of the time the CPU was idle. free -m showed swap was off.

sar -b 1 was showing about 10 Mb/s being written out with about 1000 write transactions per second with 0 reads. However, pre-PR we were seeing sub-MB disk writes with an occasional burst to 100 M/s while post-PR we see a normal flow of at least 4 M/s with frequent bursts of 100 M/s.

sudo iotop -oP showed the majority of write requests was being spent on the solana-validator, dockerd, postgres, ganache, respectively.

vmstat 1 5 showed about 50k context switches per second.

iostat -x sda shows pre-PR r_await and w_await times (Vocabulary) of 4.37 and 9.74, respectively. Post-PR we see 0.70 and 5.44 on one node and 0.44 and 3.87 on the node with --fast, respectively. This shows that with the pd-balanced we tend to wait 90% less for reads and 60% less for writes.

pidstat -w 3 10   > /tmp/pidstat.out
pidstat -wt 3 10  > /tmp/pidstat-t.out
strace -c -f -p $PID

The above commands showed that node processes were seeing the majority of the cswch/s, however, strace showed they were stalling on futex calls implying node-based mutexes were the cuprit, which may be expected, but I'm unable to confirm without past data.

Some random glances samples that are representative of a live view are:

  • pre-PR:
    • ctx_sw: 38k
    • inter: 25503
    • sw_int: 6154
    • iowait: Min:7.6 Mean:9.9 Max:12.4
  • post-PR:
    • ctx_sw: 61k
    • inter: 42494
    • sw_int: 10934
    • iowait: Min:8.8 Mean:11.7 Max:16.3
  • post-PR w/ --fast:
    • ctx_sw: 59k
    • inter: 43123
    • sw_int: 10556
    • iowait: Min:10.4 Mean:12.7 Max:18.7

-- Vocabulary

What glances highlights is that:

  • We are able to context switch about 1.5x (61%) more often than when using pd-balanced.
  • We were able to accept about 2x (72%) more hardware (network, disk) interrupts.
  • We are able to accept about 2x (78%) more kernel-level interrupts.
  • We are able to saturate the disks for longer with high iowait.
  • Given that we've gone above 10 on our iowait stat, it's an indicator that we should have already had a plan in place to migrate when iowait was above 7.
  • Sadly, even the cheaper pd-balanced SSDs are not enough to run our 4 databases + docker + VSCode server.

While our write throughputs would effectively double, we don't seem to be close to our current limitation of 400 MB/s.

However, while theoretical disk throughput could be similar between pd-standard and pd-balanced disks, the former are backed by HDDs while the latter are backed by SSDs at a cheaper price point than pd-ssd disks:

Tests

Running A in split view for 4 machines with pd-balanced (pre-PR, post-PR, post-PR with --fast) and pd-ssd (post-PR with --fast).

How will this change be monitored? Are there sufficient logs?

#infra-troubleshooting

@cheran-senthil
Copy link
Contributor

cheran-senthil commented May 12, 2022

I think this is the associated linear card, https://linear.app/audius/issue/INF-41/prevent-remote-dev-becoming-100percent-full

I don't see how this PR reduces disk space usage. I have follow up PRs that drastically reduce Solana and Ganache disk usage already.

Moreover, I think it is possible that remote boxes are being setup without the default log size limit from https://github.com/AudiusProject/audius-docker-compose/blob/main/setup.sh#L21 that is usually only added when provisioned with --fast (?) if this is not added even so, then that is probably the cause of most of this trouble.

@dmanjunath
Copy link
Contributor

dmanjunath commented May 12, 2022

i agree with @cheran-senthil that this probably doesn't solve the initial problem of us writing a ton of data. however i think this is a good change that should give us better read perf, so we should merge it. ultimately we gotta find out how to write less to disk.

@dmanjunath
Copy link
Contributor

and yes, we should bake the docker log size change into the non fast create flow

@joaquincasares
Copy link
Contributor Author

@cheran-senthil you're correct that the linear card doesn't quite match with this work, but both come from the same focus to focus on disk usage whether it be disk consumption or disk utilization. If need be, I can create a different ticket for disk utilization.

@dmanjunath what the stats in the description state is that we will get better performance, but even with pd-balanced SSDs, we're still hitting hit iowaits which is definitely going to cause problems as we start utilizing elasticsearch as well. Personally, I would hate to hurt morale by delivering a fix that will only work for a few weeks, before we start seeing the same level of pain we have been for the past couple of months.

I don't think the correct solution is pd-balanced due to these iowaits of 18, when they should be below 7-10. We're already getting data that cheapest SSD option works faster, but are under even higher disk contention.

For this to really benefit our team, we should either do 2 pd-balanced disks (what I and Docker recommend), or perhaps one pd-ssd (which will be costly, but I can test how well this does in this PR).

Thoughts?

@joaquincasares joaquincasares merged commit a4268ff into master May 12, 2022
@joaquincasares joaquincasares deleted the jc-inf-41 branch May 12, 2022 22:19
sliptype pushed a commit that referenced this pull request Sep 10, 2023
sliptype pushed a commit that referenced this pull request Sep 10, 2023
[e804aa1] [C-2248, C-2373] Use playlistUpdates, remove legacyNotifications (#3094) Dylan Jeffers
[824933e] [C-2366] Improve web notification selection performance (#3103) Dylan Jeffers
[4b8edef] [PLAT-696] Add trending-playlists/underground notifications (#3089) Dylan Jeffers
[1f9cf3e] [C-2275] Fix android drawer offsets (#3095) Dylan Jeffers
[fc14c82] [PAY-1063][PAY-1085][PAY-1086] Update UI for inaccessible gated tracks from favorites and history pages (#3100) Saliou Diallo
[b0441f5] [C-2365] Update play buttons on web and mobile to show resume when track is current (#3101) Kyle Shanks
[453910f] [C-2378] Add upload v2 feature flag (#3099) Sebastian Klingler
[962a6df] [C-2337] Remove reachability mobile web (#3090) Raymond Jacobson
[4ad5cd2] Fix visible collectibles for upload popup (#3093) Saliou Diallo
[c143078] Fix feature flag bug (#3092) Saliou Diallo
[44435b5] Fix upload prompt modal learn more url (#3091) Saliou Diallo
[c9024ad] Use chat.messagesStatus instead of selector (#3087) Reed
[38d43c4] [C-2369] Fix issue where notification poll can break app on signout (#3088) Dylan Jeffers
[90122d9] [PAY-923] DMs: Add desktop entrypoints (#3083) Marcus Pasell
[00f27e8] [PAY-907] Mobile chat reactions (#3020) Reed
[4678b89] DMs: Fix broken typecheck on main (#3086) Marcus Pasell
[756ade4] [PAY-1000][PAY-1084][PAY-1096][PAY-1097][PAY-1098] - More gated content fixes (#3085) Saliou Diallo
[820aa9d] Fix upload and repost probers tests and lint (#3076) Sebastian Klingler
[345607e] [C-2320] Fix profile socials alignment (#3079) Dylan Jeffers
[569199c] Fix prod build timeout (#3084) Sebastian Klingler
[12f6c22] Remove ports for local dev (#3082) Theo Ilie
[1940618] Fix broken Main build due to typeerror (#3080) Marcus Pasell
[eb8d47e] [PAY-1082] DMs: Dedupe sent messages (#3066) Marcus Pasell
[50a11c3] Update SDK to 2.0.3-beta.0 (#3078) Marcus Pasell
[c420fbb] Clean up NPM package lock (#3077) Marcus Pasell
[35d1124] [C-2327] Add playlist updates slice (#3063) Dylan Jeffers
[59862ad] [C-2344] Update the web playbar scrubber to respect the playback speed of podcasts (#3075) Kyle Shanks
[ffeb0d3] [C-2349] Default download on wifi only to false (#3074) Andrew Mendelsohn
[cafae41] [C-2325] Fix playlist table date-added column (#3073) Dylan Jeffers
[384a510] [PAY-927] DMs: Empty messages state (#3068) Marcus Pasell
[1132f83] Update @jup-ag/core to 2.0.0-beta.9 (#3072) Marcus Pasell
[49c0ebf] [PAY-1072] Change "Download App" icon on Settings Page (#3067) Marcus Pasell
[928dcaf] [PAY-1056] - More gated content updates and fixes (#3070) Saliou Diallo
[1e1f769] [C-2345] Move PlaybackRate drawer to common drawers map (#3071) Kyle Shanks
[f5d1251] Fix web-dist CI steps (#3069) Sebastian Klingler
[5f89800] Fix heavy rotation playlist on client (#3056) sabrina-kiam
[c0191e2] [C-2316] Add remote config for all oauth verification (#3052) Raymond Jacobson
[40f5627] [PAY-1074][PAY-1075][PAY-1076][PAY-1080] - Update availability settings states + more QA fixes (#3059) Saliou Diallo
[5be60ac] [C-2339] Update podcast control updates to also work for audiobooks (#3065) Kyle Shanks
[163ebf5] [C-2297] Add fallback flag to podcast feature (#3064) Sebastian Klingler
[f206391] [PAY-904] - Add gated content upload prompt (#3057) Saliou Diallo
[1afc4e5] [C-1344] Move probers to monorepo and make tests pass (#3061) Sebastian Klingler
[e198279] Remove random line (#3062) Saliou Diallo
[24a001b] Add playback position logic for mobile (#3051) Kyle Shanks
[d210124] [PAY-1070] Update TabSlider/SegmentedControl slider size on resize (#3044) Marcus Pasell
@AudiusProject AudiusProject deleted a comment from linear bot Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants