-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use balanced ssds instead of spinning disks for remote-dev #3079
Conversation
I think this is the associated linear card, https://linear.app/audius/issue/INF-41/prevent-remote-dev-becoming-100percent-full I don't see how this PR reduces disk space usage. I have follow up PRs that drastically reduce Solana and Ganache disk usage already. Moreover, I think it is possible that remote boxes are being setup without the default log size limit from https://github.com/AudiusProject/audius-docker-compose/blob/main/setup.sh#L21 that is usually only added when provisioned with |
i agree with @cheran-senthil that this probably doesn't solve the initial problem of us writing a ton of data. however i think this is a good change that should give us better read perf, so we should merge it. ultimately we gotta find out how to write less to disk. |
and yes, we should bake the docker log size change into the non fast create flow |
@cheran-senthil you're correct that the linear card doesn't quite match with this work, but both come from the same focus to focus on disk usage whether it be disk consumption or disk utilization. If need be, I can create a different ticket for disk utilization. @dmanjunath what the stats in the description state is that we will get better performance, but even with I don't think the correct solution is For this to really benefit our team, we should either do 2 Thoughts? |
[e804aa1] [C-2248, C-2373] Use playlistUpdates, remove legacyNotifications (#3094) Dylan Jeffers [824933e] [C-2366] Improve web notification selection performance (#3103) Dylan Jeffers [4b8edef] [PLAT-696] Add trending-playlists/underground notifications (#3089) Dylan Jeffers [1f9cf3e] [C-2275] Fix android drawer offsets (#3095) Dylan Jeffers [fc14c82] [PAY-1063][PAY-1085][PAY-1086] Update UI for inaccessible gated tracks from favorites and history pages (#3100) Saliou Diallo [b0441f5] [C-2365] Update play buttons on web and mobile to show resume when track is current (#3101) Kyle Shanks [453910f] [C-2378] Add upload v2 feature flag (#3099) Sebastian Klingler [962a6df] [C-2337] Remove reachability mobile web (#3090) Raymond Jacobson [4ad5cd2] Fix visible collectibles for upload popup (#3093) Saliou Diallo [c143078] Fix feature flag bug (#3092) Saliou Diallo [44435b5] Fix upload prompt modal learn more url (#3091) Saliou Diallo [c9024ad] Use chat.messagesStatus instead of selector (#3087) Reed [38d43c4] [C-2369] Fix issue where notification poll can break app on signout (#3088) Dylan Jeffers [90122d9] [PAY-923] DMs: Add desktop entrypoints (#3083) Marcus Pasell [00f27e8] [PAY-907] Mobile chat reactions (#3020) Reed [4678b89] DMs: Fix broken typecheck on main (#3086) Marcus Pasell [756ade4] [PAY-1000][PAY-1084][PAY-1096][PAY-1097][PAY-1098] - More gated content fixes (#3085) Saliou Diallo [820aa9d] Fix upload and repost probers tests and lint (#3076) Sebastian Klingler [345607e] [C-2320] Fix profile socials alignment (#3079) Dylan Jeffers [569199c] Fix prod build timeout (#3084) Sebastian Klingler [12f6c22] Remove ports for local dev (#3082) Theo Ilie [1940618] Fix broken Main build due to typeerror (#3080) Marcus Pasell [eb8d47e] [PAY-1082] DMs: Dedupe sent messages (#3066) Marcus Pasell [50a11c3] Update SDK to 2.0.3-beta.0 (#3078) Marcus Pasell [c420fbb] Clean up NPM package lock (#3077) Marcus Pasell [35d1124] [C-2327] Add playlist updates slice (#3063) Dylan Jeffers [59862ad] [C-2344] Update the web playbar scrubber to respect the playback speed of podcasts (#3075) Kyle Shanks [ffeb0d3] [C-2349] Default download on wifi only to false (#3074) Andrew Mendelsohn [cafae41] [C-2325] Fix playlist table date-added column (#3073) Dylan Jeffers [384a510] [PAY-927] DMs: Empty messages state (#3068) Marcus Pasell [1132f83] Update @jup-ag/core to 2.0.0-beta.9 (#3072) Marcus Pasell [49c0ebf] [PAY-1072] Change "Download App" icon on Settings Page (#3067) Marcus Pasell [928dcaf] [PAY-1056] - More gated content updates and fixes (#3070) Saliou Diallo [1e1f769] [C-2345] Move PlaybackRate drawer to common drawers map (#3071) Kyle Shanks [f5d1251] Fix web-dist CI steps (#3069) Sebastian Klingler [5f89800] Fix heavy rotation playlist on client (#3056) sabrina-kiam [c0191e2] [C-2316] Add remote config for all oauth verification (#3052) Raymond Jacobson [40f5627] [PAY-1074][PAY-1075][PAY-1076][PAY-1080] - Update availability settings states + more QA fixes (#3059) Saliou Diallo [5be60ac] [C-2339] Update podcast control updates to also work for audiobooks (#3065) Kyle Shanks [163ebf5] [C-2297] Add fallback flag to podcast feature (#3064) Sebastian Klingler [f206391] [PAY-904] - Add gated content upload prompt (#3057) Saliou Diallo [1afc4e5] [C-1344] Move probers to monorepo and make tests pass (#3061) Sebastian Klingler [e198279] Remove random line (#3062) Saliou Diallo [24a001b] Add playback position logic for mobile (#3051) Kyle Shanks [d210124] [PAY-1070] Update TabSlider/SegmentedControl slider size on resize (#3044) Marcus Pasell
Description
Move to SSDs from HDDs for new remote-dev boxes for better random read/write patterns to contend with:
Ideally, we would separate data for the solana-validator, ganache, postgres, and elasticsearch into a separate mounted/attached disk to lessen disk pressure on the boot disk allowing for a more responsive terminal and IDE experience.
If this change does not show vast improvement, we can try the
pd-ssd
disk type before considering a dedicated mounted disk as recommended by Docker:Background
After logging into my remote-dev machine, my terminal was laggy.
top
showed about ~75% of the time the CPU was idle.free -m
showed swap was off.sar -b 1
was showing about 10 Mb/s being written out with about 1000 write transactions per second with 0 reads. However, pre-PR we were seeing sub-MB disk writes with an occasional burst to 100 M/s while post-PR we see a normal flow of at least 4 M/s with frequent bursts of 100 M/s.sudo iotop -oP
showed the majority of write requests was being spent on the solana-validator, dockerd, postgres, ganache, respectively.vmstat 1 5
showed about 50k context switches per second.iostat -x sda
shows pre-PRr_await
andw_await
times (Vocabulary) of 4.37 and 9.74, respectively. Post-PR we see 0.70 and 5.44 on one node and 0.44 and 3.87 on the node with--fast
, respectively. This shows that with thepd-balanced
we tend to wait 90% less for reads and 60% less for writes.The above commands showed that
node
processes were seeing the majority of thecswch/s
, however,strace
showed they were stalling onfutex
calls implying node-based mutexes were the cuprit, which may be expected, but I'm unable to confirm without past data.Some random
glances
samples that are representative of a live view are:--fast
:-- Vocabulary
What
glances
highlights is that:pd-balanced
.iowait
.iowait
stat, it's an indicator that we should have already had a plan in place to migrate wheniowait
was above 7.pd-balanced
SSDs are not enough to run our 4 databases + docker + VSCode server.While our write throughputs would effectively double, we don't seem to be close to our current limitation of 400 MB/s.
However, while theoretical disk throughput could be similar between
pd-standard
andpd-balanced
disks, the former are backed by HDDs while the latter are backed by SSDs at a cheaper price point thanpd-ssd
disks:Tests
Running A in split view for 4 machines with
pd-balanced
(pre-PR, post-PR, post-PR with --fast) andpd-ssd
(post-PR with --fast).How will this change be monitored? Are there sufficient logs?
#infra-troubleshooting