-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move off manually maintained NFS servers on GCP #1105
Comments
@yuvipanda could you provide a basic explanation of the steps we'd need to take, or how complicated it would be? Anything that could help us understand when to prioritize this and how much work it would be. |
To signal that this is reasonably stable infrastructure, we're removing 'pilot' from a few domain names. I've setup a wildcard domain *.2i2c.cloud to point to the pilot-hubs cluster's nginx-ingress service IP, so merging this would just switch out these 3 hubs to get rid of the .pilot part of their domain. This does mean our 'primary' cluster becomes a bit more important, so we should make it a little more resilient. See 2i2c-org#1105 and 2i2c-org#1102 Ref 2i2c-org#989
Only the shared 2i2c cluster on GCP is still using a manual NFS server now. Let's get rid of it before it goes critical and requires someone with XFS knowledge to fix. |
This simply creates the filestore. I'll manually mount this in the existing NFS server, and copy files over before actually moving hubs over. Ref 2i2c-org#1105
I did a copy of the files from the old NFS server to filestore by:
I'm going to run an rsync ( |
rsync takes about 44 minutes now. |
Looking at Grafana, temple is the only hub that seems to currently consistently have users. The process for moving a hub over is: 1. Wait for the hub to have no users 2. Do an rsync just for that hub's home directories. For most of these hubs I expect this will be pretty quick. 3. If no users had logged on while this rsync was happening, proceed. If not, go to 1. 4. Delete the PV, PVC and pod providing shared homedirectory metrics. This will need to be done manually once, as we are changing the PV and that is immutable. This is a rare enough event that not automating this is fine, plus I don't actually want to automate deleting PVs (just in case it deletes data we actually want!) 5. Deploy the change, move things to the new homedirectory. 6. Repeat! Ref 2i2c-org#1105
Looking at Grafana, temple is the only hub that seems to currently consistently have users. The process for moving a hub over is: 1. Wait for the hub to have no users 2. Do an rsync just for that hub's home directories. For most of these hubs I expect this will be pretty quick. 3. If no users had logged on while this rsync was happening, proceed. If not, go to 1. 4. Delete the PV, PVC and pod providing shared homedirectory metrics. This will need to be done manually once, as we are changing the PV and that is immutable. This is a rare enough event that not automating this is fine, plus I don't actually want to automate deleting PVs (just in case it deletes data we actually want!). Note that if *any* users are active, the PV won't actually delete (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#storage-object-in-use-protection). 5. Deploy the change, move things to the new homedirectory. 6. Repeat! Ref 2i2c-org#1105
While doing #2672 I discovered that the cloudbank cluster also uses a manually setup NFS server! Good catch, and I'll eventually migrate that one too. This will make our cluster design more uniform everywhere. |
Brings this in line with all our other clusters. Ref 2i2c-org#1105
Am running the same copy steps for cloudbank hubs as well now. |
Will wait for there to be no users to get this done. Ref 2i2c-org#1105
Everything on the 2i2c shared cluster has been migrated, and I've shut down the original VM! I'll clean it up in a few days if everything is alright. |
Copy completed on cloudbank for in about 17h. Will run rsync and prep |
Similar to 2i2c-org#2672. That PR has information on the process used to deploy this. Ref 2i2c-org#1105
rsync takes about 50m to complete now! I'll just leave it running in a loop, and find an opportune moment to deploy. |
Similar to 2i2c-org#2672. That PR has information on the process used to deploy this. Ref 2i2c-org#1105
still waiting for an opportune moment without any skyline users. |
Cloudbank fully migrated, and the NFS server there has been shut down! I'll wait for a week then kill it. I've also deleted the disk of the 2i2c shared cluster, although a snapshot remains. I've left the NFS server on still, just in case we need to bring it back up. I'll kill it in a week. |
Wieeee thank you for working this @yuvipanda!!!! I feel a sense of relief with fewer cluster/hub specific exceptions to consider! |
Ref 2i2c-org#1105 The servers are all gone now. The infrastructure diagram was also edited to be slightly more accurate.
All cleaned up now! |
Context
On pilot-hubs cluster as well as cloudbank (I think?) user home directories are on a hand-rolled NFS VM. This is problematic - when it fills up, it is a bit complex to resize, and while resizing a PD is fairly standard, I don't think we have it documented.
We should just switch over to paying money for Google Filestore.
This is one of those issues that isn't an issue until it is, and when it is an issue, it's a big issue.
Next steps
The text was updated successfully, but these errors were encountered: