Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move off manually maintained NFS servers on GCP #1105

Closed
4 tasks done
yuvipanda opened this issue Mar 15, 2022 · 13 comments
Closed
4 tasks done

Move off manually maintained NFS servers on GCP #1105

yuvipanda opened this issue Mar 15, 2022 · 13 comments
Assignees

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Mar 15, 2022

Context

On pilot-hubs cluster as well as cloudbank (I think?) user home directories are on a hand-rolled NFS VM. This is problematic - when it fills up, it is a bit complex to resize, and while resizing a PD is fairly standard, I don't think we have it documented.

We should just switch over to paying money for Google Filestore.

This is one of those issues that isn't an issue until it is, and when it is an issue, it's a big issue.

Next steps

  • Figure out which clusters are using hand-rolled NFS
  • Modify appropriate cluster tfvars to enable google filestore creation
  • Migrate data from current NFS server to google filestore
  • Decomission old NFS VM
@choldgraf
Copy link
Member

@yuvipanda could you provide a basic explanation of the steps we'd need to take, or how complicated it would be? Anything that could help us understand when to prioritize this and how much work it would be.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 17, 2022
To signal that this is reasonably stable infrastructure,
we're removing 'pilot' from a few domain names. I've setup
a wildcard domain *.2i2c.cloud to point to the pilot-hubs
cluster's nginx-ingress service IP, so merging this would
just switch out these 3 hubs to get rid of the .pilot
part of their domain.

This does mean our 'primary' cluster becomes a bit more
important, so we should make it a little more resilient.
See 2i2c-org#1105
and 2i2c-org#1102

Ref 2i2c-org#989
@yuvipanda
Copy link
Member Author

Only the shared 2i2c cluster on GCP is still using a manual NFS server now. Let's get rid of it before it goes critical and requires someone with XFS knowledge to fix.

@yuvipanda yuvipanda self-assigned this Jun 14, 2023
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 14, 2023
This simply creates the filestore. I'll manually mount this
in the existing NFS server, and copy files over before actually
moving hubs over.

Ref 2i2c-org#1105
@yuvipanda
Copy link
Member Author

yuvipanda commented Jun 19, 2023

I did a copy of the files from the old NFS server to filestore by:

  1. Mounting the filestore on the existing NFS server, with sudo mount -t nfs -o soft,noatime 10.234.45.250:/homes/ filestore/
  2. Copying the existing files over, with time sudo cp -rav /export/home-01/homes/ /export/filestore/. This took about 31h to complete.

I'm going to run an rsync (time sudo rsync -a -P --delete /export/home-01/homes/ /export/filestore/homes/) now, and then I'll move over hubs one by one based on when they are unused.

@yuvipanda
Copy link
Member Author

rsync takes about 44 minutes now.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 19, 2023
Looking at Grafana, temple is the only hub that seems to
currently consistently have users. The process for moving a
hub over is:

1. Wait for the hub to have no users
2. Do an rsync just for that hub's home directories. For most
   of these hubs I expect this will be pretty quick.
3. If no users had logged on while this rsync was happening,
   proceed. If not, go to 1.
4. Delete the PV, PVC and pod providing shared homedirectory
   metrics. This will need to be done manually once, as we are
   changing the PV and that is immutable. This is a rare enough
   event that not automating this is fine, plus I don't actually
   want to automate deleting PVs (just in case it deletes data
   we actually want!)
5. Deploy the change, move things to the new homedirectory.
6. Repeat!

Ref 2i2c-org#1105
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 19, 2023
Looking at Grafana, temple is the only hub that seems to
currently consistently have users. The process for moving a
hub over is:

1. Wait for the hub to have no users
2. Do an rsync just for that hub's home directories. For most
   of these hubs I expect this will be pretty quick.
3. If no users had logged on while this rsync was happening,
   proceed. If not, go to 1.
4. Delete the PV, PVC and pod providing shared homedirectory
   metrics. This will need to be done manually once, as we are
   changing the PV and that is immutable. This is a rare enough
   event that not automating this is fine, plus I don't actually
   want to automate deleting PVs (just in case it deletes data
   we actually want!). Note that if *any* users are active, the
   PV won't actually delete (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#storage-object-in-use-protection).
5. Deploy the change, move things to the new homedirectory.
6. Repeat!

Ref 2i2c-org#1105
@yuvipanda
Copy link
Member Author

While doing #2672 I discovered that the cloudbank cluster also uses a manually setup NFS server! Good catch, and I'll eventually migrate that one too. This will make our cluster design more uniform everywhere.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 19, 2023
Brings this in line with all our other clusters.

Ref 2i2c-org#1105
@yuvipanda
Copy link
Member Author

Am running the same copy steps for cloudbank hubs as well now.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 20, 2023
Will wait for there to be no users to get this done.

Ref 2i2c-org#1105
@yuvipanda
Copy link
Member Author

Everything on the 2i2c shared cluster has been migrated, and I've shut down the original VM! I'll clean it up in a few days if everything is alright.

@yuvipanda
Copy link
Member Author

Copy completed on cloudbank for in about 17h. Will run rsync and prep

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 20, 2023
Similar to 2i2c-org#2672.
That PR has information on the process used to deploy this.

Ref 2i2c-org#1105
@yuvipanda
Copy link
Member Author

rsync takes about 50m to complete now! I'll just leave it running in a loop, and find an opportune moment to deploy.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 21, 2023
Similar to 2i2c-org#2672.
That PR has information on the process used to deploy this.

Ref 2i2c-org#1105
@damianavila damianavila moved this from Needs Shaping / Refinement to In progress in DEPRECATED Engineering and Product Backlog Jun 22, 2023
@damianavila damianavila moved this to In Progress ⚡ in Sprint Board Jun 22, 2023
@damianavila damianavila moved this from In Progress ⚡ to Waiting 🕛 in Sprint Board Jun 22, 2023
@damianavila damianavila moved this from Waiting 🕛 to In Progress ⚡ in Sprint Board Jun 22, 2023
@yuvipanda
Copy link
Member Author

still waiting for an opportune moment without any skyline users.

@yuvipanda
Copy link
Member Author

Cloudbank fully migrated, and the NFS server there has been shut down! I'll wait for a week then kill it.

I've also deleted the disk of the 2i2c shared cluster, although a snapshot remains. I've left the NFS server on still, just in case we need to bring it back up. I'll kill it in a week.

@consideRatio
Copy link
Contributor

Wieeee thank you for working this @yuvipanda!!!! I feel a sense of relief with fewer cluster/hub specific exceptions to consider!

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 30, 2023
Ref 2i2c-org#1105

The servers are all gone now.

The infrastructure diagram was also edited to be slightly
more accurate.
@yuvipanda
Copy link
Member Author

All cleaned up now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants