add snapshot step in `etcdutl migrate` #17911

siyuanfoundation · 2024-04-29T17:54:34Z

What would you like to be added?

Currently during downgrade, if there is a WAL entry with fields from the higher version, the offline migration tool etcdutl migrate would abort because of the version check. This would block the offline downgrade migration even if the entries of higher version have been committed to the backend.

On the contrary, in the online migration, after enabling downgrade, ClusterVersionSet would downgrade the cluster version, and force snapshot, so that old WAL entries would be compacted and not blocking the migration.

We need to make the offline migration consistent with the online migration flow, by adding a snapshot step before calling schema.Migrate

Why is this needed?

To make offline downgrade consistent with online downgrade.

#13168 Milestone 3

The text was updated successfully, but these errors were encountered:

siyuanfoundation · 2024-04-29T17:55:11Z

cc @serathius @henrybear327

ah8ad3 · 2024-05-06T07:34:55Z

If you think this task is suitable for me i would like to give it a try.

siyuanfoundation · 2024-05-07T16:31:19Z

@ah8ad3 yes, it would be great if you can give it a try. You can ping me on slack if you have any questions. Thank you!

ah8ad3 · 2024-05-11T12:59:10Z

Sorry do we have any implementation of getting snapshot without need of server? i couldn't find any, all i find needs Endpoint to get snapshot from server.
If i'm right we need to create an offline snapshotter or we should use the snapshot.SaveWithVersion.
Am i right? @siyuanfoundation @serathius

siyuanfoundation · 2024-05-11T20:34:28Z

snapshot.SaveWithVersion fetches snapshot from remote etcd server, and when the data is connected to an ectd server, it could add new entries, which is not desirable. So it would need to be an offline snapshotter or just something to clean up the old WAL entries. @ah8ad3 can you dive into the details of EtcdServer.snapshot to see if it is possible?

ah8ad3 · 2024-05-13T07:39:16Z

Thanks for answer. I'll try to understand the code and see what can we do. @siyuanfoundation

ah8ad3 · 2024-05-13T07:42:08Z

Also can you please explain what is

something to clean up the old WAL entries

Clean from version tags? or delete?

ah8ad3 · 2024-05-13T12:39:36Z

Think i got the answer from my last question.
I dive into code and find out we have an example directory of raft that seems to do the job we want to here, but i need confirmation that if i am in the right path for it? @siyuanfoundation @serathius

etcd/contrib/raftexample/raft.go

Line 363 in d0ea231

func (rc *raftNode) maybeTriggerSnapshot(applyDoneC <-chan struct{}) {

ah8ad3 · 2024-05-15T05:38:23Z

I end up reading more about raft to understand the design, here is what stopped me from thinking about implementing offline snapshotter.

Based on this photo we can't decide offline which member to get snapshot from, we should only get it from online leader. So i think we should think about other suggestion which is

or just something to clean up the old WAL entries

Which i'm looking into it to see how to do that.

Update:
I didn't think about migrate command when writing this, i was thinking about whole project snapshot. If we get data-dir and only check that member we don't care about raft anymore.

siyuanfoundation · 2024-05-15T21:21:46Z

I think you can do ReadConsistentIndex directly from the backend, and Compact (aka remove) any entries before the consistent index. I don't think we need to keep catchup entries in the migrate process, is that correct? @serathius

ah8ad3 · 2024-05-19T12:30:15Z

Thank you @siyuanfoundation it seems ReadConsistentIndex will get apply id and will help but i have a little bit of problem in understanding and well implementing the downgrade which i ask in blow. If you think this need more consideration and we can ask it on community meeting or tag ask it on slack??!

I tested out current behavior of etcdutl migrate and etcdctl downgrade, it seems

etcdctl downgrade 3.5 will change the version of db
after running: etcd-dump-db iterate-bucket default.etcd/member/snap/db cluster --decode=true:

key="downgrade", value="{\"target-version\":\"3.5.0\",\"enabled\":true}"
key="clusterVersion", value="3.5.0"

and force snapshot which removes all entries from wal file
etcd-dump-logs default.etcd

Snapshot:
term=8 index=34 nodes=[8e9e05c52164694d] confstate={"voters":[10276657743932975437],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=8 commitIndex=34 vote=8e9e05c52164694d
WAL entries: 0
term	     index	type	data

Entry types (Normal,ConfigChange) count is : 0

running etcdutl migrate --data-dir default.etcd --target-version 3.5 --force successfully:
after running: etcd-dump-db iterate-bucket default.etcd/member/snap/db cluster --decode=true:

key="downgrade", value="{\"target-version\":\"\",\"enabled\":false}"
key="clusterVersion", value="3.6.0"

Which doesn't change

and wal:
etcd-dump-logs default.etcd

Snapshot:
term=8 index=34 nodes=[8e9e05c52164694d] confstate={"voters":[10276657743932975437],"auto_leave":false}
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=9 commitIndex=38 vote=8e9e05c52164694d
WAL entries: 4
lastIndex=38
term	     index	type	data
   9	        35	norm	
   9	        36	norm	header:<ID:7587878799003708674 > cluster_member_attr_set:<member_ID:10276657743932975437 member_attributes:<name:"default" client_urls:"http://localhost:2379" > > 
   9	        37	norm	header:<ID:7587878799003708677 > downgrade_info_set:<> 
   9	        38	norm	header:<ID:7587878799003708678 > cluster_version_set:<ver:"3.6.0" > 

Entry types (Normal,ConfigChange) count is : 4

It seems to me that etcdctl downgrade did a general better job on downgrading than etcdutl migrate, We need to remove cluster_version_set:<ver:"3.6.0" > entry maybe with compaction. But do you sure just compacting will be fine?

In addition i suggest we remove extra tools and add all utility on db and wal files (etcd-dump-logs, etcd-dump-dbs) into etcdutl. Having this many tools is a little bit hard to use and confusing.

siyuanfoundation · 2024-05-20T16:44:44Z

Thank you @ah8ad3 for digging into this.
etcdctl downgrade works perfectly fine for the online downgrade, you need to start the etcdctl downgrade process before shutting down any node.
etcdutl migrate is meant to be used to migrate the files offline, say if you have the data files for 3.6, and you want to use them to start a 3.5 cluster. This issue is to make etcdutl migrate converge as much as possible as the etcdctl downgrade.

When you run etcdutl migrate, it will abort if there is any WAL entries that are not compatible with the old version, as described in this issue.
When you run etcdutl migrate --force, it would just change the storage the storage version, and pretend the migration is complete. I think you also uncovered that "clusterVersion" should also be changed too.

I think you make a really good point that there are too many tools. We can raise that in the community meeting.

ah8ad3 · 2024-06-08T07:52:08Z

Sorry that took me a lot to respond this issue was a little bit hard for me i'm not that confident in etcd codebase, anyway.
I read a lot about snapshoting and still i think i may not know all about it.
In this function
func (s *EtcdServer) snapshot(snapi uint64, confState raftpb.ConfState) it creates snapshot v2 for compatibility,
but i don't think we need that in this issue so i went with standard v3 which is here and the final code of using it is something like this:

	var f *os.File
	defer f.Close()
	partpath := "etcd-test-db"
	f, err = os.OpenFile(partpath, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, fileutil.PrivateFileMode)
	if err != nil {
		panic(fmt.Errorf("could not open %s (%v)", partpath, err))
	}
	c.lg.Info("created temporary db file", zap.String("path", partpath))
	snapshot := c.be.Snapshot()
	defer snapshot.Close()
	snapshot.WriteTo(f)

This will create a snapshot from the db and i test the restore with
etcdutl snapshot restore etcd-test-db --data-dir etcd-test-db-dir which works and etcd can start with it.
The only missing part in mind is that we should somehow check last consistent index with the snapshot but didn't think about it that much.

Also asked about this issue in main slack channel https://kubernetes.slack.com/archives/C3HD8ARJ5/p1716904436225239 , Marek said

You don't need to remove any files (assuming that #17227 is fixed). You just need to create a snapshot based on the last commit index, assuming that there is newer version entries after last snapshot or database consistent index. (edited)

And also

If there is a incompatible WAL entry after commit index, then we cannot downgrade the storage. We should give up and return error.

So i think this issue can be fixed with the code i suggested and we should fix the other issue #17227 .
WDYT @siyuanfoundation ?

siyuanfoundation added the type/feature label Apr 29, 2024

jmhbnz added area/etcdctl area/etcdutl priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed area/etcdctl labels May 2, 2024

chaochn47 assigned ah8ad3 May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add snapshot step in `etcdutl migrate` #17911

add snapshot step in `etcdutl migrate` #17911

siyuanfoundation commented Apr 29, 2024 •

edited

siyuanfoundation commented Apr 29, 2024

ah8ad3 commented May 6, 2024

siyuanfoundation commented May 7, 2024

ah8ad3 commented May 11, 2024 •

edited

siyuanfoundation commented May 11, 2024

ah8ad3 commented May 13, 2024

ah8ad3 commented May 13, 2024

ah8ad3 commented May 13, 2024

ah8ad3 commented May 15, 2024 •

edited

siyuanfoundation commented May 15, 2024

ah8ad3 commented May 19, 2024

siyuanfoundation commented May 20, 2024

ah8ad3 commented Jun 8, 2024 •

edited

add snapshot step in etcdutl migrate #17911

add snapshot step in etcdutl migrate #17911

Comments

siyuanfoundation commented Apr 29, 2024 • edited

What would you like to be added?

Why is this needed?

siyuanfoundation commented Apr 29, 2024

ah8ad3 commented May 6, 2024

siyuanfoundation commented May 7, 2024

ah8ad3 commented May 11, 2024 • edited

siyuanfoundation commented May 11, 2024

ah8ad3 commented May 13, 2024

ah8ad3 commented May 13, 2024

ah8ad3 commented May 13, 2024

ah8ad3 commented May 15, 2024 • edited

siyuanfoundation commented May 15, 2024

ah8ad3 commented May 19, 2024

siyuanfoundation commented May 20, 2024

ah8ad3 commented Jun 8, 2024 • edited

add snapshot step in `etcdutl migrate` #17911

add snapshot step in `etcdutl migrate` #17911

siyuanfoundation commented Apr 29, 2024 •

edited

ah8ad3 commented May 11, 2024 •

edited

ah8ad3 commented May 15, 2024 •

edited

ah8ad3 commented Jun 8, 2024 •

edited