New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk snapshots dissapearing #2687

Closed
tkald opened this Issue Dec 5, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@tkald
Copy link

tkald commented Dec 5, 2018

Description
Somewhere after upgrade to One 5.6 my vm disk snapshots on Ceph storage started to misbehave.
Previous disk snapshots taken with one 5.4 are overwritten by new snapshots - snapshot index at least starts with zero index angain. Sometimes also double snapshot indexes.
image

Snapshots taken with one 5.2 disappear completely from sunstone if I try to make new snapshot with one 5.6
Before taking a new disk snapshot:
image
And after
image
Also I'm unable to take any new disk snapshots on that disk image.

Listing rbd disks snapshots on ceph cluster shows that snapshots are still present

rbd snap ls one/one-93
SNAPID NAME     SIZE
    81 0    24576 MB
   922 1    24576 MB
   923 2    24576 MB

VM log also shows that one is trying to overwrite snapshots:

Wed Dec 5 20:21:19 2018 [Z0][VM][I]: New LCM state is DISK_SNAPSHOT
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: Command execution failed (exit code: 17): /var/lib/one/remotes/tm/ceph/snap_create_live r620-4:/var/lib/one//datastores/100/1587/disk.0 0 1587 101
Wed Dec 5 20:21:21 2018 [Z0][VMM][E]: snap_create_live: Command " set -e -o pipefail
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: 
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: if virsh -c qemu:///system domfsfreeze one-1587 ; then
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: trap "virsh -c qemu:///system domfsthaw one-1587" EXIT TERM INT HUP
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: fi
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: 
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: RBD="rbd --id libvirt"
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: 
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: rbd_check_2 one/one-93
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: 
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: rbd --id libvirt snap create one/one-93@0
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: rbd --id libvirt snap protect one/one-93@0" failed: rbd: failed to create snapshot: (17) File exists
Wed Dec 5 20:21:21 2018 [Z0][VMM][E]: Error creating snapshot one/one-93@0
Wed Dec 5 20:21:21 2018 [Z0][VMM][I]: Failed to execute transfer manager driver operation: tm_snap_create_live.
Wed Dec 5 20:21:21 2018 [Z0][VMM][E]: Error creating new disk snapshot: Error creating snapshot one/one-93@0
Wed Dec 5 20:21:21 2018 [Z0][VM][I]: New LCM state is RUNNING
Wed Dec 5 20:21:21 2018 [Z0][LCM][E]: Could not take disk snapshot.
Wed Dec 5 20:23:16 2018 [Z0][VM][I]: New state is ACTIVE
Wed Dec 5 20:23:16 2018 [Z0][VM][I]: New LCM state is DISK_SNAPSHOT
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: Command execution failed (exit code: 17): /var/lib/one/remotes/tm/ceph/snap_create_live r620-4:/var/lib/one//datastores/100/1587/disk.0 1 1587 101
Wed Dec 5 20:23:18 2018 [Z0][VMM][E]: snap_create_live: Command " set -e -o pipefail
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: 
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: if virsh -c qemu:///system domfsfreeze one-1587 ; then
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: trap "virsh -c qemu:///system domfsthaw one-1587" EXIT TERM INT HUP
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: fi
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: 
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: RBD="rbd --id libvirt"
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: 
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: rbd_check_2 one/one-93
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: 
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: rbd --id libvirt snap create one/one-93@1
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: rbd --id libvirt snap protect one/one-93@1" failed: rbd: failed to create snapshot: (17) File exists
Wed Dec 5 20:23:18 2018 [Z0][VMM][E]: Error creating snapshot one/one-93@1
Wed Dec 5 20:23:18 2018 [Z0][VMM][I]: Failed to execute transfer manager driver operation: tm_snap_create_live.
Wed Dec 5 20:23:18 2018 [Z0][VMM][E]: Error creating new disk snapshot: Error creating snapshot one/one-93@1
Wed Dec 5 20:23:18 2018 [Z0][VM][I]: New LCM state is RUNNING
Wed Dec 5 20:23:18 2018 [Z0][LCM][E]: Could not take disk snapshot.

To Reproduce
Steps to reproduce the behavior.

Expected behavior
Proper snapshots are taken

Details

  • Affected Component: [Sunstone, Storage]
  • Hypervisor: [KVM on ubuntu 16.04]
  • Version: [5.6.2]

Additional context
Ceph version 10.2.11

Progress Status

  • Branch created
  • Code committed to development branch
  • Testing - QA
  • Documentation
  • Release notes - resolved issues, compatibility, known issues
  • Code committed to upstream release/hotfix branches
  • Documentation committed to upstream release/hotfix branches
@tkald

This comment has been minimized.

Copy link

tkald commented Dec 5, 2018

UPDATE:
4th snapshot try succeeded
image

rbd snap ls one/one-93
SNAPID NAME     SIZE
    81 0    24576 MB
   922 1    24576 MB
   923 2    24576 MB
  1613 3    24576 MB

But still no old snapshots visible on sunstone.

Snapshots taken with one 5.4 where double indexes are show (1st picture of original post). There's no old snapshots on ceph cluster any more:

rbd snap ls one/one-339
SNAPID NAME     SIZE
  1612 0    81920 MB
@vholer

This comment has been minimized.

Copy link
Member

vholer commented Dec 13, 2018

In 5.4, there was a problem that next snapshot ID was calculated always as a maximum+1 from the current list of snapshots, which lead to reusing the snapshot IDs (if some of them were deleted from the tail and OpenNebula was restarted in the meanwhile). As part of the fix #2189 we have started using a persistent value from NEXT_SNAPSHOT (id), which is now part of the VM / image templates.

On database upgrade, the NEXT_SNAPSHOT (id) should be calculated from the list of existing snapshots. But, it looks to me there is a wrong condition check in the database migrator and the calculated NEXT_SNAPSHOT isn't persisted into the templates. This leads to reusing the snapshot IDs again from 0 for all existing VM.

sxml = doc.xpath("//SNAPSHOTS")
if !sxml
ns = doc.create_element("NEXT_SNAPSHOT")
ns.content = next_snapshot
sxml = sxml.first.add_child(ns)

Just a quick note, I might be wrong.

vholer added a commit that referenced this issue Dec 13, 2018

vholer added a commit that referenced this issue Dec 17, 2018

vholer added a commit that referenced this issue Dec 17, 2018

vholer added a commit that referenced this issue Dec 18, 2018

vholer added a commit that referenced this issue Dec 18, 2018

@vholer

This comment has been minimized.

Copy link
Member

vholer commented Dec 18, 2018

We have:

  1. database fix when upgrading to 5.7.80 to be merged into master #2739
  2. documentation OpenNebula/docs#448 update into one-5.6 and one-5.6-maintenance with workarounds for 5.6 users
  3. TBD - inform users in the forum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment