-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Volume creating exception #8
Comments
Thanks for the report, that looks like we are forgetting to delete the volume from the |
Sure, please |
Unfortunately none of the two error reports give more insight in the actually error. Your first post shows that you create a new Resource-Definition, a new Volume-Definition and the controller successfully (!) auto-places the 2 resources in storage pool 'data' on nodes 'pve1-2', 'pve1-3'. Another question is the following: The two ErrorReports you gave me complain about access to a deleted volume. I would be very interested in how that is possible. What steps did you do before the auto-place failed with the shown error? Did you tried some auto-places before and deleted them again? Did some errors occur which didn't seemed too big of an issue? If a volume claims that it is already deleted, it should already be deleted from the database. That means, you can try restarting the controller which will reload all data from the database. That might fix your current issue, allowing you to use the auto-place feature again. (I'm still interested in what actually happened / how you got into this situation :) ) |
Quick update: by accident I somehow managed to recreate the issue of "Access to deleted volume" and I think that I fixed that problem. I will not close this issue yet, because the "Access to deleted volume" was more or less just a side-effect of the original bug (why both drbd-resources were secondary), which we should investigate further here. |
OK, I was updated linstor to the latest version on all nodes I found out that satellites is going OFFLINE immediately after creating new resources on them. Example: I've prepared new VM in proxmox. It's exactly VM not container so not requires filesystem creation. Then I do the next procedure:
There is no any error reports,
No exceptions no errors, nothing more From the controller side:
|
UPD: after I stopped linstor-sattelite process I've got the next output:
|
Looks like the reason is hunged
|
This one really bothers me:
within 2 ms 3 different threads reported "Controller connected and authenticated". Are you sure that those satellites are only registered within ONE currently active and running controller node? Edit: the next linstor release will also log the ip:port of the connected and authenticated controller :) |
For LVM, as it bit me today on xen-server: It is really important to set filterns in |
Ok, it seems I found out where is problem:
I used some iscsi devices on this testing environmnt previously, seems they wasn't disconnected properly. |
Linstor's default timeout for external commands is 45 seconds. |
No, my iscsi device is working fine.
I'm sure, but I'll check it via tcpdump now |
No other connections, I also try strace, but it showing nothing because of fork process.. |
If i restart controller, it shows node like |
Starting the controller or satellite "by hand", you can pass the Regarding the change in online-status: if the satellite does not go into ONLINE state (only from connected -> offline) it might be that the satellite is unhappy with some data it initially received from the controller. That might be the satellite's node-name it gets assigned by the controller (has to match its local Can you try deleting and re-adding the node? Or do you have already deployed resources there which would be lost when you delete? |
Well as I said, If I remove any resources from this node it is going to By the way I was enabled collecting only
Then I was started the controller. It's thinking a lot on those actions:
|
Yes it is not helping too. Probably I can reinstall the system on satellites there, but I want to find the reason of this problem. To avoid similar issues in future on production. |
lvm commands is executing slow, not because iscsi, it is because node have drbd device which is corrently secondary :-/ |
The controllers and satellites normally log their connections. It would probably make sense to run both of them in the foreground, to watch what's going on. |
@raltnoeder, well, if you don't mind I can organize that nice trick:
|
Yes, nodes are going OFFLINE immediately after placing resources on them, they will be marked as ONLINE only after I manually switch this device into primary mode (for start syncing):
It also working if I'll just remove resource from the node, then restart linstor-satellite service. |
There is some kind of "panic-connection-closing" mechanism in the satellite's method which applies the data received from the controller. However, the command right before the I'd suggest the same as robert suggested, namely running at least the satellite (or also the controller) in foreground and simply watch the output of the processes if anything suspicious occurs. You could also increase the log-level to DEBUG or even TRACE in the |
@ghernadi thanks for info, now I know how to improve logging... Now, I've done the next steps.
then I create new proxmox VM: controller-trace.log Ok node is going offline: satellite_trace2.txt node is still OFFLINE, but seems controller is doing something and execute some command on the satellite, I don't see any error messages there stop them and try once more: satellite_trace3.txt do you see these repeatedly appearing messages? Ok shutdown all components and run, on the node:
status of resource chages from satellite_trace4.txt Everything is fine since this action. |
I've got some error report! Creating new resource from the scratch controller-trace5.log |
Looks like you have some strange setup...
That means that the controller has loaded 1 node ("pve1-3"), 1 resource ("VM-1232-DISK-1") with 1 volume (volume-number 0) from its database. satellite-log:
These two lines say that we are indeed looking at the logs from a satellite called "pve1-3". That's good.
However, in the Additionally: The controller-log line Regarding the ErrorReport: fair enough, I will look into this |
@ghernadi thanks, resource called PS: ow, I just found that there is time shift I wasn't set GMT+2 inside container so time is two hours differs inside the logs |
After a quick look at
After the logs, the satellite is now offline? and stays offline or does the controller reconnect to it? |
bascally If I leave both of them for a while I see the next messages on the controller:
I attach more long period log file where controller can't connect to the satellite, until it have satellite-trace6.log UPD: hmm date is right for now, but controller still uses UTC, how is that possible? controller-trace7.log (bad case) UPD4: logs after executing (after that my node is ONLINE):
controller-trace8.log (good case) UPD5: satellite-trace8.log - was wrong, uploaded it again |
Okay, somehow "good news" here: I was able to reproduce your issue .... 🎉
With this code-snippet added, I get pretty much the same behaviour as you show in your log files. In summary: I agree that we will need to investigate why the connection breaks down when the "vgs" command (or other external commands) take too long. Until then, you might want to configure your lvm.conf in a way that lvm should exclude all devices except the actual PVs, This should speed up your vgs command even when having a drbd device running. |
Good news, |
Adding quick command for achieve this:
|
Bug
Versions
0.6.0
2.9.0
5.2
9.0.15
9.5.0
Details
Hi I'm testing linstor-proxmox plugin, and I have some problems:
When I try to create lxc container, I have the next output:
Experienced, I found out that in the moment of crating filesystem both devices are in Secondary
After that I can't create any new devices, because autoplacing is not working anymore:
Full stacktrace: ErrorReport-5B89B429-000001.log
I have thee nodes cluster and the next config:
The text was updated successfully, but these errors were encountered: