Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume creating exception #8

Closed
kvaps opened this issue Aug 31, 2018 · 31 comments
Closed

Volume creating exception #8

kvaps opened this issue Aug 31, 2018 · 31 comments

Comments

@kvaps
Copy link

kvaps commented Aug 31, 2018

Bug

Versions

  • linstor 0.6.0
  • linstor-proxmox plugin 2.9.0
  • proxmox-ve 5.2
  • drbd kernel 9.0.15
  • drbdadm 9.5.0
  • deiver: lvm

Details

Hi I'm testing linstor-proxmox plugin, and I have some problems:

When I try to create lxc container, I have the next output:

SUCCESS:
Description:
    New resource definition 'vm-555-disk-1' created.
Details:
    Resource definition 'vm-555-disk-1' UUID is: e82d64c5-3a83-48d1-b9e7-6bb342320f0b
SUCCESS:
Description:
    Resource definition 'vm-555-disk-1' modified.
Details:
    Resource definition 'vm-555-disk-1' UUID is: e82d64c5-3a83-48d1-b9e7-6bb342320f0b
SUCCESS:
    New volume definition with number '0' of resource definition 'vm-555-disk-1' created.
SUCCESS:
Description:
    Resource 'vm-555-disk-1' successfully autoplaced on 2 nodes
Details:
    Used storage pool: 'data'
    Used nodes: 'pve1-2', 'pve1-3'
mke2fs 1.43.4 (31-Jan-2017)
Could not open /dev/drbd/by-res/vm-555-disk-1/0: Wrong medium type
WARNING: Satellite connection lost
error with cfs lock 'storage-drbdstorage': Could not remove vm-555-disk-1: exit code 11
TASK ERROR: command 'mkfs.ext4 -O mmp -E 'root_owner=0:0' /dev/drbd/by-res/vm-555-disk-1/0' failed: exit code 1

Experienced, I found out that in the moment of crating filesystem both devices are in Secondary

After that I can't create any new devices, because autoplacing is not working anymore:

Reported error:
===============

Category:                           RuntimeException
Class name:                         AccessToDeletedDataException
Class canonical name:               com.linbit.linstor.AccessToDeletedDataException
Generated at:                       Method 'checkDeleted', Source file 'VolumeData.java', Line #416

Error message:                      Access to deleted volume

Error context:
    Registration of auto-placing resource: 'vm-333-disk-1' failed due to an unknown exception.

Full stacktrace: ErrorReport-5B89B429-000001.log

I have thee nodes cluster and the next config:

drbd: drbdstorage
   content images,rootdir
   redundancy 2
   controller 10.28.36.172
   controllervm 103
@ghernadi
Copy link
Contributor

ghernadi commented Sep 3, 2018

Thanks for the report, that looks like we are forgetting to delete the volume from the FreeSpaceManager's internal volumes-list. Will investigate further.
The question I am more interested in is what happened that the volume was deleted in first place? Could you please also attach the ErrorReport-5B89B429-000000.log? Hopefully that gives some insight why both drbd-resources were secondary when the plugin tried to create the FS.

@kvaps
Copy link
Author

kvaps commented Sep 3, 2018

Sure, please
ErrorReport-5B89B429-000000.log

@ghernadi
Copy link
Contributor

ghernadi commented Sep 4, 2018

Unfortunately none of the two error reports give more insight in the actually error.

Your first post shows that you create a new Resource-Definition, a new Volume-Definition and the controller successfully (!) auto-places the 2 resources in storage pool 'data' on nodes 'pve1-2', 'pve1-3'.
For whatever reason one of the satellites fail to create the FS on the newly create volume. We would need other ErrorReports for that (from the satellite-node, not from controller).

Another question is the following: The two ErrorReports you gave me complain about access to a deleted volume. I would be very interested in how that is possible.

What steps did you do before the auto-place failed with the shown error? Did you tried some auto-places before and deleted them again? Did some errors occur which didn't seemed too big of an issue?

If a volume claims that it is already deleted, it should already be deleted from the database. That means, you can try restarting the controller which will reload all data from the database. That might fix your current issue, allowing you to use the auto-place feature again.

(I'm still interested in what actually happened / how you got into this situation :) )

@ghernadi
Copy link
Contributor

ghernadi commented Sep 4, 2018

Quick update: by accident I somehow managed to recreate the issue of "Access to deleted volume" and I think that I fixed that problem.

I will not close this issue yet, because the "Access to deleted volume" was more or less just a side-effect of the original bug (why both drbd-resources were secondary), which we should investigate further here.

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

OK, I was updated linstor to the latest version on all nodes 0.6.2

I found out that satellites is going OFFLINE immediately after creating new resources on them.

Example: I've prepared new VM in proxmox. It's exactly VM not container so not requires filesystem creation.
So drbd devices was prepared but both in secondary, because both nodes was marked as OFFLINE.

Then I do the next procedure:

  • Stop linstor-controller.service
  • Restart linstor-satelitte.service on all nodes
  • Start linstor-controller.service
  • Watch whats changed

There is no any error reports, linstor-satellite just shows the next things into the log, after I bring up my linstor-controller:

15:28:56.491 [MainWorkerPool-1] INFO  LINSTOR/Satellite - Controller connected and authenticated
15:28:56.651 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Node 'pve1-2' created.
15:28:56.652 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Node 'pve1-3' created.
15:28:56.653 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Storage pool 'DfltDisklessStorPool' created.
15:28:56.653 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Storage pool 'data' created.
15:28:56.659 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Resource 'vm-1231-disk-1' created for node 'pve1-2'.
15:28:56.659 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Resource 'vm-1231-disk-1' created for node 'pve1-3'.

No exceptions no errors, nothing more

From the controller side:

  • Nodes marked as Connecting after start up then immideately going OFFLINE.
  • The node which is not holding drbd resource is always ONLINE

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

UPD: after I stopped linstor-sattelite process I've got the next output:

15:38:53.242 [Thread-10] INFO  LINSTOR/Satellite - Shutdown in progress
15:38:53.248 [StltWorkerPool_0000] ERROR LINSTOR/Satellite - Initialization of storage for resource 'vm-1231-disk-1' volume 0 failed [Report number 5B8FDA04-000001]

15:38:53.248 [MainWorkerPool-7] INFO  LINSTOR/Satellite - Controller connected and authenticated
15:38:53.248 [MainWorkerPool-1] INFO  LINSTOR/Satellite - Controller connected and authenticated
15:38:53.249 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Controller connected and authenticated
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Shutting down service instance 'DeviceManager' of type DeviceManager
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Waiting for service instance 'DeviceManager' to complete shutdown
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Shutting down service instance 'DrbdEventPublisher-1' of type DrbdEventPublisher
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Waiting for service instance 'DrbdEventPublisher-1' to complete shutdown
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Shutting down service instance 'DrbdEventService-1' of type DrbdEventService
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Waiting for service instance 'DrbdEventService-1' to complete shutdown
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Shutting down service instance 'FileEventService' of type FileEventService
15:38:53.249 [MainWorkerPool-8] ERROR LINSTOR/Satellite - Command 'vgs hv3 -o vg_free --units k --noheadings --nosuffix' returned with exitcode 130. 

Standard out: 


Error message: 

 [Report number 5B8FDA04-000000]

15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Waiting for service instance 'FileEventService' to complete shutdown
15:38:53.249 [Thread-10] INFO  LINSTOR/Satellite - Shutting down service instance 'NetComService' of type NetComService
15:38:53.250 [Thread-10] INFO  LINSTOR/Satellite - Waiting for service instance 'NetComService' to complete shutdown
15:38:53.250 [Thread-10] INFO  LINSTOR/Satellite - Shutting down service instance 'TimerEventService' of type TimerEventService
15:38:53.250 [Thread-10] INFO  LINSTOR/Satellite - Waiting for service instance 'TimerEventService' to complete shutdown
15:38:53.250 [Thread-10] INFO  LINSTOR/Satellite - Shutdown complete

5B8FDA04-000000.log

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

Looks like the reason is hunged vgs hv3 -o vg_free --units k --noheadings --nosuffix command,
but without linstor it is working correctly, but somehow slow:

# time vgs hv3 -o vg_free --units k --noheadings --nosuffix
  880820224.00

real    0m10.149s
user    0m0.006s
sys     0m0.000s

@ghernadi
Copy link
Contributor

ghernadi commented Sep 5, 2018

This one really bothers me:

15:38:53.248 [MainWorkerPool-7] INFO LINSTOR/Satellite - Controller connected and authenticated 
15:38:53.248 [MainWorkerPool-1] INFO LINSTOR/Satellite - Controller connected and authenticated 
15:38:53.249 [MainWorkerPool-2] INFO LINSTOR/Satellite - Controller connected and authenticated

within 2 ms 3 different threads reported "Controller connected and authenticated". Are you sure that those satellites are only registered within ONE currently active and running controller node?

Edit: the next linstor release will also log the ip:port of the connected and authenticated controller :)

@rck
Copy link
Member

rck commented Sep 5, 2018

For LVM, as it bit me today on xen-server: It is really important to set filterns in lvm.conf. I already had many linstor "issues" (which are not actual linstor issues) on different platforms, which were related to LVM. Slow lvs usually was a sign for that. Then I usually set filters allowing the pv devices and ignore everything else. Usually then everything is fine.

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

Ok, it seems I found out where is problem:

# pstree -l 2158200                                                                                                                                                                                                               
java─┬─drbdsetup                                                                                                                                                                                
     ├─vgs                                                                                                                                                              
     └─51*[{java}]     
# ps aux | grep vgs
root     2158869  0.0  0.0  35712  4744 pts/3    D+   15:45   0:00 vgs -o vg_name --noheadings
root     2158872  0.0  0.0  34040  4652 ?        S    15:45   0:00 /sbin/vgscan --ignorelockingfailure --mknodes
# time vgs -o vg_name --noheadings
  hv3

real    0m20.556s
user    0m0.000s
sys     0m0.006s

I used some iscsi devices on this testing environmnt previously, seems they wasn't disconnected properly.
Thats why this lvm comands executing so long. Presume linstor wasn't expect that, or was it?

@ghernadi
Copy link
Contributor

ghernadi commented Sep 5, 2018

Linstor's default timeout for external commands is 45 seconds.

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

No, my iscsi device is working fine.

Are you sure that those satellites are only registered within ONE currently active and running controller node?

I'm sure, but I'll check it via tcpdump now

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

No other connections, I also try strace, but it showing nothing because of fork process..
It there any debugging option for satellite?

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

If i restart controller, it shows node like Connected for few second, then it cahnge it to OFFLINE

@ghernadi
Copy link
Contributor

ghernadi commented Sep 5, 2018

Starting the controller or satellite "by hand", you can pass the -d option to enable the debug-console. In there you can list the currently open connections, but I'm afraid those 2ms are not really much time to debug...

Regarding the change in online-status: if the satellite does not go into ONLINE state (only from connected -> offline) it might be that the satellite is unhappy with some data it initially received from the controller. That might be the satellite's node-name it gets assigned by the controller (has to match its local uname -n), or the version mismatch, etc... Although the version mismatch should be visible via the client as something like OFFLINE (VERSION MISMATCH) ...

Can you try deleting and re-adding the node? Or do you have already deployed resources there which would be lost when you delete?
If you could do it, you might have to use the node lost command, as a simple delete would require the satellite to be considered as ONLINE :)

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

Well as I said, If I remove any resources from this node it is going to ONLINE without any problems.
Here is some problem on deeper level.

By the way I was enabled collecting only open calls via this strace command:

strace -e open -f -p `pgrep -f satellite` 2>&1 | tee /tmp/stace_log_open.txt

Then I was started the controller.

It's thinking a lot on those actions:

open("/dev/drbd1004",

stace_log_open.txt

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

Can you try deleting and re-adding the node?

Yes it is not helping too. Probably I can reinstall the system on satellites there, but I want to find the reason of this problem. To avoid similar issues in future on production.

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

lvm commands is executing slow, not because iscsi, it is because node have drbd device which is corrently secondary :-/

@raltnoeder
Copy link
Member

The controllers and satellites normally log their connections. It would probably make sense to run both of them in the foreground, to watch what's going on.
If a controller port is reachable on the internet (or forwarded correctly), we could also add some database entries to create a privileged identity and configure an SSL connector, so we could remotely debug the controller.

@kvaps
Copy link
Author

kvaps commented Sep 5, 2018

@raltnoeder, well, if you don't mind I can organize that
Np, it's just testing instance for me, please let me know your external IP address where you will connect from, I'llconfigure nat to the linstor controller and linstor-satellites for you.

nice trick:

curl icanhazip.com

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

Yes, nodes are going OFFLINE immediately after placing resources on them, they will be marked as ONLINE only after I manually switch this device into primary mode (for start syncing):

drbdadm primary <resource> --force
drbdadm secondary <resource>

It also working if I'll just remove resource from the node, then restart linstor-satellite service.

@ghernadi
Copy link
Contributor

ghernadi commented Sep 6, 2018

There is some kind of "panic-connection-closing" mechanism in the satellite's method which applies the data received from the controller. However, the command right before the connection.close() actually generates an ErrorReport.

I'd suggest the same as robert suggested, namely running at least the satellite (or also the controller) in foreground and simply watch the output of the processes if anything suspicious occurs. You could also increase the log-level to DEBUG or even TRACE in the logback.xml located in /usr/share/linstor-server/lib/conf/

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

@ghernadi thanks for info, now I know how to improve logging...

Now, I've done the next steps.

  • cleaned old failed resources
  • Reduced number of satellites to 1
  • Changed number of replicas to 1
  • Enabled TRACE for linstor-controller and linstor-satellite
  • Run both services linstor-controller and linstor-satellite in foreground and coolect output

then I create new proxmox VM:

controller-trace.log
satellite-trace.log

Ok node is going offline:
Switch down satellite and controller then start it again:

satellite_trace2.txt
controller-trace2.log

node is still OFFLINE, but seems controller is doing something and execute some command on the satellite, I don't see any error messages there

stop them and try once more:

satellite_trace3.txt
controller-trace3.log

do you see these repeatedly appearing messages?
I think something wrong here

Ok shutdown all components and run, on the node:

drbdadm primary --force vm-106-disk-1
drbdadm primary secondary vm-106-disk-1

status of resource chages from Inconsistent to UpToDate, then try to run daemons now:

satellite_trace4.txt
controller-trace4.log

Everything is fine since this action.

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

I've got some error report!

Creating new resource from the scratch

controller-trace5.log
satellite_trace5.txt
ErrorReport-5B911484-000000.log

@ghernadi
Copy link
Contributor

ghernadi commented Sep 6, 2018

Looks like you have some strange setup...
Only from looking at the first two logs, the controller says

10:44:50.939 [Main] TRACE LINSTOR/Controller - Loading all Nodes
...
10:44:50.945 [Main] TRACE LINSTOR/Controller - Node loaded from DB (NodeName=pve1-3)
10:44:50.945 [Main] TRACE LINSTOR/Controller - Loaded 1 Nodes
10:44:50.945 [Main] TRACE LINSTOR/Controller - Loading all ResourceDefinitions
...
10:44:50.968 [Main] TRACE LINSTOR/Controller - Loaded 6 ResourceDefinitions
...
10:44:51.041 [Main] TRACE LINSTOR/Controller - Loading all Resources
10:44:51.043 [Main] TRACE LINSTOR/Controller - Loaded all (3) properties for instance (InstanceName=/resources/PVE1-3/VM-1232-DISK-1)
10:44:51.043 [Main] TRACE LINSTOR/Controller - Loaded 1 Resources
...
10:44:51.057 [Main] TRACE LINSTOR/Controller - Loading all Volumes
10:44:51.058 [Main] TRACE LINSTOR/Controller - Loading properties for instance (InstanceName=/volumes/PVE1-3/VM-1232-DISK-1/0)
10:44:51.059 [Main] TRACE LINSTOR/Controller - Loaded all (0) properties for instance (InstanceName=/volumes/PVE1-3/VM-1232-DISK-1/0)
10:44:51.059 [Main] TRACE LINSTOR/Controller - Loaded 1 Volumes

That means that the controller has loaded 1 node ("pve1-3"), 1 resource ("VM-1232-DISK-1") with 1 volume (volume-number 0) from its database.
OK.

satellite-log:

12:32:05.401 [MainWorkerPool-1] DEBUG LINSTOR/Satellite - Executing command: uname -n
12:32:05.402 [Thread-17] DEBUG LINSTOR/Satellite - pve1-3

These two lines say that we are indeed looking at the logs from a satellite called "pve1-3". That's good.

12:32:05.472 [MainWorkerPool-2] DEBUG LINSTOR/Satellite - Start handling oneway call FullSyncData
...
12:32:05.525 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Node 'pve1-2' created.
12:32:05.525 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Node 'pve1-3' created.
...
12:32:05.533 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Resource 'vm-104-disk-2' created for node 'pve1-2'.
12:32:05.533 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Resource 'vm-104-disk-2' created for node 'pve1-3'.
12:32:05.535 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Resource 'vm-1232-disk-1' created for node 'pve1-2'.
12:32:05.535 [MainWorkerPool-2] INFO  LINSTOR/Satellite - Resource 'vm-1232-disk-1' created for node 'pve1-3'.
...
12:32:05.536 [MainWorkerPool-2] TRACE LINSTOR/Satellite - Full sync with controller finished

However, in the FullSync we receive 2 nodes, with 2 resources each (volume-count is unfortunately not logged).
I know that you already checked, but this really looks like there is a second controller involved.
Whenever the satellite shuts down, it forgets EVERY data about nodes and resources (running DRBD-configurations will remain untouched, but the whole "Linstor-state" is completely wiped never persisted anywhere on the satellite). That means, the satellite printing something about node "pve1-2" - where does this node / data / node-name and the corresponding resource(s) come from?

Additionally: The controller-log line Sending authentication to satellite <node-name> is always printed before the controller sends an authorization message to the corresponding satellite. When the satellite receives such a message, it performs some checks (like uname -n...) and on success prints Controller connected and authenticated. The controller sends the Sending authentication to satellite ... message 2 times (the second time for some reason the connection broke and was re-established. okay). But the satellite log line is 11 times within the log-file.

Regarding the ErrorReport: fair enough, I will look into this ConcurrentModificationException.

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

@ghernadi thanks,
presume I can made a mistake with satellite-trace.log and copy wrong log file
please take a look into controller-trace5.log and satellite_trace5.txt instead .
There is just one node and just one resource exactly

resource called test

PS: ow, I just found that there is time shift I wasn't set GMT+2 inside container so time is two hours differs inside the logs

@ghernadi
Copy link
Contributor

ghernadi commented Sep 6, 2018

After a quick look at controller-trace5.log and satellite_trace5.txt here is what wonders me (nothing really special):

  • the unexpected disconnect at the end of both
  • a very different local-time of the two machines
  • very long execution times of vgs commands at the end of the satellite.

After the logs, the satellite is now offline? and stays offline or does the controller reconnect to it?

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

  1. as services was running in foregroud, the logs ended up when I pressed Ctrl+C after a while
  2. I wrote there wasn't set timezone for controller container, already fixed, problem is not here :(
  3. very long execution because of existing non operable drbd device right now:
# drbdadm status
test role:Secondary
  disk:Inconsistent

# strace -e open vgs
...
open("/dev/drbd1000", O_RDONLY|O_DIRECT|O_NOATIME) = -1 EMEDIUMTYPE (Wrong medium type)
open("/dev/drbd1000", O_RDONLY|O_NOATIME) = -1 EMEDIUMTYPE (Wrong medium type)
...

bascally If I leave both of them for a while I see the next messages on the controller:

13:10:28.828 [PlainConnector] DEBUG LINSTOR/Controller - Sending authentication to satellite 'pve1-3'
13:10:31.189 [MainWorkerPool-1] DEBUG LINSTOR/Controller - Start handling API call LstStorPool
13:10:31.190 [MainWorkerPool-1] TRACE LINSTOR/Controller - Running in scope of ApiCall 'LstStorPool' from 10.28.36.161:50000 start
13:10:31.190 [MainWorkerPool-1] TRACE LINSTOR/Controller - Running in scope of ApiCall 'LstStorPool' from 10.28.36.161:50000 finished
13:10:31.190 [MainWorkerPool-1] TRACE LINSTOR/Controller - Running in scope of ApiCall 'LstStorPool' from 10.28.36.161:50000 start
13:10:31.190 [MainWorkerPool-1] TRACE LINSTOR/Controller - Running in scope of ApiCall 'LstStorPool' from 10.28.36.161:50000 finished
13:10:35.828 [TaskScheduleService] TRACE LINSTOR/Controller - Connection to Node: 'pve1-3' lost. Removed from pingList, added to reconnectList.
13:10:35.828 [TaskScheduleService] TRACE LINSTOR/Controller - Event 'connection closed' start: Node: 'pve1-3'
13:10:35.828 [TaskScheduleService] TRACE LINSTOR/Controller - Event 'connection closed' end: Node: 'pve1-3'
13:10:35.829 [TaskScheduleService] TRACE LINSTOR/Controller - Running in scope of ApiCall 'LstStorPool' from 10.28.36.161:50000 start
13:10:35.829 [TaskScheduleService] TRACE LINSTOR/Controller - Running in scope of ApiCall 'LstStorPool' from 10.28.36.161:50000 finished

I attach more long period log file where controller can't connect to the satellite, until it have test resource, it shows like OFFLINE (time is fixed there)

satellite-trace6.log
controller-trace6.log

UPD: hmm date is right for now, but controller still uses UTC, how is that possible?
UPD2: OK time is ok after dpkg-reconfigure tzdata
UPD3: syncronised long logs:

controller-trace7.log (bad case)
satellite-trace7.log (bad case)

UPD4: logs after executing (after that my node is ONLINE):

drbdadm primary --force test
drbdadm secondary test

controller-trace8.log (good case)
satellite-trace8.log (good case)

UPD5: satellite-trace8.log - was wrong, uploaded it again

@ghernadi
Copy link
Contributor

ghernadi commented Sep 6, 2018

Okay, somehow "good news" here: I was able to reproduce your issue .... 🎉
The "not so good news" is that the reproduction includes something like this in our "ExternalCommandExecutor" class:

if (command[0].equals("vgs"))
{
    try
    {
        Thread.sleep(5_000);
    }
    catch (InterruptedException exc)
    {
        exc.printStackTrace();
    }
}

With this code-snippet added, I get pretty much the same behaviour as you show in your log files.

In summary: I agree that we will need to investigate why the connection breaks down when the "vgs" command (or other external commands) take too long.

Until then, you might want to configure your lvm.conf in a way that lvm should exclude all devices except the actual PVs, This should speed up your vgs command even when having a drbd device running.

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

Good news,
I was glad to help you, you're making nice piece of software, thanks! :)

@kvaps
Copy link
Author

kvaps commented Sep 6, 2018

Adding /dev/drbd.* into global_filter is usable workaround for this issue now:

quick command for achieve this:

sed -i '/^[^#]*global_filter *=/ s/ *] *$/, "r|\/dev\/drbd.*|" ]/' /etc/lvm/lvm.conf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants