Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionManager loses db server port when CDJ disconnects, but won't gain it again #35

Closed
ben-xo opened this issue Oct 4, 2019 · 12 comments

Comments

@ben-xo
Copy link
Contributor

commented Oct 4, 2019

ConnectionManager -1s the db server port when the relevant CDJ disconnects in the deviceLost() method. However, the paired deviceFound() method is only triggered the first time a device is seen (as DeviceFinder::deliverFoundAnnouncement() has a guard which only delivers if the device is new, on line 220). The upshot is that if the CDJ disconnects and then rejoins the network, its db server port is stuck at -1.

This manifests as a sudden spewing of messages from MetadataFinder (once per CdjStatus packet) all saying the same thing:

ERROR [MetadataFinder metadata request] (MetadataFinder.java:114) - Problem requesting metadata, returning null
java.lang.IllegalStateException: Player 2 does not have a db server requesting metadata
	at org.deepsymmetry.beatlink.dbserver.ConnectionManager.allocateClient(ConnectionManager.java:113)
	at org.deepsymmetry.beatlink.dbserver.ConnectionManager.invokeWithClientSession(ConnectionManager.java:192)
	at org.deepsymmetry.beatlink.data.MetadataFinder.requestMetadataInternal(MetadataFinder.java:112)
	at org.deepsymmetry.beatlink.data.MetadataFinder.access$1100(MetadataFinder.java:41)
	at org.deepsymmetry.beatlink.data.MetadataFinder$8.run(MetadataFinder.java:1978)
	at java.lang.Thread.run(Thread.java:748)

version: 0.5.2

@brunchboy

This comment has been minimized.

Copy link
Member

commented Oct 5, 2019

Thanks for the report. I’m traveling dealing with some family issues for a bit so I probably won’t be able to work on this in the next few days (not even sure when I’ll get WiFi for my laptop, and no Pioneer gear with me) but this is a good clue towards fixing a long-standing problem!

@brunchboy brunchboy added the bug label Oct 5, 2019
@ben-xo

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2019

Hey no worries, hope everything's okay. I'll take a look at it as well. To me it feels a bit like it might be a race condition in the lifecycles.

It's not 100% clear from the docs in what order (or even, which of) the main singletons you need to start(). They all seem to instantiate each other when needed, and then listen to each others lifecycles to tear themselves down; and I feel as if the problem is somewhere in that area.

I have a couple of ideas I'm checking out, but there are some parts of the lifecycle that could just do with a little more logging (such as DeviceFinder::expireDevice()).

I'm have a few hunches, but the main one is that perhaps DeviceFinder::expireDevice() races with deviceFound() (they're both threaded after all), and maybe sometimes fire in the wrong order?

I am reliably able to reproduce the problem if I power-cycle the CDJ; and can only then fix it by cycling my app.

@ben-xo

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2019

I also have a related issue (although not identical): if i start my app first then power-on the CDJ second (or: connect the network cable to an already-on CDJ), then the first exception I get is a socket-timeout:

2019-10-05 11:44:26,520  INFO [main] (Application.java:189) - Starting VirtualCDJ…
2019-10-05 11:44:26,522  INFO [main] (VirtualCdj.java:584) - Found matching network interface en8 (en8), will use address /169.254.55.40/16 [/169.254.255.255]
2019-10-05 11:44:27,873  INFO [main] (Application.java:213) - Starting MetadataFinder…

[ first exception ]

2019-10-05 11:44:37,876  WARN [main] (ConnectionManager.java:292) - Problem requesting database server port number
java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at org.deepsymmetry.beatlink.dbserver.ConnectionManager.requestPlayerDBServerPort(ConnectionManager.java:279)
	at org.deepsymmetry.beatlink.dbserver.ConnectionManager.start(ConnectionManager.java:463)
	at org.deepsymmetry.beatlink.data.MetadataFinder.start(MetadataFinder.java:1358)
	at am.xo.cdjscrobbler.Application.startVirtualCdj(Application.java:218)
	at am.xo.cdjscrobbler.Application.start(Application.java:138)
	at am.xo.cdjscrobbler.Application.main(Application.java:82)
2019-10-05 11:44:37,879  INFO [main] (MetadataFinder.java:351) - Processing device found, number:2, name:"CDJ-2000".
2019-10-05 11:44:37,879  INFO [main] (Application.java:140) - Starting QueueProcessor…
2019-10-05 11:44:38,034  INFO [Thread-1] (MetadataFinder.java:884) - Reporting media mounted in SlotReference[player:2, slot:USB_SLOT]
2019-10-05 11:44:38,042  INFO [Thread-1] (MetadataFinder.java:884) - Reporting media mounted in SlotReference[player:2, slot:CD_SLOT]

[ second exception ]

2019-10-05 11:44:38,044 ERROR [MetadataFinder metadata request] (MetadataFinder.java:124) - Problem requesting metadata, returning null
java.lang.IllegalStateException: Player 2 does not have a db server requesting metadata
	at org.deepsymmetry.beatlink.dbserver.ConnectionManager.allocateClient(ConnectionManager.java:113)
	at org.deepsymmetry.beatlink.dbserver.ConnectionManager.invokeWithClientSession(ConnectionManager.java:192)
	at org.deepsymmetry.beatlink.data.MetadataFinder.requestMetadataInternal(MetadataFinder.java:122)
	at org.deepsymmetry.beatlink.data.MetadataFinder.access$500(MetadataFinder.java:39)
	at org.deepsymmetry.beatlink.data.MetadataFinder$7.run(MetadataFinder.java:1312)
	at java.lang.Thread.run(Thread.java:748)

[... this second exception then loops twice a second ...]
@ben-xo

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2019

Update: I have found that if I drop DeviceFinder::MAXIMUM_AGE down to 2 seconds (instead of 10), then the problem happens a lot less often.

I think this is what's happening (although I base this only on reading the code and looking at log output):

  • network link down -causes-> device announcement fails -causes-> virtualCDJ stops -causes-> MetadataFinder stops
  • My application sees the lifecycle event for MetadataFinder and attempts to restart virtualcdj and metadatafinder immediately.
  • this eventually succeeds, but it takes about 20 to 30 seconds (some of which the CDJ's "Player N" status is flickering as it comes to terms with the network) - MetadataFinder sees a DeviceFound announcement
  • however, in the middle of that process (one would assume 10 seconds in, but it isn't logged), the DeviceFinder sends a deviceLost announcement, which -1's the db server port

Shortening the timeout for the expiry announcement isn't the proper fix for this. I have a suspicion that part of the problem may be that after one round of device disconnections, MetadataFinder is first in the announcement queue, before virtual cdj - it would usually be the other way round.

@ben-xo

This comment has been minimized.

Copy link
Contributor Author

commented Oct 5, 2019

Created pull request #37 . You don't necessarily have to merge it as-is, as it feels like there might be solution that isn't quite so… zero-knowledge. Nevertheless, this fixes it well enough for me, although a network disconnect still incurs a roughly 30 second penalty. But at least now it recovers of its own accord!

@brunchboy

This comment has been minimized.

Copy link
Member

commented Oct 6, 2019

There is definitely a lot more state and synchronized methods scattered around in this stuff than I like, I sometimes think about rewriting it all with a cleaner design now that we know what works and what doesn’t (maybe even in Kotlin so we can just use functions instead of singletons for some things), but on the other hand I always have more interesting things to pursue and new features to add. Thanks for the PR, and I am pretty sure the issue you are working on here is the one described at the start of Deep-Symmetry/beat-link-trigger#18

@ben-xo

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

What a small world, I also happen to know Marcus Krooked!

@ben-xo

This comment has been minimized.

Copy link
Contributor Author

commented Oct 6, 2019

Regarding the architecture - (just throwing ideas out) it would probably make sense to introduce a "control plane" device / class/ thread that owned all of the instances and responded to their lifecycles, rather than having the lifecycle of each device work independently. That would mean all of your world-inspecting interactions happen in one place.

@brunchboy

This comment has been minimized.

Copy link
Member

commented Oct 13, 2019

That would be a very different project. The idea here is to allow you to use whatever pieces you need, and have them manage their own dependencies as much as they can. Something like Clojure’s immutable data structures and atomic update functions would help make this a lot less fraught.

Anyway, to help me better understand what is happening here, how many CDJs do you have on your network? And what exactly is happening? In the initial bug report it sounded like you were taking a CDJ off the network. But in your comment about dropping MAXIMUM_AGE to 2 seconds you talk about the entire network going away and coming back, which is a completely different scenario. That should probably be a separate issue, so we can focus on one thing at a time? But I am going to do some of my own experimentation now and see what I can find.

@brunchboy

This comment has been minimized.

Copy link
Member

commented Oct 13, 2019

Also, I take it that you are using a real player number in your application. Does Crate Digger ever succeed at downloading the exported database and start offering the metadata that way? Actually, are you even starting the CrateDigger singleton? It occurs to me that you might not be, and this is a newer and faster way of getting metadata, after an initial pause for it to download and parse the entire database file from the NFS server running in the CDJs. It is, in fact, the only way that you can obtain metadata when there are four physical CDJs on the network. Once you have it running, and once it has got the data for the players’ mounted media, it registers itself with the MetadataFinder and handles metadata requests for those media.

@brunchboy

This comment has been minimized.

Copy link
Member

commented Oct 13, 2019

I can absolutely reproduce the problem of the ConnectionManager getting its database port number stuck at -1 if I power off a player for a while, so I am adding more logging to see how to fix that. But the reason I haven’t been running into this issue myself is that I have been using the CrateDigger approach to getting metadata for almost a year now, and you should too. I need to add some pointers to the MetadataFinder documentation encouraging people to always also call org.deepsymmetry.beatlink.data.CrateDigger.getInstance().start() whenever they fire up the MetadataFinder.

@brunchboy

This comment has been minimized.

Copy link
Member

commented Oct 13, 2019

Ok, you’re not going to believe what is happening here. It turns out that this isn’t any sort of lifecycle management problem or race condition in Beat Link. The ConnectionManager is properly receiving the device found announcement, and asking the player what port to use to reach dbserver, and the player itself is replying -1 to us, and we are recording that in the table! Apparently if you ask a player too soon after it has powered up, it is not yet running a dbserver and it responds this way!

@brunchboy brunchboy closed this in 7c0a295 Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.