Skip to content

Conversation

@georgweiss
Copy link
Collaborator

@georgweiss georgweiss commented Dec 12, 2023

Discovered when analyzing inability to acknowledge alarm:

When client sends producer messages which may/should fail (e.g. due to Kafka config), user will not be aware of the failure as the call to asynchronous producer.send() always succeeds without exception. A get() call is needed to trigger an exception.

This PR adds what is needed to make user aware of failed alarm actions.

Of course one may consider calling get() with a timeout, e.g. to deal with network issues. If so, what should the timeout be set to? 3s? 5s?

@kasemir
Copy link
Collaborator

kasemir commented Dec 12, 2023

What types of errors do you get? A generic "Kafka is inaccessible" type of situation which is already indicated as "Disconnected from alarm server"? Or does everything look great, just those writes fail?

Checking for and logging errors is of course still a good idea. But these are called by the user on the UI thread. Blocking is a bad idea because it could freeze the UI indefinitely.

At minimum, a timeout is necessary, like 10 seconds for a default. But then the UI would still freeze for 10 seconds.

Best would be a timeout like 10 seconds inside each of the AlarmClient methods, plus updating all the calls from for example

server_mode.setOnAction(event ->  client.setMode(! client.isMaintenanceMode()));

which directly performs the call on the UI thread into

server_mode.setOnAction(event ->
    JobManager.schedule("SetClientMode", monitor -> client.setMode(! client.isMaintenanceMode())));     

which runs it in a background thread.

@georgweiss
Copy link
Collaborator Author

In our use case clients must be whitelisted in Kafka in order to be allowed to send a producer message like acknowledge. Currently no exception is triggered if a client is not whitelisted, but of course the action fails. Nothing is logged.

Sure, I can add timeout and use the JobManager instead of blocking the UI thread. Not sure what the Kafka client does, though, in terms of handling connection issues. But an ack action should fail quickly.

@kasemir
Copy link
Collaborator

kasemir commented Dec 12, 2023

clients must be whitelisted in Kafka in order to be allowed to send a producer message like acknowledge

I see. In https://github.com/ControlSystemStudio/phoebus/blob/master/app/alarm/Readme.md there is a section on Auth & Auth. Are you using that, and the whitelist is that list of super.users=...?

@georgweiss
Copy link
Collaborator Author

georgweiss commented Dec 12, 2023

No auth, but rather a list of host addresses allowed to send producer messages.

The idea is to allow some users to acknowledge on certain alarm configs, while only hosts on the control room network are unrestricted.

@kasemir
Copy link
Collaborator

kasemir commented Dec 12, 2023

rather a list of host addresses

That should be of general interest since it's much easier to configure than the full encryption, auth & auth!
Maybe add a blurb to the https://github.com/ControlSystemStudio/phoebus/blob/master/app/alarm/Readme.md, "Simple Access Whitelist" just before the 3 pages of "Encryption, Authentication and Authorization"?

@georgweiss
Copy link
Collaborator Author

Sure, I can do that.

@georgweiss
Copy link
Collaborator Author

As it turns out, most calls to AlarmClient methods where producer.send() is called are actually already invoked using JobManager. Will update those that are not.

@georgweiss
Copy link
Collaborator Author

georgweiss commented Dec 14, 2023

Updated as discussed. Also updated alarm server code to be able to log errors.

Observed that the timeout set in producer.send().get(timeout) is not considered if the client connection to Kafka is dropped. From the documentation I conclude that this is due to Kafka internal logic. Instead the call times out after 60s.
To avoid the get() call to block for 60s, I've added a property setting that will set the "max.block.ms" Kafka client property. Defaults to 10s.

The alarm item configuration dialog I decided to actually hang until a result is available. Reason is that if the call to sendItemConfigurationUpdate fails, en error dialog is shown on top of the configuration dialog. The alternative to close the dialog and later (up to max.block.ms milliseconds) - when the editor dialog is gone - show an error message is in my view worse.

Not added documentation on how to use Kafka config to block producer messages at this point. Need some time to put together something useful.

Copy link
Collaborator

@kasemir kasemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devastated to lose the divinely inspired source code alignment in AlarmClient.java to the blind application of a soul-crushing "formatter", but otherwise fine.

@georgweiss
Copy link
Collaborator Author

@kasemir, sorry about that, my bad. Did not pay attention when I invoked IntelliJ's diabolic code cleanup tool. I usually do it to get rid of unused imports...

@georgweiss georgweiss merged commit 2fac707 into master Dec 16, 2023
@georgweiss georgweiss deleted the alarm_patches branch December 16, 2023 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants