Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with Poller #62

Closed
aesculus opened this issue Jul 16, 2023 · 52 comments
Closed

Help with Poller #62

aesculus opened this issue Jul 16, 2023 · 52 comments

Comments

@aesculus
Copy link

My use case is fairly simple. I want to query my two Heat Pumps periodically (every 5 seconds or so) and see what state they are in (heating/cooling/standby) and what the physical characteristics are of the units and the zones they are supporting. There are a total of 2 S30 thermostats and additional slave thermostat that gives a total of 3 zones. I do not want to control or set anything. I want to do all of this locally (not via the cloud).

It would appear that the poller would meet my needs at first glance but I am having some issues getting the results I expect and could use some guidance.

  1. It appears that when you refactored the systemList and zoneList you did not update the poller samples. NBD as it was easy to fix.
  2. Since I only want to try the poller demo I commented out the "cloud_message_pump_task" and "command_reader_task" from the test_async_local.py file. This left me without a connection to the equipment so I copied "await s30api.serverConnect()" and placed it at the top of the poller routine. That does seem to connect to the S30
  3. But when I try to get any value other than the sysID its None or blank. login Success homes [1] systems [1]
  4. Iterating through the zones I get only 1 (should have 2 in this connection) and all the fields are blank.
  5. Lastly - and something to worry about once the above is addressed, how do I get the other S30 that is a different system? Do I need to log off and do the same process with it or create two versions of the s30api variable, one for each system?

Log
2023-07-16 10:42:58,793 [MainThread ] [DEBUG] serverConnect - Entering
2023-07-16 10:42:58,793 [MainThread ] [DEBUG] Closing Session
2023-07-16 10:42:58,793 [MainThread ] [DEBUG] Creating Session
2023-07-16 10:42:58,794 [MainThread ] [DEBUG] authenticate - Enter
2023-07-16 10:42:58,794 [MainThread ] [DEBUG] login - Enter
2023-07-16 10:42:58,942 [MainThread ] [INFO ] Creating lennox_home homeId [local]
2023-07-16 10:42:58,942 [MainThread ] [INFO ] Updating lennox_home homeIdx [0 homeId [local] homeName [local]
2023-07-16 10:42:58,943 [MainThread ] [INFO ] Creating lennox_system sysId [LCC]
2023-07-16 10:42:58,943 [MainThread ] [INFO ] Update lennox_system idx [0] sysId [LCC]
2023-07-16 10:42:58,943 [MainThread ] [INFO ] login Success homes [1] systems [1]
2023-07-16 10:42:58,943 [MainThread ] [DEBUG] Negotiate - Enter
2023-07-16 10:42:58,944 [MainThread ] [DEBUG] serverConnect - Complete
lsystem LCC
lsystem LCC
lsystem LCC
lsystem LCC

@PeteRager
Copy link
Owner

Great set of questions. That API sample is really out of date. I'd take a pull request with your updated ##sample.

You will need to run the message pump as the configuration is delivered asynchronously in messages. There is no "polling" interface to the equipment; fundamentally its a pub/sub model. The "cloud_message_pump_task" is mislabeled it should be called message_pump_task (initially the API only supported the cloud connection)

So basically the way it works, is as the message pump gets messages it updates it's in-memory view of the system. The API bootstraps by requesting subscriptions to a set of topics. The initial messages from the S30 contain the full information for the topic (for example a list of zones). Subsequent messages provide updates (for example the temperature in the zone).

The poller task is just periodically reading that in-memory representation.

To set up two connections you create two instances of the API. If you do that, use a different APP_ID for each connection as the S30s do talk to each other; and this will cause them to get confused.

s30api = s30api_async("none", "none", APP_ID, IP_ADDRESS)

What is done in the Home Assistant Integration is the zone temperature sensors setup a callback and that callback gets called whenever a requested property changes.

https://github.com/PeteRager/lennoxs30/blob/25ce86f288fa4b94aa6096136efd35039467186d/custom_components/lennoxs30/sensor.py#L409C1-L409C1

So that's an approach also.

@aesculus
Copy link
Author

Thanks for the info. I will attempt to play with it a bit more with that knowledge.

Note that I am not using HA. My plans are to write to a InfluxDB DB and then read it later with Grafana. And so whatever I create will most likely end up as a Docker container to feed the DB.

@aesculus
Copy link
Author

Do you see anything wrong with this approach to defining 2 systems? It seems to work OK.

async def multiple_tasks(s30api,s30api2):
input_coroutines = [
cloud_message_pump_task(s30api),
api_poller_task(s30api),
cloud_message_pump_task(s30api2),
api_poller_task(s30api2),
]
res = await asyncio.gather(*input_coroutines, return_exceptions=True)
return res

def main():
s30api = s30api_async("none", "none", APP_ID, IP_ADDRESS)
s30api2 = s30api_async("none", "none", APP_ID2, IP_ADDRESS2)
loop = asyncio.get_event_loop()
loop.run_until_complete(multiple_tasks(s30api,s30api2))

Console Output (Zone 1 is a separate system):
[Master Bedroom] Temp [80] Humidity [43] SystemMode [cool] FanMode [auto] HumidityMode [off] Cool Setpoint [80] Heat Setpoint [62]Outdoor - Temp [104]
[Core Residence] Temp [78] Humidity [43] SystemMode [cool] FanMode [auto] HumidityMode [off] Cool Setpoint [78] Heat Setpoint [66]Outdoor - Temp [104]
[Zone 1] Temp [81] Humidity [43] SystemMode [cool] FanMode [auto] HumidityMode [off] Cool Setpoint [80] Heat Setpoint [62]Outdoor - Temp [105]

@PeteRager
Copy link
Owner

That looks great.

A case to handle will be when the connection fails - the message pump task will generate an exception at which point the connection will need to be reestablished. You can simulate this by rebooting your S30. Or there is a simulator in the repository. In vscode these are the debug config, which starts a simulator on localhost on a configurable port 8081,8082, etc. You can then point the API at localhost:8081 and there's a parameter in the constructor to use HTTP instead of the default HTTPS. I do most dev work against the simulators.

There's also a rich set of diagnostic data that can be obtained - for example the heat pump inverter current - useful for energy monitoring.

@aesculus
Copy link
Author

A case to handle will be when the connection fails - the message pump task will generate an exception at which point the connection will need to be reestablished.

This seems to be the case at initialization but then recovers. Is the recovery not automatic?

@PeteRager
Copy link
Owner

The API has no recovery / retry coded in it.

That said, the S30 does keep state information regarding the APPID subscription and message queue so it is possible that some scenarios do recover if this state info has not been removed. For example, a temporary network glitch, would cause the pump to generate an exception, but when the network is restored it should pick up where it left off. I'm not sure of the rules of when the s30 removes the subscriber, I've assumed there must be a timeout, max queued messages and it gets cleared on restart - but I've never done detailed analysis. Early on there were indications that failure to disconnect would cause the message queue to grow indefinitely and the device to automatically reboot.

@aesculus
Copy link
Author

aesculus commented Jul 17, 2023

Just to be sure we are talking about this exception in the sample?

# Intermittent errors due to Lennox servers, etc, may occur, log and keep pumping.
   except S30Exception as e:
       print("Message pump error " + str(e))

From your comment it sounds like you expect some communications errors to occur but it should keep running and respond when the connection resumes? Are you saying that may not be the case?

Would some sort of timer and then a reconnection (await s30api.serverConnect) request if exceeded be what you have in mind? The timer value would get reinitialized after every successful read? Could it be as simple as this?

Any idea what happens to the buffered data when you close the connection? I suspect it is wiped out. So if I was to aggressive with this I could inadvertently loose data. Maybe a number of minutes to wait before a reconnection is called?

I see that when you connect you force a close_session first during the process. Should I had a close_session for the keyboard abort too? In my normal use case this will run forever but could be shut down via Docker. So I guess I will need to check into a way to trap the container shutting down and terminate the connection then too.

@aesculus
Copy link
Author

There's also a rich set of diagnostic data that can be obtained - for example the heat pump inverter current - useful for energy monitoring.

Unfortunately this only appears to be available after/during a diagnostic test. I tried it for one I tested yesterday and got the same values. And the other S30 that has not had a diagnostic test run on it (well at least since a reboot), had None as its values.

This would be exactly what I would like to see and plot though. Any chance there is one lurking somewhere that is real-time for voltage and current?

@aesculus
Copy link
Author

@PeteRager Any thoughts on how I can tell if a system is currently using the strip heaters, but not set as Emergency Heat.

I looked for Aux Heat or stages of heating. One issue is that the variable speed units are not fixed. But at some point they will go into aux heating.

I thought Demand would have done it but this appears to be a value of the airflow.

@PeteRager
Copy link
Owner

On error handling, the test program is overly optimistic and just catches errors and continues. That will not work reliably; at some point the code will need to go back through the initialization process; reconnect and resubscribe and then start the message pump again. When building the HASS integration, I put all the retry logic into the integration it took a while to get is right. That logic is in this file, it is convoluted, but has been very reliable

https://github.com/PeteRager/lennoxs30/blob/master/custom_components/lennoxs30/__init__.py

Regarding the last two questions of diagnostic data. Call this turn diags on

await system.set_diagnostic_level(2)

After that you'll get a stream of diagnostic data forever (or until the S30 reboots or you set it back to zero)

Read this document as there are some potential stability issues that could arise:
https://github.com/PeteRager/lennoxs30/blob/master/docs/diagnostics.md

The electric heat data is part of the diagnostic data. There is a diagnostic element called #_of_electric_heat_sections_on on the indoor unit. This test shows how to access them:

def test_process_diagnostics():

This is what you are looking for:

Looks like diagnostic 13 on equipment 2

Yes demand is CFM of airflow.

@aesculus
Copy link
Author

Wow. Great stuff. This will keep me busy for a bit.

@aesculus
Copy link
Author

aesculus commented Jul 17, 2023

Quick question. Re reading your setup instructions for HA under Emergency Heat I see this statement:

If the Lennox Auxiliary Heat is running, the aux attribute in the HA Climate entity will be set to True and the HA Climate Entity will show Heating

This seems to imply that you can detect if the heat strips have been applied to normal heating (ie 2nd stage or aux heating). Is this true?

This may be all I need for my purposes to monitor excessive heat strip use. Somewhere/somehow I can keep track of how long this is going on per operation/day, and make a judgement call if it is excessive or not. Its a high level use case to trap a broken system before I run up a huge electric bill (don't ask me why this is a concern).

BTW why are the attributes defrost, outdoor_temperature and aux in the zone vs system?

@PeteRager
Copy link
Owner

In a system where there is an heat pump and a gas furnace; when the heat pump is locked out due to ambient temperature being too low and the gas furnace is running: these indicators are used. Whether those work for the heat strips also, I do not know, either way I'll update the docs based on what you find.

Outdoor temperature is on the system object. The other two are on the zone because that's the way the S30 models the data. . In general the zone and system properties map 1:1 with a JSON attribute in the zone or system messages.

What could be helpful is to setup HASS in Docker and enable the integration. This will provide a GUI with all the data. Turn on message logging and it'll dump the messages to a file - then as the strips come on - we can see what is in the messages. It is possible there is a parameter the API is not processing. The API also has ability to log to a file. Either way getting the messages will be helpful to see what changes when.

@aesculus
Copy link
Author

The heat strip experiment will have to wait a bit. Fighting 100F+ temps right now.

I'll look into the HASS Docker thing if I get some time later in the week. The test python stuff spits out a lot of data and its easy to add to it on the fly.

@aesculus
Copy link
Author

Had a few minutes free so I installed HA via Docker. It all seems to be OK but when I went to add Lennox S30 as an integration nothing shows when I search for len or lennox. Thoughts?

@PeteRager
Copy link
Owner

The integration is community add-on and so needs to be installed.

The simplest way it to download the latest release:
https://github.com/PeteRager/lennoxs30/releases/tag/2023.6.1

Unzip it and put it the contents of the custom_components/lennoxs30 into /config/custom_components/lennoxs30 folder. Where the /config is the volume used by docker. Then restart HASS.

Alternatively install HACS (Home Assistant Community Store) https://hacs.xyz/ and follow the instruction here on how to add the integration https://github.com/PeteRager/lennoxs30#hacs

@aesculus
Copy link
Author

OK. Up and running. Both systems added.

Now where do I find these log files?

@PeteRager
Copy link
Owner

PeteRager commented Jul 18, 2023

This section describes how to enable message logging (ignore the debug logging). The files will appear in the /config directory.

https://github.com/PeteRager/lennoxs30#reporting-bugs

@aesculus
Copy link
Author

aesculus commented Jul 19, 2023

Just to let you know I did enable logging and verified it was working. Seems about the same detail as I was getting in the test scripts (minus the prints from the zones) but I did not do a detailed comparison.

NOTE: I did not enable the diagnostic feature as I am not ready for that test, so maybe more details appear then.

For now I will return to working on my simple usage collection server (need to embed your recovery logic). Also interesting to note is that one of my systems occasionally has wifi issues. This came up during some of my tests today and in HASS I could see the state going from connected to disconnected to connecting, so it showed the value of your recovery routines.

@aesculus
Copy link
Author

aesculus commented Jul 20, 2023

Pete: I think there is evidence that we should be able to detect the heat strips coming on without the diagnostics based on these reports that Lennox makes available to S30 users:

image

@aesculus
Copy link
Author

aesculus commented Aug 1, 2023

@PeteRager

On error handling, the test program is overly optimistic and just catches errors and continues. That will not work reliably; at some point the code will need to go back through the initialization process; reconnect and resubscribe and then start the message pump again. When building the HASS integration, I put all the retry logic into the integration it took a while to get is right. That logic is in this file, it is convoluted, but has been very reliable

I have been studying the file and seem to understand much of it's logic. I will have to gently remove the HASS stuff as I will be running it under Docker, just feeding an InfluxDB DB for use in a Grafana dashboard.

I must admit it's rather complex but I think I get the jest of it. But now my question is much more basic.
Do you have a sample or suggestion on where I would:

  • Jump into the code (ie start the processes)
  • Where/how I would capture the message pump data for my DB feed?
  • Where/how I would enter the shut down logic
  • Any other high level calls I should prepare to manage the solution

@PeteRager
Copy link
Owner

It's a good question.

Have you looked at the InfluxDB connector in HASS? I think you can configure it to send any entity you want to Influx. This may be a no code solution.

I have had a couple of requests to create an MQTT server, as this would allow integration with lots of systems. So that model would be something like this
https://fullstackenergy.com/mqtt-into-influx/ with the benefit of having something widely applicable.

Jump into the code (ie start the processes)

  • Don't understand this question.
    Where/how I would capture the message pump data for my DB feed?
  • Derive a class from the API and override processmessage(), this function will get called on every message with a JSON dict.
    Where/how I would enter the shut down logic
  • Docker should have some type of notification hook that gets called when the container is getting shutdown.
    Any other high level calls I should prepare to manage the solution

@aesculus
Copy link
Author

aesculus commented Aug 2, 2023

Thanks for getting back so quick. Sorry for not being more complete in my questions.

I am good for understanding how to inject my data into InfluxDB via Docker. And also when to trigger the shutdown event.

My questions were more about what the calls would be like to the init example you built for HASS. Clearly I don't need all the HASS setup functioinality and will replace that with my own variable setting routines. But after that how do I kick off starting the processes to pull the S30?

For example it looks like I would start by calling async_setup_entry. Would this then cascade all the other routines buried in the manager classs?

And for stopping the processes call async_unload_entry?

And I notice that return received is the result of the message routine. What do I need to do to subscribe to this so I can parse the message stream in my code?

Sorry but I am a Python newbie.

@aesculus
Copy link
Author

aesculus commented Aug 3, 2023

@PeteRager Just pinging you in case you did not see my last post.

@PeteRager
Copy link
Owner

PeteRager commented Aug 3, 2023

Yes, your statements regarding async_setup_entry are correct, calling that should kickoff the process: and likewise async_unload_entry should stop it.

I have often thought that untangling the manager class from HASS and having it in the API would make the API more complete and would make the HASS integration simpler.

There is no callback handler to get the JSON messages directly, so are options are to add one to the API or derive a class from the API and instantiate instead.

This is python pseudo-code and likely won't compile but should get you going in the right direction.

class MyAPI(s30api_async):
    def __init__(
        self,
        username: str,
        password: str,
        app_id: str,
        ip_address: str = None,
        protocol: str = "https",
        pii_message_logs=True,
        message_debug_logging=True,
        message_logging_file=None,
        timeout: int = None,
    ):
        Call super.__init__() with the parameters 

    def processMessage(self, message):
         Your code
          self.super().processMessage(message)

And then you'd instantiate this class instead of s30api_async

@aesculus
Copy link
Author

aesculus commented Aug 4, 2023

Outstanding. Give me a few days to stitch something together and I will report back my findings.

@aesculus
Copy link
Author

aesculus commented Aug 7, 2023

I have often thought that untangling the manager class from HASS and having it in the API would make the API more complete and would make the HASS integration simpler.

Looking over the complexity of stripping HASS and overwriting the message handler, I have decided to wait to see if you end up doing this.

In the mean time I will see if I can hack the original sampler above with some very basic recovery logic.

My brain hurts. :-(

@aesculus
Copy link
Author

OK, I am back at this again. Have a few questions:
And then you'd instantiate this class instead of s30api_async
This I suppose is: self.api: s30api_async = MyAPI(

Call super.init() with the parameters

Confused about this. My plan was to create a new file based on init.py after stripping all the HASS stuff, and then just calling the async_setup_entry to start it. I could never find out how you got started in HASS with this.

One issue I am having with my conversion is how to define the config object. Is that what CONFIG_SCHEMA is doing? Cannot find the source for that.

I was also in the process of creating a stripped down ConfigType that just supported the class structure without all that HASS functions embedded.

@PeteRager
Copy link
Owner

One of the challenges with that starting point is unwinding and understanding the HASS objects. The main element of the config entry is a dictionary of the configuration values. There are some examples in the test directory of where I setup a fake config entry to test those functions, that could provide a good starting point.

Another alternative could be to go back to the command line program and get this executing so that data is flowing into influx. A simple way to handle errors is to have the program exit and then have the docker container auto-restart. So essentially a failure stops it and it restarts from scratch. Simple may be better. That HASS integration was my first attempt at python programming and while it works very reliably it's not very modular and unwrapping it may be a lot of work.

Myself, I'd likely just put this into HASS configuration.yaml and be done with it. This will send all the data collected by the integration to influx.

# Example filter to include specified domains and exclude specified entities
influxdb:
  include:
    domains:
      - sensor
      - climate
      - binary_sensor

@aesculus
Copy link
Author

Good tip on the Docker restart. I'll set it to automatic.

I was heading in the direction of simple and using the base of the command examples and then got caught up in thinking the HASS one was more complete. Sometimes complexity looks like elegance when you really don't understand everything. :-)

I'll reset and take another stab at it. I have all the parts so it should be fairly easy to crank something out now that you put me back on the rails.

@aesculus
Copy link
Author

Quick question: If running as a Docker container is it necessary to shutdown the s30api_async object before the container terminates or will this happen when the container is gone?

I don't want to leave connections hanging around unnecessarily. Of is it NBD, especially if I use the same app_id on a restart.

@PeteRager
Copy link
Owner

PeteRager commented Aug 16, 2023

I would try to logout as it will prevent messages from accumulating for that app_id. I don't know how long they accumulate for and the side effects of it running out of space. That said, if it's in trouble is usually reboots on its own.

@aesculus
Copy link
Author

OK. Looks like the best way to detect a container shutdown is via signals (SIGTERM)? It looks like we have 10 seconds to gracefully quiesce the container before it's shut down?

Any experience or best practices here?

@aesculus
Copy link
Author

As an experiment I wanted to determine where and how I did the shutdown. It looked like I would do this in the message_pump_task routine so I just put in a counter for now to allow a few executions and then terminate the program:

running = True

# This task gets messages from S30 and processes them
async def message_pump_task(s30api: s30api_async) -> None:
    global running
    my_counter = 0
    try:
        # Login and establish connection
        await s30api.serverConnect()
        # For each S30 found, initiate the data subscriptions.
        for lsystem in s30api.system_list:
            await s30api.subscribe(lsystem)
    # Catch errors and exit this task.
    except S30Exception as e:
        print("Failed to connect error " + str(e))
        return
    while running:
        my_counter += 1
        print(my_counter)
        if my_counter > 20:
            running = False
        # Checks for new messages and processes them which may update the state of zones, etc.
        try:
            await s30api.messagePump()
            await asyncio.sleep(1)
        # Intermittent errors due to Lennox servers, etc, may occur, log and keep pumping.
        except S30Exception as e:
            print("Message pump error " + str(e))
    else:
        await s30api.shutdown()
        print("shutdown")

This resulted in what I think is a smooth exit without an error:

2023-08-17 14:36:40,147 [MainThread  ] [INFO ]  logout - Entering - [https://192.168.1.224/Endpoints/Downstairs/Disconnect]
shutdown
2023-08-17 14:36:40,282 [MainThread  ] [INFO ]  logout - Entering - [https://192.168.1.192/Endpoints/Upstairs/Disconnect]
shutdown
Program Ended

Before I put in the shutdown I would get:

Program Ended
2023-08-17 14:28:23,398 [MainThread  ] [ERROR]  Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f2824494430>
2023-08-17 14:28:23,399 [MainThread  ] [ERROR]  Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f28244945e0>
2023-08-17 14:28:23,399 [MainThread  ] [ERROR]  Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f28244a06a0>, 1228009.631245871)]']
connector: <aiohttp.connector.TCPConnector object at 0x7f28244946a0>
2023-08-17 14:28:23,400 [MainThread  ] [ERROR]  Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7f28244a0400>, 1228011.270983376)]']
connector: <aiohttp.connector.TCPConnector object at 0x7f2824494a60>

@PeteRager
Copy link
Owner

That looks great!

@aesculus
Copy link
Author

OK. Here is my first hack at collecting zonal information once every minute and storing it in a InfluxDB database. I will use this with https://github.com/jasonacox/Powerwall-Dashboard so I will placing it in the same db. I am using the unique_id for each zone as the measurement so each zone has its own data set.

This does not have a lot of error recovery built in so I will see how it goes after running it for awhile. I will make it a Docker container once it seems to be stable enough.

Next up will be to create a Grafana dashboard to display the data, with some of it emulating the monthly summaries that come from Lennox. #62 (comment)

server.py.txt

@aesculus
Copy link
Author

Well my attempt to get it running in Docker has been limited. When I run it I get this error in the log:

from lennoxs30api.s30exception import S30Exception
ModuleNotFoundError: No module named 'lennoxs30api'

So it cannot find module. Any thoughts on what that may be? No errors during the build.

@aesculus
Copy link
Author

aesculus commented Aug 25, 2023

Small update: Still struggling here. For some reason Docker is not finding the packages that have been installed into the site-packages. I verified that the path was good and also pip says its valid. :-(

EDIT: Making good progress. Past all the module prereqs and config file. Now trying to solve why it can't write to the log file that is there and has rw permissions for everyone.

EDIT 2: Success. I had to redo the way logging is done in Docker as it would not write in the format I used first. Another sample after I run this for a few days and it seems to work OK.

Just barely started on the Dashboard in Grafana.

@aesculus
Copy link
Author

System has been working good and I have been making slow progress on the Grafana dashboard until this last night: #65

@aesculus
Copy link
Author

aesculus commented Sep 5, 2023

@PeteRager OK. I have hit a wall. For some reason one of my S30s just stops updating the message pump data. This can happen after few minutes or a few hours. It does not throw any errors so there is nothing to trap to restart it. The message queue just has the same values in it and my app will just keep pumping away.

Thoughts on what this is and how to detect it?

BTW my Grafana dashboard is complete. All I need now is a reliable data feed.

@PeteRager
Copy link
Owner

When it's in that state no messages are being processed. Meaning the get is not returning any messages?

The only time I've seen something like that is when I've had two applications running using the same app_id; but typically this would just cause each application to just get a subset of the messages.

Are you using different app_ids for the two S30s?

Are the two S30s running the same firmware version?

@aesculus
Copy link
Author

aesculus commented Sep 5, 2023

I am using the sample api_poller_task. So I guess it just keeps looping looking in what was dumped in the queue? And if await s30api.subscribe(lsystem) does not update the values, I just process the same values over and over.

Yes different app_ids for the two S30s.

I swore before when there was an issue with subscribe I would get an error. That is what I programmed for.

@PeteRager
Copy link
Owner

As a way to detect no messages. There is a metric object attached to the api.metrics

In that object is a last received time and a message count. So this could be used to detect the connection not sending data. Typically there are multiple messages a minute, so 10 minutes of no messages could be the diagnostic to detect this.

@aesculus
Copy link
Author

aesculus commented Sep 5, 2023

That is what I probably need. I will code that up. Since I am on a LAN I would think no messages in a minute or so would mean a problem?

@PeteRager
Copy link
Owner

Maybe log that message count to grafana, see how often it typically changes, then go 3x or so around that.

Is that the single zone s30 or the multizone one? Restarting the S30 may help, but I've only ever had to do that one time in two years.

@aesculus
Copy link
Author

aesculus commented Sep 5, 2023

Good idea about putting it in grafana. Easier than hunting down through Docker Compose.

Its the zoned one. The single zone one seems to be stable. I rebooted both of them last week once I "thought" I had my code stable. :-(

@PeteRager
Copy link
Owner

Take a look at the firmware version. Mine are at 3.81.213

@PeteRager
Copy link
Owner

Is the WiFi signal strength the same on both of them? If the WiFi was weak and drops out occasionally I could see this situation arising. The devices so send the RSSI in messages periodically, it's not captured by the API but should be in the message logs. The other value to look at is the sysUpTime on the system object, that tells how long in seconds the system have been running. If it resets to zero, that means the S30 has restarted - but if that happened I'd expect communication errors to be reported.

@aesculus
Copy link
Author

aesculus commented Sep 6, 2023

@PeteRager Here are the two metrics from both systems after they were established and reporting messages for three minutes (1 minute polling time). Note Downstairs has two zones vs upstairs having one.

INFO:system=Aesculus_Upstairs message_count = 40 last_receive_time = 2023-09-06 17:47:01.322791+00:00
INFO:system=Aesculus_Downstairs message_count = 47 last_receive_time = 2023-09-06 17:47:35.145735+00:00
INFO:system=Aesculus_Upstairs message_count = 47 last_receive_time = 2023-09-06 17:47:54.404752+00:00
INFO:system=Aesculus_Downstairs message_count = 57 last_receive_time = 2023-09-06 17:48:36.324384+00:00
INFO:system=Aesculus_Upstairs message_count = 57 last_receive_time = 2023-09-06 17:49:03.409017+00:00
INFO:system=Aesculus_Downstairs message_count = 67 last_receive_time = 2023-09-06 17:49:36.249007+00:00

Are these the metrics you described to check? The other time ones are: last_metric_time, last_message_time , last_reconnect_time, last_send_time.

And seeing these what types of deltas would indicate a problem?

My S30 FW is from July 2022. My WiFi signal has 5 bars but I have seen the device randomly disconnect before, but rare.

@PeteRager
Copy link
Owner

Looks good. If you don't get any messages for 5 minutes that would mean there's an issue and the code could try to reconnect.

@aesculus
Copy link
Author

aesculus commented Sep 7, 2023

I think I have an experiment running now. I tested it immediately and it did the disconnect and then restarted so I am hopeful. Now if I did not screw up my date math, I have a chance.

I did it for a 3 minute lag. My python date math was tested as some give back UTC and others local, so I have to fix that.

@aesculus
Copy link
Author

Small update but making progress. Once I have this resolved I am done.

#65 (comment)

@aesculus aesculus closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants