Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS_Common: whitelist AUTOPILOT_VERSION for in_delay_callback sending #13727

Conversation

peterbarker
Copy link
Contributor

GCSs may request this very early on in the boot process, particularly
for SITL.

If we try to send it during a delay callback then we end up dropping it
at the moment - but we'd already sent the ack in response to the
request.

The sequence here is that we receive the request for the message from the GCS, and put that into the pushed_ap_message_ids bitmask. Something then calls delay_ms(...) for some period of time, and we start to push messages out. We choose this one to go out - but in try_send_message we discover that we're not allowed to send it. We return true, saying we sent it - lying - if we don't do this then at the moment we would effectively have "blocker" messages which stop any of the in-delay-callback-whitelisted-messages (e.g. heartbeat) getting out.

The more elaborate fix for this is probably to have try_send_message return an enumeration value instead of true/false. There are three callers, however, and things sending from a queue might have problems moving onto the next entry while preserving the need to send the first. An alternate fix might be for the caller to check if we're in a delay callback itself and change its behaviour appropriately - so, for example:

            if (!do_try_send_message(next)) {
                break;
            }
            pushed_ap_message_ids.clear(next);

would become more elaborate.

Another option might be to whitelist virtually all of the messages - that requires an audit of the message sending to see how long they could possibly take and/or modifying the existing timing code to ensure the deadlines are not exceeded.

GCSs may request this very early on in the boot process, particularly
for SITL.

If we try to send it during a delay callback then we end up dropping it
at the moment - but we'd already sent the ack in response to the
request.
@DonLakeFlyer
Copy link
Contributor

So the issue here is that a gcs really wants to know capability bits asap. In most cases it will delay completion of a vehicle connection state machine to wait for it. And depending on how the gcs is implemented with respect to timeouts and retries it may slow down the process of completing the vehicle connection. Which in turn could make the user experience crappy/slow.

@peterbarker
Copy link
Contributor Author

@DonLakeFlyer Yep - thus whitelisting it here while we look at the deeper issue I think is a good idea :-)

Can you confirm this has the autopilot spit the message out as well as the ack for your test cases, please?

@DonLakeFlyer
Copy link
Contributor

I'll give it a try tomorrow. I"m throwing in the towel for the day. Tired of chasing crazed race conditions in spaghetti QGC code.

@rmackay9
Copy link
Contributor

rmackay9 commented Mar 5, 2020

I don't really know what I'm talking about but maybe AP should just proactively send the capabilities soon after connecting?

@DonLakeFlyer
Copy link
Contributor

FYI: I've been testing current ArduPilot with respect to when it finally will send AUTOPILOT_VERSION. It takes a full 5 seconds from the Vehicle showing up in QGC after connect USB to before it will reponsd to the request and send the AUTOPILOT_VERSION. That's a really nasty delay.

Haven't had time to try this new code. Still fighting with this on QGC Stable trying to get it to work with QGC and current ArduPilot releases.

@DonLakeFlyer
Copy link
Contributor

I don't really know what I'm talking about but maybe AP should just proactively send the capabilities soon after connecting?

Unless it will contiuously stream that won't work. Need to guarantee delivery.

@DonLakeFlyer
Copy link
Contributor

I tested this with ArduCopter and I get AUTOPILOT_VERSION back the first time I ask for it. But it reports support for rally, but no support for fence. So not sure what that's about. Master/Stable ardupilot reports correct bits for copter.

@peterbarker
Copy link
Contributor Author

I tested this with ArduCopter and I get AUTOPILOT_VERSION back the first time I ask for it. But it reports support for rally, but no support for fence. So not sure what that's about. Master/Stable ardupilot reports correct bits for copter.

Sorry, I'm having trouble parsing that.

Your second sentence seems to say it doesn't work but your third says it does seem to work?

@DonLakeFlyer
Copy link
Contributor

Your second sentence seems to say it doesn't work but your third says it does seem to work?

Correct :)

  • Does Work: I get a response to MAV_CMD_REQUEST_AUTOPILOT_CAPABILITIES right away with an AUTOPILOT_VERSION message on the first request.
  • Doesn't Work: The capability bits in the AUTOPILOT_VERSION message specify support for rally but no support for fence. Since I tested with Copter it should report support for fence as well.

@peterbarker
Copy link
Contributor Author

@DonLakeFlyer Hmph. So we probably shouldn't merge this until we're returning the correct bits.

I can't see anything here by inspection.

Copter.h has AC_Fence as a normal member of the Copter object - not a pointer. So it should be constructed at program startup. Its constructor sets the singleton pointer - which is what we're getting with saying "AP::fence()".

So I can't see the race condition here :-(

I'll have to try to reproduce.

@peterbarker
Copy link
Contributor Author

@DonLakeFlyer I've just tried to reproduce here and failed.

I:

  • flashed a CubeBlack with this PR / arducopter
  • turned the cube off (switch on USB hub)
  • started qgc (compiled from source, a7c5a57)
  • waited for parameters to finish downloading
  • changed to the "planning" page
  • noted that all mission types could be planned (rally / fence / waypoints)
  • checked the mav tlog file:
pbarker@bluebottle:/tmp$ mavlogdump.py FlightData146713.mavlink151117 --t AUTOPILOT_VERSION
2020-05-11 10:12:21.85: AUTOPILOT_VERSION {capabilities : 64495, flight_sw_version : 67109888, middleware_sw_version : 0, os_sw_version : 0, board_version : 0, flight_custom_version : [48, 48, 53, 55, 48, 51, 101, 99], middleware_custom_version : [0, 0, 0, 0, 0, 0, 0, 0], os_custom_version : [51, 51, 49, 102, 101, 55, 53, 100], vendor_id : 0, product_id : 0, uid : 0, uid2 : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
pbarker@bluebottle:/tmp$ 

So I'm unable to reproduce what you're seeing - the fence bit changing value during the boot process. If we ever moved to dynamically allocate these objects we could certainly see the problem - and that's a real possibility on Plane - but we're not doing that AFAICS.

Any ideas?

@peterbarker peterbarker merged commit 5100c9f into ArduPilot:master May 11, 2020
@peterbarker peterbarker deleted the pr/autopilot-version-in-delay-callback branch May 13, 2020 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants