Skip to content
This repository has been archived by the owner on May 1, 2024. It is now read-only.

EKF: replacement of autocode with sympy generated output #868

Closed
wants to merge 19 commits into from

Conversation

priseborough
Copy link
Collaborator

@priseborough priseborough commented Jul 16, 2020

This is a work in progress. Derivations and conversions in c++ code are complete, however test benches and SITL/flight testing require further work.

This is built on proof of concept work by @RomanBapst - see https://github.com/RomanBapst/ecl/tree/ekf_python

Test comparison programs have been written that compare the output from the sympy generated code to the legacy code in master. The largest fraction difference defined as (new_value - old_value)/old_value is found and printed. For example the covariance prediction gives:

max_diff_fraction = 2.074566e-06 , old = -1.490614e+17 , new = -1.490610e+17 , location index = 4,7

Comparing the observation Jacobians for the various yaw fusion methods gives:

321 option A max_diff_fraction = 1.322706e-07 , old = 9.012533e-01 , new = 9.012532e-01 , location index = 3
321 option B max_diff_fraction = 3.555763e-07 , old = 1.005770e+00 , new = 1.005769e+00 , location index = 2
312 option A max_diff_fraction = 1.696653e-07 , old = -2.810457e+00 , new = -2.810457e+00 , location index = 2
312 option B max_diff_fraction = 5.844245e-07 , old = -4.589488e-01 , new = -4.589485e-01 , location index = 1

Differences have so far have been the last significant figure.

Testing

Comparison test programs for the covariance prediction and observation methods. See https://github.com/PX4/ecl/blob/pr-ekfSymPyDerivationConversion/EKF/python/ekf_derivation/run_test_programs.sh

SITL testing that exercises all observation methods. See the following:
SITL gazebo test https://review.px4.io/plot_app?log=a08ff241-6840-456b-b509-491235c4267d
SITL gazebo_iris_opt_flow test https://logs.px4.io/plot_app?log=9c29150e-ab23-4245-869c-bb2a978cebc6
SITL gazebo plane test https://logs.px4.io/plot_app?log=a568008f-92f8-48c8-93ed-694b7a94dd19

TODO: SITL testing for MC body frame specific force fusion and GPS yaw fusion.

@priseborough priseborough self-assigned this Jul 16, 2020
@priseborough priseborough changed the title EKF: preliminary replacement of autocode with sympy generated output EKF: replacement of autocode with sympy generated output Jul 16, 2020
Copy link
Contributor

@kamilritz kamilritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. I am super happy that this is finally happening. I appreciate that you paid attention to const correctness and used const references to grab the state's values.

I would like to raise the discussion about the size of the observation jacobian container. I saw that it is not done consistently throughout this PR. At some points it is a container of size 24 with lot of entries being 0, sometimes it is a container of size << 24 with only containing non-zero elements. I am not sure of to trade-off efficiency and readability here. Maybe @jkflying can help us here.
Any way it would be great if we could decide on a single approach. As it will help us to simplify the sparse computation of KHP in a single function.

Additionally, would it be possible to swap the autogenerated code in multiple steps/PRs? If think this would make it much simpler to review, and rebase on any change in master.

Write comparison test programs for the rest of the observation methods.
SITL and/or testing that exercises all observation methods. I can do the MC and FW which will exercise most, but am unsure how to exercise the external vision/VIO fusion methods.

I am happy to help here. Can you be a bit more specific about the nature of these comparison test programs? Do you want to compare state outputs similar to the "change output predictor", or write unit tests or general SITL tests?


if (update_wind_only) {
// If we are getting aiding from other sources, then don't allow the airspeed measurements to affect the non-windspeed states
for (unsigned row = 0; row <= 21; row++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Kfusion is a matrix::Vector it will be already initialized with zeros. No need to set it to zero again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix


for (uint8_t i = 0; i < _k_num_states; i++) { H_TAS[i] = 0.0f; }
// Intermediate variables
const float HK0 = vn - vwn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming HK* for the intermediate variables could be a bit confusing as there are also variables with the name KHP or KH.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it makes it more confusing. They are clearly intermediate variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed a comment saying that it is an intermediate variable. Nevertheless if someone sees HK0, he might think it is the 0th element of the result of the operation between two vector/matrix quantities H and K. And that is also the way I perceived it in the first moment.
The former t0, t1, ... notation - at least for me - was clearer on the first glance. I understand that this is probably not so safe, when copying generated code output. Maybe there is a good option in between, something like t_hk_0 or tmp_hk_0.

SPP[8] = 2*q0*q3 + 2*q1*q2;
SPP[9] = 2*q0*q2 + 2*q1*q3;
SPP[10] = SF[16];
const float PS0 = powf(q1, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this powf be replaced with sq?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler should optimise out powf(x,2) and doing a find and replace on it has risks.

SPP[9] = 2*q0*q2 + 2*q1*q3;
SPP[10] = SF[16];
const float PS0 = powf(q1, 2);
const float PS1 = 0.25F*daxVar;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most situation we used a small f for specifing the type of the literal. Is there a specific reason why a capital F is used here? Would be nice if we have consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but thats what python outputs.

const float PS45 = P(0,13) - P(1,13)*PS11 + P(10,13)*PS6 + P(11,13)*PS7 + P(12,13)*PS9 - P(2,13)*PS12 - P(3,13)*PS13;
const float PS46 = PS37 + PS38;
const float PS47 = P(0,15) - P(1,15)*PS11 + P(10,15)*PS6 + P(11,15)*PS7 + P(12,15)*PS9 - P(2,15)*PS12 - P(3,15)*PS13;
const float PS48 = 2*PS47;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make a difference using 2.f instead of 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference testing says no.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I guess the test tests the numerical output. What about computational speed? I guess the compiler is forcing a float multiplication.

const float HK29 = -HK12*P(1,23) + HK12*P(1,5) - HK13*P(1,6) + HK14*P(1,1) + HK15*P(0,1) - HK16*P(1,2) + HK17*P(1,3) - HK7*P(1,22) + HK7*P(1,4);
const float HK30 = -HK12*P(2,23) + HK12*P(2,5) - HK13*P(2,6) + HK14*P(1,2) + HK15*P(0,2) - HK16*P(2,2) + HK17*P(2,3) - HK7*P(2,22) + HK7*P(2,4);
const float HK31 = -HK12*P(3,23) + HK12*P(3,5) - HK13*P(3,6) + HK14*P(1,3) + HK15*P(0,3) - HK16*P(2,3) + HK17*P(3,3) - HK7*P(3,22) + HK7*P(3,4);
float HK32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to declare the variable already here. Can be declared const on initialization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that is possible. It needs to be declared in the larger scope and only updated if the inverse is numerically OK.

const float HK29 = 4/powf(HK17, 2);
const float HK30 = -HK16*P(0,2) - HK24*P(1,2) - HK25*P(2,2) + HK26*P(2,3);
const float HK31 = -HK16*P(0,3) - HK24*P(1,3) - HK25*P(2,3) + HK26*P(3,3);
float HK32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to declare the variable already here. Can be declared const on initialization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how that is possible. It needs to be declared in the larger scope and only updated if the inverse is numerically OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we directly return inside the if statement, we could declare the variable afterwards.

float HK32;
if(bla) {
  HK32 = ...;
  status_ = A;
} else {
   // Do stuff not related to HK32
  return;
}

could be

if(!bla) {
   // Do stuff not related to HK32
  return;
}
const float HK32 = ...;
status_ = A;

}
} else {
for (uint8_t i = 0; i < 16; i++) {
Kfusion(i) = 0.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to set 0 to 0. Kfusion is initialized to 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

Vector24f gain;
// calculate observation Jocobians and Kalman gains
if (obs_index == 0) {
// Observation Jacobians - axis 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, I really like that we can now compute K and H after computing the innovation variance of both axis. This is how it has been done everywhere else.

}
} else {
for (int i = 16; i <= 21; i++) {
Kfusion(i) = 0.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to set 0 to 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

@dagar
Copy link
Member

dagar commented Jul 18, 2020

The flash savings here are looking great. What this likely means is that we'll be able to comfortably afford to increase the optimization level from -Os -> -O2 on the embedded side, which can make a significant difference for Matrix library usage.

Screenshot from 2020-07-18 12-51-11

For reference, the entire estimator (PX4 ekf2 frontend + ecl/EKF backend) is about 140 KB of flash on px4_fmu-v5.

@priseborough priseborough force-pushed the pr-ekfSymPyDerivationConversion branch from dba97cc to a93349c Compare July 20, 2020 06:45
@priseborough
Copy link
Collaborator Author

I've added test programs to compare the before and after autocode and rebased.

@priseborough
Copy link
Collaborator Author

@kamilritz I did some simple test programs for the covariance prediction and observation jacobian and Kalman gains that compare them before and after. Differences are within what is expected for single precision calculations.

9136a0f

@kamilritz
Copy link
Contributor

I would like to raise the discussion about the size of the observation jacobian container. I saw that it is not done consistently throughout this PR and in the codebase in general. At some points it is a container of size 24 with lot of entries being 0, sometimes it is a container of size << 24 with only containing the non-zero elements. I am not sure about the trade-off between efficiency and readability here. Maybe @jkflying can help us here.
Any way it would be great if we could decide on a single approach. As it will help us to simplify the sparse computation of KHP in a single function.

@priseborough @jkflying

@jkflying
Copy link
Contributor

This is a pretty big PR, and I'm not able to pull out the patterns in the time I have. However, it sounds like the 'gather-scatter' pattern would work well here, allowing you to work with small dense matrices rather than large sparse matrices.

How this works is you create a permutation matrix which selects which rows/cols you keep (and can reorder them too, if you want). So rather than having a large matrix Q (dimensions mxm), we have a smaller D (nxn) with the non-zero values, and permutation P (mxn) with a single 1 per column, such that PDP' = Q. We would have to add support for permutation matrix multiplications to Matrix, which would enable this kind of design work. To be efficient, and because we don't have the same compiler type optimization Eigen does, we'd need something like an operator+=(PermuationMatrix<M,Q>, Matrix<Q,R>, PermutationMatrix<R,N>); in Matrix, to expand and fill (aka 'scatter') the small dense matrix into the large matrix.

@kamilritz
Copy link
Contributor

kamilritz commented Jul 20, 2020

How this works is you create a permutation matrix which selects which rows/cols you keep (and can reorder them too, if you want). So rather than having a large matrix Q (dimensions mxm), we have a smaller D (nxn) with the non-zero values, and permutation P (mxn) with a single 1 per column, such that PDP' = Q. We would have to add support for permutation matrix multiplications to Matrix, which would enable this kind of design work. To be efficient, and because we don't have the same compiler type optimization Eigen does, we'd need something like an operator+=(PermuationMatrix<M,Q>, Matrix<Q,R>, PermutationMatrix<R,N>); in Matrix, to expand and fill (aka 'scatter') the small dense matrix into the large matrix.

@jkflying That is a cool concept. I am not sure if we need to go that far. The observation jacobian H is almost the only consistently sparse vector/matrix quantity. I think it would be enough to have a sparse KHP computation, that takes benefit of knowing which are the non-zero element of H.

The question is should we store H as a Vector<float, 24>, initialize everything to 0, set the nonzero elements and pass it to a function computeKHP(..., Vector<float,24> H, Vector<size_t, numberOfNonZeroElements> indices) or
should we store H as a Vector<float, numberOfNonZeroElements>, set all elements and pass it to a function computeKHP(..., Vector<float,numberOfNonZeroElements> H, Vector<size_t, numberOfNonZeroElements> indices).
In the former version the entries of H can be easily interpreted, in the latter version one needs to know the mapping to the state. It should also be easier to autogenerate the code for the first version. But in the first version, we almost need to initialize the rest to 0, so it would be less efficient.

@jkflying
Copy link
Contributor

Yes, my usage of this pattern was in a very different situation, with matrices of 5000+ x 5000+, but with lots of sparsity, that could be reduced down to a ~100x100 dense matrix with a permutation matrix. Particularly on larger processors (and GPUs) there is a lot of efficiency gained (~20x) by having your computations not being spread over a matrix, but by operating on adjacent elements. I'm not sure how much benefit it will have here, considering the small sizes of our vectors and that we don't have lots of tiered caches for memory (ie. L1/L2/L3/RAM/swap).

I don't see the (mathematical) need to initialize the other elements of the matrix to 0, since we can ignore them if we know what indices we need. The initialization is just because it is in a Vector, right? You could also make a 'reverse index' which maps indices in the sparse matrix to the dense matrix. Then the autogenerated code could use the 24-based indexing, which gets translated to the dense indexing. But this might be going too far, and with the small amount of computation we're doing I don't think would be worth the benefit.

@priseborough
Copy link
Collaborator Author

I've added a script file to run the test programs.

@priseborough priseborough force-pushed the pr-ekfSymPyDerivationConversion branch from 757de17 to 7d74568 Compare July 21, 2020 11:59
@priseborough
Copy link
Collaborator Author

With respect to the use of the powf(x, y) function, although the compiler appears to be optimising it to x*x, it leaves the code at the discretion of the compiler which is not ideal. At yesterdays estimation call we discussed the options. One suggestion was to define an inlined ecl::powfx,y)( function to be used for integer values of y. I have implemented that in 0a4f5cc

The other options is to modify the python script to do a find and replace for all powf(x, 2) instances with sq(x). The total number of instances of powf(x,y) is not large, so an editor guided manual replacement is possible, but not ideal.

@dagar what is your view on relying on the compiler optinisations for this function. Can compilers be relied on to interpret powf(x,2) as x*x ? i don't have access to the ISO/IEC 9899:2011 document.

@jkflying
Copy link
Contributor

GCC at least has been able to optimize powf(x, 2) for a very long time, I'm more concerned about the cases where we have something like powf(x, 1/2.) which will be orders of magnitude slower (and less accurate) than a sqrt(x) and doesn't get optimized even on current compilers. The ecl::powf solution will at least give a compiler error here, which can be manually fixed. At some point, if we really decide we hate the manual fixes, we could even make an ecl::powf(float, float) version which errors on everything except some obvious use-cases like 0.5.

Also maybe I missed it, but the ecl::powf could easily be added to the generator with something like

sed 's/\(ecl::\)\{0,1\}powf/ecl::powf/' generated/*.cpp

to make it happen with all files automatically.

@dagar
Copy link
Member

dagar commented Jul 22, 2020

@dagar what is your view on relying on the compiler optinisations for this function. Can compilers be relied on to interpret powf(x,2) as x*x ? i don't have access to the ISO/IEC 9899:2011 document.

Is the concern performance or correctness?

@priseborough
Copy link
Collaborator Author

priseborough commented Jul 23, 2020

@dagar The number of usages is not large so the concern is firstly correctness and secondly performance.

Edit: What i don't want happening is a future change in compiler producing a slightly different numerical result.

@priseborough
Copy link
Collaborator Author

@jkflying There is one place in airspeed fusion where the code generator used powf(x, -1.0F/2.0F), but that can be reworke by hand. As you point out, the ecl::powf will generate a compiler error for instances where non integer exponents are used so your proposal for a global find and replace should work OK and we can handle the odd exception manually

@priseborough
Copy link
Collaborator Author

The feedback to date is that the overall approach looks OK, so I will now break this down into a number of smaller PR's starting with the covariance prediction and a separate PR for each fusion type.

@kamilritz
Copy link
Contributor

@priseborough Regarding the observation jacobian H, it would be great if all of them could be of type Vector<float, numberOfNonZeroElements>

@priseborough
Copy link
Collaborator Author

@priseborough Regarding the observation jacobian H, it would be great if all of them could be of type Vector<float, numberOfNonZeroElements>

Doesn't work in all instances because sometimes we only need a few entries and optimise accordingly.

@kamilritz
Copy link
Contributor

@priseborough Regarding the observation jacobian H, it would be great if all of them could be of type Vector<float, numberOfNonZeroElements>

Doesn't work in all instances because sometimes we only need a few entries and optimise accordingly.

The numberOfNonZeroElemens could vary for each measurement, eg for yaw Vector<float,4> H_YAW, for airspeed Vector<float, 5> H_TAS and so on . Would it be doable to optimise the non zero entries away in each single case? Or is this too much manual work?
It would be great if we could be consistent throughout all measurements. Because then we can have a single function doing the sparse KHP computation.

@priseborough
Copy link
Collaborator Author

@kamilritz There are sparse matrix techniques that could be applied, but we would need to provide an occupancy mask. This would end up trading memory use for code commonality. Lets have the discussion when the PR for the first observation method goes in.

If you are able to look at the options for doing this efficiently, I will in the meantime prepare the covariance prediction PR so we don't have to make a decision right now.

@dagar
Copy link
Member

dagar commented Jul 23, 2020

@dagar The number of usages is not large so the concern is firstly correctness and secondly performance.

Edit: What i don't want happening is a future change in compiler producing a slightly different numerical result.

It's not an unreasonable concern, although in this particular case I really don't think it's going to be an issue.

https://godbolt.org/z/7d1q87 - With any optimization at all (-O1) you can see powf(n, 2.f) is the same as n * n (__aeabi_fmul).

Screenshot from 2020-07-23 11-44-56

In general for this kind of thing it's another reason why I want to make sure we're continuously updating PX4/Firmware with each PX4/ecl change. On the PX4/Firmware side we can fairly easily monitor changes in execution time, memory usage, and flash usage over time across a range of boards. We lose a lot of detail when each PX4/ecl bump is large.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants