Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTD at KDEN in DataCleansing / GetToPos #207

Closed
TwinFan opened this issue Jan 24, 2021 · 4 comments
Closed

CTD at KDEN in DataCleansing / GetToPos #207

TwinFan opened this issue Jan 24, 2021 · 4 comments
Labels
bug Something isn't working
Milestone

Comments

@TwinFan
Copy link
Owner

TwinFan commented Jan 24, 2021

Describe the bug
One user reports reoccuring crashes when nearing or being at KDEN, standard X-Plane scenery. In the crash dumps he provided two patterns are visible, but they are just one line apart in the source code, so I treat them for the moment as related.

At the same time, other provided crash dumps point into X-Plane. We should not completely forget about them because what we see here is quite weird...not sure what happened to our memory. And maybe sometimes it happens to X-Plane's memory?

To Reproduce
Have LiveTraffic run at KDEN.
Channels involved are OpenSky and Open Glider Network. No ADSBEx, no RealTraffic.
Active channels are important as the issue is about auto-land aircraft.

Expected behavior
No crash.

Technical Info:

  • OS: Windows 10.0 (build 10586/2)
  • X-Plane version: 11.51r1
  • LiveTraffic version: 2.40.201221

1st Pattern in LTFlightData:916 calling pAc->GetToPos

Log.txt
Two dumps report the same call stack:
crash_report_01_24_2021_17_25_50.zip
crash_report_01_22_2021_18_42_05.zip

Call Stack

[Inline Frame] LiveTraffic.xpl!positionTy::operator<(const positionTy &) Line 383
	at LiveTraffic\Include\CoordCalc.h(383)
LiveTraffic.xpl!LTAircraft::GetToPos() Line 1352
	at LiveTraffic\src\LTAircraft.cpp(1352)
LiveTraffic.xpl!LTFlightData::DataCleansing(bool & bChanged=false) Line 564
	at LiveTraffic\src\LTFlightData.cpp(564)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611509086.0000000) Line 1016
	at LiveTraffic\src\LTFlightData.cpp(1016)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1353
	at LiveTraffic\src\LTFlightData.cpp(1353)
[Inline Frame] LiveTraffic.xpl!std::invoke(void(*)() &&) Line 1584
	at MSVC\14.28.29333\include\type_traits(1584)
LiveTraffic.xpl!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals=0x00000240eaa342b0) Line 44
	at MSVC\14.28.29333\include\thread(44)
ucrtbase.dll!_crt_at_quick_exit�()
kernel32.dll!BaseThreadInitThunk�()
ntdll.dll!RtlUserThreadStart()

In GetToPos it crashes while applying the operator <:

// position heading to (usually posList.back(), but ppos if ppos > posList.back())
const positionTy& LTAircraft::GetToPos() const
{
    if ( posList.size() >= 2 )
===>    return ppos < posList.back() ? posList.back() : ppos;
    else
        return ppos;
}

The operator < is the actual point of bad memory access. It's a one-liner:

    inline bool operator<  (const positionTy& p) const { return ts() < p.ts(); }

The memory location accessed is reported as 0x0000000000000018. Probably, the timestamp ts is just 18 bytes down from the beginning.
Interesting is, however, that the crash location is the operator <, not the also inlined function ts(), which returns one of the positionTy::v valarray elements.

2nd Pattern in LTFlightData:917 calling LTAptFindRwy

This is just one line down in CalcNextPos, ie. right after the call to pAc->GetToPos.

Log Text
crash_report_01_22_2021_17_19_06.zip

Call Stack

LiveTraffic.xpl!LTAptFindRwy(const LTAircraft::FlightModel & _mdl, const positionTy & _from, double _speed_m_s, std::string & _rwyId, const std::string & _logTxt={...}) Line 2511
	at LiveTraffic\Src\LTApt.cpp(2511)
[Inline Frame] LiveTraffic.xpl!LTAptFindRwy(const LTAircraft &) Line 55
	at LiveTraffic\Include\LTApt.h(55)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611335878.1228712) Line 917
	at LiveTraffic\src\LTFlightData.cpp(917)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1353
	at LiveTraffic\src\LTFlightData.cpp(1353)
... (same as above)

So we now call into LTAptFindRwy and crash while reading address 0x0000000000000020 (so it is 2 bytes later) in line 2511 of LTApt.cpp:

        const Apt& apt = iterApt->second;
        
        // Find the rwy endpoints matching the current plane's heading
        for (const RwyEndPt& re: apt.GetRwyEndPtVec())
        {
            // skip if rwy heading differs too much from flight heading
===>        if (std::abs(HeadingDiff(re.heading, _from.heading())) > ART_RWY_MAX_HEAD_DIFF)
                continue;

Analysis

The crash is in the LT_CalcPos worker thread, which is why the X-Plane log couldn't identify LiveTraffic as the crash didn't occur in X-Plane's main thread.

Situation is: Plane is approaching with no live data. Auto-land is kicking in. But before (1st pattern) or while (2nd pattern= trying to actually find a runway the crash happens at the above locations.

1st pattern We are in code of an aircraft object. The aircraft object has been used before. One could think into the direction of aircraft removal while this code is going on...but aircraft removal is pretty tightly safeguarded with two mutexes. And if that would be the reason we would hear a lot more people scream and it would not be limited to KDEN.

2nd pattern is totally unrelated to the aircraft object but combs through the apt scenery data. The objects involved are the airport, its rwy endpoints, and a position passed down from CalcNextPos. And guess what...the position is the result of a call to _ac.GetToPos() in LTAptFindRwy (LTApt.h:55).

So the constant in both patterns is the aircraft's position. But why at KDEN?

@TwinFan
Copy link
Owner Author

TwinFan commented Jan 24, 2021

Could be linked to #208: The latter makes planes disappear rather suddenly quite often, way more often than what normally would be expected. So if there is still some kind of race condition that would allow a plane to be removed while in a separate thread calculations on it are still on, then the auto-land issue #208 would increase the likelihood of this race condition being triggered.
How is that related to KDEN? Maybe only because it has rather much traffic but maybe OpenSky's coverage doesn't always extend down to ground, causing more auto-lands.

@TwinFan
Copy link
Owner Author

TwinFan commented Jan 24, 2021

Accessing aircraft in LTFlightData::CalcNextPosMain is not guarded by the LTFlightData::dataAccessMutex mutex. Only a little later, in CalcNextPos (called by CalcNextPosMain) the lock is taken.
Also, while removing aircraft completely, the global mapFdMutex is in use, but when starting to remove an individual plane the aircraft-specific LTFlightData::dataAccessMutex is not set. It is set a little later in the destructor of LTFlightData.

So there is a theoretical point in time, as short as it may be, for a race condition:

  • CalcNextPosMain doesn't lock and finds the FD object valid, and starts to call CalcNextPos
  • in the main thread, the aircraft is set invalid and even removed
  • CalcNextPos continues (one would need to argue that it still can get the lock...that needs another loophole somewhere) and subsequently fails.

Solution: Consequently guard all LTFlightData member functions with the dataAccessMutex.

@TwinFan TwinFan added the bug Something isn't working label Jan 25, 2021
@TwinFan TwinFan added this to the v2.x milestone Jan 25, 2021
@TwinFan
Copy link
Owner Author

TwinFan commented Jan 26, 2021

Code changes didn't solve the problem. New crashes surfaced at slightly different positions, obviously in different situation, but still it is the calculation thread, which crashes, and still it is when accessing the plane's current position Aircraft::ppos, complaining

Access violation reading location 0x0000000000000018

In the 4th pattern, the exception shown in Visual Studio even clearly reads:

Unhandled exception thrown: read access violation. ppos_i.v._Myptr was nullptr

stating that an internal pointer of the std::valarray implementation of the runtime library was nullptr causing the access violation when trying to access the timestamp, stored in the 4th valarray element.

While sorting all crash dumps I noted that one was left from yesterday, which I forgot to assign to any of the patterns...turns out: That crash dump of yesterday shows what is now noted here as pattern 3.
=> Pattern 3 is not new but happened before the code change already!

What strikes me is that for reaching the point of crash in the 4th pattern, execution came very surely past the spot in the 3rd pattern: pAc && posDeque.size() >= 1 must be fulfilled for the 4th pattern to happen, which is the condition allowing the code in the 3rd pattern to happen. Why didn't it crash there already?

3rd Pattern

Call Stack

[Inline Frame] LiveTraffic.xpl!positionTy::operator<(const positionTy &) Line 383
	at LiveTraffic\Include\CoordCalc.h(383)
LiveTraffic.xpl!LTAircraft::GetToPos() Line 1352
	at LiveTraffic\src\LTAircraft.cpp(1352)
LiveTraffic.xpl!LTFlightData::DataCleansing(bool & bChanged=false) Line 574
	at LiveTraffic\src\LTFlightData.cpp(574)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611630439.0000000) Line 1032
	at LiveTraffic\src\LTFlightData.cpp(1032)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1375
	at LiveTraffic\src\LTFlightData.cpp(1375)
[Inline Frame] LiveTraffic.xpl!std::invoke(void(*)() &&) Line 1584
	at MSVC\14.28.29333\include\type_traits(1584)
LiveTraffic.xpl!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals=0x00000153cec5fbe0) Line 44
	at MSVC\14.28.29333\include\thread(44)
ucrtbase.dll!_crt_at_quick_exit�()
kernel32.dll!BaseThreadInitThunk�()
ntdll.dll!RtlUserThreadStart()

So it is again Aircraft::GetToPos(), which crashes, we had that before in the 1st pattern. But it is called from a different place this time:

    // *** Remove weird positions ***
    if (( pAc && posDeque.size() >= 1) ||
        (!pAc && posDeque.size() >= 3))
    {
        positionTy pos1;
        vectorTy v2;
        double h1 = NAN;
        dequePositionTy::iterator iter = posDeque.begin();
        
        // position _before_ the first position in the deque
        if (pAc) {
--->        pos1 = pAc->GetToPos();
            h1 = pAc->GetTrack();       // and the heading towards pos1

crash_report_01_25_2021_15_28_06 - 3rd pattern.zip
crash_report_01_26_2021_03_08_22.zip

4th Pattern

Call Stack

[Inline Frame] LiveTraffic.xpl!positionTy::ts() Line 414
	at LiveTraffic\Include\CoordCalc.h(414)
LiveTraffic.xpl!LTFlightData::CalcNextPos(double simTime=1611675999.0000000) Line 1149
	at LiveTraffic\src\LTFlightData.cpp(1149)
LiveTraffic.xpl!LTFlightData::CalcNextPosMain() Line 1375
	at LiveTraffic\src\LTFlightData.cpp(1375)
[Inline Frame] LiveTraffic.xpl!std::invoke(void(*)() &&) Line 1584
	at MSVC\14.28.29333\include\type_traits(1584)
LiveTraffic.xpl!std::thread::_Invoke<std::tuple<void (__cdecl*)(void)>,0>(void * _RawVals=0x0000026b4f2b3920) Line 44
	at MSVC\14.28.29333\include\thread(44)
ucrtbase.dll!_crt_at_quick_exit�()
kernel32.dll!BaseThreadInitThunk�()
ntdll.dll!RtlUserThreadStart()

The place calling positionTy::ts() is

            else for (size_t i = 0;
                      std::isnan(rotateTS) && posDeque.size() >= i+1;
                      i++ )
            {
                // i == 0 is as above with actual a/c present position
                // in later runs we use future data from our queue
                const positionTy& ppos_i  = i == 0 ? pAc->GetPPos() : posDeque[i-1];
                positionTy& to_i          = posDeque[i];
                const double to_i_ts      = to_i.ts();  // the reference might become invalid later once we start erasing, so we copy this timestamp that we need
                
                // we look up to 60s into the future
--->            if (ppos_i.ts() > simTime + MDL_TO_LOOK_AHEAD)
                    break;

crash_report_01_26_2021_15_47_41 - 4th pattern.zip

@TwinFan
Copy link
Owner Author

TwinFan commented Jan 26, 2021

Will try to catch positionTy::v._Myptr == nullptr if I find a way to get access to that internal member variable, and bail out with a proper exception, which can be handled in the try/catch blocks. Maybe we can then get further insights this way.

TwinFan added a commit that referenced this issue Feb 2, 2021
Regression introduced with changes for #207
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant