Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random excel crashes while editing a formula #87

Closed
gmichaud opened this issue Feb 6, 2020 · 23 comments · Fixed by #98
Closed

Random excel crashes while editing a formula #87

gmichaud opened this issue Feb 6, 2020 · 23 comments · Fixed by #98

Comments

@gmichaud
Copy link
Contributor

gmichaud commented Feb 6, 2020

We have been seeing this randomly from time to time and have just narrowed it down to ExcelDna Intellisense. Heap corruption is reported as Excel crashes.

The following minidump shows the full callstack and error details EXCEL.EXE.1956.dmp.zip

Screen Shot 2020-02-06 at 11 05 23

As you can see ExcelDna.IntelliSense is reported in the formula. I'm not able to replicate this issue but it does happen from time to time. I suspect that this won't be enough to help you troubleshoot the issue, is there anything that I could do to help identify and resolve the problem?

@gmichaud
Copy link
Contributor Author

gmichaud commented Feb 6, 2020

P.S. I was going through the intellisense source code and noticed that there's only one place where XLCALL32.dll is invoked by the library, and this function mentions access violations if called during shutdown. I was not shutting down in my case.

Link to code:

[HandleProcessCorruptedStateExceptions]

@govert
Copy link
Member

govert commented Feb 7, 2020

Maybe you can switch on a little bit of logging - Say the warning level, and see if that gets anything.
You need a .config file with something like this https://github.com/Excel-DNA/IntelliSense/blob/master/Source/ExcelDna.IntelliSense.Host/App.config

Note that "All" (Verbose) is probably overwhelming.

@govert
Copy link
Member

govert commented Jul 29, 2020

From the minidump it seems the error that this crashes with is "0xC0000374: A heap has been corrupted".

@gmichaud
Copy link
Contributor Author

@govert this is what I see as well, but I couldn't find anything useful in the crash dump to point me in the right direction. Anytime I was able to get a dump the content proved unusable; a few times I saw seeing a fault in RICHED20.dll, but it's not consistently happening there. It's always happening when editing a formula.

Is this a behaviour you have seen before?

@gmichaud
Copy link
Contributor Author

One more today, always while editing a formula

Faulting application name: EXCEL.EXE, version: 16.0.13029.20308, time stamp: 0x5f20d490
Faulting module name: RICHED20.DLL, version: 16.0.13029.20170, time stamp: 0x5f0993f1
Exception code: 0xc0000005
Fault offset: 0x00022ad9
Faulting process id: 0x1e34
Faulting application start time: 0x01d6676754742a49
Faulting application path: C:\Program Files (x86)\Microsoft Office\root\Office16\EXCEL.EXE
Faulting module path: C:\Program Files (x86)\Microsoft Office\root\vfs\ProgramFilesCommonX86\Microsoft Shared\OFFICE16\RICHED20.DLL
Report Id: a38d438d-ef32-415b-84df-60950ffd931b
Faulting package full name:
Faulting package-relative application ID:

I'm attaching the WER file generator by windows after the crash; there's nothing else that I have at this point :(

Report.wer.zip

@govert
Copy link
Member

govert commented Jul 31, 2020

I have had occasional reports of Excel crashes with IntelliSense, and I don't have any telemetry to tell how widely it is used.
But I've not heard of it happening as often as on your machine, so I am wondering if something is special there, and whether it might affect the Excel or IntelliSense.

Are you running on a Mac inside a Parallels VM? I see some Parallels stuff in the WER log.

@gmichaud
Copy link
Contributor Author

@govert yes i'm running inside Parallels. I've had other reports from customers regarding the same crash; I don't think they're running Parallels.

@govert
Copy link
Member

govert commented Jul 31, 2020

OK - yes, I can't think why Parellels would make a difference.

In this case it was an Access Violation. I expect to catch those if they are converted to a .NET exception.
Can you help set up some kind of automated testing, where we run Excel and write formulas to cells, simulate mouse movements etc with some tool? If we can recreate this, it would help immensely, of course.

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 1, 2020

@govert do you have any documentation regarding LPenHelper? I noticed that the memory pointed by the char* pointer you receive is never freed. The function where it is used also has the HandleProcessCorruptedStateExceptions attribute, letting me think the issue is related. We also found articles pointing to memory vulnerabilities in this function here https://www.snort.org/search?query=LPenHelper&submit_search=

@wh1t3cAt1k
Copy link

@govert to add context, we are referring to the method GetFormulaEditPrefix in the XlCall class.

@govert
Copy link
Member

govert commented Sep 1, 2020

@gmichaud I do not know of any documentation for the LPenHelper functionm beyond the xlcall.h header in the Excel SDK - .

From the link you give to the 'snort' site I can follow further on to the Microsoft Security Bulletins, but see nothing there related to the LPenHelper part. I can't see how a vulnerability there could somehow be related to remote code execution. So I think those snort advisories are spurious, though they might relate to Excel crashes (like yours) being logging from that function.

The char* pointer returned from the LPenHelper call is explicitly called out in the Excel header as "READ ONLY!!!" (with the three exclamation marks), so I understood that as indicating we should not be trying to free it.

typedef struct _fmlainfo
{
	int wPointMode;	// current edit mode.  0 => rest of struct undefined
	int cch;	// count of characters in formula
	char *lpch;	// pointer to formula characters.  READ ONLY!!!
	int ichFirst;	// char offset to start of selection
	int ichLast;	// char offset to end of selection (may be > cch)
	int ichCaret;	// char offset to blinking caret
} FMLAINFO;

@govert
Copy link
Member

govert commented Sep 1, 2020

The other approach I tried for reading the formula while editing was through the accessibility interfaces, but I was not able to make that stable or reliable at all.

My suggestion is still to set up some testing framework, but I would need some help with that.

@wh1t3cAt1k
Copy link

wh1t3cAt1k commented Sep 1, 2020

Setting up such a UI testing framework with real Excel onboard would be a large undertaking...
I agree it could help troubleshoot the problem. I guess we'll have to in the end, if nothing helps.

In the meanwhile, during the code review, we noticed three potential issues:

image

  1. Could fmlaInfo.lpch ever be a null pointer? If yes, the Marshal call would be invalid, so we could add a check for that.
  2. Double marshalling (one for logging purposes, other for result return) seems unwarranted... Probably we can replace that with a single call?

image

  1. This can be a source of platform-dependent bugs, because technically a long in C can be longer than 32 bits. Could it truncate the pointer address as a result and make us access an incorrect address?

Currently, all our crash reports are isolated to "formula editing" and whatever stack traces we have mention XLCALL32.dll, so I highly suspect that the fix can be scoped to the XlCall class...

@govert
Copy link
Member

govert commented Sep 1, 2020

  1. The documentation for Marshal.PtrToStringUni indicates that will return null if the input pointer is null (https://docs.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.marshal.ptrtostringuni?view=netframework-4.0#System_Runtime_InteropServices_Marshal_PtrToStringUni_System_IntPtr_System_Int32_)

  2. The two strings we copy out are not the same length, but the Verbose Logging case could be guarded explicitly to only be called when verbose logging is enabled. I can't see how this relates to the process crashing, though.

  3. My understanding of the Windows header convention is that long means 32-bit integer in both 32-bit and 64-bit versions. See https://docs.microsoft.com/en-us/windows/win32/winprog/windows-data-types . Is that not correct?
    The question in my comment about 'so why different' refer to the fact that some API return types are 'int' in the xlcall.h file, while others are 'long'. The question is for Microsoft, because these types mean the same thing. A possible answer would be that the differences in the header were inherited from a 16-bit history, whether 'int' and 'long' meant different things.

I agree that the LPenHelper call is the most likely cause of trouble, but I don't know whether we are violating some Excel API requirement, or whether we are exposing a bug in Excel. Without a reproducible case, I don't feel able to open a ticket with Microsoft. Maybe we're using a really old API call that isn't really supported, and that trips over some Excel internals in some environments. That would be sad, because I couldn't find another way to implement the IntelliSense.

@govert
Copy link
Member

govert commented Sep 1, 2020

One option I mention in the scattered TODOs there is to find a way to move the UpdateEditState call onto the main Excel thread. I remember finding that this would not be an easy change though - I have tried to avoid interfering with the main input loop as much as possible.

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 1, 2020

I'm able to get it to crash reliably by adding a loop calling LPenHelper in a loop. Behaviour is the same that I experience occasionally at run time, Excel closes without showing any exception in Visual Studio even if it's attached to the process.

https://recordit.co/NQkQRgA510

Screen Shot 2020-09-01 at 11 46 52

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 1, 2020

With WinDbg attached, i'm able to catch the following exception details:

Screen Shot 2020-09-01 at 11 54 20

I know 100M calls is not realistic, but it could simply expose a race condition or threading issue that can happen with very specific timing.

Looks a lot like the random exceptions i've seen in the wild with our issue and that also include RICHED20.dll...

@govert
Copy link
Member

govert commented Sep 1, 2020

That's useful - I can also get it to crash, and if I reduce to 100k calls I get really weird behaviour in Excel between the in-sheet editing and the formula bar !?

It would be good to get this to run somewhere in the main thread and see if that at least does not crash. If it doesn't cause trouble from the main thread, we just need to call it at the right time without interfering with the editing . . .

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 1, 2020

@govert I was checking for memory leaks, this is why I made it so high -- I confirm it can be replicated with 100k, even 10k calls. Will try it from the main thread now.

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 1, 2020

@govert One of the things I tried is to set a timer to avoid making repeated calls to GetFormulaEditPrefix while the user is still typing -- the crash doesn't happen as quickly but it still happens. Using _syncContextMain instead of _syncContextAuto ultimately seems to be what helps the most, but it slows down keyboard input (that's running on the main UI thread, correct?) Combing a timer with the use of _syncContextMain seems to be the solution: the UI is responsive, and it's no longer crashing (and I still have that loops that repeatedly calls LPenHelper)

       void UpdateEditStateAfterTimeout()
        {
            const int TimeoutBeforeUpdate = 100;

            if (_updateTimer == null)
            {
                _updateTimer = new System.Timers.Timer();
                _updateTimer.AutoReset = false;
                _updateTimer.Interval = TimeoutBeforeUpdate;
                _updateTimer.Elapsed += new System.Timers.ElapsedEventHandler((o,e) => UpdateEditState());
            }

            //This resets any currently running timer and ensures it only fires after the interval set above
            _updateTimer.Stop();
            _updateTimer.Start();
        }


        void UpdateEditState(bool moveOnly = false)
        {
            // Switches to our Main UI thread, updates current state and raises StateChanged event
            _syncContextMain.Post(_ =>
            {
                UpdateEditStateImpl(moveOnly);
            }, null);
        }

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 1, 2020

@govert here's the fix I made on my side which seems to resolve the problem: VelixoSolutions@c877ba9

If you think this makes sense I'll probably give a debug build to the customers that complain the most to see if issue goes away for them.

@govert
Copy link
Member

govert commented Sep 1, 2020

@gmichaud Thank you for digging into this! I'll try to have a look at your changes soon - it would make sense if the LPenHelper call must be made on the main thread. As you point out, managing the timing then without interfering with the editing experience is tricky. Some feedback from testing elsewhere would help to confirm that the crashes go away.

@gmichaud
Copy link
Contributor Author

gmichaud commented Sep 2, 2020

@govert we just shipped an update to a few customers. I'll let you know in a few weeks if the issue is gone. @wh1t3cAt1k also refactored my changes a bit and added cleanup, if it works out fine I'll open a PR. Thank again for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants