Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexOutOfRangeException at FixedSizeDictionary`3.AddOrReplace #498

Closed
phil000 opened this issue Feb 3, 2022 · 47 comments · Fixed by #512
Closed

IndexOutOfRangeException at FixedSizeDictionary`3.AddOrReplace #498

phil000 opened this issue Feb 3, 2022 · 47 comments · Fixed by #512
Labels

Comments

@phil000
Copy link

phil000 commented Feb 3, 2022

Getting this exception periodically in Handlebars, Version=2.1.0.0 trying to render content from a compiled delagate (HandlebarsTemplate<object, object>)

Not sure what else I can provide because it doesn't seem related to the content being rendered. Debugging though it always works.

We were previously using 2.0.9 on .NET 5. We upgraded our project to .NET 6.0 and Handlebars 2.1.0 and the problem apparently started.

System.IndexOutOfRangeException:
   at HandlebarsDotNet.Collections.FixedSizeDictionary`3.AddOrReplace (Handlebars, Version=2.1.0.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.BindingContext.Initialize (Handlebars, Version=2.1.0.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.BindingContext+BindingContextPool.CreateContext (Handlebars, Version=2.1.0.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.BindingContext.Create (Handlebars, Version=2.1.0.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.HandlebarsEnvironment+<>c__DisplayClass15_0.<Compile>b__0 (Handlebars, Version=2.1.0.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.HandlebarsEnvironment+<>c__DisplayClass16_0.<Compile>b__0 (Handlebars, Version=2.1.0.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at OnSend.MessagingService.Core.Templating.HandlebarsMessageBuilder.BuildMessage (OnSend.MessagingService.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null: D:\a\1\s\OnSend.MessagingService.Core\Templating\HandlebarsMessageBuilder.cs:68)
@oformaniuk oformaniuk added the bug label Feb 3, 2022
@phil000
Copy link
Author

phil000 commented Feb 3, 2022

FYI,
Just checking our logs, we've never seen this issue before upgrading to .NET 6.0 and 2.1.0 a few days ago.

I've temporarily downgraded back to 2.0.9 to see if the issue is relieved by that. We were previously on 2.0.9.

@oformaniuk
Copy link
Member

Hello @phil000
Thanks for the issue report.
There were no changes related to the code path in 2.1.0.
Although, it's unclear how .NET upgrade could have caused it.

@phil000
Copy link
Author

phil000 commented Feb 3, 2022

Thank you for that.

I've reverted to .NET 6.0 + 2.0.9 and deployed to our test environment already.

I'll be able to report in a few days if we see the issue keep occurring, or if it stops.

Either way that will provide a path forward.

There's a lot of calls to 'AddOrReplace' in Initialize so a debug build and maybe a process dump at the time of the exception may help.

@oformaniuk
Copy link
Member

@phil000, additional information will definitely help 👍
Looking forward for more info.

@phil000
Copy link
Author

phil000 commented Feb 7, 2022

Hi.
For your information, we have been monitoring our test environment on .NET 6.0 + 2.0.9 for several days and have not seen the 'IndexOutOfRangeException' occur again.

We have deployed this configuration into production as of now.

Next step is to see if we can get the 'IndexOutOfRangeException' to occur again in the test environment on 2.1.0 and provide a debug stack trace.

@phil000
Copy link
Author

phil000 commented Feb 9, 2022

This issue hasn't occurred again. Happy to close it as we cannot reproduce it.

@oformaniuk
Copy link
Member

@phil000, thanks for keeping me updated.

@jeffclary
Copy link

I am seeing a very similar issue in version 2.0.9 and netstandard2.1. My stack trace is a bit different because I am using Handlebars.Net.Extension.Json version 1.0.0.

The behavior is the same, in that I see only periodic errors that I can't explain. Still investigating on my end.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 18, 2022

We are seeing this in our service as well, several dozen times a day. Though our callstack isn't quite the same:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at HandlebarsDotNet.Collections.FixedSizeDictionary`3.AddOrReplace(TKey& key, TValue& value, EntryIndex`1& index)
   at HandlebarsDotNet.Iterators.ListIterator`2.Iterate(EncodedTextWriter& writer, BindingContext context, ChainSegment[] blockParamsVariables, Object input, TemplateDelegate template, TemplateDelegate ifEmpty)
   at HandlebarsDotNet.Iterators.ListIterator`2.HandlebarsDotNet.Iterators.IIterator.Iterate(EncodedTextWriter& writer, BindingContext context, ChainSegment[] blockParamsVariables, Object input, TemplateDelegate template, TemplateDelegate ifEmpty)
   at HandlebarsDotNet.Compiler.Iterator.Iterate(BindingContext context, EncodedTextWriter writer, ChainSegment[] blockParamsVariables, Object target, TemplateDelegate template, TemplateDelegate ifEmpty)
   at lambda_method(Closure , EncodedTextWriter& , BindingContext )
   at HandlebarsDotNet.HandlebarsEnvironment.<>c__DisplayClass15_0.<Compile>b__0(TextWriter writer, Object context, Object data)
   at HandlebarsDotNet.HandlebarsEnvironment.<>c__DisplayClass16_0.<Compile>b__0(Object context, Object data)

It definitely seems like something is weird about the FixedSizeDictionary. It's also not consistent... we run it ~3000 times a day, and it only fails 30 ish times... But that still means 1% of our users get essentially random failures, which is a bit too high for comfort... We are about to attempt to just wrap all the handlebars code in a giant "try{Render}catch{Render... again}" loop, but that's really unfortunate for our service to have to do.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 18, 2022

My guess is there is some sort of multithreading issue... and some handlebars data structure we are sharing isn't actually thread safe (that's the only way I'd be able to explain the infrequency, and a failure to look up a thing in an immutable input).

Is the "HandlebarsTemplate" delegate returned from IHandlebars.Compile not reusable? Or perhaps not threadsafe (so we could have a pool of them shared, but perhaps they have to be rented)?

@oformaniuk
Copy link
Member

oformaniuk commented Feb 19, 2022

@ChadNedzlek , @jeffclary
It is thread safe (by design). FixedSizeDictionary is part of BindingContext which is pooled and should not be shared between different threads at any given moment in time.
However, I suspect that there's some race-condition involved.
I've already tried to tackle the problem but I still cannot reproduce it in a way I can get some useful data about what is happening.
If you see the error constantly and if you'd be able to provide some dump or try to create a consistent repro - it will help.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 22, 2022

I'll try my best, unfortunately, its server code, so it just causes 500's in our server code, and taking a dump isn't really feasible (especially not one I could share, since it would contain customer data). If it's not shared between threads, it would be weird for it to be a race condition (since what two things would be racing)?

It's definitely causing us serious problems in our server... I'm not sure what we are going to do here.

@ChadNedzlek
Copy link

I'm going to try making a small repo that basically just does millions of simultaneous renders, and see if I can trigger it... I'm not sure what callstack that's coming from though, so it's going to take me a while to produce anything. Like I'm not even sure what it's iterating.

@ChadNedzlek
Copy link

Hrm... I think it might be a double return of something from a pool. In general, disposing something is supposed to be safe to do multiple times... but a lot of the Disposes in Handlebars appear to return things to the pool unchecked, including "contexts"... so if we disposed something we shouldn't have twice, then two copies of the same object are in the pool, and will just randomly crash later. (documentation here about 6 paragraphs down)

It would probably be good if the returning disposes were multi-call safe. I'm going to check our code base for any potential double disposes and see if I can reproduce the issue.

@ChadNedzlek
Copy link

That would explain why the crash is so random... it will happen later to whatever poor code ends up sharing the object from the pool.

@ChadNedzlek
Copy link

Not only do we not have any double disposes... we don't have any disposes at all... But we are using a dependency injection framework, which is managing lifetimes for us, so might be disposing things in unexpected ways.

For now, I think I'm going to have to put a global lock around all our calls to Handlebars, since a lot of these corrupted pools look to be fairly singleton, and spread pretty generously all over the place, so there isn't a good way to avoid them. Hopefully only allowing a single page to render per process should avoid the race condition... though that's brutal for performance.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 24, 2022

If we could have some version of the assembly that has some sort of double dispose/double return validation we could opt into, I'd be happy to execute that, since we are able to produce the problem hundreds of times a day, so I'd be able to pretty quickly identify if that's what's happening. The callstack doesn't seem to be the same... though it's almost always from some sort of ListIterator, and the problem gets worse and worse the longer the process is running (because the pool gets more and more corrupt, if my debugging is to be trusted), until our server eventually gets so many exceptions it reboots itself, then the problem goes away for about a day until the ramp up starts.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 24, 2022

Actually, the repository is simple enough, I might try building a copy of the DLL myself with that validation, and see if I can find it.

@ChadNedzlek
Copy link

Sorry to spam the thread so much, but I'm running this fork https://github.com/ChadNedzlek/Handlebars.Net/commit/4f42cdaed3730d1dae6b6c52fe56e3693e5223bc in my service in one of our staging environments for a while to see if I can figure out if what I suspect is happening is happening, and, if so, where (since the exception I added should have the double return on the stack)

@Rayzbam
Copy link

Rayzbam commented Feb 24, 2022

We just had the same issue on a .NET 6 program too.
We tried everything without understanding this behavior. We just restart our service and all worked after that.
Let me know if you need some input, but actually we don't have so much to provide to you.

Here is our stack :
image

@ChadNedzlek
Copy link

Mine is happening on 3.1. I don't think the version of .NET has anything to do with the problem... there is just some race condition happening with one of the iterators that's somehow being shared and shouldn't be... I haven't caught the problem in my staging environment, but it's incredibly low scale, so it's possible I need to scale up to cause it to happen.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 24, 2022

Hrmm... it might also be something using a context that's already been returned, which is a lot harder to track (since there is no lease or anything to attach any diagnostic context to, the objects are totally usable after being disposed, since the pooling will rehydrate them, with no way to tell). The contexts are disposable/returnable, and passed to a LOT of methods that build methods that build methods... so I'm having a hard time following the code to see if anyone is keeping a reference to a context that's getting disposed by the caller. I might just remove the pooling altogether and see if that make my problem go away and then just use that fork going forward.

@rexm
Copy link
Member

rexm commented Feb 24, 2022

I'm seeing some speculation here but not a lot of evidence. Can you at least point us at some specific lines of code you're referring to that you think are suspect?

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 24, 2022

It's hard to tell, because the race condition is so difficult to detect/cause, and the callstacks appear to be fairly random. The "disposes" are difficult to find, because any random using can call it... and there are hundreds in the code base, which I'm not familiar with. I'd be happy to try and provide any assistance I can to diagnose a problem, because it happens hundreds of times a day to my service. but I need something here, because my server is crashing. Any potential workaround I'm happy to try, or any diagnostic information I can gather to assist (that isn't a dump containing customer data). I'm doing my best to attempt to solve/prevent the problem.

@rexm
Copy link
Member

rexm commented Feb 25, 2022

We need to narrow it down to a manageable amount of suspicious code areas, below "hundreds". Can you clarify with a few specific examples of these hundreds? You mentioned double-dispose problems, can you point to the body of a Dispose that would be problematic if invoked twice, so we could start from there? I went through the whole codebase and found approximately 10 instances of using() that might lead to a Dispose invocation in Handlebars code, for example this one.

@ChadNedzlek
Copy link

ChadNedzlek commented Feb 25, 2022

For example: BindingContext.Dispose: https://github.com/Handlebars-Net/Handlebars.Net/blob/master/source/Handlebars/Pools/BindingContext.Pool.cs#L21

If that is called twice, the same context object is returned to the pool twice, effectively duplicating it. So that any two future contexts might actually be sharing a context (since the same reference was put into the pool twice), which could lead to strange behaviors as they both attempted to simultaneously modify the BlockParamsObject inside that context (which, based on the code and the callstacks, is the object that's throwing the exception).

@rexm
Copy link
Member

rexm commented Feb 25, 2022

Digging into the implementation, it does look like calling Return would not be thread safe. Have you found any other instances we should also look into? @zjklee do you think we should make sure FixedSizeDictionary is thread safe, or put a semaphore in pool.Return?

@madmox
Copy link

madmox commented Mar 3, 2022

Same issue here, the error has started to pop randomly a few weeks/months ago. We are running the app under .NET Core 3.1.22 and the nuget's version is 2.0.10. The error happens when we compile the template:

public class HandlebarsTemplateService : ITemplateService
{
    private readonly IHandlebars _compiler;

    public HandlebarsTemplateService()
    {
        this._compiler = Handlebars.Create(new HandlebarsConfiguration() { TextEncoder = null });
    }

    string ITemplateService.GetTemplatedSimpleString(string template, Dictionary<string, object> parameters)
    {
        HandlebarsTemplate<object, string> compiledTemplate = this._compiler.Compile(template); // randomly fails
        return compiledTemplate(parameters);
    }
}
System.IndexOutOfRangeException: Index was outside the bounds of the array.
  ?, in void FixedSizeDictionary<TKey, TValue, TComparer>.AddOrReplace(in TKey key, in TValue value, out EntryIndex<TKey> index)
  ?, in void BindingContext.Initialize()
  ?, in BindingContext BindingContextPool.CreateContext(ICompiledHandlebarsConfiguration configuration, object value, BindingContext parent, TemplateDelegate partialBlockTemplate)
  ?, in BindingContext BindingContext.Create(ICompiledHandlebarsConfiguration configuration, object value)
  ?, in HandlebarsTemplate<TextWriter, object, object> HandlebarsEnvironment.Compile(TextReader template)+(TextWriter writer, object context, object data) => { } x 2

What I find strange is that we are running this code on AWS lambda in the main thread (no use of parallel tasks), so even though the HandlebarsTemplateService instance is a singleton that is reused between 2 consecutive lambda invocations, it should never be subject to concurrent executions :/ (multiple parallel lambda invocations are not supposed to share execution contexts). Or I might be misunderstanding how lambdas actually work with singletons.

@oformaniuk
Copy link
Member

@rexm , I do not think that the issue is in FixedSizeDictionary itself. At this point I do see some value in an assumption related to unsafe pooling.
Unfortunately, I still was not able to reproduce it on my side in order to verify the fix, so I may need some help from people here who do see the issue. I'm going to publish beta package soon.

@ChadNedzlek
Copy link

Let me know when you've got a beta version ready to go and I'll give it a run for a couple weeks in our service. If we don't see the problem at that point, we can assume it's fixed.

@oformaniuk
Copy link
Member

oformaniuk commented Mar 4, 2022

@ChadNedzlek
Copy link

Perfect, I'll get it checked in to our staging environment tomorrow, and that should roll out next week, so I should be able to report back in 7-8 days from now.

@pavisalavisa
Copy link

Same issue here. I'm running .NET Core 3.1 with Handlebars.NET 2.0.9. The app itself is an Azure function app running with an in-process runtime model.

    class HtmlTemplate
    {
        private readonly HandlebarsTemplate<object, string> template;

        public HtmlTemplate(Stream templateStream)
        {
            using var reader = new StreamReader(templateStream);
            var source = reader.ReadToEnd();
            template = Handlebars.Compile(source);
        }

        public string Render<T>(T model)
        {
            return template(model);
        }
    }

    // Usage
    var template = new HtmlTemplate(stream);
    var renderedHtml = template.Render(model);
System.IndexOutOfRangeException:
   at HandlebarsDotNet.Collections.FixedSizeDictionary`3.AddOrReplace (Handlebars, Version=2.0.9.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.BindingContext.Initialize (Handlebars, Version=2.0.9.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.BindingContext+BindingContextPool.CreateContext (Handlebars, Version=2.0.9.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.HandlebarsEnvironment+<>c__DisplayClass15_0.<Compile>b__0 (Handlebars, Version=2.0.9.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
   at HandlebarsDotNet.HandlebarsEnvironment+<>c__DisplayClass16_0.<Compile>b__0 (Handlebars, Version=2.0.9.0, Culture=neutral, PublicKeyToken=22225d0bf33cd661)
  at (client code)...

I have observed this issue only once so far, and it doesn't seem to be correlated with any traffic spikes. In my case, only a portion of client requests were affected by this issue. Restarting the app fixed the issue immediately.

This observation fits into your hypothesis about the pool getting corrupted.

@madmox I had the same confusion. In my case, the in-process runtime model means that static host resources are shared between individual function executions. As a result, only functions running in the host with a corrupted resource pool had this issue.

I'll try updating to package to the beta release. There's no telling if that's going to fix it, given the unpredictable nature of the problem.

@madmox
Copy link

madmox commented Mar 10, 2022

@pavisalavisa I suspected something similar in my case, but the AWS documentation makes it pretty clear that lambda execution environments can only run one function invocation at a time. If this is the case, static resources are not shared accross parallel lambda invocations, although they are generally shared accross consecutive invocations, because the same lambda execution environment is reused whenever possible.

Also, I never observed such a thing as a "corrupted" resource pool. I actually never observed the event twice in the same week, without having to restart anything. Maybe the execution environment is recycled if the function errors out. Maybe I could investigate this further by testing the "shared static resources" by activating tracing of static random variables and running a small load test, though I'm not sure I will have the time to do so.

@oformaniuk
Copy link
Member

@ChadNedzlek , any news from your side?

@ChadNedzlek
Copy link

The incidence seems to have gone down, but not gone, which is... confusing. We only have 1 hit since deploying the beta version 5 day ago to this particular set of machines, whereas we previously had around a dozen a day. But I'm confused that there would be any at all. It somehow seems to have changed the behavior, but not eradicated it, which from the change I saw... doesn't make any sense... it seems like it should either have done nothing or solved the problem.

@oformaniuk
Copy link
Member

@ChadNedzlek , thanks for the update.
Could it be due do some rolling update and the error actually popped from the old version?
Anyway, it's been additional 4 days, how is the picture with the errors now?

@pavisalavisa
Copy link

FYI I have observed another instance of this issue since deploying the beta package.

There's probably more to this problem.

@ChadNedzlek
Copy link

Yeah, unfortunately it seems to still be prevalent, We deployed the fix March 10, and for a couple days it looked fine, but then a whole bunch of the problem came up, looks like the problem is somehow somewhere else:
image

I still haven't managed to get it to reproduce in a non-production environment to get a dump of it. I can't figure out what the difference is other than scale.

oformaniuk added a commit that referenced this issue Mar 26, 2022
Fix collision handling in`FixedSizeDictionary`
@oformaniuk
Copy link
Member

oformaniuk commented Mar 26, 2022

Thanks to a failure in Github Actions I was able get some more details about the problem:

Error Message:
   System.IndexOutOfRangeException : Index was outside the bounds of the array.
  Stack Trace:
     at HandlebarsDotNet.Collections.FixedSizeDictionary`3.AddOrReplace(TKey& key, TValue& value, EntryIndex`1& index) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars\Collections\FixedSizeDictionary.cs:line 307
   at HandlebarsDotNet.ValueProviders.BlockParamsValues.CreateProperty(Int32& variableIndex, EntryIndex`1& index) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars\ValueProviders\BlockParamsValues.cs:line 30
   at HandlebarsDotNet.Iterators.DictionaryIterator`3.Iterate(EncodedTextWriter& writer, BindingContext context, ChainSegment[] blockParamsVariables, Object input, TemplateDelegate template, TemplateDelegate ifEmpty) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars\Iterators\DictionaryIterator'2.cs:line 26
   at HandlebarsDotNet.Iterators.DictionaryIterator`3.HandlebarsDotNet.Iterators.IIterator.Iterate(EncodedTextWriter& writer, BindingContext context, ChainSegment[] blockParamsVariables, Object input, TemplateDelegate template, TemplateDelegate ifEmpty)
   at HandlebarsDotNet.Compiler.Iterator.Iterate(BindingContext context, EncodedTextWriter writer, ChainSegment[] blockParamsVariables, Object target, TemplateDelegate template, TemplateDelegate ifEmpty) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars\Compiler\Translation\Expression\IteratorBinder.cs:line 135
   at lambda_method(Closure , EncodedTextWriter& , BindingContext )
   at HandlebarsDotNet.HandlebarsEnvironment.<>c__DisplayClass15_0.<Compile>b__0(TextWriter writer, Object context, Object data) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars\HandlebarsEnvironment.cs:line 171
   at HandlebarsDotNet.HandlebarsEnvironment.<>c__DisplayClass16_0.<Compile>b__0(Object context, Object data) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars\HandlebarsEnvironment.cs:line 185
   at HandlebarsDotNet.Test.BasicIntegrationTests.DictionaryEnumeratorWithBlockParams(IHandlebars handlebars) in D:\a\Handlebars.Net\Handlebars.Net\source\Handlebars.Test\BasicIntegrationTests.cs:line 814

According to stacktrace the problem happened due to a bug in collisions handling code. It could occur regardless of pooling or load and mostly depend on hashcode value used and the state of internal storage.

@ChadNedzlek , I'm going to release a new version of the package today and I'd really appreciate if you can deploy it to your environment to verify that problem went away.

@ChadNedzlek
Copy link

That's great news. I will be happy to try out the new code during our next deployment.

@ChadNedzlek
Copy link

Is this included in the 2.1.1 version on Nuget? If so, I'll pull it and try it out.

@oformaniuk
Copy link
Member

Is this included in the 2.1.1 version on Nuget? If so, I'll pull it and try it out.

Yes, it is part of 2.1.1

@oformaniuk
Copy link
Member

@ChadNedzlek , are there any updates from your side?

@ChadNedzlek
Copy link

It's been in the staging environment for a week without issue. It's rolling out a bit broader in the next couple days. I'll let you know what I see.

@ChadNedzlek
Copy link

After running in production for a week, I don't think I've seen any more evidence of this. Looks like it's all good.

@oformaniuk
Copy link
Member

@ChadNedzlek , awesome, thanks for keeping us posted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants