Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[.Net SDK, state service] Doesn't save state #4477

Closed
SenseVision opened this issue Apr 16, 2018 · 71 comments
Closed

[.Net SDK, state service] Doesn't save state #4477

SenseVision opened this issue Apr 16, 2018 · 71 comments
Assignees

Comments

@SenseVision
Copy link

SenseVision commented Apr 16, 2018

Bot Info

Issue Description

It started about 2 weeks ago .

Bot doesn't save it's current state \ deserialize the previous state instead of current (the last one where he stopped ).

Occurs rarely.(It could be user get this bug , it could be not , user could get this bug several times per conversation)

Occurs in random moment of the workflow (doesn't depend on which dialog is \ context.Wait(Smth) ).

No Exceptions at all ! Neither in workflow nor http requests to Azure Tables.

Bot process looks like :

Bot state is 0
Bot moves to the state 1( send message 1)
Bot waiting for an answer( user message 1)
//
User gets bot message 1
User respond (user message 1)
//
Bot accept answer (user message 1) , Bot state = 1( expected = 1)
Bot moves to the state 2( send message 2)
Bot waiting for an answer( user message 2)
//
User gets bot message 2
User respond (user message 2)
//
Bot accept answer (user message 2) , Bot state = 1 ( expected = 2)

Actual Results

Not really good example (because use nested dialog), but same was with dialogues , wich doesn't know anything about each other.

errorstate

Later , didn't changed anything in code \ server config any elsewhere
ok

@JasonSowers
Copy link
Contributor

Please share your code, this will be next to impossible to diagnose without your code.

@SenseVision
Copy link
Author

@JasonSowers added as collaborator to the copy of repository

Also today we have tried to downgrade BF to 3.12.4(stable version we used before 3.13.1) right now 30+ full conversations(25-40 interactions) , no bug

@oflahero
Copy link

You're not going mad SenseVision, I just had a panic attack for most of the afternoon today because I thought I had completely messed up my work (and mediocre understanding of how BotFramework worked).

Exactly the same experience after moving from 3.14.0 to either 3.14.1 or 3.15 (not enough testing between upgrades to determine which) - random mutations in conversation position, sometimes even going back to points in 'completed' (Context.Done) dialogs. I only noticed properly after upgrading again to 3.15.

I'm using Azure Table Storage for state, and an Azure app service for the bot. Also 3.2.5 for Microsoft.Bot.Builder.Azure. Dialogs for conversations. (No LUIS, no FormFlow, or anything fancy. Bog-standard MessagesController.Post -> scope.Resolve, Conversation.SendAsync.... -> RootDialog -> context.Call(other dialogs) )

After downgrading Bot.Builder and Bot.Builder.History from 3.15 to 3.13.1 (and BotConnector to 3.14.07, although I can't say right now which is making the difference) everything is OK again. WindowsAzure.Storage is at 9.1.1.

To Jason and the team - apologies, as this kind of random comment that is missing a workable repro is about as welcome as a kick in the teeth :)

If I can find a consistent repro, I will certainly post it, but as SenseVision notes, it is frustratingly random.

@JasonSowers
Copy link
Contributor

I got your invite, I will start looking into this. We recently released a new SDK version 3.15 can you please confirm if you are experiencing the same issue in 3.15. @oflahero I couldn't gather if 3.15 was working or not for you from your post.

@oflahero
Copy link

Small update: Bot.Connector is not the issue. It's a problem in Bot.Builder (and/or .History) between 3.13.1 (OK) and 3.15 (not OK).

I can't test if it's Bot.Builder.History in isolation as upgrading that nupkg requires Bot.Builder.

I will try 3.14.1.1 at some point tomorrow and try to narrow it down.

@oflahero
Copy link

I experienced the same issues with 3.14.1.1.

So the problem appears to have been introduced somewhere when moving between 3.13.1 to 3.14.1.1 in Microsoft.Bot.Builder and Microsoft.Bot.Builder.History (unless I was doing something stupid in my code that relied on something that was fixed after 3.13.1!) I don't have any further granularity info.

@JasonSowers
Copy link
Contributor

@oflahero Can you please share your project as well? I'm trying to make a minimum reproducible example and want to see if there are any similarities between your project and SenseVision's

So I have started with the most simple implementation of saving state data and I am adding pieces in from SenseVision's project to try to figure out the root cause, but I haven't been able to reproduce it yet. I will be spending the day looking into this

@JasonSowers
Copy link
Contributor

JasonSowers commented Apr 18, 2018

@oflahero @SenseVision Can either of you confirm this happens in the emulator as well? I have been trying all day to reproduce this with no luck.

@oflahero
Copy link

Hey there. I spent a bit of time on this today too - I've settled on 3.14.0 for Bot.Builder, Bot.Builder.History and Bot.Connector as the most stable configuration. Anything higher (>=3.14.1) and the issue is back.

I tried to reproduce in the emulator with no luck. It's only happening for me in the channel, which in my case is Facebook Messenger.

My project's commercial, and currently pretty large, so no sharing I'm afraid. I understand the frustration of spending time trying to reproduce an issue like this. I will hopefully try to do a pared-down minimal project that exhibits the problem, but I wouldn't be surprised if it doesn't. The problem project has so many interconnected moving parts - targets full 4.7.1, also NetStandard by some NuGet packages. I have enough issues with assembly bindings as it is.

I can understand putting this on hold til more people claim to have the issue also.

@JasonSowers
Copy link
Contributor

JasonSowers commented Apr 18, 2018

Let me try facebook, no luck with repro in the emulator, webchat, (or facebook now) 👎

I would really prefer to get to the bottom of this rather than put it on hold. I would guess it is affecting more customers than just you 2. So any more help you can give would be greatly appreciated

@oflahero
Copy link

Crazy busy today but I'll try to do a repro later if I can.

@SenseVision
Copy link
Author

SenseVision commented Apr 20, 2018

@JasonSowers sorry , i haven't answered , was super busy .

Check 3.15 ->
the earliest i'll be able try would be on Tuesday, i guess.

Can either of you confirm this happens in the emulator as well? ->
3.14.1.1 - yes . Emulator and FB.

I've rechecked references history , looks like @oflahero is right 3.13.1 is OK

@thschuetze
Copy link

Hi guys,
we experienced the same thing after upgrading from 3.13 to 3.14.1.1. dialogs seem to jump to a previous state. Seemed to be most reliably reproducible in the web versions of Slack and MS Teams (not sure if that has anything to do with it though)

Our guess was that something in the way the bot state is saved has changed between those versions (we are using TableStorage with a custom key for the bot state), but we couldn't find anything suspicious in the Bot.Builder commits...

So we just reverted back to 3.13.1 and everything works again.
Would be nice to get this out of the way so we can upgrade again.

@oflahero
Copy link

Welcome to the club!

I'm about halfway through setting up a minimal project - haven't forgotten about you Jason. No guarantees it'll exhibit the symptoms, mind.

@JasonSowers
Copy link
Contributor

No worries guys, been busy over here as well. I hope to make some more time to try to get to the bottom of this next week. All of my bots are running 3.15 (because I test the SDK before we release it) and I have not noticed any issues with state like you are seeing, so I'm hoping one of you can provide a clue here especially because I haven't been able to reproduce this.

@oflahero
Copy link

oflahero commented Apr 22, 2018

I have a repo at https://github.com/oflahero/315Issue where it looks like I can reproduce the issue.

It's a simple bot that echoes what you type with the length of the input string. If you enter 'find', it also launches a dialog simulating searching for a phone or email contact by name substring.

There are only three contacts hardcoded. Once it either returns a match or tells you there are no matches, you're out of the dialog again.

Using branch 315 (Bot Builder & Connector at 3.15.0), and the Facebook channel, you can eventually (after repeatedly hitting 'find' and responding appropriately) observe it jump out of sync. It may take a good ten minutes of this.

Note: there's some code in the repo that isn't used - I was just trying to grab stuff from the problem project in order to reproduce the issue.

I have a publicly available FB messenger page hosting the 3.15.0 version, if anyone wants to try it without installing themselves. Just send me your FB profile identifier so I can add you as a test user to the app on FB Developer, and I'll let you know where to look.

Branch 314 is the stable (3.14.0) version.

The 'UnrelatedPostIssue' branch is another problem entirely - it's a one-character diff commit, but it causes a 'BadRequest' exception in the bot on PostAsync. It's completely floored me.

@seca23
Copy link

seca23 commented Apr 23, 2018

Hi Guys
I have also noticed that our dialogs are jumping to a previous state ever since I updgraded the bot builder sdk from 3.13.1 to 3.15.0. It appears to specifically occur when there is a delay in the user response. We are storing our state using local SQL TableStorage. I have first noticed this via the directline channel. It also occurs in the emulator.

@thschuetze
Copy link

I went through the Bot.Builder commits between 3.13.1 and 3.14.1.1

The only change related to our issue seems to be CachingBotDataStore changes in commit
992e29f
(the cache was changed from local Dictionary to MemoryCache)

I couldn't spot anything suspicious with the changes, but @oflahero could you register your TableBotDataStore directly as IBotDataStore and remove the CachingBotDataStore registration in your example and see if that helps?

@oflahero
Copy link

As I understand it though @thschuetze , you need a CachingBotDataStore sitting on top of your keyed registration for it to work though? At least until we get BotBuilder v4... (see issue #4376)

Maybe there's a simpler way. I'll be the first to admit that the tight integration of Autofac into BotBuilder is a pet hate of mine (mainly because it confuses the hell out of me, and I loathe DI on principle :) )

@JasonSowers
Copy link
Contributor

JasonSowers commented Apr 23, 2018

Awesome work guys, thanks for helping us work this out. I will be spending time on this today, but mainly tomorrow since its late in the day now trying to address the issue.

As a note when I spent nearly a whole day trying to reproduce I tried both with and without CachingBotDataStore and was unsuccessful.

@seca23 this issue does not seem related can you please open a new issue with details about what you are experiencing.

@thschuetze
Copy link

thschuetze commented Apr 24, 2018

@oflahero you are right, I wasn't aware of that issue.
Could you try to register the old version of the CachingBotDataStore in the module to check if those changes have anything to do with the issue.

Btw. I got your example code and deployed it, but so far haven't been able to reproduce the issue with Slack (web), Teams (web) or Skype (iOS app). I don't have a facebook bot available that I could connect.
Is there a specific way to reproduce it more reliably? Answering/typing as quick as possible for example?

@JasonSowers
Copy link
Contributor

Good news I was able to reproduce in the emulator, the bad news is I need to catch it on a turn it fails on this is going to be a tough one to find 👎

@JasonSowers
Copy link
Contributor

JasonSowers commented Apr 24, 2018

@seca23 I actually think your issue is related after looking into this. Are you using multiple dialogs/prompts are you able to share what your dialogs look like if so

@seca23
Copy link

seca23 commented Apr 25, 2018

@JasonSowers from the MessageController I call a LuisDialog. When an intent is triggered it calls a new dialog [that inherits from IDialog]. This dialog contains a few prompts and then starts a form flow. Now, at first it seemed like the problem existed at the completion of the form flow. However, later on, I have noticed that the problem also occurred in the middle of the form flow, and in other prompt dialogs. This happens intermittently. If I run through a form flow quickly, it seems to be fine again. However, especially when there is a delay in user response the problem seems to occur.

Our bot endpoint is local and we store the state in local SQL table storage.

@oflahero
Copy link

@JasonSowers great stuff!

@thschuetze It can be very frustrating trying to reproduce it. I gave up today after five solid minutes of banging the keyboard. I did on two occasions find that the problem appeared when I had left the conversation in the middle of a dialog for a significant time.

How long that has to be I can't say, but my Azure website had definitely spun down by then and took an extra ten secs or so to get going again (don't let anyone fool you that Azure's 'always on' setting really does what it says on the tin)

@JasonSowers
Copy link
Contributor

JasonSowers commented Apr 25, 2018

It seems prompts (which are dialogs themselves) seem to be the culprit here. Is anyone noticing this behavior outside of on a step when a prompt is involved?

@oflahero
Copy link

That could fit my situation; I'm using a PromptDIalog.Text. My example's not big enough to rule out non-prompt interactions though, might have to add a few more dummy steps in the process.

@VanjuliK
Copy link

VanjuliK commented Apr 27, 2018

Hi guys,
I am also facing the same issue in my project.
Bot Info:
SDK Platform: .NET
Active Channels: Skype For Business, Emulator
Deployment Environment: Azure Tables

Package Version Info:
Microsoft.Bot.Builder - 3.14.1.1
Microsoft.Bot.Builder.Azure - 3.2.5
Microsoft.Bot.Builder.CognitiveServices -1.1.2
Microsoft.bot.Builder.History - 3.14.1.1
Microsoft.Bot.Connector - 3.14.1.1

Issue:
There are multiple dialogs for various chat flows and prompts used in the project. I have implemented bot state service using Azure Table and used CachingBotDataStore also. I have ended the context wherever required and returned to parent dialogs from child dialog. The issue is intermittent and the bot is deployed on SfB channel. The behavior is random as sometimes the code works without any issue but other times there is problem with context. I have used Prompts extensively in my project as it's a guided flow for user.

@delight-by
Copy link

delight-by commented Jun 14, 2018

@JasonSowers

This is the workaround we recommend right now.

Wonder how you can recommend that if CachingBotDataStore is the only thing that makes the state being saved (because of its FlushAsync implementation that loads state instances and calls SaveAsync on them). Even InMemoryDataStore doesn't work without it in tests.

@mitkodi
Copy link
Contributor

mitkodi commented Jun 18, 2018

I have the same issue with v3.15.2.2. I'm using SqlBotDataStore and it happens regardless I register CachingBotDataStore or not. The bot uses only one dialog and keeps track of its state in a property, but sometimes the UPDATE [dbo].[SqlBotDataEntities] is not called and on the next message the state is restored to the previous one.

@peterbozso
Copy link

@JasonSowers I'd be glad to provide you an example, but the code I am working on is non-shareable. I'll try to reproduce it in a small sample project as soon as I'll have some time.

@sbuffkin
Copy link

@peterbozso Thank you very much. I am working with Jason on this and a reproduction would go a long way to being able to get closer to the heart of this issue.

@mitkodi
Copy link
Contributor

mitkodi commented Jun 19, 2018

If it could help to someone. Yesterday I changed the scope of SqlBotDataStore registration from SingleInstance to InstancePerLifetimeScope and haven't faced the issue since then (more than 24 hours).

@seca23
Copy link

seca23 commented Jun 19, 2018

@mitkodi
I am still experiencing the issue even with InstancePerLifetimeScope

Version: v3.15.2.2

Here is how my registration looks:

var store = new SqlBotDataStore("BotDataContextConnectionString");
builder.Register(c => new CachingBotDataStore(store, CachingBotDataStoreConsistencyPolicy.ETagBasedConsistency)) .As<IBotDataStore<BotData>>() .AsSelf() .InstancePerLifetimeScope();

@mitkodi
Copy link
Contributor

mitkodi commented Jun 19, 2018

@seca23 I don't wrap the SqlBotDataStore by CachingBotDataStore. I register only SqlBotDataStore:

var store = new SqlBotDataStore(connectionString);
builder.Register(c => store)
    .Keyed<IBotDataStore<BotData>>(AzureModule.Key_DataStore)
    .AsSelf()
    //.SingleInstance();
    .InstancePerLifetimeScope();

@seca23
Copy link

seca23 commented Jun 20, 2018

@mitkodi, I see what you mean. Just to confirm, I am using an on-premise SqlBotDataStore. Thanks, I will surely try it out and let you know if it resolves my problem.

@egracer
Copy link

egracer commented Jun 21, 2018

Not sure if this will help, but we are seeing the same issue while running automated tests. The same conversation will persist the DialogState key in private conversation data during one execution of the test, and will not persist the same key during another execution of the test. The behavior is empirically intermittent — we are not changing anything about the test between executions.

I am running v3.15.2 for Builder & Builder.History. We are using the CachingBotDataStore which wraps the DocumentDbBotDataStore like so:

            builder.Register(c => _botStore)
                .Keyed<IBotDataStore<BotData>>(AzureModule.Key_DataStore)
                .AsSelf()
                .SingleInstance();
            builder.Register(c => new CachingBotDataStore(_botStore, CachingBotDataStoreConsistencyPolicy.ETagBasedConsistency))
                .As<IBotDataStore<BotData>>()
                .AsSelf()
                .InstancePerLifetimeScope();

Sample, incorrect private conversation data saved after processing a message during the test:

{
  "someKey": {}
}

Sample, correct private conversation data saved after processing the same message during a different execution of the same test:

{
 "someKey": {},
  "ResumptionContext": {
    "IsTrustedServiceUrl": true
  },
  "DialogState": "Big long string of serialized Dialog"
}

@egracer
Copy link

egracer commented Jun 21, 2018

Looking a bit further, it also seems that the CachingBotStore invalidates the cache with every flush, but since we're running multiple instances of our Bot, that strategy is not going to work reliably without sticky sessions. One instance of the Bot will flush the state and invalidate the cache, but the other instance will retain the previous state until it flushes its own changes.

@fanidamj-zz fanidamj-zz changed the title Doesn't save state [.Net SDK, state service] Doesn't save state Jun 21, 2018
@oflahero
Copy link

I've been playing with a dev deployment for a couple of days using 3.15.2.2, using SqlBotDataStore (Instance PerLifetimeScope()) with no surrounding CachingBotDataStore. It appears to be stable so far, no occurrences of the issue.

@oflahero
Copy link

oflahero commented Jun 27, 2018

Update: yesterday I observed behaviour matching this issue again, still using SqlBotDataStore, InstancePerLifetimeScope() on its own (no CachingBotDataStore.) Obviously I can't be 100% sure that the cause is the same, but the symptom is identical.

My RootDialog put a new child dialog on the stack, but then on replying to the first message from the child dialog, the bot acted as if the user was still in the MessageReceivedAsync loop of the RootDialog. That is to say, the new bot state ('now in child dialog') was not saved.

I didn't see any SQL Server dependency failures for this operation - had the internal 'bot state save' processing failed, I assume some kind of exception or trace would have been emitted?

If it helps, I've attached the ApplicationInsights timeline from the bot request. Times are GMT (UTC+1). There appear to be more calls to login.botframework.com (and one to login.microsoftonline.com) than I'd expected. And it was after a long period of inactivity for the bot up to that point (it had last received an /api/messages request the day before.)

In the meantime, back I go to 3.14.0 just in case.

image

@JasonSowers
Copy link
Contributor

Thank you, everyone, for your continued patience with this issue. I finally have some good news. We think we have identified the issue and will be testing a private release internally very soon. We hope to have this out in the next release. I will continue to keep you updated as I have information.

@oflahero
Copy link

Nice one! Any hints while we're waiting Jason? Something to do with authentication timeout affecting the state save?

@JasonSowers
Copy link
Contributor

@oflahero @SenseVision @thschuetze @seca23 @VanjuliK @glasody @jamesmundy @tempel3 @peterbozso @delight-by @mitkodi @egracer

We believe this fix works, but there have already been instances of my testing being successful and others still reporting failures. If anyone would like to test this before we release it to nuget you can find the packages here for 3.15.3 here. Please let us know how this works out for you. I will be actively monitoring this thread today.

@EricDahlvang
Copy link
Member

@oflahero You can check out the PR here: #4810

@oflahero
Copy link

Thanks guys. If I'm reading it right, this fix will have no effect on a setup using SqlBotDataStore with no CachingBotDataStore?

@JasonSowers
Copy link
Contributor

@oflahero It shouldn't

@mitkodi
Copy link
Contributor

mitkodi commented Jul 3, 2018

@JasonSowers for me it works till now. I wrapped again SqlBotDataStore (SingleInstance) by CachingBotDataStore (InstancePerLifetimeScope) and haven't faced the issue last three days.

@seca23
Copy link

seca23 commented Jul 3, 2018

@JasonSowers, I have also upgraded to 3.15.3 and everything seems to work fine.
I am also using the CachingBotDataStore with an on-premise SqlBotDataStore, and everything looks great.
Thanks a mil.

@JasonSowers
Copy link
Contributor

@oflahero you should no longer have the CachingBotDataStore disabled. In fact, we advise against it at this point.

@oflahero
Copy link

oflahero commented Jul 3, 2018

Thanks Jason, I saw the wind was blowing that way alright and added it back in only a few hours ago. Looking good so far.

@peterbozso
Copy link

@JasonSowers I just checked with the project I mentioned above and 3.15.3 seems to fix the issue for me completely.
Thanks for your work!

@JasonSowers
Copy link
Contributor

If anyone is still experiencing issues related to this, please open a new issue. Thanks again to everyone for your help and patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests