Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ServiceBus.Core.MessageSender.OnSendAsync throws "InvalidCastException: Unable to cast object of type 'Microsoft.Azure.Amqp.Framing.Accepted' to type 'Microsoft.Azure.Amqp.Transaction.Declared'." #14836

Closed
chwarr opened this issue Sep 3, 2020 · 22 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus

Comments

@chwarr
Copy link
Member

chwarr commented Sep 3, 2020

Describe the bug

ServiceBus.Core.MessageSender.OnSendAsync is occasionally throwing the exception "System.InvalidCastException: Unable to cast object of type 'Microsoft.Azure.Amqp.Framing.Accepted' to type 'Microsoft.Azure.Amqp.Transaction.Declared'."

Expected behavior

No InvalidCastException is thrown. If this is an error due to how I'm using the library, something more descriptive than an AMQP cast exception is thrown so that I can fix how I'm calling the library/service.

Actual behavior (include Exception or Stack Trace)

Here's an exception stack trace excerpt. Everything after the "..." is Polly retry code or my service's code:

System.InvalidCastException: Unable to cast object of type 'Microsoft.Azure.Amqp.Framing.Accepted' to type 'Microsoft.Azure.Amqp.Transaction.Declared'.
   at Microsoft.Azure.ServiceBus.Core.MessageSender.OnSendAsync(IList`1 messageList)
   at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
   at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)
   at Microsoft.Azure.ServiceBus.Core.MessageSender.SendAsync(IList`1 messageList)
   at Polly.AsyncPolicy.<>c__DisplayClass40_0.<<ImplementationAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
...

To Reproduce

Unknown. This happens occasionally when our service tries to add items to a Service Bus queue. I can provide more details, but I would appreciate some guidance on what you're looking for.

Environment:

  • Name and version of the Library package used:
  • Hosting platform or OS and .NET runtime version:
    • .NET Core version 3.1.0 (commit 157910edee) with runtimes:
      • Microsoft.AspNetCore.App 3.1.0 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
      • Microsoft.NETCore.App 3.1.0 [/usr/share/dotnet/shared/Microsoft.NETCore.App]
    • Running on Azure Kubernetes Service with
      • Kubernetes version 1.14.8
      • Linux kernel version "4.15.0-1089-azure #99~16.04.1-Ubuntu SMP Fri Jun 5 15:30:32 UTC 2020 x86_64 GNU/Linux"
  • IDE and version : N/A (exception is encountered when running under Azure Kubernetes Service, not an IDE)
@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 3, 2020
@jsquire jsquire added Client This issue points to a problem in the data-plane of the library. Service Attention Workflow: This issue is responsible by Azure service team. Service Bus labels Sep 3, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 3, 2020
@ghost
Copy link

ghost commented Sep 3, 2020

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl, @axisc.

@jsquire
Copy link
Member

jsquire commented Sep 3, 2020

//cc: @JoshLove-msft, @MiYanni

@jsquire
Copy link
Member

jsquire commented Sep 3, 2020

Thank you for your feedback. Tagging and routing to the team best able to assist.

@axisc
Copy link

axisc commented Sep 4, 2020

@chwarr could you share a code repro for when you see this?

@chwarr
Copy link
Member Author

chwarr commented Sep 4, 2020

This happens intermittently, so I haven't been able to craft a minimal repro. It's also in a code base that I'm not the most familiar with, so I'm learning it at the same time as debugging this issue.

  • The service code that is encountering this exception looks like this:
    public Task AddItem(MyServicesQueueItem item
    {
        string serializedItem = JsonConvert.SerializeObject(item);
        byte[] itemBlob = Encoding.UTF8.GetBytes(serializedItem);
        var message = new Message(itemBlob);
        // _combinedPolicy is a Polly AsyncPolicy
        // _serviceBusSender is a Microsoft.Azure.ServiceBus.Core.MessageSender
        return _combinedPolicy.ExecuteAsync(() => _serviceBusSender.SendAsync(message));
    }
  • The messages are around 170 bytes.
  • There is an active transaction:
    _transaction = new TransactionScope(
        TransactionScopeOption.RequiresNew,
            new TransactionOptions
            {
                Timeout = DefaultTransactionTimeout,
                IsolationLevel = IsolationLevel.Serializable,
            },
            TransactionScopeAsyncFlowOption.Enabled);
  • There is typically one message being read and one message being written during the transaction.
  • The exception is happening when calling Microsoft.Azure.ServiceBus.Core.MessageSender.SendAsync(), not System.Transactions.TransactionScope.Complete().

I'm happy to provide more details. Let me know what you're looking for.
(If you want to look at the repo with all this service's code, find me in the Microsoft company directory and I can point you at the code.)

@lucasmaj
Copy link

lucasmaj commented Sep 7, 2020

I am experiencing exactly the same exception intermittently.

System.InvalidCastException: Unable to cast object of type 'Microsoft.Azure.Amqp.Framing.Accepted' to type 'Microsoft.Azure.Amqp.Transaction.Declared'.   at 
Microsoft.Azure.ServiceBus.Core.MessageReceiver.DisposeMessagesAsync(IEnumerable`1 lockTokens, Outcome outcome)    at 
Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)    at 
Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)    at 
Microsoft.Azure.ServiceBus.Core.MessageReceiver.CompleteAsync(IEnumerable`1 lockTokens)
  • For us it occurs on IMessageSession.CompleteAsync(string lockToken).
  • We use it inside an active transaction, where one message read results in multiple messages sent via SendVia's.
  • We use SessionClient.AcceptMessageSessionAsync(TimeSpan operationTimeout) approach, not message pump.
  • Latest Microsoft.Azure.ServiceBus 4.1.3 on .net Core 3.1 in Linux container.

I am happy to provide more details. Let me know what you're looking for.

@23min
Copy link

23min commented Feb 17, 2021

We are seeing this in our subscription too. But we are seeing the exception on DisposeMessagesAsync as well rather than just OnSendAsync.

It is reproducable only under a load.

@DorothySun216
Copy link
Contributor

@23min Can you also share your user scenario or a piece of code if possible? In how much load will this occur and where is the exception throwing from? What SDK version are you using?
I am trying to gather more data points and work on a repro.

@DorothySun216
Copy link
Contributor

@lucasmaj @chwarr I tried to repro this today but couldn't. Do you by any chance have any logs that include a exact timestamp of when the incident happened? We need either a repro or some logs that we can investigate further.

@chwarr
Copy link
Member Author

chwarr commented Feb 26, 2021

I haven't been working on the for a while. The freshest timestamp I have is 2020-10-14 06:26:40Z. I don't have a fresher repro than that.

The task I was using Service Bus for has concluded. I won't be able to get you a fresher repro than that.

I've sent you a Event Source trace log from that October incident at your Microsoft email.

I hope this helps.

@23min
Copy link

23min commented Mar 4, 2021

We are using 5.1.0. We will soon swith to 5.1.1 or 5.1.2 but no date for that yet.

Stack trace:

System.InvalidCastException:
   at Microsoft.Azure.ServiceBus.Core.MessageReceiver+<DisposeMessagesAsync>d__101.MoveNext (Microsoft.Azure.ServiceBus, Version=5.1.0.0, Culture=neutral, PublicKeyToken=7e34167dcc6d6d8c)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.ServiceBus.RetryPolicy+<RunOperation>d__19.MoveNext (Microsoft.Azure.ServiceBus, Version=5.1.0.0, Culture=neutral, PublicKeyToken=7e34167dcc6d6d8c)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.ServiceBus.RetryPolicy+<RunOperation>d__19.MoveNext (Microsoft.Azure.ServiceBus, Version=5.1.0.0, Culture=neutral, PublicKeyToken=7e34167dcc6d6d8c)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.ServiceBus.Core.MessageReceiver+<CompleteAsync>d__69.MoveNext (Microsoft.Azure.ServiceBus, Version=5.1.0.0, Culture=neutral, PublicKeyToken=7e34167dcc6d6d8c)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Biometria.Extensions.AzureServiceBus.Transports.EndpointWorker+<OnMessageReceivedAsync>d__21.MoveNext (Biometria.Extensions.AzureServiceBus, Version=5.5.2.0, Culture=neutral, PublicKeyToken=null)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Biometria.Extensions.AzureServiceBus.Transports.EndpointWorker+<OnMessageReceivedAsync>d__21.MoveNext (Biometria.Extensions.AzureServiceBus, Version=5.5.2.0, Culture=neutral, PublicKeyToken=null)

The failure occurs about 1 in 10,000 transactions through our system. The total number of transactions on ServiceBus is at least tenfold.

@DorothySun216
Copy link
Contributor

DorothySun216 commented Mar 4, 2021

@23min thanks so much. I don't think 5.1.2 will make any difference since we haven't identified the issue yet. If this exception happened within 30 days, can you open a Azure support ticket with us? https://azure.microsoft.com/en-us/support/create-ticket/ We will investigate further.

@DorothySun216
Copy link
Contributor

We don't have enough information to answer this right now since we suspect this is an issue on the AMQP layer. We will get back when we find proof.

@yvgopal
Copy link
Member

yvgopal commented Jun 25, 2021

We couldn't yet reproduce it on our end. There is something going wrong with AMQP library, in a particular corner case. I just couldn't reproduce it so far and don't see any particular code path that may have caused this. We will keep this open, and continue to investigate as a low priority item. In the meantime, if any of you can consistently reproduce it and identify a reproducible pattern, we will be happy to fix it.

@LilanSilva
Copy link

I also face the same issue. Please check the below steps. hope, you can generate an issue

  • issue comes with a heavy load. I have 5 parallel instances (5 separate workers send messages parallelly into the queue using transaction scope). but I haven't any issue.
  • I increase it to 10 parallel process instances(10 separate workers send messages parallelly into the queue using transaction scope), it will start to throw the above error)

Seems like the issue comes with load

@ankitvijay
Copy link

ankitvijay commented Nov 2, 2021

Hi Team,
We face the same issue during peak load when we have thousands of messages in the queue. We have opened a ticket with Azure support. We are waiting to hear from them.

Update: Unfortunately, we did not receive a positive response on the ticket we had open. We still have this issue.

@jsquire
Copy link
Member

jsquire commented Oct 11, 2022

@EldertGrootenboer: Would you be able to follow-up on this?

@mirmansoor
Copy link

Hi Team,
We have seen this exception in excess of 10k times in the past 90 days in our production environment. It occurs during high load. We can provide stack traces and other diagnostic information if it helps.

@BillVolz
Copy link

I am getting this in the latest client. The line that causes this is at the link below. I looked over the code a little to see if I could find the cause but this is some serious multi-threaded stuff. My guess, from what I am seeing is that something is getting disposed/cancelled, while one of these async tasks is in flight and something in the queue processing is having it return the wrong value. Again its impossible to reproduce but it happens several times a day, where we are processing 100's thousands of messages. The reason I think it's a cancel/dispose issue is because we are running containers and this happens often when autoscaling and not so much in the debugger.

https://github.com/Azure/azure-amqp/blob/master/src/Transaction/Controller.cs#L81

@chklauser
Copy link

We are also seeing this error regularly (on the receiving side, when messages get marked as completed). We have also noticed that the error rate is correlated with the number of pod starts/stops.

NuGet Package versions:
microsoft.azure.amqp\2.6.2
azure.messaging.servicebus\7.15.0
azure.core.amqp\1.3.0

Example stack trace:

System.InvalidCastException: Unable to cast object of type 'Microsoft.Azure.Amqp.Framing.Accepted' to type 'Microsoft.Azure.Amqp.Transaction.Declared'.
   at Microsoft.Azure.Amqp.Transaction.Controller.DeclareAsync(CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpTransactionEnlistment.OnCreateAsync(TimeSpan timeout, CancellationToken cancellationToken)
   at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout, CancellationToken cancellationToken)
   at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpTransactionManager.EnlistAsync(Transaction transaction, AmqpConnectionScope connectionScope, TimeSpan timeout)
   at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.DisposeMessageAsync(Guid lockToken, Outcome outcome, TimeSpan timeout)
   at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.CompleteInternalAsync(Guid lockToken, TimeSpan timeout)
   at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.<>c.<<CompleteAsync>b__47_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.<>c__22`1.<<RunOperation>b__22_0>d.MoveNext()
--- End of stack trace from previous location ---
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1,TResult](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken, Boolean logTimeoutRetriesAsVerbose)
   at Azure.Messaging.ServiceBus.ServiceBusRetryPolicy.RunOperation[T1](Func`4 operation, T1 t1, TransportConnectionScope scope, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.Amqp.AmqpReceiver.CompleteAsync(Guid lockToken, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.ServiceBusReceiver.CompleteMessageAsync(ServiceBusReceivedMessage message, CancellationToken cancellationToken)
   at Azure.Messaging.ServiceBus.ProcessMessageEventArgs.CompleteMessageAsync(ServiceBusReceivedMessage message, CancellationToken cancellationToken)
   at NServiceBus.Transport.AzureServiceBus.ProcessMessageEventArgsExtensions.SafeCompleteMessageAsync(ProcessMessageEventArgs args, ServiceBusReceivedMessage message, TransportTransactionMode transportTransactionMode, AzureServiceBusTransportTransaction azureServiceBusTransaction, CancellationToken cancellationToken) in /_/src/Transport/Receiving/ProcessMessageEventArgsExtensions.cs:line 21
   at NServiceBus.Transport.AzureServiceBus.MessagePump.ProcessMessage(ServiceBusReceivedMessage message, ProcessMessageEventArgs processMessageEventArgs, String messageId, Dictionary`2 headers, BinaryData body, CancellationToken messageProcessingCancellationToken) in /_/src/Transport/Receiving/MessagePump.cs:line 352

@jsquire jsquire removed the Service Attention Workflow: This issue is responsible by Azure service team. label Jul 24, 2023
@jsquire
Copy link
Member

jsquire commented Jul 24, 2023

@JoshLove-msft : From the most recent report, it looks as if this can be reproduced on Azure.Messaging.ServiceBus. Would you please take a look and transfer to the AMQP repo if it turns out to be in the AMQP lib?

@jsquire jsquire added customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Jul 24, 2023
@JimSuplizio
Copy link
Member

Hi @chwarr, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

@JimSuplizio JimSuplizio closed this as not planned Won't fix, can't repro, duplicate, stale Mar 5, 2024
@Azure Azure locked and limited conversation to collaborators Mar 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus
Projects
None yet
Development

No branches or pull requests