New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceBus Topic Trigger requires Manage rights to work properly #1048

Closed
mathewc opened this Issue Dec 14, 2016 · 33 comments

Comments

Projects
None yet
9 participants
@mathewc
Contributor

mathewc commented Dec 14, 2016

Our scale controller relies on the ServiceBus GetSubscription API to access the MessageCount for a subscription to determine whether it should scale out. That API requires Manage rights. So while the Function runtime supports Listen rights, users will see unexpected behavior unless they give us a Manage rights connection string (until this issue is fixed).

@paulbatum paulbatum added this to the January 2017 milestone Dec 19, 2016

@paulbatum paulbatum added the bug label Dec 19, 2016

@paulbatum

This comment has been minimized.

Show comment
Hide comment
@paulbatum

paulbatum Dec 20, 2016

Member

Can we find some way to get the message count without manage rights. The answer is probably no, but its worth doing some more investigation as this would be the simplest fix.

Assuming no for the above, have the functions runtime emit a warning to host logs indicating that it needs a manage level connection string. Only do so when running in dynamic.

During investigation, use the ScaleControllerEvents table to verify you are reproing the issue correctly.

Member

paulbatum commented Dec 20, 2016

Can we find some way to get the message count without manage rights. The answer is probably no, but its worth doing some more investigation as this would be the simplest fix.

Assuming no for the above, have the functions runtime emit a warning to host logs indicating that it needs a manage level connection string. Only do so when running in dynamic.

During investigation, use the ScaleControllerEvents table to verify you are reproing the issue correctly.

@paulbatum

This comment has been minimized.

Show comment
Hide comment
@paulbatum

paulbatum Jan 18, 2017

Member

@mamaso So we went with the emit host logs approach right? I see thats merged, can we close this?

Member

paulbatum commented Jan 18, 2017

@mamaso So we went with the emit host logs approach right? I see thats merged, can we close this?

@mamaso

This comment has been minimized.

Show comment
Hide comment
@mamaso

mamaso Jan 18, 2017

Contributor

@mathewc and I chatted and decided to revert the logging part of the PR, we felt it wasn't worth the added complexity when the real part of the problem is in the scale controller and the host logs are not visible enough.

Keep this open for:

  1. the scale controller work
  2. the work to route BindingProvider traces to function logs (maybe)
Contributor

mamaso commented Jan 18, 2017

@mathewc and I chatted and decided to revert the logging part of the PR, we felt it wasn't worth the added complexity when the real part of the problem is in the scale controller and the host logs are not visible enough.

Keep this open for:

  1. the scale controller work
  2. the work to route BindingProvider traces to function logs (maybe)
@lindydonna

This comment has been minimized.

Show comment
Hide comment
@lindydonna

lindydonna Mar 14, 2017

Contributor

FYI to @mathewc @davidebbo: we had three support cases in Jan/Feb that matched this symptom.

Customers report that their functions seem to only work properly when opened in the portal. I'm assuming it's because the Scale Controller doesn't have the right permissions.

I'm adding the reliability label and removing the milestone so we re-triage.

Contributor

lindydonna commented Mar 14, 2017

FYI to @mathewc @davidebbo: we had three support cases in Jan/Feb that matched this symptom.

Customers report that their functions seem to only work properly when opened in the portal. I'm assuming it's because the Scale Controller doesn't have the right permissions.

I'm adding the reliability label and removing the milestone so we re-triage.

@christopheranderson

This comment has been minimized.

Show comment
Hide comment
@christopheranderson

christopheranderson Mar 20, 2017

Member

@tohling - could you please provide status on this?

Member

christopheranderson commented Mar 20, 2017

@tohling - could you please provide status on this?

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Mar 20, 2017

Member

This issue could be fixed now.

Member

tohling commented Mar 20, 2017

This issue could be fixed now.

@tohling tohling closed this Mar 20, 2017

@napalm684

This comment has been minimized.

Show comment
Hide comment
@napalm684

napalm684 Mar 30, 2017

I am not feeling like this is fixed yet. I opened Azure/Azure-Functions#229 . I am going to try to give the function connection string manage rights to see if that resolves things.

napalm684 commented Mar 30, 2017

I am not feeling like this is fixed yet. I opened Azure/Azure-Functions#229 . I am going to try to give the function connection string manage rights to see if that resolves things.

@lindydonna

This comment has been minimized.

Show comment
Hide comment
@lindydonna

lindydonna Mar 30, 2017

Contributor

@tohling Was the fix actually deployed? I.e., if customers have a connection string with Listen rights, will the Scale Controller scale out?

Contributor

lindydonna commented Mar 30, 2017

@tohling Was the fix actually deployed? I.e., if customers have a connection string with Listen rights, will the Scale Controller scale out?

@mamaso

This comment has been minimized.

Show comment
Hide comment
@mamaso

mamaso Apr 17, 2017

Contributor

@tohling how did we resolve this on the scale controller side? With the listener error changes I made recently we could check for Manage rights and propagate them to the offending function if necessary (from the runtime)

Contributor

mamaso commented Apr 17, 2017

@tohling how did we resolve this on the scale controller side? With the listener error changes I made recently we could check for Manage rights and propagate them to the offending function if necessary (from the runtime)

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Apr 17, 2017

Member

If the connection string has Manage rights, then both the total unprocessed message count and age of the first message in the SB queue/topic are used as metrics for scale decisions.

If the connection string has only Listen rights, then the ScaleController will use only the age of the first message in the SB queue/topic as a metric for scale decisions.

Member

tohling commented Apr 17, 2017

If the connection string has Manage rights, then both the total unprocessed message count and age of the first message in the SB queue/topic are used as metrics for scale decisions.

If the connection string has only Listen rights, then the ScaleController will use only the age of the first message in the SB queue/topic as a metric for scale decisions.

@mamaso

This comment has been minimized.

Show comment
Hide comment
@mamaso

mamaso Apr 17, 2017

Contributor

Thanks @tohling, nice solution! Sounds like we don't need the warning in that case 👍

Contributor

mamaso commented Apr 17, 2017

Thanks @tohling, nice solution! Sounds like we don't need the warning in that case 👍

@lindydonna

This comment has been minimized.

Show comment
Hide comment
@lindydonna

lindydonna Apr 17, 2017

Contributor

@tohling and @mamaso I'm going to improve the documentation on our Service Bus bindings. Thanks for the details on how everything works. It sounds like scaling will work best with Manage rights, so that is what we'd recommend?

Contributor

lindydonna commented Apr 17, 2017

@tohling and @mamaso I'm going to improve the documentation on our Service Bus bindings. Thanks for the details on how everything works. It sounds like scaling will work best with Manage rights, so that is what we'd recommend?

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 23, 2017

Member

Reopening bug due to regression. See #1610 (comment)

Member

tohling commented Jun 23, 2017

Reopening bug due to regression. See #1610 (comment)

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 27, 2017

Hey,
This happens also to me - and the access right is already on "manage". I am working on that with the Support team, but wanted also to update here. Currently the solution is adding timer function.

omerlh commented Jun 27, 2017

Hey,
This happens also to me - and the access right is already on "manage". I am working on that with the Support team, but wanted also to update here. Currently the solution is adding timer function.

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 27, 2017

Member

@omerlh, if you are using an SB connection string that already has Manage rights, then your Function should trigger when there is a new event in your queue or topic and you should not need a Timer function. The regression stated at #1610 (comment) only applies to connection strings with only Listen rights.

If you are seeing this issue with Manage rights, then there might be another issue. Could you share your Function App name directly or indirectly so that we can investigate?

Member

tohling commented Jun 27, 2017

@omerlh, if you are using an SB connection string that already has Manage rights, then your Function should trigger when there is a new event in your queue or topic and you should not need a Timer function. The regression stated at #1610 (comment) only applies to connection strings with only Listen rights.

If you are seeing this issue with Manage rights, then there might be another issue. Could you share your Function App name directly or indirectly so that we can investigate?

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 27, 2017

@tohling, those are the requested details:
2017-06-27T08:11:51.359 Function completed (Success, Id=fea151ed-69f8-4374-ad58-905645418a00, Duration=2770ms). The region is South Central US.
Thank you!

omerlh commented Jun 27, 2017

@tohling, those are the requested details:
2017-06-27T08:11:51.359 Function completed (Success, Id=fea151ed-69f8-4374-ad58-905645418a00, Duration=2770ms). The region is South Central US.
Thank you!

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 27, 2017

Member

@omerlh, unfortunately having the request ID is not sufficient for me to look up the information about your Function App. Kindly share your Function App name directly or indirectly (see link for instructions in previous comment).

Member

tohling commented Jun 27, 2017

@omerlh, unfortunately having the request ID is not sufficient for me to look up the information about your Function App. Kindly share your Function App name directly or indirectly (see link for instructions in previous comment).

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 27, 2017

Member

@omerlh, sorry, I just found out that there is a way to retrieve the your Function App name based on the info you provided. I'll investigate and update this thread soon.

Member

tohling commented Jun 27, 2017

@omerlh, sorry, I just found out that there is a way to retrieve the your Function App name based on the info you provided. I'll investigate and update this thread soon.

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 27, 2017

@tohling I am glad to hear that. In the instruction, it said to share exactly this information, so I guessed it has to be useful...

omerlh commented Jun 27, 2017

@tohling I am glad to hear that. In the instruction, it said to share exactly this information, so I guessed it has to be useful...

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 27, 2017

Member

@omerlh, I looked at our internal logs and as of the current timestamp, we are seeing System.UnauthorizedException for your ServiceBus connection string, similar to the one that is mentioned in this SO thread.

The last recorded modification to your configuration for your Function's function.json occurred on 6/27/2017 7:36:20 AM (UTC), and the accessRights configuration is set to "listen". Kindly double-check the configuration for your Function to verify that you are using the expected connection string with Manage access rights.

Member

tohling commented Jun 27, 2017

@omerlh, I looked at our internal logs and as of the current timestamp, we are seeing System.UnauthorizedException for your ServiceBus connection string, similar to the one that is mentioned in this SO thread.

The last recorded modification to your configuration for your Function's function.json occurred on 6/27/2017 7:36:20 AM (UTC), and the accessRights configuration is set to "listen". Kindly double-check the configuration for your Function to verify that you are using the expected connection string with Manage access rights.

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 28, 2017

@tohling sorry about that, it was my mistake. I do have to say that it is a bit weird that I did not see any error - actually, it looks like everything is working as expected...

omerlh commented Jun 28, 2017

@tohling sorry about that, it was my mistake. I do have to say that it is a bit weird that I did not see any error - actually, it looks like everything is working as expected...

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 28, 2017

Member

@omerlh, no problem at all. I am glad that I was able to locate the issue. Yes, your observation is correct and we are actively working on a better way to surface those errors in the next few iterations of the product. But in case you are interested, here's an in-depth explanation of why you are experiencing this unexpected behavior.

Background

A Function App instance is a process instance for all Functions inside the Function App. For a Function to execute, at least 1 Function App instance must be running.

Typically, a Function App instance is launched due to one of the following scenarios:

  1. Function App is being accessed via Azure Portal - The UX workflow loads the Function App on behalf of the user, effectively forcing the Function App to be launched and come alive.

  2. Function App instance is launched by central listener service - There is currently a central listener service that acts as the proxy listener for events on all triggers. It is responsible for

    • listening for new events,
    • launching a Function App instance if no-active instances exists, and
    • scaling new instances when necessary.

For Functions that are created under the Consumption Plan, the Function App instance will stay alive for 5 minutes (or 10 minutes if configured as such). After 5 minutes, the Function App instance will idle out. Once the last Function App instance idles out, if Scenario #1 does not occur, then the Function will only be triggered by new events arriving in Scenario #2.

Issue

This bug is fixed but until the fix is rolled out to production, all ServiceBus-Triggered Functions using Listen rights will throw an UnauthorizedException, causing the "listen for events" code-path to fail for these types of Functions. As such, when new events arrive in your ServiceBus, your Function App instance will not be launched, resulting in missing executions.

There is currently limited bi-directional communication between the central listener and any Function App instance. We are working on ways to bridge that gap so that customers can be promptly alerted of such issues in the future.

Member

tohling commented Jun 28, 2017

@omerlh, no problem at all. I am glad that I was able to locate the issue. Yes, your observation is correct and we are actively working on a better way to surface those errors in the next few iterations of the product. But in case you are interested, here's an in-depth explanation of why you are experiencing this unexpected behavior.

Background

A Function App instance is a process instance for all Functions inside the Function App. For a Function to execute, at least 1 Function App instance must be running.

Typically, a Function App instance is launched due to one of the following scenarios:

  1. Function App is being accessed via Azure Portal - The UX workflow loads the Function App on behalf of the user, effectively forcing the Function App to be launched and come alive.

  2. Function App instance is launched by central listener service - There is currently a central listener service that acts as the proxy listener for events on all triggers. It is responsible for

    • listening for new events,
    • launching a Function App instance if no-active instances exists, and
    • scaling new instances when necessary.

For Functions that are created under the Consumption Plan, the Function App instance will stay alive for 5 minutes (or 10 minutes if configured as such). After 5 minutes, the Function App instance will idle out. Once the last Function App instance idles out, if Scenario #1 does not occur, then the Function will only be triggered by new events arriving in Scenario #2.

Issue

This bug is fixed but until the fix is rolled out to production, all ServiceBus-Triggered Functions using Listen rights will throw an UnauthorizedException, causing the "listen for events" code-path to fail for these types of Functions. As such, when new events arrive in your ServiceBus, your Function App instance will not be launched, resulting in missing executions.

There is currently limited bi-directional communication between the central listener and any Function App instance. We are working on ways to bridge that gap so that customers can be promptly alerted of such issues in the future.

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 28, 2017

@tohling thank you for the detailed explanation! I am really appreciating it.
Anyway, I updated the SAS correctly, but it stopped working again. What I did was adding the required permissions to the existing SAS signature that was used by the function. It looked like the connection string is the same. So as far as I can tell, I think it should have the correct access right. Could you please take a look? If it still does not have them, I will create a new SAS and use it.

omerlh commented Jun 28, 2017

@tohling thank you for the detailed explanation! I am really appreciating it.
Anyway, I updated the SAS correctly, but it stopped working again. What I did was adding the required permissions to the existing SAS signature that was used by the function. It looked like the connection string is the same. So as far as I can tell, I think it should have the correct access right. Could you please take a look? If it still does not have them, I will create a new SAS and use it.

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 28, 2017

Member

@omerlh, could you first edit the function.json and change the "accessRights" to "Manage" and see if that will resolve the issue? You may also remove the "accessRights" definition entirely from the function.json file, since the default is to use Manage rights.

Member

tohling commented Jun 28, 2017

@omerlh, could you first edit the function.json and change the "accessRights" to "Manage" and see if that will resolve the issue? You may also remove the "accessRights" definition entirely from the function.json file, since the default is to use Manage rights.

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 28, 2017

@tohling I already did it a few days ago - that why I was expecting for the issues to resolve. I also was surprised to find out that I can set access right to Manage, although the connection string does not allow it.

omerlh commented Jun 28, 2017

@tohling I already did it a few days ago - that why I was expecting for the issues to resolve. I also was surprised to find out that I can set access right to Manage, although the connection string does not allow it.

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 28, 2017

Member

@omerlh, strange. Our records indicate that this is the current config for your function.json as saved on 6/28/2017 3:18:43 AM(UTC)

[
{
"name": "mySbMsg",
"type": "serviceBusTrigger",
"direction": "in",
"topicName": "[removed for privacy]",
"subscriptionName": "[removed for privacy]",
"connection": "pubsub**********g",
"accessRights": "listen",
"functionName": "HandleNewEvent"
}
]

The accessRights is still set to Listen. Could you try editing it in the UI and save the function.json again?

Member

tohling commented Jun 28, 2017

@omerlh, strange. Our records indicate that this is the current config for your function.json as saved on 6/28/2017 3:18:43 AM(UTC)

[
{
"name": "mySbMsg",
"type": "serviceBusTrigger",
"direction": "in",
"topicName": "[removed for privacy]",
"subscriptionName": "[removed for privacy]",
"connection": "pubsub**********g",
"accessRights": "listen",
"functionName": "HandleNewEvent"
}
]

The accessRights is still set to Listen. Could you try editing it in the UI and save the function.json again?

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 28, 2017

@tohling I was looking at the Integrate blade and it was set to manage there:
screen shot 2017-06-28 at 7 14 04
Any way, I now also changed it in the function.json file, and noticed the following error when I entered the function:

Host Error: Microsoft.ServiceBus: The remote server returned an error: (401) Unauthorized. claim is empty. TrackingId:dce2649c-0f4c-447a-986d-8ee0317e1f28_G8, SystemTracker:pubsuboffloading.servicebus.windows.net:analytic-event-received-v2, Timestamp:6/28/2017 4:15:00 AM. System: The remote server returned an error: (401) Unauthorized.

So I updated the connection string with the SAS again, and now it seems to work. I'll update if it will stop working again.

omerlh commented Jun 28, 2017

@tohling I was looking at the Integrate blade and it was set to manage there:
screen shot 2017-06-28 at 7 14 04
Any way, I now also changed it in the function.json file, and noticed the following error when I entered the function:

Host Error: Microsoft.ServiceBus: The remote server returned an error: (401) Unauthorized. claim is empty. TrackingId:dce2649c-0f4c-447a-986d-8ee0317e1f28_G8, SystemTracker:pubsuboffloading.servicebus.windows.net:analytic-event-received-v2, Timestamp:6/28/2017 4:15:00 AM. System: The remote server returned an error: (401) Unauthorized.

So I updated the connection string with the SAS again, and now it seems to work. I'll update if it will stop working again.

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 28, 2017

Member

@omerlh, OK, I am watching the logs as well and will update this thread if I still see the errors. Close the Portal and let's observe this in the next hour. Make sure that there are new events coming into your ServiceBus.

Member

tohling commented Jun 28, 2017

@omerlh, OK, I am watching the logs as well and will update this thread if I still see the errors. Close the Portal and let's observe this in the next hour. Make sure that there are new events coming into your ServiceBus.

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 28, 2017

@tohling Can you also take a look into why in the web UI the access right where different?

omerlh commented Jun 28, 2017

@tohling Can you also take a look into why in the web UI the access right where different?

@tohling

This comment has been minimized.

Show comment
Hide comment
@tohling

tohling Jun 28, 2017

Member

@omerlh, looks like things are good now. I can see that the UnauthorizedException is gone as of ~1 hour ago. The central listener is now detecting new events entering your ServiceBus and contacting your Function App instance as expected.

As for the inconsistency in the UI, if the entries on the page were already saved, you should not see the Save and Cancel button. The UI snapshot you provided seems to indicate that some changes were made to the page but not yet saved. I also tried to repro the anomaly to be sure but was unsuccessful. If you can repro this, please reach out to us by filing a bug on at our UX repo and we will investigate promptly.

Member

tohling commented Jun 28, 2017

@omerlh, looks like things are good now. I can see that the UnauthorizedException is gone as of ~1 hour ago. The central listener is now detecting new events entering your ServiceBus and contacting your Function App instance as expected.

As for the inconsistency in the UI, if the entries on the page were already saved, you should not see the Save and Cancel button. The UI snapshot you provided seems to indicate that some changes were made to the page but not yet saved. I also tried to repro the anomaly to be sure but was unsuccessful. If you can repro this, please reach out to us by filing a bug on at our UX repo and we will investigate promptly.

@omerlh

This comment has been minimized.

Show comment
Hide comment
@omerlh

omerlh Jun 28, 2017

@tohling yeah, it seems to work. Regarding the UI - I am not sure about reproducing, but it is the same for the past week, so I am not sure this is the situation...
Anyway, thank you very much for your help!

omerlh commented Jun 28, 2017

@tohling yeah, it seems to work. Regarding the UI - I am not sure about reproducing, but it is the same for the past week, so I am not sure this is the situation...
Anyway, thank you very much for your help!

@paulbatum

This comment has been minimized.

Show comment
Hide comment
@paulbatum

paulbatum Jul 12, 2017

Member

@jocawtho has a fix for this. Full global deployment of the fix is due by end of July. @jocawtho can you update this issue once its complete?

Member

paulbatum commented Jul 12, 2017

@jocawtho has a fix for this. Full global deployment of the fix is due by end of July. @jocawtho can you update this issue once its complete?

@paulbatum paulbatum closed this Jul 12, 2017

@jocawtho

This comment has been minimized.

Show comment
Hide comment
@jocawtho

jocawtho Jul 12, 2017

Member

I do have a fix. I will update this when our next round of deployments finish. i predict about the end of next week.

Member

jocawtho commented Jul 12, 2017

I do have a fix. I will update this when our next round of deployments finish. i predict about the end of next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment