Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] "azd up" fails to deploy Azure AI template - "UserError" - "Deployment chat-deployment-xxxxx not found in endpoint mloe-xxxxx, workspace ai-project-xxxxx" #4037

Closed
1 task done
nitya opened this issue Jun 26, 2024 · 1 comment · Fixed by #4043
Assignees

Comments

@nitya
Copy link

nitya commented Jun 26, 2024

Am currently using a GitHub Codespaces configured with a devcontainer that grabs the latest Azure Developer CLI using this command:

RUN curl -fsSL https://aka.ms/install-azd.sh | bash -s -- --version daily

Output from azd version
Run azd version and copy and paste the output here:

This is the output I get:

azd version 1.9.3 (commit e1624330dcc7dde440ecc1eda06aac40e68aa0a3)

Describe the bug

Issue: "azd up" completes provisioning but terminates prematurely with error during "deploy"

The application has run correctly in the past. However, in the current instance,

  • azd up completes provisioning step correctly
  • it also completes post-provisioning hooks execution
  • then fails with a "UserError" on the deploy step

The error appears to be timing related

  • azd deploy complains that a specific chat-deployment endpoint is not available
  • that deployment is in provisioning state at this point (and gets deployed successfully later)
  • meanwhile azd process terminates on error (so no post-deployment actions are run)

The problem caused:

  • When testing the deployed app we get a "Network Error"
  • By trial and error we determined this was because Traffic Allocation for deployment was 0%
  • Manually using "Update traffic" to set it to 100% - allowed test to pass on retry

The insight:

  • Premature azd termination of azd prevented traffic allocation setup being completed.
  • This agrees with azd documentation which indicates azd should wait for deployment to enter terminal provisioning state, then shift traffic to new deployment.

To Reproduce

(This bug was originally seen on Jun 20 - and was reproduced by community on Jun 24. I have re-run the flow on Jun 26 to capture the above screenshots and provide these steps to reproduce)

The bug relates to the azd deploy step on the "Azure-Samples/Contoso-Chat" AZD-enabled template . For capturing this issue report, I created this branch on my fork to have a reproducible commit for validation.

These are the steps to reproduce the bug:

  1. Launch GitHub Codespaces on that fork/branch (commit)
  2. "azd auth login" - and complete workflow. You should see: Logged in to Azure.
  3. "azd up" - enter environment name, subscription, location: I used "Sweden Central"
  4. Wait for process to complete
    • provisioning completes successfully (~10-12 minutes)
    • post-provision hooks run successfully (populate data, connections)
    • deployment begins - then fails with error shown

🚨 | Error seen in the CLI (VSC on Codespaces) - error message shown as snippet for clarity. Note that the CLI exits as a result of this error, returning to cursor prompt in VS Code.

...
Deploying services (azd deploy)

  (x) Failed: Deploying service chat

ERROR: error executing step command 'deploy': failed deploying service 'chat': GET https://management.azure.com/subscriptions/XXXXXXXXX/resourceGroups/rg-06-26-azuredev-issue-test/providers/Microsoft.MachineLearningServices/workspaces/ai-project-brvtzcsc5w4vs/onlineEndpoints/mloe-brvtzcsc5w4vs/deployments/chat-deployment-1719417150
--------------------------------------------------------------------------------
RESPONSE 404: 404 Not Found
ERROR CODE: UserError
--------------------------------------------------------------------------------
{
  "error": {
    "code": "UserError",
    "message": "Deployment chat-deployment-1719417150 not found in endpoint mloe-brvtzcsc5w4vs, workspace ai-project-brvtzcsc5w4vs",
    "details": [],
    "additionalInfo": [
      {
        "type": "ComponentName",
        "info": {
          "value": "managementfrontend"
        }
      },
      {
        "type": "Correlation",
        "info": {
          "value": {
            "operation": "a868f3a507f74d8885aed34fa77bfcf4",
            "request": "d23c458b283430b1"
          }
        }
      },
...
...
...

--------------------------------------------------------------------------------

TraceID: 84e0b3890c3b7407f577b01c17b6acfb
@nitya ➜ /workspaces/contoso-chat (fix/06-24-add-safety-connection) $ 

🚨 | At the same time the Azure AI deployments tab shows that the identified endpoint resource was created and the related chat-deployment resource was still in the process of being created (backend Azure) at the time the error message was seen (CLI, local development IDE)

image

🚨 | If we look at the deployment resource - it shows that the resource is still in the provisioning state at this time. And it has 0% traffic allocation at this point as expected.

image

🚨 | If we continue to wait for backend process to complete (takes about 10 mins) - you will see that the Provisioning state is now set to "Succeeded" but traffic allocation still remains at 0%

image

🚨 | If we now test the deployment with a valid input, we get a "Network error"
image

Scroll down for debug/workaround that validated the issue.


Expected behavior

Expected that "azd up" would successfully deploy the application and shift traffic to new deployment. This would be validated by testing the deployed endpoint with a relevant test input.


Environment
Information on your environment:
* Language name and version | Python version 3.11.9 (GitHub Codespaces devcontainer)
* Dev Container Dockerfile | mcr.microsoft.com/devcontainers/python:3.11-bullseye
* IDE and version : Visual Studio Code 1.90.2 running in browser (GitHub Codespaces)


Additional context

Also took these actions to support debug:

  1. CLI says: "You can view detailed progress in the Auzre Portal" - with link specified
  2. Opened Link: Shows "Deployment is in progress" with "Status OK" for requested roles.
  3. Opened RG - Deployments: Monitor status and wait till all resources are created
  4. Opened Azure AI - Hub - Projects - Deployments - Monitored page to correlate status to CLI

Debugged issue by assuming it was related to Traffic Allocation.

🚨 | Manually updated traffic allocation on the deployment to 100%

image

🚨 | Refreshed deployment to validate that traffic allocation was now updated to 100%

image

🚨 | Tried the test input again - this time it worked! (Validates that issue was because azd did not get to complete the traffic allocation update)
image


Tagging @kristenwomack @wbreza for awareness

@nitya nitya changed the title [Issue] "azd up" fails to deploy Azure AI template - "UserError" - "Deployment chat-deployment-xxxxx not found in endpoint mloe-xxxxx, workspace ai-projject-xxxxx" [Issue] "azd up" fails to deploy Azure AI template - "UserError" - "Deployment chat-deployment-xxxxx not found in endpoint mloe-xxxxx, workspace ai-project-xxxxx" Jun 26, 2024
@vhvb1989 vhvb1989 self-assigned this Jun 27, 2024
@vhvb1989
Copy link
Member

Started internal conversation for this.

azd is failing with 404 because, for some time, after creating a deployment for the ai-endpoint, the deployment is not propagated and is not found when trying to query its status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants