# Deploy a Llama model as an AML Endpoint


# 1. Connect to Azure Machine Learning Workspace


## 1.1. Import the required libraries

In [16]:
# import required libraries
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Environment,
    CodeConfiguration,
    OnlineRequestSettings,
)
from azure.identity import DefaultAzureCredential

## 1.2. Configure workspace details and get a handle to the workspace


In [17]:
# get a handle to the workspace

ml_client = MLClient.from_config(credential=DefaultAzureCredential())

Found the config file in: /afh/projects/cba-19f78871-1b4d-4995-a55d-9bb46faef344/config.json


# 2. Define endpoint and deployment

## 2.1 Define the endpoint

To define an endpoint, you need to specify:

* Endpoint name: The name of the endpoint. It must be unique in the Azure region. For more information on the naming rules, see [managed online endpoint limits](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints).
* Authentication mode: The authentication method for the endpoint. Choose between key-based authentication and Azure Machine Learning token-based authentication. A key doesn't expire, but a token does expire. For more information on authenticating, see [Authenticate to an online endpoint](how-to-authenticate-online-endpoint.md).
* Optionally, you can add a description and tags to your endpoint.

In [18]:
# Define an endpoint name


import uuid
endpoint_name = "llama-guard-2-8b" + str(uuid.uuid4())[:4]

endpoint = ManagedOnlineEndpoint(name=endpoint_name)

endpoint = ml_client.begin_create_or_update(endpoint).result()

## 2.2 Define the deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. To deploy a model, you must have:

- Model files (or the name and version of a model that's already registered in your workspace). In the example, use use an empty string (no model)
- A scoring script, that is, code that executes the model on a given input request. The scoring script receives data submitted to a deployed web service and passes it to the model. The script then executes the model and returns its response to the client. The scoring script is specific to your model and must understand the data that the model expects as input and returns as output. In this example, we have a *score.py* file.
- An environment in which your model runs. The environment can be a Docker image with Conda dependencies or a Dockerfile.
- Settings to specify the instance type and scaling capacity.

The following table describes the key attributes of a deployment:

| Attribute      | Description                                                                                                                                                                                                                                                                                                                                                                                    |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name           | The name of the deployment.                                                                                                                                                                                                                                                                                                                                                                    |
| Endpoint name  | The name of the endpoint to create the deployment under.                                                                                                                                                                                                                                                                                                                                       |
| Model          | The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.                                                                                                                                                                                                                                    |
| Code path      | The path to the directory on the local development environment that contains all the Python source code for scoring the model. You can use nested directories and packages.                                                                                                                                                                                                                    |
| Scoring script | The relative path to the scoring file in the source code directory. This Python code must have an `init()` function and a `run()` function. The `init()` function will be called after the model is created or updated (you can use it to cache the model in memory, for example). The `run()` function is called at every invocation of the endpoint to do the actual scoring and prediction. |
| Environment    | The environment to host the model and code. This value can be either a reference to an existing versioned environment in the workspace or an inline environment specification.                                                                                                                                                                                                                 |
| Instance type  | The VM size to use for the deployment. For the list of supported sizes, see [Managed online endpoints SKU list](reference-managed-online-endpoints-vm-sku-list.md).                                                                                                                                                                                                                            |
| Instance count | The number of instances to use for the deployment. Base the value on the workload you expect. For high availability, we recommend that you set the value to at least `3`. We reserve an extra 20% for performing upgrades. For more information, see [managed online endpoint quotas](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints).                                |

In [23]:
request_settings = OnlineRequestSettings(
    request_timeout_ms=180000  # Timeout in milliseconds
)

In [26]:
deployment_name = "inference-3"
deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    code_configuration=CodeConfiguration(
        code=".", scoring_script="score.py"
    ),
    request_settings=request_settings,
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
        conda_file="conda.yaml",
    ),
    instance_type="Standard_NC24ads_A100_v4",
    instance_count=1,
)

# 3. Create local endpoint and deployment

## 3.1 Create local endpoint

The goal of a local endpoint deployment is to validate and debug your code and configuration before you deploy to Azure. Local deployment has the following limitations:
* Local endpoints *do not support* traffic rules, authentication, or probe settings.
* Local endpoints support only one deployment per endpoint.
* They support local model files only. If you want to test registered models, first download them, then use `path` in the deployment definition to refer to the parent folder.

In [27]:
deployment = ml_client.online_deployments.begin_create_or_update(deployment).result()

Check: endpoint llama-guard-2-8b5c02 exists


[32mUploading deploy_llama (0.03 MBs): 100%|██████████| 30631/30631 [00:00<00:00, 277472.18it/s]
[39m



...............................................................................................

## 3.2 Create deployment



In [None]:
endpoint.traffic = {deployment_name: 100}
endpoint = ml_client.begin_create_or_update(endpoint).result()

# 6. Test the endpoint with sample data


In [None]:
API_URI = endpoint.scoring_uri
print(f"API URI: {API_URI}")

## Remember to create a .env file with the API key:

```bash
API_KEY=<<get api key from endpoint in aml>>
```

# Define inference

In [None]:
import requests
import os
from dotenv import load_dotenv
load_dotenv()



API_KEY = os.getenv("API_KEY")


def inference(data,url=API_URI,api_key=API_KEY):


    # Replace this with the primary/secondary key, AMLToken, or Microsoft Entra ID token for the endpoint
    
    if not api_key:
        raise Exception("A key should be provided to invoke the endpoint")

    headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + api_key}

    response = requests.post(url, json=data, headers=headers)

    if response.status_code == 200:
        result = response.json()
        return result
    else:
        print("The request failed with status code: " + str(response.status_code))
        print(response.headers)
        print(response.text)
        return None


In [None]:
print(inference({"data":
                      {"conversation": 
                       [{"user": 
                         "I forgot, how do I kill a process in Linux?"}, 
                        {"assistant": 
                         "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."}
                        ]}}))