# Deploy `llama guard v2` to AML Endpoint - with model

Deploy llama guard with model - (model is first registered in AML)

> This model is gated.
>
> Before you start, please:
> - go to [https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B](https://huggingface.co/meta-llama/Meta-Llama-Guard-2-8B)
> - use your Hugging Face login to log in
> - request access to the model (it may take a few hours to get it)
> - once access has been granted, create a `token`
> - create the file `./src/.env` 
> - add the following to it:
>   ```
>   HUGGINGFACE_TOKEN=hf_* # your token
>   ```
> - Now proceed with the instructions below

In [31]:
# %pip install sentence-transformers

## Import model from Hugging Face

In [32]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from dotenv import load_dotenv
from huggingface_hub import login
import os

load_dotenv("./src/.env")

HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")

# login(token=HUGGINGFACE_TOKEN)

model_id = "meta-llama/Meta-Llama-Guard-2-8B"
device = "cuda"
dtype = torch.bfloat16

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype)


Loading checkpoint shards: 100%|██████████| 4/4 [00:01<00:00,  2.85it/s]


## Save model

In [33]:
model.save_pretrained('model/model')
tokenizer.save_pretrained('./model/tokenizer')

('./model/tokenizer/tokenizer_config.json',
 './model/tokenizer/special_tokens_map.json',
 './model/tokenizer/tokenizer.json')

## Connect to Azure Macine Learning Workspace

In [34]:

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential


ml_client = MLClient.from_config(credential=DefaultAzureCredential())


print(ml_client.workspace_name)


Found the config file in: /afh/projects/cba-19f78871-1b4d-4995-a55d-9bb46faef344/config.json


cba


## Register the model

In [35]:
from azure.ai.ml.entities import Model

model = Model(
    path="./model",
    name="Meta-Llama-Guard-2-8B",
    description="./Meta-Llama-Guard-2-8B model"
)
ml_client.models.create_or_update(model)


Your file exceeds 100 MB. If you experience low speeds, latency, or broken connections, we recommend using the AzCopyv10 tool for this file transfer.

Example: azcopy copy '/afh/projects/cba-19f78871-1b4d-4995-a55d-9bb46faef344/shared/Users/resilv/ai-studio-101/deploy_llama/static_model_load/model' 'https://strensnewerh194762525521.blob.core.windows.net/19f78871-1b4d-4995-a55d-9bb46faef344-azureml-blobstore/LocalUpload/4f87103d13dd33afc2a81f02cec9544a/model' 

See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.


Model({'job_name': None, 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'Meta-Llama-Guard-2-8B', 'description': './Meta-Llama-Guard-2-8B model', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': '/subscriptions/691b572d-8686-481a-9757-4befaa7f9526/resourceGroups/rg-resilvai/providers/Microsoft.MachineLearningServices/workspaces/cba/models/Meta-Llama-Guard-2-8B/versions/3', 'Resource__source_path': '', 'base_path': '/afh/projects/cba-19f78871-1b4d-4995-a55d-9bb46faef344/shared/Users/resilv/ai-studio-101/deploy_llama/static_model_load', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f9924d4d420>, 'serialize': <msrest.serialization.Serializer object at 0x7f9924d4cc70>, 'version': '3', 'latest_version': None, 'path': 'azureml://subscriptions/691b572d-8686-481a-9757-4befaa7f9526/resourceGroups/rg-resilvai/workspaces/cba/datastores/workspaceblobstore/paths/LocalUpload/4f8710

## Create the Endpoint

In [36]:
from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment, CodeConfiguration

import uuid
endpoint_name =  "llama-guard-2-8b"  + str(uuid.uuid4())[:4]

endpoint = ManagedOnlineEndpoint(name=endpoint_name)

endpoint = ml_client.begin_create_or_update(endpoint).result()


## Define the deployment (the real thing)

In [37]:
from azure.ai.ml.entities import (
    Environment
)
deployment_name = "inference"
deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model,
    code_configuration=CodeConfiguration(
        code="./src", scoring_script="score.py"
    ),
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
        conda_file="conda.yaml",
    ),
    instance_type="Standard_NC24ads_A100_v4",
    instance_count=1,
)

## Create the deployment

In [38]:
ml_client.online_deployments.begin_create_or_update(deployment).result()

Check: endpoint llama-guard-2-8baf7a exists


........................................................................................

ManagedOnlineDeployment({'private_network_connection': None, 'package_model': False, 'provisioning_state': 'Succeeded', 'endpoint_name': 'llama-guard-2-8baf7a', 'type': 'Managed', 'name': 'inference', 'description': None, 'tags': {}, 'properties': {'AzureAsyncOperationUri': 'https://management.azure.com/subscriptions/691b572d-8686-481a-9757-4befaa7f9526/providers/Microsoft.MachineLearningServices/locations/australiaeast/mfeOperationsStatus/odidp:19f78871-1b4d-4995-a55d-9bb46faef344:23c97181-3559-4b9f-8e8e-f4649d0ad348?api-version=2023-04-01-preview'}, 'print_as_yaml': False, 'id': '/subscriptions/691b572d-8686-481a-9757-4befaa7f9526/resourceGroups/rg-resilvai/providers/Microsoft.MachineLearningServices/workspaces/cba/onlineEndpoints/llama-guard-2-8baf7a/deployments/inference', 'Resource__source_path': '', 'base_path': '/afh/projects/cba-19f78871-1b4d-4995-a55d-9bb46faef344/shared/Users/resilv/ai-studio-101/deploy_llama/static_model_load', 'creation_context': None, 'serialize': <msrest.

## Assign Traffic to deployment

In [45]:
endpoint.traffic = {deployment_name: 100}
endpoint = ml_client.begin_create_or_update(endpoint).result()

Readonly attribute principal_id will be ignored in class <class 'azure.ai.ml._restclient.v2022_05_01.models._models_py3.ManagedServiceIdentity'>
Readonly attribute tenant_id will be ignored in class <class 'azure.ai.ml._restclient.v2022_05_01.models._models_py3.ManagedServiceIdentity'>


## Get the endpoint URL

In [44]:
API_URI = endpoint.scoring_uri
print(f"API URI: {API_URI}")

API URI: https://llama-guard-2-8baf7a.australiaeast.inference.ml.azure.com/score


## Check the endpoint on the deployment

Go to https://aml.azure.com, find your **Endpoint** -> **Consume** and get the key.

Create a `.env` file and put the following:

```bash
API_KEY=<<get api key from endpoint in aml>>
API_URI=<<the API_URI you got above .. you can also get it from **Endpoint** -> **Consume** >> 
```

## Define Inference

In [1]:
import requests
import os
from dotenv import load_dotenv
load_dotenv(".env")



API_URI = os.getenv("API_URI")
API_KEY = os.getenv("API_KEY")


def inference(data,url=API_URI,api_key=API_KEY):


    # Replace this with the primary/secondary key, AMLToken, or Microsoft Entra ID token for the endpoint
    
    if not api_key:
        raise Exception("A key should be provided to invoke the endpoint")

    headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + api_key}

    response = requests.post(url, json=data, headers=headers)

    if response.status_code == 200:
        result = response.json()
        return result
    else:
        print("The request failed with status code: " + str(response.status_code))
        print(response.headers)
        print(response.text)
        return None


## Test safe data

In [2]:
data = {
  "conversation": [
    {
      "role": "user",
      "content": "I forgot, how do I kill a process in Linux?"
    },
    {
      "role": "assistant",
      "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."
    }
  ]
}
inference(data)

'safe'

## Test Unsafe data

In [3]:
data = {
  "conversation": [
    {
      "role": "user",
      "content": "I want to kill someone?"
    },
    {
      "role": "assistant",
      "content": "Here is how you kill a person: you can shoot them"
    }
  ]
}
inference(data)

'unsafe\nS1'