# Chat Completion: Run Llama 2 Models in SageMaker JumpStart

Original Notebook Credits: https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/llama-2-chat-completion.ipynb

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy a JumpStart model for Text Generation using the Llama 2 fine-tuned model optimized for dialogue use cases.

To perform inference on these models, you need to pass custom_attributes='accept_eula=true' as part of header. This means you have read and accept the end-user-license-agreement (EULA) of the model. EULA can be found in model card description or from https://ai.meta.com/resources/models-and-libraries/llama-downloads/. By default, this notebook sets custom_attributes='accept_eula=false', so all inference requests will fail until you explicitly change this custom attribute.

Note: Custom_attributes used to pass EULA are key/value pairs. The key and value are separated by '=' and pairs are separated by ';'. If the user passes the same key more than once, the last value is kept and passed to the script handler (i.e., in this case, used for conditional logic). For example, if 'accept_eula=false; accept_eula=true' is passed to the server, then 'accept_eula=true' is kept and passed to the script handler.

---

## Setup

***

In [1]:
%pip install --upgrade --quiet sagemaker datasets

Note: you may need to restart the kernel to use updated packages.


***
You can continue with the default model or choose a different model: this notebook will run with the following model IDs :
- `meta-textgeneration-llama-2-7b-f`
- `meta-textgeneration-llama-2-13b-f`
- `meta-textgeneration-llama-2-70b-f`
***

In [2]:
model_id, model_version = "meta-textgeneration-llama-2-7b-f", "2.*"

## Deploy model

***
You can now deploy the model using SageMaker JumpStart.
***

In [3]:
from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id, model_version=model_version)
predictor = model.deploy()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


For forward compatibility, pin to model_version='2.*' in your JumpStartModel or JumpStartEstimator definitions. Note that major version upgrades may have different EULA acceptance terms and input/output signatures.
Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '2.*'. You can pin to version '2.0.4' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


-----------------!

## Invoke the endpoint

***
### Supported Parameters
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

***
### Notes
- If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.
- In order to support a 4k context length, this model has restricted query payloads to only utilize a batch size of 1. Payloads with larger batch sizes will receive an endpoint error prior to inference.
- This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).

***

In [4]:
def print_dialog(payload, response):
    dialog = payload["inputs"][0]
    for msg in dialog:
        print(f"{msg['role'].capitalize()}: {msg['content']}\n")
    print(
        f">>>> {response[0]['generation']['role'].capitalize()}: {response[0]['generation']['content']}"
    )
    print("\n==================================\n")

### Example

In [6]:
%%time

payload = {
    "inputs": [
        [
            {"role": "user", "content": "what is the recipe of mayonnaise?"},
        ]
    ],
    "parameters": {"max_new_tokens": 512, "top_p": 0.9, "temperature": 0.6},
}
try:
    response = predictor.predict(payload, custom_attributes="accept_eula=true")
    print_dialog(payload, response)
except Exception as e:
    print(e)

User: what is the recipe of mayonnaise?

>>>> Assistant:  Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings. Here is a basic recipe for homemade mayonnaise:

Ingredients:

* 2 large egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon (15 ml) vinegar or lemon juice
* 1/2 teaspoon salt
* 1/4 teaspoon sugar (optional)
* 1/4 teaspoon ground black pepper

Instructions:

1. In a medium-sized bowl, whisk together the egg yolks and salt until the mixture is smooth and slightly thickened.
2. Slowly pour in the oil while continuously whisking the mixture. The mixture will start to thicken and emulsify as you add the oil.
3. Once you have added about half of the oil, add the vinegar or lemon juice and whisk until fully incorporated.
4. Continue whisking until the mixture is smooth and creamy, and has a thick, velvety texture. This should take about 5-7 minutes.
5. Taste and adjust the seas

In [10]:
import boto3
import json

runtime_client = boto3.client('sagemaker-runtime')
content_type = "application/json"

response = runtime_client.invoke_endpoint(
    EndpointName=predictor.endpoint_name,
    ContentType=content_type,
    Body=json.dumps(payload),
    CustomAttributes="accept_eula=true")
result = json.loads(response['Body'].read().decode())
print(result[0]['generation'])

{'role': 'assistant', 'content': " Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, and vinegar or lemon juice. Here is a basic recipe for homemade mayonnaise:\n\nIngredients:\n\n* 2 egg yolks\n* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed\n* 1 tablespoon (15 ml) vinegar or lemon juice\n* Salt and pepper to taste\n\nInstructions:\n\n1. In a small bowl, whisk together the egg yolks and vinegar or lemon juice until the mixture is smooth and slightly thickened.\n2. Slowly pour the oil into the egg yolk mixture while continuously whisking. The mixture should thicken as you add the oil, becoming smooth and creamy.\n3. Continue whisking until the mixture is thick and smooth, about 5-7 minutes.\n4. Taste and adjust the seasoning as needed with salt and pepper.\n5. Transfer the mayonnaise to a jar or airtight container and store it in the refrigerator for up to 1 week.\n\nNote: It's important to use a slow and steady stream of oil when making