# Getting started with Llama 3 on AWS

## Llama 3

Llama 3 (Large Language Model Meta AI) is the third iteration of Meta's advanced language models, designed for tasks like text generation, translation, and summarization. Built on transformer architecture, it excels at understanding and generating human-like text by training on extensive datasets from diverse sources.

## Amazon Bedrock

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Since Amazon Bedrock is serverless, you don't have to manage any infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

AWS Bedrock can be used to access and deploy LLaMA 3 by providing the necessary infrastructure and tools. Developers can integrate LLaMA 3 into their applications through Bedrock's APIs, customize it with specific datasets, and scale the deployment as needed, leveraging AWS's robust infrastructure and cost management features.

In [1]:
# Define a couple utility functions. You can skip this section.

import rich

def print_json(data):
    rich.print_json(json.dumps(data))

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> Large language models produce non-deterministic results, you may see different outputs than those presented in this notebook. Running this code from your own AWS account will incur charges for the tokens used.
</div>

## Introduction

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> This notebook is vetted to run on a [SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) Jupyter notebook running the `ipykernel`. Also, the credentials, namely AWS Access Key and AWS Secret Access key, are assigned as an IAM Role to the notebook instance, hence why they are not hard-coded anywhere in this code. If you run this outside of SageMaker Studio, make the right ajustement to [authenticate your requests](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) to the Bedrock API with your AWS Access Key.
</div>

[Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) is a Python library that allows you to interact with AWS resources programmatically. It provides an easy way to automate tasks and manage AWS services through code. We'll use Boto3 to make requests and retrieve data from the Amazon Bedrock API. The Boto3 Bedrock SDK includes four clients designed to interact with different aspects of Bedrock:

- **bedrock**: Includes APIs for controlling model management, training, and deployment.
- **bedrock-runtime**: Includes APIs for making inference requests to models hosted in Amazon Bedrock.
- **bedrock-agent**: Provides APIs for creating and managing agents and knowledge bases.
- **bedrock-agent-runtime**: Includes APIs for controlling model management, training, and deployment for agents and knowledge bases.

Let's start by installing the latest version of boto3.

In [2]:
# Install the latest version of boto3
!python3 -m pip install --quiet --upgrade boto3

In [3]:
import boto3
print(boto3.__version__)

1.34.138


To kick things off, we list all models available via Bedrock from Meta. Note that differents models will be available based on the AWS Region you choose.

In [4]:
# Set default AWS region
default_region = "us-east-1"

# Create a Bedrock client in the AWS Region of your choice.
bedrock = boto3.client("bedrock", region_name=default_region)

# List all models from meta
models = bedrock.list_foundation_models(
    byProvider='Meta' # comment this line to get all models from all providers
)

models

{'ResponseMetadata': {'RequestId': '3b3f1555-674a-4f08-8a51-d21f184a015c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Wed, 03 Jul 2024 18:06:17 GMT',
   'content-type': 'application/json',
   'content-length': '3672',
   'connection': 'keep-alive',
   'x-amzn-requestid': '3b3f1555-674a-4f08-8a51-d21f184a015c'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/meta.llama2-13b-chat-v1:0:4k',
   'modelId': 'meta.llama2-13b-chat-v1:0:4k',
   'modelName': 'Llama 2 Chat 13B',
   'providerName': 'Meta',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': [],
   'inferenceTypesSupported': ['PROVISIONED'],
   'modelLifecycle': {'status': 'LEGACY'}},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/meta.llama2-13b-chat-v1',
   'modelId': 'meta.llama2-13b-chat-v1',
   'modelName': 'Llama 2 Chat 13B',
   'providerName': 'Meta',
   'inputModalitie

The output returns many important attributes for each models:

| Field                        | Description                                                              |
|------------------------------|--------------------------------------------------------------------------|
| `modelArn`                   | ARN that uniquely identifies the model in AWS Bedrock.                   |
| `modelId`                    | Unique identifier for the model within AWS Bedrock.                      |
| `modelName`                  | Name or title of the model.                                              |
| `providerName`               | Organization or entity providing the model.                              |
| `inputModalities`            | Types of inputs the model accepts (e.g., `'TEXT'`).                        |
| `outputModalities`           | Types of outputs the model generates (e.g., `'TEXT'`).                     |
| `responseStreamingSupported` | Indicates if the model supports streaming responses.                     |
| `customizationsSupported`    | Lists any customization options available for the model.                 |
| `inferenceTypesSupported`    | Describes the ways inference can be requested (e.g., `'ON_DEMAND'`).       |
| `modelLifecycle`             | Current status of the model (e.g., `'ACTIVE'`).                            |


Another way to list all models in a more readable fashion is as follow:

In [5]:
for model in models['modelSummaries']:
    print(model['modelId'])

meta.llama2-13b-chat-v1:0:4k
meta.llama2-13b-chat-v1
meta.llama2-70b-chat-v1:0:4k
meta.llama2-70b-chat-v1
meta.llama2-13b-v1:0:4k
meta.llama2-13b-v1
meta.llama2-70b-v1:0:4k
meta.llama2-70b-v1
meta.llama3-8b-instruct-v1:0
meta.llama3-70b-instruct-v1:0


## Calling a model

The first example consist of a call to the Bedrock API to pass a prompt and receive an answer from the LLM.

The `InvokeModel` API call the specified Amazon Bedrock model to run inference using the prompt and inference parameters provided in the request body. Depending on the model, you can infer text, images or embeddings.

API documentation: https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html

In [6]:
import json

from botocore.exceptions import ClientError

# Set the model ID.
model_id = "meta.llama3-8b-instruct-v1:0"

# Set the prompt.
prompt = "Describe the purpose of a 'hello world' program in one line."

# Create a Bedrock Runtime client in the AWS Region you want to use.
bedrock_runtime = boto3.client("bedrock-runtime", region_name=default_region)

# Embed the prompt in Llama 3's instruction format.
# More information: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
formatted_prompt = f"""
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""

# Format the request payload using the model's native structure.
native_request = {
    "prompt": formatted_prompt,
    "max_gen_len": 512,
    "temperature": 0.5,
}

# Convert the native request to JSON.
request = json.dumps(native_request)

try:
    # Invoke the model with the request.
    response = bedrock_runtime.invoke_model(modelId=model_id, body=request)
    
    # Decode the response body.
    model_response = json.loads(response["body"].read())

    # Extract and print the response text.
    response_text = model_response["generation"]
    print(response_text)

except (ClientError, Exception) as e:
    print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
    exit(1)

The purpose of a "Hello World" program is to serve as a simple, introductory example of a working program in a programming language, demonstrating the basic syntax and structure of the language, and providing a starting point for new programmers to learn and experiment with.


Additionally, Llama 2 Chat, Llama 2, and Llama 3 Instruct models return the following fields for a text completion inference call alongside the generated text by the model.
 
| Field                    | Description                                                                                                                                                                           |
|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `generation`             | The generated text.                                                                                                                                                                   |
| `prompt_token_count`     | The number of tokens in the prompt.                                                                                                                                                   |
| `generation_token_count` | The number of tokens in the generated text.                                                                                                                                           |
| `stop_reason`            | The reason why the response stopped generating text. Possible values are: <br> - `stop`: The model has finished generating text for the input prompt. <br> - `length`: The length of the tokens for the generated text exceeds the value of `max_gen_len` in the call to `InvokeModel` (`InvokeModelWithResponseStream`, if you are streaming output). The response is truncated to `max_gen_len` tokens. Consider increasing the value of `max_gen_len` and trying again. |

In [7]:
print_json(model_response)

The drawback of using the InvokeModel API lies in its requirement for different JSON request and response structures depending on the model provider. Recall the following code snippet from the example:


```python
formatted_prompt = f"""
<|begin_of_text|>
<|start_header_id|>user<|end_header_id|>
{prompt}
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
```

Switching from llama2 or llama3 to another model with a different prompt structure, such as from a different provider (or maybe even a future release of llama), would necessitate rewriting the code. This situation leads to managing diverse formats, complicating integration efforts.

A better approach is to use the Amazon Bedrock `Converse` API.

### Bedrock converse API

The [Bedrock Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference.html) is designed for creating advanced conversational applications by interacting with large language models like Llama3. It allows developers to send conversation prompts and receive contextually relevant responses, maintaining dialogue coherence over multiple exchanges.

Compared to the InvokeModel API, the Converse API offers advantages in dialogue management and context retention. While `InvokeModel` handles single, standalone prompts, the `Converse` API is built to maintain the context of an ongoing conversation, making it more suitable for applications that require multi-turn interactions and a natural flow of dialogue. This enhanced capability results in more engaging and effective conversational agents.

For a complete guide, see [Getting started with the Amazon Bedrock Converse API
](https://community.aws/content/2hHgVE7Lz6Jj1vFv39zSzzlCilG/getting-started-with-the-amazon-bedrock-converse-api?lang=en).

In [8]:
# Use the Conversation API to send a text message to Meta Llama.

def send_message_to_model(conversation, model_id=model_id, max_tokens=512, temperature=0.5, top_p=0.9, system_prompt="You are a helpful assistant"):
    try:
        # Send the message to the model, using the provided inference configuration.
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=conversation,
            inferenceConfig={"maxTokens": max_tokens, "temperature": temperature, "topP": top_p},
            system=[{"text":system_prompt}],
        )

        # Extract and print the response text.
        print(response["output"]["message"]["content"][0]["text"])
        return response

    except (ClientError, Exception) as e:
        print(f"ERROR: Can't invoke '{model_id}'. Reason: {e}")
        exit(1)


# Start a conversation with the user message.
user_message = "Describe the purpose of a 'hello world' program in one line."
conversation = [
    {
        "role": "user",
        "content": [{"text": user_message}],
    }
]

response = send_message_to_model(conversation)



A "Hello World" program is a simple computer program that prints or displays the text "Hello, World!" to demonstrate the basic syntax and functionality of a programming language or development environment.


Alternatively, we can print the whole conversation. Notice the two roles `user` and `assistant` alterning between each other. The last message in the list should be from the `user` role, so that the LLM can respond to it.

In [9]:
conversation.append(response["output"]["message"])
print_json(conversation)

#### Setting a system prompt

You can set a system prompt to communicate basic instructions for the large language model outside of the normal conversation. System prompts are generally used by the developer to define the tone and constraints for the conversation. In this case, we’re instructing Llama to act like a pirate.

In [10]:
new_message = {
    "role": "user",
    "content": [
        { "text": "What is the best place to hide a pirate booty?" } 
    ],
}

system_prompt="Answer in the style of a pirate"

conversation.append(new_message)
response = send_message_to_model(conversation, system_prompt=system_prompt)



Arrr, shiver me timbers! The best place to hide a pirate booty be a place that's hard to find, but not impossible. I'd say, stash yer loot on a deserted isle, deep in the jungle, where the only creatures that'll find it be the scurvy dogs that live there. Make sure it be hidden good, with traps and puzzles to keep landlubbers from gettin' their grubby hands on it. And don't ferget to leave a treasure map, just in case ye need to find it yerself!


#### Getting response metadata and token counts

The Converse method also returns metadata about the API call. The `stopReason` property tells us why the model completed the message. This can be useful for your application logic, error handling, or troubleshooting. The `usage` property includes details about the input and output tokens. This can help you understand the charges for your API call.

In [11]:
print_json(response)

## Additional Ressource

- Meta's Llama receipt for AWS: https://github.com/meta-llama/llama-recipes/tree/main/recipes/3p_integrations/aws
- Amazon Bedrock samples: https://github.com/aws-samples/amazon-bedrock-samples