# Run Llama 2 Models in SageMaker JumpStart

---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy a JumpStart model for Text Generation using the Llama 2 fine-tuned model optimized for dialogue use cases.

---

## Setup

***

In [2]:
import datetime

print(datetime.datetime.now())

2024-06-25 00:59:09.390737


In [3]:
%pip install --upgrade --quiet sagemaker

[0mNote: you may need to restart the kernel to use updated packages.


***
You can continue with the default model or choose a different model: this notebook will run with the following model IDs :
- `meta-textgeneration-llama-2-7b-f`
- `meta-textgeneration-llama-2-13b-f`
- `meta-textgeneration-llama-2-70b-f`
***

In [4]:
model_id = "meta-textgeneration-llama-2-7b-f"

In [5]:
model_version = "3.*"

## Deploy model

***
You can now deploy the model using SageMaker JumpStart. For successful deployment, you must manually change the `accept_eula` argument in the model's deploy method to `True`.
***

In [6]:
%%time
from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id, model_version=model_version)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


Using vulnerable JumpStart model 'meta-textgeneration-llama-2-7b-f' and version '3.2.0'.
Using model 'meta-textgeneration-llama-2-7b-f' with wildcard version identifier '3.*'. You can pin to version '3.2.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


CPU times: user 1.69 s, sys: 4.6 s, total: 6.29 s
Wall time: 1.68 s


In [7]:
%%time
predictor = model.deploy(accept_eula=True)

-----------!CPU times: user 65.4 ms, sys: 15.4 ms, total: 80.8 ms
Wall time: 6min 2s


## Invoke the endpoint

***
### Supported Parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must be a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. Next, we show an example of how to invoke endpoint with these arguments.

**NOTE**: If `max_new_tokens` is not defined, the model may generate up to the maximum total tokens allowed, which is 4K for these models. This may result in endpoint query timeout errors, so it is recommended to set `max_new_tokens` when possible. For 7B, 13B, and 70B models, we recommend to set `max_new_tokens` no greater than 1500, 1000, and 500 respectively, while keeping the total number of tokens less than 4K.

**NOTE**: This model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...).

***

### Example prompts
***
The examples in this section demonstrate how to perform text generation with conversational dialog as prompt inputs. Example payloads are retrieved programmatically from the `JumpStartModel` object.

Input messages for Llama-2 chat models should exhibit the following format. The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and alternating (u/a/u/a/u...). The last message must be from 'user'. A simple user prompt may look like the following:
```
<s>[INST] {user_prompt} [/INST]
```
You may also add a system prompt with the following syntax:
```
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt} [/INST]
```
Finally, you can have a conversational interaction with the model by including all previous user prompts and assistant responses in the input:
```
<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_prompt_1} [/INST] {assistant_response_1} </s><s>[INST] {user_prompt_1} [/INST]
```
***

In [8]:
%%time

example_payloads = model.retrieve_all_examples()


CPU times: user 292 µs, sys: 0 ns, total: 292 µs
Wall time: 297 µs


In [9]:
example_payloads[0].body

{'inputs': '<s>[INST] what is the recipe of mayonnaise? [/INST] ',
 'parameters': {'max_new_tokens': 256,
  'top_p': 0.9,
  'temperature': 0.6,
  'decoder_input_details': True,
  'details': True}}

In [10]:

for payload in example_payloads:
    response = predictor.predict(payload.body)
    print("\nInput\n", payload.body, "\n\nOutput\n", response[0]["generated_text"], "\n\n===============")


Input
 {'inputs': '<s>[INST] what is the recipe of mayonnaise? [/INST] ', 'parameters': {'max_new_tokens': 256, 'top_p': 0.9, 'temperature': 0.6, 'decoder_input_details': True, 'details': True}} 

Output
 Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, and an acid, such as vinegar or lemon juice. Here is a basic recipe for homemade mayonnaise:

Ingredients:

* 3 large egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon (15 ml) vinegar or lemon juice
* Salt and pepper to taste

Instructions:

1. In a medium-sized bowl, whisk together the egg yolks and vinegar or lemon juice until the mixture becomes thick and pale yellow in color.
2. Slowly pour in the oil while continuously whisking the mixture. The mixture should thicken and emulsify as you add the oil.
3. Continue whisking until the mixture is smooth and creamy, and has a thick, velvety texture. This should take about 5-7 minutes.
4. Taste and adjust the seasoni

***
While not used in the previously provided example payloads, you can format your own messages to the Llama-2 model with the following utility function.
***

In [11]:
from typing import Dict, List


def format_messages(messages: List[Dict[str, str]]) -> List[str]:
    """Format messages for Llama-2 chat models.
    
    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and 
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """
    prompt: List[str] = []

    if messages[0]["role"] == "system":
        content = "".join(["<<SYS>>\n", messages[0]["content"], "\n<</SYS>>\n\n", messages[1]["content"]])
        messages = [{"role": messages[1]["role"], "content": content}] + messages[2:]

    for user, answer in zip(messages[::2], messages[1::2]):
        prompt.extend(["<s>", "[INST] ", (user["content"]).strip(), " [/INST] ", (answer["content"]).strip(), "</s>"])

    prompt.extend(["<s>", "[INST] ", (messages[-1]["content"]).strip(), " [/INST] "])

    return "".join(prompt)


dialog = [
    {"role": "system", "content": "You are writing strategic objective statements for US government goals. You work for the US Department of Health and Human Services. Each statement should clearly communicate what the agency is trying to accomplish. It should energize Americans who have a stake in the issue to get involved. Rewrite the strategic objective statement at the bottom to make it stronger. Generate three options. Do not use jargon. Write for a 10th-grade reading level or lower. Format the statement so that it is easy to skim and read. Use less than 225 characters."},
    {"role": "user", "content": "Rewrite: Drive the integration of behavioral health into the healthcare system to strengthen and expand access to mental health and substance use disorder treatment and recovery services for individuals and families."""},
]

prompt = format_messages(dialog)
prompt

'<s>[INST] <<SYS>>\nYou are writing strategic objective statements for US government goals. You work for the US Department of Health and Human Services. Each statement should clearly communicate what the agency is trying to accomplish. It should energize Americans who have a stake in the issue to get involved. Rewrite the strategic objective statement at the bottom to make it stronger. Generate three options. Do not use jargon. Write for a 10th-grade reading level or lower. Format the statement so that it is easy to skim and read. Use less than 225 characters.\n<</SYS>>\n\nRewrite: Drive the integration of behavioral health into the healthcare system to strengthen and expand access to mental health and substance use disorder treatment and recovery services for individuals and families. [/INST] '

In [12]:
payload= {'inputs': prompt,
 'parameters': {'max_new_tokens': 256,
  'top_p': 0.9,
  'temperature': 0.6,
  'decoder_input_details': True,
  'details': True}}

In [13]:
response = predictor.predict(payload)
print(response[0]["generated_text"])

Option 1:
"Unlock Access to Mental Health & Substance Use Treatment: Integrate Behavioral Health into Healthcare to Strengthen Care for All."

Option 2:
"Transform Healthcare with Behavioral Health Integration: Ensure Equitable Access to Mental Health & Substance Use Treatment for All."

Option 3:
"Healthy Minds, Healthy Lives: Integrate Behavioral Health into Healthcare to Expand Access to Mental Health & Substance Use Treatment & Recovery Services."

Each of these options is clear, concise, and easy to understand, with a focus on the importance of integrating behavioral health into the healthcare system to improve access to mental health and substance use disorder treatment and recovery services for individuals and families. They also use language that is relatable and motivating, and avoid technical jargon to make them accessible to a wide range of audiences.


In [14]:
def reformat_user_prompt(user_prompt):
    return f"<s>[INST] {user_prompt} [/INST]"

additional_examples = [
    {
    "inputs": "Tell me about General Services Administration.",
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
        "stop": ["<|endoftext|>", "</s>"]
    }
},  # tell me about...
{
    "inputs": "Tell me about Amazon SageMaker.",
    "parameters": {
        "do_sample": True,
        "top_p": 0.9,
        "temperature": 0.8,
        "max_new_tokens": 1024,
        "stop": ["<|endoftext|>", "</s>"]
    }
}, # summarization
 {
    "inputs":"""Starting today, the state-of-the-art Falcon 40B foundation model from Technology
    Innovation Institute (TII) is available on Amazon SageMaker JumpStart, SageMaker's machine learning (ML) hub
    that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get
    started with ML. You can deploy and use this Falcon LLM with a few clicks in SageMaker Studio or
    programmatically through the SageMaker Python SDK.
    Falcon 40B is a 40-billion-parameter large language model (LLM) available under the Apache 2.0 license that
    ranked #1 in Hugging Face Open LLM leaderboard, which tracks, ranks, and evaluates LLMs across multiple
    benchmarks to identify top performing models. Since its release in May 2023, Falcon 40B has demonstrated
    exceptional performance without specialized fine-tuning. To make it easier for customers to access this
    state-of-the-art model, AWS has made Falcon 40B available to customers via Amazon SageMaker JumpStart.
    Now customers can quickly and easily deploy their own Falcon 40B model and customize it to fit their specific
    needs for applications such as translation, question answering, and summarizing information.
    Falcon 40B are generally available today through Amazon SageMaker JumpStart in US East (Ohio),
    US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai),
    Europe (London), Europe (Frankfurt), Europe (Ireland), and Canada (Central),
    with availability in additional AWS Regions coming soon. To learn how to use this new feature,
    please see SageMaker JumpStart documentation, the Introduction to SageMaker JumpStart –
    Text Generation with Falcon LLMs example notebook, and the blog Technology Innovation Institute trainsthe
    state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker. Summarize the article above:""",
    "parameters":{
        "max_new_tokens":200
        }
    },
    
]

In [15]:
def query_endpoint(payload):
    """Query endpoint and print the response"""
    response = predictor.predict(payload)
    print(f"\033[1m Input:\033[0m {payload['inputs']}")
    print(f"\033[1m Output:\033[0m {response[0]['generated_text']}")

In [16]:
for ex in additional_examples:
    text = reformat_user_prompt(ex["inputs"])
    ex["inputs"] = text
    

In [17]:
print(len(additional_examples))

3


In [18]:
%%time
for payload in additional_examples:
    query_endpoint(payload)
    print("---------------------------------------------------------------------------\n\n")
    

[1m Input:[0m <s>[INST] Tell me about General Services Administration. [/INST]
[1m Output:[0m   The General Services Administration (GSA) is a federal agency in the United States that provides centralized support services to other federal agencies. The GSA was established in 1949 to help federal agencies acquire goods and services more efficiently and cost-effectively. The agency is responsible for managing the government's real estate, fleet, and acquisition of goods and services.

The GSA has several key responsibilities:

1. Real Estate Management: The GSA is responsible for managing the federal government's real estate portfolio, which includes over 9,000 buildings across the country. The agency leases and manages these buildings on behalf of federal agencies, as well as provides design and construction services for new buildings.
2. Fleet Management: The GSA manages the federal government's fleet of vehicles, including cars, trucks, and aircraft. The agency also provides fuel 

## Clean up the endpoint

In [19]:
# Delete the SageMaker endpoint
predictor.delete_model()
predictor.delete_endpoint()

In [20]:
print(datetime.datetime.now())

2024-06-25 01:06:44.946718
