# Module 2, Activity 2: Parameter Tuning for Model Behavior

We have thus far seen how to tune the prompt a bit.  Now let's look at ways to tune the model.  We are going to explore three different hyperparameters: temperature, top_p, and the number of tokens.  Please be aware that there are many others that you are encouraged to explore on your own.

In [None]:
import json
from pprint import pprint
import boto3

from langchain.chains import LLMChain
from langchain_aws import ChatBedrock, ChatBedrockConverse
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.runnables import RunnableLambda

In [None]:
session = boto3.session.Session()
region = session.region_name

In [None]:
def get_data_from_s3(bucket_name, key):
    s3 = boto3.client(
        's3',
        region_name=region,
    )
    response = s3.get_object(Bucket=bucket_name, Key=key)
    data = response['Body'].read().decode('utf-8')

    return data

## Temperature

We have briefly touched on temperature but not really gone into too much depth yet.  Think of temperature in an LLM like a "creativity dial" for its responses.

- Low temperature (e.g., 0.0-0.3): The model plays it safe, sticking to the most likely answers. This is great when you want accuracy and consistency, like coding help or fact-based answers.

- High temperature (e.g., 0.7-1.0): The model gets more adventurous, picking less common words and generating more diverse responses. This is useful for creative writing, brainstorming, or when you want unique outputs.

If you set it to 0, the model is basically deterministic—it’ll always give the same answer if asked the same thing.  As you increase temperature you will get more creative (and perhaps unpredictable!) responses.  So let's create a simple prompt and run it a few times.  First we will set a low temperature.

In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    "You are a helpful assistant."
)
human_prompt = HumanMessagePromptTemplate.from_template(
    "{input}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

In [None]:
llm = ChatBedrock(
    model_id="amazon.titan-text-express-v1",
    region_name=region,
    temperature=0.0,
    max_tokens=1000,
)

chain = prompt | llm | StrOutputParser()

In [None]:
chain.invoke("Describe the plot of Hamlet")

In [None]:
chain.invoke("Describe the plot of Hamlet")

In [None]:
chain.invoke("Describe the plot of Hamlet")

Because the temperature is low you will not see much variation between each of these three runs.  Now let's turn it up to a high value and see what happens with multiple runs.

In [None]:
llm = ChatBedrock(
    model_id="amazon.titan-text-express-v1",
    region_name=region,
    temperature=0.9,
    max_tokens=1000,
)

chain = prompt | llm | StrOutputParser()

In [None]:
chain.invoke("Describe the plot of Hamlet")

In [None]:
chain.invoke("Describe the plot of Hamlet")

In [None]:
chain.invoke("Describe the plot of Hamlet")

What did you notice when you did this?  Hopefully you see much more variability.  

## ChatBedrockConverse()

As the LLM space evolves, new APIs are always coming out that add more functionality.  In particular, we are now going to tune a different parameter in the LLM that is not part of the original `ChatBedrock`.  So we will move to `ChatBedrockConverse`, which is the Bedrock chat model integration built on the Bedrock converse API.  This implementation will eventually replace the existing ChatBedrock implementation once the Bedrock converse API has feature parity with older Bedrock API.  Specifically the converse API does not yet support custom Bedrock models.

## top_p (AKA nucleus sampling)

We are now going to turn to a different hyperparameter we can tune called top_p or nucleus sampling.  Top_p is a way to narrow down the choices the model can pick from when generating text. Instead of considering every possible word, it focuses on a small group of the most likely words that together make up a certain percentage of the chance to appear. This helps keep the response creative yet sensible by avoiding too many unlikely word choices.

In [None]:
llm = ChatBedrockConverse(
    model="anthropic.claude-3-sonnet-20240229-v1:0",  # Note that not all models support top_p
    temperature=0.0,
    top_p=0.9,
)

chain = prompt | llm | StrOutputParser()

In [None]:
chain.invoke("Compose a sonnet about my love of coffee.")

Now let's change the value of top_p...

In [None]:
llm = ChatBedrockConverse(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    temperature=0.0,
    top_p=0.2,
)

chain = prompt | llm | StrOutputParser()

In [None]:
chain.invoke("Compose a sonnet about my love of coffee.")

## Updating both at the same time

When you adjust both temperature and top_p at the same time, you're fine-tuning two different aspects of the sampling process:

**Temperature:**
This parameter scales the raw logits (the model’s confidence levels for each token) before sampling.

- Low Temperature: Leads to a peaked distribution where the model is more confident in its top choices, resulting in more deterministic outputs.
- High Temperature: Flattens the distribution, introducing more randomness and potentially more creative outputs.
  
**top_p:**
This parameter restricts the sampling to only the smallest set of tokens whose cumulative probability exceeds the threshold p.

- Low top_p (e.g., 0.3): Limits the model to only the most likely tokens, which makes the output more focused and less varied.
- High top_p (e.g., 0.9): Allows a wider selection of tokens, increasing diversity in the output.
  
**Combined Effects:**

- If you set a high temperature and a high top_p, the model is encouraged to explore a wide range of tokens, resulting in very creative, diverse, and potentially less predictable responses.
- Conversely, a low temperature with a low top_p will constrain the model to the most likely tokens, leading to more consistent and deterministic outputs.
- Tuning both parameters simultaneously lets you balance the trade-off between creativity and predictability in the model’s output. Experimentation is key to finding the right mix for your particular use case.

Be sure to play with your own questions and combinations of temperature and top_p to see what you get!

## Tokens

At this point we have been specifying the maximum number of tokens that a model can return.  Each model has its own limit on the maximum total number of tokens it will work with (either on input and/or output), called the "context window" or "context size".  The context size is the maximum amount of text (measured in tokens) in a <prompt, completion> pair.  For the models we are working with, here is that value:

- Titan Text G1 Lite: 4k
- Titan Text G2 Express: 8k
- Claude 3 Sonnet: 200k (maximum output: 8k)
- Claude 3 Haiku: 200k (maximum output: 4k)

Let's see this in action now by working with data that will not all necessarily fit into the context.

## Note on throttling

Each model has a rate limit associated with it and those vary model by model.  If start running many back-to-back queries and receive this error
```
ERROR:root:Error raised by bedrock service: An error occurred (ThrottlingException) when calling the InvokeModel operation (reached max retries: 4): Too many requests, please wait before trying again. You have sent too many requests.  Wait before trying again.
```
either try a different model or wait a few minutes.  

In [None]:
s3_data = get_data_from_s3("bucket-test-cj", "hamlet.txt")
s3_data[0:200]

In [None]:
llm = ChatBedrockConverse(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
)

token_count = llm.get_num_tokens(s3_data)
print(token_count)

Let's start by seeing what happens if we run this through a model that does not have a large enough context size to accommodate the data...

In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    "You are a helpful assistant."
)
human_prompt = HumanMessagePromptTemplate.from_template(
    "{input}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="amazon.titan-text-express-v1",
    region_name=region,
    temperature=0.0,
)

chain = prompt | llm | StrOutputParser()

At best, this will happen...

In [None]:
chain.invoke(f"What happens at the end of {s3_data}?")

(Note that that message is in characters (s3_data + prompt) and not tokens.)  At worst, you will not be given a similar message and just think that you have fully gotten all of the data in only to be given a very wrong answer.

Now let's try this with a better model...

In [None]:
system_prompt = SystemMessagePromptTemplate.from_template(
    "You are a helpful assistant."
)
human_prompt = HumanMessagePromptTemplate.from_template(
    "{input}"
)

prompt = ChatPromptTemplate.from_messages([system_prompt, human_prompt])

llm = ChatBedrock(
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name=region,
    temperature=0.0,
)

chain = prompt | llm | StrOutputParser()

In [None]:
chain.invoke(f"What happens at the end of {s3_data}?")