# Set max_tokens to be close to the expected value

The guidance shared in the post (below) was tested using GPT-4. It suggests that setting the max_tokens parameter close to the expected output tokens can help reduce the latency of requests.

https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/latency

The results were not conclusive, showing only minor differences between the two tests. 

It is important to consider- the intent of this technique is not to truncate responses that are longer than average. For example, if 90% of responses are less than 200 tokens, and 10% are much longer, setting the max_tokens to 200 tokens will give an improvement to response times. However, this may not give a useful response, as it will be cutoff part way. 

The expectation is that for responses that output a number of generation tokens below the max_token thershold

If you have identified a working implementation of this technique, please submit a PR!

#### Load Helper Functions and Import Libraries

In [4]:
import datetime
import json
import time
import os
import datetime
import json
import time
from openai import AzureOpenAI
from dotenv import load_dotenv
import json
import copy
import textwrap

# Load environment variables
load_dotenv()

import os
from openai import AzureOpenAI

def aoai_call(system_message, prompt, model, max_tokens):
    client = AzureOpenAI(
        api_version=os.getenv("API_VERSION"),
        azure_endpoint=os.getenv("AZURE_ENDPOINT"),
        api_key=os.getenv("API_KEY")
    )

    start_time = time.time()

    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt},
        ],
        max_tokens=max_tokens
    )

    end_time = time.time()
    e2e_time = end_time - start_time

    result = json.loads(completion.model_dump_json(indent=2))
    prompt_tokens = result["usage"]["prompt_tokens"]
    completion_tokens = result["usage"]["completion_tokens"]
    completion_text = result["choices"][0]["message"]["content"]

    return result, prompt_tokens, completion_tokens, completion_text, e2e_time
model=os.getenv("MODELGPT432k")

## Use case: RAG

A simple RAG use case has been used to test this technique.

For the first A/B test, the end-to-end latency is measured as an average across several runs, as the difference appeared to not be significant and was difficult to discern.

In [5]:
context_documents="""

1. “Quantum Entanglement: Spooky Action at a Distance”
Abstract:
Quantum entanglement, a phenomenon that baffled even Einstein, lies at the heart of quantum mechanics. In this article, we delve into the mysterious world of entangled particles, exploring how they can be connected across vast distances instantaneously. From Bell’s theorem to quantum teleportation, we unravel the enigma of entanglement and its potential applications in quantum computing and secure communication.

Introduction:
Quantum entanglement defies classical intuition. Imagine two particles—say, electrons—created together and then separated by light-years. Remarkably, their properties remain intertwined, regardless of the distance between them. When one particle’s state changes, the other responds instantaneously, as if they share a hidden connection. But how does this “spooky action at a distance” work?

Bell’s Theorem:
Physicist John Bell proposed a test to determine whether entanglement was real or merely a statistical fluke. Experiments confirmed Bell’s predictions: the correlations between entangled particles violated classical limits. Quantum mechanics prevailed, and entanglement emerged as a fundamental property of the universe.

Quantum Teleportation:
Entanglement enables quantum teleportation—a process where information about one particle is transmitted to another, even if they are light-years apart. This isn’t “Star Trek” teleportation of matter; instead, it transfers quantum states. Researchers are harnessing this phenomenon for secure communication and quantum networks.

Applications:
Beyond teleportation, entanglement plays a pivotal role in quantum computing. Qubits, the building blocks of quantum computers, rely on entanglement for their power. Scientists are also exploring entanglement-based sensors, clocks, and cryptography.

2. “CRISPR-Cas9: Rewriting the Genetic Code”
Abstract:
CRISPR-Cas9, a revolutionary gene-editing tool, has transformed biology and medicine. In this article, we explore the origins of CRISPR, its mechanism, and its impact on genetic research. From curing genetic diseases to creating designer organisms, CRISPR opens new frontiers in biotechnology.

Introduction:
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) were initially discovered in bacteria as part of their immune system. Scientists soon realized that they could repurpose this system for precise gene editing. Enter CRISPR-Cas9—the Swiss Army knife of genetic manipulation.

How It Works:
CRISPR-Cas9 acts like molecular scissors. It uses a guide RNA to target specific DNA sequences, and the Cas9 protein cuts the DNA at that location. Researchers can then insert, delete, or modify genes with unprecedented accuracy. The simplicity and efficiency of CRISPR have revolutionized genetic research.

Applications:
Treating Genetic Diseases: CRISPR holds promise for curing genetic disorders like sickle cell anemia and cystic fibrosis. Clinical trials are underway.
Agriculture: CRISPR can create crops resistant to pests, drought, and disease.
Designer Babies?: Ethical debates surround using CRISPR for human enhancement.
Conservation: CRISPR may help save endangered species by editing their genomes.

"""

### A: Setting max_tokens to 2000

**Time taken: 1.9 seconds**

The max_token parameter is set significantly higher than the requirement for this use case.

In [6]:
system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
Return the character "A".
"""

e2e_times = []
for _ in range(10):
    result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model,2000)
    print(f"Prompt Tokens: {prompt_tokens}")
    print(f"Completion Tokens: {completion_tokens}")
    print(f"Time taken: {e2e_time:.2f} seconds")
    print(completion_text)
    e2e_times.append(e2e_time)

average_e2e_time = sum(e2e_times) / len(e2e_times)
print(f"Average time taken: {average_e2e_time:.2f} seconds")

Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.69 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.76 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.47 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.57 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.66 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.82 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.68 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.53 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.51 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.63 seconds
A
Average time taken: 1.63 seconds


### B: Set max_tokens to 50 tokens

**Time taken: 1.7 seconds**

The max_tokens is set significantly lower, close to the expected number of tokens.

In [17]:
system_message="""
You are a helpful AI assistant.
"""
prompt=f"""
Return the character "A".
"""

e2e_times = []
for _ in range(10):
    result,prompt_tokens,completion_tokens,completion_text,e2e_time=aoai_call(system_message,prompt,model,50)
    print(f"Prompt Tokens: {prompt_tokens}")
    print(f"Completion Tokens: {completion_tokens}")
    print(f"Time taken: {e2e_time:.2f} seconds")
    print(completion_text)
    e2e_times.append(e2e_time)

average_e2e_time = sum(e2e_times) / len(e2e_times)
print(f"Average time taken: {average_e2e_time:.2f} seconds")

Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.79 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.72 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.45 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.97 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.62 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.80 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.87 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.53 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.85 seconds
A
Prompt Tokens: 26
Completion Tokens: 1
Time taken: 1.61 seconds
A
Average time taken: 1.72 seconds


### Testing benefits to Time-to-First Token and other metrics

The initial test did not reveal a material benefit to the end to end latency.

A more detailed test was set up to explore other metrics relating to latency.

In [18]:
import pandas as pd
client = AzureOpenAI(
    api_version=os.getenv("API_VERSION"),
    azure_endpoint=os.getenv("AZURE_ENDPOINT"),
    api_key=os.getenv("API_KEY")
)

# Initialize DataFrame

def run_experiment(max_tokens,samples):
    df = pd.DataFrame(columns=['e2e_time', 'time_to_first_token', 'average_tbt_duration','achieved_completion_chunks'])
    


    system_message="""
    You are a helpful AI assistant.
    """
    prompt=f"""
    Return the character A.
    """
    for _ in range(samples):
        start_time = time.time()

        completion = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": prompt},
            ],
            stream=True,
            max_tokens=max_tokens
        )

        e2e_start_time = time.time()

        tbt_durations = []
        previous_time = time.time()

        for i, message in enumerate(completion):
            if i==1:
                time_to_first_token=time.time()-e2e_start_time

            current_time = time.time()
            tbt_durations.append(current_time - previous_time)
            previous_time = current_time

        average_tbt_duration = sum(tbt_durations) / len(tbt_durations)

        e2e_end_time = time.time()
        e2e_time = e2e_end_time - e2e_start_time
        # print(time_to_first_token)
        # print(tbt_durations)

        # Create a DataFrame for the current row
        current_df = pd.DataFrame({
            'e2e_time': [e2e_time],
            'time_to_first_token': [time_to_first_token],
            'average_tbt_duration': [average_tbt_duration],
            'achieved_completion_chunks': len(tbt_durations),
        })

        # Concatenate the current DataFrame with the main DataFrame
        df = pd.concat([df, current_df], ignore_index=True)

    # Print DataFrame
    return df




In [19]:
high_max_tokens_df=run_experiment(2000,20)
# low_max_tokens_df=run_experiment(300,20)

  df = pd.concat([df, current_df], ignore_index=True)


In [20]:
high_max_tokens_df

Unnamed: 0,e2e_time,time_to_first_token,average_tbt_duration,achieved_completion_chunks
0,0.096534,0.096027,0.032059,3
1,0.100113,0.099034,0.03307,3
2,0.341751,0.340789,0.06827,5
3,0.16207,0.161486,0.053887,3
4,0.099334,0.098958,0.033035,3
5,0.126829,0.126391,0.042184,3
6,0.286629,0.285353,0.057238,5
7,0.00154,0.001029,0.000389,3
8,0.353777,0.35279,0.070669,5
9,0.091511,0.09128,0.030452,3


In [11]:
high_max_tokens_df.describe(percentiles = [.5, 0.9, .95, .99])

Unnamed: 0,e2e_time,time_to_first_token,average_tbt_duration
count,20.0,20.0,20.0
mean,0.100992,0.097004,0.030563
std,0.048295,0.038386,0.011666
min,0.057345,0.056483,0.018992
50%,0.094961,0.093883,0.030989
90%,0.149183,0.148146,0.040058
95%,0.166388,0.162287,0.049956
99%,0.243698,0.193258,0.063657
max,0.263026,0.201001,0.067083


In [12]:
low_max_tokens_df.describe(percentiles = [.5, 0.9, .95, .99])

Unnamed: 0,e2e_time,time_to_first_token,average_tbt_duration
count,20.0,20.0,20.0
mean,0.165456,0.164403,0.046039
std,0.321768,0.321733,0.080006
min,0.002427,0.001521,0.00062
50%,0.07855,0.077613,0.024106
90%,0.189198,0.188353,0.062854
95%,0.298322,0.297291,0.092914
99%,1.270729,1.26961,0.32126
max,1.51383,1.51269,0.378347


### Conclusion

It is unclear whether this technique has had a benefit to the latency. It also adds a risk of truncating responses unintentionally, which may impact the effectiveness of the app.

Only one use case and workload type was tested, and this is in no way a rigorous or exhaustive test.