# 1.b Tweak the LLM

In this notebook you will see:
- How to change the model parameters
- How to ask the LLM for a specific output format
- How to estimate the cost of a query

# Setup

In [1]:
import sys
import json

from loguru import logger
import tiktoken

from conversational_toolkit.llms.base import LLMMessage, Roles
from conversational_toolkit.llms.openai import OpenAILLM

In [2]:
# Remove logging
logger.remove()
logger.add(sys.stderr, level="ERROR", filter=lambda record: record["level"].no < 40)

1

# Model Parameters


## Model Architecture

Different models can be used, they will vary in cost, efficiency, modality, ... This is controlled by the `model_name` parameter.

In [4]:
# Create the user message (answer should be 12)
query = "What is (3^3-3)/2? Be short."
user_message = LLMMessage(role=Roles.USER, content=query)

In [5]:
# Define and call the LLM with GPT-4.1-mini model
llm_gpt_4_1_mini = OpenAILLM(model_name="gpt-4.1-mini")
llm_gpt_4_1_mini_response = await llm_gpt_4_1_mini.generate([user_message])

print(llm_gpt_4_1_mini_response.content)

\((3^3 - 3)/2 = (27 - 3)/2 = 24/2 = 12\)


In [6]:
# Define and call the LLM with GPT-3.5-turbo model, now it's wrong
llm_gpt_3_5_turbo = OpenAILLM(model_name="gpt-3.5-turbo")
llm_gpt_3_5_turbo_response = await llm_gpt_3_5_turbo.generate([user_message])

print(llm_gpt_3_5_turbo_response.content)

13.5


## Temperature

Temperature influences the randomness/creativity of the model output. This is controlled by the `temperature` parameter. 
To test this, we iterate to see different outputs.

In [7]:
query = "What is the color of the sky that pirates prefer? Be short (only 3 words)."
user_message = LLMMessage(role=Roles.USER, content=query)

In [8]:
for temperature in [0.1, 0.3, 0.8, 1.0]:
    # Each time use the same LLM but with different temperature
    for it in range(3):
        llm = OpenAILLM(model_name="gpt-4.1-mini", temperature=temperature)
        response = await llm.generate([user_message])
        print(
            f"Temperature: {temperature}, Iteration: {it + 1}, Response: {response.content}"
        )
    print("-----")

Temperature: 0.1, Iteration: 1, Response: Black as night
Temperature: 0.1, Iteration: 2, Response: Black as night
Temperature: 0.1, Iteration: 3, Response: Black as night
-----
Temperature: 0.3, Iteration: 1, Response: Black as night
Temperature: 0.3, Iteration: 2, Response: Black as night
Temperature: 0.3, Iteration: 3, Response: Black as night
-----
Temperature: 0.8, Iteration: 1, Response: Clear and blue
Temperature: 0.8, Iteration: 2, Response: Clear and blue
Temperature: 0.8, Iteration: 3, Response: Clear blue skies
-----
Temperature: 1.0, Iteration: 1, Response: Clear and blue
Temperature: 1.0, Iteration: 2, Response: Clear blue skies
Temperature: 1.0, Iteration: 3, Response: Clear and blue
-----


# Structured Output

It's possible to ask models to answer in a specific format, either by prompting them to do so, however this typically leads to unreliable results.

To mitigate that, some LLMs allow to provide a schema in order to enforce the LLM to complete it as required, and thus automatize processes (see [link](https://docs.pydantic.dev/latest/)).

For each properties, one can specify the type, if it mandatory, ...

In [9]:
sentences = [
    "I picked up 'The Clockwork Orchard' by Mira Ellison during my trip and couldn't stop reading it on the train.",
    "My friend recommended 'A Map of Forgotten Rivers' by Daniel Cho and Lina Petrov for our book club this month.",
    "At the old bookstore, I discovered 'Whispers in the Library' with no listed author, which made it even more mysterious.",
]

In [10]:
output_schema = {
    "type": "object",
    "name": "AnswerSchema",
    # Describe what is this schema for
    "description": "The structured output for the user's answer",
    "properties": {
        # Define a first property 'title'
        "title": {
            # It's a string
            "type": "string",
            # It should contain the title of the book
            "description": "Title of the book",
        },
        # Define a second property 'authors'
        "authors": {
            # It's an array
            "type": "array",
            # It should contain list of authors
            "description": "List of the authors",
            "items": {
                # Each item in the array is a string
                "type": "string",
            },
        },
    },
    # The 'title' property is required, others are optional
    "required": ["title"],
    # No additional properties are allowed
    "additionalProperties": False,
}

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "AnswerSchema",
        "schema": output_schema,
    },
}

In [11]:
# Define the LLM and the query that will be asked for each sentence
llm = OpenAILLM(response_format=response_format)
query = "Please extract the title and authors from this sentence"

In [12]:
answers_as_dict = []

# Iterate over each sentence and get structured response from LLM
for sentence in sentences:
    user_message = LLMMessage(
        role=Roles.USER, content=f"{query}\n\nSentence: {sentence}"
    )
    answer = await llm.generate([user_message])
    answers_as_dict.append(json.loads(answer.content))

for answer in answers_as_dict:
    print("Title: ", answer.get("title"))
    print("Authors: ", answer.get("authors"))
    print("-----")

Title:  The Clockwork Orchard
Authors:  ['Mira Ellison']
-----
Title:  A Map of Forgotten Rivers
Authors:  ['Daniel Cho', 'Lina Petrov']
-----
Title:  Whispers in the Library
Authors:  []
-----


# Cost estimation

Computing the number of tokens sent and generated are important, as they will define the latency and/or the cost of the tool. Note that the most costly/slower is usually the output tokens (from the LLM), not the one sent to the LLM (user query, prompt, history, ...). The computation depends on the model used.

In [13]:
# price per token (USD) for gpt-4.1-mini
INPUT_PRICE = 0.40 / 1_000_000
OUTPUT_PRICE = 1.60 / 1_000_000

# Function to count tokens using tiktoken
enc = tiktoken.encoding_for_model("gpt-4.1-mini")


def tokens(text):
    return len(enc.encode(text))


# Function to calculate conversation cost
def conversation_cost(messages):
    input_tokens = 0
    output_tokens = 0

    for m in messages:
        n = tokens(m.content)
        if m.role == "assistant":
            output_tokens += n
        else:
            input_tokens += n

    return {
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "total_cost_usd": input_tokens * INPUT_PRICE + output_tokens * OUTPUT_PRICE,
    }

In [14]:
query = "Write a catchy slogan for a coffee shop run by robots."
user_message = LLMMessage(role=Roles.USER, content=query)

llm = OpenAILLM(model_name="gpt-4.1-mini")

answer = await llm.generate([user_message])

print("User: ", user_message.content)
print("LLM: ", answer.content)

conversation = [user_message, answer]

conversation_cost(conversation)

User:  Write a catchy slogan for a coffee shop run by robots.
LLM:  "Perk Up with Precision â€“ Brewed by Bots, Crafted for You!"


{'input_tokens': 12,
 'output_tokens': 16,
 'total_cost_usd': 3.0400000000000004e-05}

------------------------