<a href="https://colab.research.google.com/github/ArthurNazarenko/nebius_academy_practice/blob/main/topic2/2.1_structured_inputs_and_outputs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

Author: Alex Umnov

Links:
- [LinkedIn](www.linkedin.com/in/alex-umnov)
- Discord Profile: *alexumnov* , best to tag at #nebius-academy

The course is in development now, with more materials coming soon. [Subscribe to stay updated](https://academy.nebius.com/llm-engineering-essentials/update/)
# 2.1. Structured Inputs and Outputs

In Topic 1, we learnt how to prompt an LLM in such a way that it understands what you want from it and gives a relevant answer. In this notebook we'll continue this discussion by understanding

* How to make prompts reusable by using **prompt templates**
* How to ensure that an LLM creates its outputs in a convenient, easily parsable format

Let's start by running some code which will help us in the whole notebook:

In [1]:
!pip install openai -qU

In [2]:
from google.colab import userdata
nebius_api_key = userdata.get('nebius_api_key')

In [3]:
from google.colab import userdata
from openai import OpenAI
import os

os.environ['NEBIUS_API_KEY'] = userdata.get("nebius_api_key")

nebius_client = OpenAI(
    base_url="https://api.studio.nebius.ai/v1/",
    api_key=os.environ.get("NEBIUS_API_KEY"),
)

llama_model = "meta-llama/Llama-3.3-70B-Instruct"

def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

def answer_with_llm(prompt: str,
                    system_prompt="You are a helpful assistant",
                    max_tokens=512,
                    client=nebius_client,
                    model=llama_model,
                    prettify=True,
                    temperature=None) -> str:

    messages = []

    if system_prompt:
        messages.append(
            {
                "role": "system",
                "content": system_prompt
            }
        )

    messages.append(
        {
            "role": "user",
            "content": prompt
        }
    )

    completion = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=max_tokens,
        temperature=temperature
    )

    if prettify:
        return prettify_string(completion.choices[0].message.content)
    else:
        return completion.choices[0].message.content


# Prompt templates

In an LLM-powered system, there's always a layer of prompting logic hidden from the user. For example, ChatGPT, Claude, Gemini and others have quite elaborate **system prompts** that set up rules and guardrails of LLM's communication with the user.

However, in some cases a system prompting isn't a flexible enough mechanism. Imagine, for example,

* a customer support bot that needs to be aware of the user's geography to give relevant answers about locally available products
* a railway service support bot that needs to be aware of today's railway strikes and other calamities

You'll likely need to insert this information in the middle of the prompt; and for such things, **prompt templates** are a great tool.

Basically, a **prompt template** is a template string like

```python
"some fixed information {template placeholder 1}
some more fixed information {template placeholder 2}"
```

where the template placeholders are to be filled in just before an actual LLM call.

Let's check several neat ways of wrapping this logic.

First of all, you can write your own wrapper. In the example below, `m['content'].format(**kwargs)` allows to put as much formatting as you wish into the user's message.

In [4]:
from typing import List, Dict

class MessagesPromptTemplate():
    messages: List[Dict]

    def __init__(self, messages: List[Dict]):
        self.messages = messages

    def format(self, **kwargs):
        return [
            {
                "role":  m['role'],
                "content": m['content'].format(**kwargs)
            }
            for m in self.messages
        ]

In [5]:
prompt_template = MessagesPromptTemplate(
    messages = [
        {"role": "system", "content": "You only answer in rhymes"},
        {"role": "user", "content": "Tell me about {city}"}
    ]
)

In [6]:
prompt_template.format(city="Paris")

[{'role': 'system', 'content': 'You only answer in rhymes'},
 {'role': 'user', 'content': 'Tell me about Paris'}]

Let's try calling an llm with different variables

In [7]:
outputs = nebius_client.chat.completions.create(
    messages=prompt_template.format(city="Paris"),
    model=llama_model
).choices[0].message.content
print(outputs)

In Paris, the city of love and light,
The Eiffel Tower shines with pure delight.
The Seine River flows, a gentle stream,
Through streets of charm, and a romantic dream.

The Louvre Museum, a treasure to see,
Holds artwork and history, for you and me.
The Notre Dame Cathedral, a sight to behold,
A masterpiece of Gothic, with stories untold.

Montmartre's hills, with artists so fine,
Create masterpieces, with colors divine.
The Champs-Élysées, a boulevard so grand,
A shopping and dining, in this lovely land.

In Paris, the city of love and desire,
You'll find romance, and a heart on fire.
So come and visit, this city so fair,
And let the magic, of Paris, be beyond compare.


In [8]:
outputs = nebius_client.chat.completions.create(
    messages=prompt_template.format(city="Amsterdam"),
    model=llama_model,
).choices[0].message.content
print(outputs)

Amsterdam's a city so fine,
With canals and bridges that truly shine.
The Rijksmuseum's a sight to behold,
With art and history that never grows old.

The Anne Frank House is a must-see place,
A somber reminder of a troubled pace.
The city's vibrant, with life and with zest,
From coffee shops to flower markets, it's truly the best.

The Jordaan neighborhood's a charming delight,
With narrow streets and quaint shops in sight.
The city's famous for its red light district too,
But there's more to Amsterdam than just that to do.

From bicycling through the city's streets so wide,
To taking a boat tour, with the wind as your guide.
Amsterdam's a city that's full of surprise,
A destination that's sure to open your eyes.


The prompt template class we've written is very primitive and would fail if, for example, some keys aren't inputted.

One of the good implementations of prompt templates can be found in LangChain [PromptTemplates](https://python.langchain.com/docs/concepts/prompt_templates/)

In [9]:
!pip install langchain -qU

In [10]:
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate([
    ("system", "You only answer in rhymes"),
    ("user", "Tell me about {city}")
])

prompt_template.invoke({"city": "Madrid"})

ChatPromptValue(messages=[SystemMessage(content='You only answer in rhymes', additional_kwargs={}, response_metadata={}), HumanMessage(content='Tell me about Madrid', additional_kwargs={}, response_metadata={})])

**Note:** You don't have to use LangChain llm calls or anything else, you can only take their PromptTemplate implementation.

However, there's quiet a bit of useful code in that library.

In [11]:
from langchain_core.messages import convert_to_openai_messages

In [12]:
templated_messages = convert_to_openai_messages(prompt_template.invoke({"city": "Madrid"}).to_messages())
templated_messages

[{'role': 'system', 'content': 'You only answer in rhymes'},
 {'role': 'user', 'content': 'Tell me about Madrid'}]

In [13]:
outputs = nebius_client.chat.completions.create(
    messages=templated_messages,
    model=llama_model,
).choices[0].message.content
print(outputs)

In Madrid, the city's so fair,
A destination that's beyond compare.
From tapas to siestas, it's a delight,
A place to visit, morning, noon, and night.

The Prado Museum's a must-see, don't you know,
With art and history, it's a treasure to show.
The Royal Palace's grand, a sight to behold,
And the Retiro Park's where locals go to unfold.

The nightlife's vibrant, with music and cheer,
And the food's delicious, with flavors so clear.
From paella to gazpacho, it's a culinary dream,
In Madrid, your taste buds will start to beam.

So pack your bags, and come to this place,
Madrid's waiting for you, with a warm and friendly face.
You'll dance, you'll sing, you'll have so much fun,
In Madrid, the city that's number one!


# Structuring LLM outputs

In many cases you require not just a free text answer, but something particular you can use later in your system. For example, if you want your LLM to classify a customet's intent to later pass the conversation to a relevant department, you need to extract the particular intent class from the LLM's answer.

To parse your LLM outputs conveniently, it's wise to structure them in a specific way. We've already discussed some prompting tricks in Topic 1; this time, we'll learn several more reliable ways of making the LLM abide a deisgnated output format.

## Basic output structuring

As a basic way to structure your output, you can "ask" an LLM to present the output in a specific format. For example:

In [15]:
outputs = nebius_client.chat.completions.create(
    messages=[{
        'role': 'user',
        'content': """Design one role play character\'s name, class and a short description.
Present it as a markdown list"""}],
    model=llama_model,
).choices[0].message.content
print(outputs)

* **Character Name:** Eira Shadowglow
* **Class:** Moonlit Ranger
* **Description:** A skilled and deadly ranger with unparalleled accuracy in the night, Eira uses her knowledge of the wilderness and her mystical connection to the moon to track and hunt her prey, bringing justice to those who dwell in the shadows.


While this is quite good, it's not very reliable. A better way would be to show some examples to LLM so that it knows what we expect.

These examples are known as **few-shot examples** and the prompting technique itself - as **few-shot prompting**.

In [16]:
outputs = nebius_client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Design one role play character\'s name, class and a short description. Present it as a markdown list.\n'\
            "Examples:\n"\
            "\n"\
            "- **Name:** Randalf the Yellow;\n"\
            "- **Class:** Fire mage;\n"\
            "- **Proficiency:** Pyro magic;\n"\
            "- **Resistance:** Fire;\n"\
            "\n"\
            "- **Name:** Bonan;\n"\
            "- **Class:** Barbarian;\n"\
            "- **Proficiency:** Axe;\n"\
            "- **Resistance:** Mental magic;\n"\
        }
    ],
    model=llama_model,
).choices[0].message.content
print(outputs)

- **Name:** Eirakai Moonwhisper;
- **Class:** Shadow Assassin;
- **Proficiency:** Stealth and Dual Daggers;
- **Resistance:** Poison;


As you can see, LLM captured the format pretty well.


In [18]:
outputs = nebius_client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Solve the following equation and output only the answer number without reasoning after "Answer:"\n' \
            '123 * 321 = ?\n' \
            'Answer:'
        }
    ],
    model=llama_model,
).choices[0].message.content
print(outputs)

39543


Ever though the answer isn't correct (LLMs are notoriously bad at arithmetics), the output structure is correct and easy to parse out.

However, we can do even better.

## Structured outputs

Modern LLMs support outputing in a specific format, for example we can use "JSON mode" to force outputs to be in JSON fromat.

In [19]:
json_output = nebius_client.chat.completions.create(
    messages=[{'role': 'user', 'content': 'Design a role play character\'s name, class and a short description in json format'}],
    model=llama_model,
    response_format={"type": "json_object"}
).choices[0].message.content
json_output

'{"name": "Eryndor Thorne", "class": "Shadow Weaver", "description": "A mysterious rogue with unparalleled stealth and agility, Eryndor navigates the shadows to manipulate the fabric of reality and outmaneuver his foes."}'

This is useful, because that'll make it much easier for you later to parse the outputs:

In [20]:
import json
json.loads(json_output)

{'name': 'Eryndor Thorne',
 'class': 'Shadow Weaver',
 'description': 'A mysterious rogue with unparalleled stealth and agility, Eryndor navigates the shadows to manipulate the fabric of reality and outmaneuver his foes.'}

We can go another step further and actually define a `pydantic` model to create a schema for our outputs:

In [21]:
from typing import List
from pydantic import BaseModel

class CharacterProfile(BaseModel):
    name: str
    age: int
    special_skills: List[str]
    traits: List[str]
    character_class: str
    origin: str

completion = nebius_client.chat.completions.create(
    model=llama_model,
    messages=[
        {"role": "user", "content": "Design a role play character"}
    ],
    extra_body={
        "guided_json": CharacterProfile.model_json_schema()
    }
)

CharacterProfile.model_validate_json(completion.choices[0].message.content)

CharacterProfile(name='Eira Shadowglow', age=25, special_skills=['Stealth', 'Archery', 'Poison-making'], traits=['Agile', 'Intelligent', 'Independent'], character_class='Rogue', origin='Moonlit Forest')

So no we have predefined format of outputs, which is easy to work with.

Another way to structure outputs is using examples

Let's consider an example from a famous [MMLU dataset](https://huggingface.co/datasets/cais/mmlu):

In [22]:
question = "Which of the following statements about Ethernets is typically FALSE?"

A = "Ethernets use circuit switching to send messages."
B = "Ethernets use buses with multiple masters."
C = "Ethernet protocols use a collision-detection method to ensure that messages are transmitted properly."
D = "Networks connected by Ethernets are limited in length to a few hundred meters."

correct_answer = "A"

Ideally we want our LLM to solve this "test" by answering to us with a letter corresponding to the right answer. This will also make calculating metrics much easier. Let's see what would happen.

In [25]:
output = nebius_client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": f"""
Answer the following question with one of the options listed below
Question: {question}
A: {A}
B: {B}
C: {C}
D: {D}
Answer:
"""}],
    model=llama_model,
).choices[0].message.content
print(output)

A: Ethernets use circuit switching to send messages.

Explanation: Ethernets use packet switching, not circuit switching. Packet switching allows multiple devices to share the same communication channel, whereas circuit switching dedicates a channel to a single communication session. 

The other options are true:
- B: Ethernets do use buses with multiple masters, allowing multiple devices to transmit data.
- C: Ethernet protocols, such as CSMA/CD (Carrier Sense Multiple Access with Collision Detection), use collision detection to manage packet transmission and avoid data loss.
- D: Networks connected by Ethernets are limited in length, typically to a few hundred meters, due to signal attenuation and other physical limitations.


As you can see, it did output the right answer, but if we do a simple comparison, we'll get into trouble:

In [26]:
output == correct_answer

False

In [27]:
output

'A: Ethernets use circuit switching to send messages.\n\nExplanation: Ethernets use packet switching, not circuit switching. Packet switching allows multiple devices to share the same communication channel, whereas circuit switching dedicates a channel to a single communication session. \n\nThe other options are true:\n- B: Ethernets do use buses with multiple masters, allowing multiple devices to transmit data.\n- C: Ethernet protocols, such as CSMA/CD (Carrier Sense Multiple Access with Collision Detection), use collision detection to manage packet transmission and avoid data loss.\n- D: Networks connected by Ethernets are limited in length, typically to a few hundred meters, due to signal attenuation and other physical limitations.'

In [28]:
correct_answer

'A'

So let's teach our model to answer in the right way using so-called Few Shot Prompting also known as In-Context Learning. We essentially show the model some examples in the prompt to teach it in which format we want the answer to be

In [29]:
output = nebius_client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": f"""
Examples:
Question: The IP protocol is primarily concerned with
A: Routing packets through the network
B: Reliable delivery of packets between directly connected machines
C: Reliable delivery of large (multi-packet) messages between machines that are not necessarily directly connected
D: Dealing with differences among operating system architectures
Answer:
A

Question: Which of the following is NOT a property of bitmap graphics?
A: Fast hardware exists to move blocks of pixels efficiently
B: Realistic lighting and shading can be done.
C: All line segments can be displayed as straight.
D: Polygons can be filled with solid colors and textures.
Answer:
A

Task:
Answer the following question with one of the options listed below. Only ouput the answer in the same format as the examples.
Question: {question}
A: {A}
B: {B}
C: {C}
D: {D}
Answer:
"""}],
    model=llama_model,
).choices[0].message.content
print(output)

A


In [30]:
output == correct_answer

True

We also have observed that for some models the dialog format is actually a better way to structure the Few-Shot examples

In [31]:
output = nebius_client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": f"""
User: Answer the following question with one of the options listed below.
Question: The IP protocol is primarily concerned with
A: Routing packets through the network
B: Reliable delivery of packets between directly connected machines
C: Reliable delivery of large (multi-packet) messages between machines that are not necessarily directly connected
D: Dealing with differences among operating system architectures
Answer:
Assistant: A

User: Answer the following question with one of the options listed below.
Question: Which of the following is NOT a property of bitmap graphics?
A: Fast hardware exists to move blocks of pixels efficiently
B: Realistic lighting and shading can be done.
C: All line segments can be displayed as straight.
D: Polygons can be filled with solid colors and textures.
Answer:
Assistant: A

User: Answer the following question with one of the options listed below.
Question: {question}
A: {A}
B: {B}
C: {C}
D: {D}
Answer:
Assitant:
"""}],
    model=llama_model,
).choices[0].message.content
print(output)

A 

The statement "Ethernets use circuit switching to send messages" is typically false. Ethernets actually use packet switching to send messages. Circuit switching is a method used in traditional telephone networks, where a dedicated circuit is established between two endpoints for the duration of the call. In contrast, Ethernets use packet switching, where data is broken into small packets and transmitted independently, allowing for more efficient use of bandwidth.


Theoretically we don't even need to show the model relevant examples if we want it to learn the output formatting

In [32]:
output = nebius_client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": f"""
Question: Choose the letter A
A: A
B: B
C: C
D: D
Answer:
A

Question: Which is the biggest number?
A: 1
B: 2
C: 3
D: 4
Answer:
D

Answer the following question with one of the options listed below
Question: {question}
A: {A}
B: {B}
C: {C}
D: {D}
Answer:
"""}],
    model=llama_model,
).choices[0].message.content
print(output)

A

Explanation: Ethernets use packet switching to send messages, not circuit switching. Packet switching allows multiple devices to share the same communication channel, whereas circuit switching dedicates a channel to a single connection for the duration of the transmission. 

The other options are true: 
- B: Ethernets do use buses with multiple masters, meaning multiple devices can initiate communications.
- C: Ethernet protocols, such as CSMA/CD (Carrier Sense Multiple Access with Collision Detection), use collision-detection methods to manage how devices share the network and resolve conflicts when two or more devices try to transmit at the same time.
- D: Networks connected by Ethernets are indeed limited in length, typically to a few hundred meters, due to signal degradation over distance. Repeaters or switches can be used to extend the network length.


**Note:** Sometimes you can confuse the model if you have examples from the distribution, which is different than your data's one. So for the best results try to match the distribution.

## Function Calling

We can use tools in OpenAI api as well. Let's see how we can use web search with just the api:

In [33]:
!pip install tavily-python -qU

We'll need a Tavily API key which you can get from [here](https://app.tavily.com/sign-in).

Then either use google's secret storage or put it into a file and upload it.

In [34]:
#os.environ['TAVILITY_API_KEY"] = open(".tavily_api_key").read()
os.environ["TAVILY_API_KEY"] = userdata.get("tavily_api_key")

from tavily import TavilyClient

tavily_client = TavilyClient()

response = tavily_client.search("Who is Leo Messi?", topic="general")

print(response['results'])



Now we can define a `tool` description for client, so that the model knows how to use it.

We will only expose `query` and `topic` parameters.

We also need to write short descriptions to explain what the tool and the parameters are for. Note that it's not for you, but for the LLM :) So please make sure you provide a clear explanation.

Tool usage is sort of an extension of "JSON mode" because in the end we get a dict of parameters, parsed from the JSON.

In [35]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "web-search",
            "description": "Retrieves results from web search",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "What you search for",
                    },
                    "topic": {
                        "type": "string",
                        "description": "Search topic either 'general' or 'news'",
                        "enum": ["general", "news"]
                    },
                },
                "required": ["query"],
            },
        }
    },
]


messages = []
messages.append({"role": "system", "content": "If you are asked about the factual information, create a function call instead. If you already searched, use the results to give an answer."})
messages.append({"role": "user", "content": "What is the name of the cat from Shrek?"})
chat_response = nebius_client.chat.completions.create(
    messages=messages, tools=tools, model=llama_model
)
chat_response

ChatCompletion(id='chatcmpl-9c9092ea546b4e54949911b355ccb1d6', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-ae46e24d151f4cfc964095b13bcc4d6d', function=Function(arguments='{"query": "Shrek cat name", "topic": "general"}', name='web-search'), type='function')], reasoning_content=None), stop_reason=128008)], created=1749461310, model='meta-llama/Llama-3.3-70B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=32, prompt_tokens=267, total_tokens=299, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

And we can also try to ask for some news-worthy content to see if LLM decides on a different `topic`.

In [36]:
messages = []
messages.append({"role": "system", "content": "If you are asked about the factual information, create a function call instead. If you already searched, use the results to give an answer."})
messages.append({"role": "user", "content": "What happened in London today?"})
chat_response = nebius_client.chat.completions.create(
    messages=messages, tools=tools, model=llama_model
)
chat_response

ChatCompletion(id='chatcmpl-e5ad035574964c138e99334cd05c18af', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='chatcmpl-tool-239737bdc1ba4b43b37497d51458987e', function=Function(arguments='{"query": "London news today", "topic": "news"}', name='web-search'), type='function')], reasoning_content=None), stop_reason=128008)], created=1749461343, model='meta-llama/Llama-3.3-70B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=31, prompt_tokens=262, total_tokens=293, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

Now we can extract the function usage output from the result

In [37]:
chat_response.choices[0].message.tool_calls[0]

ChatCompletionMessageToolCall(id='chatcmpl-tool-239737bdc1ba4b43b37497d51458987e', function=Function(arguments='{"query": "London news today", "topic": "news"}', name='web-search'), type='function')

You might be wondering, why do we include tool usage in structured output topic.

Thing is, you can also use this functionality to structure your output. You don't have to use a real function as your tool. Let's use our previous example

In [38]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "create_rpg_character",
            "description": "Creates a character based on attributes and description",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "Name of the character",
                    },
                    "age": {
                        "type": "integer",
                        "description": "Age of the character",
                    },
                    "special_skills": {
                        "type": "array",
                        "description": "List of special skills of the character",
                        "items": {
                            "type": "string"
                        }
                    },
                    "traits": {
                        "type": "array",
                        "description": "List of traits of the character",
                        "items": {
                            "type": "string"
                        }
                    },
                    "character_class": {
                        "type": "string",
                        "description": "Class of the character",
                        "enum": ["mage", "rogue", "barbarian", "knight", "paladin"]
                    },
                    "origin": {
                        "type": "string",
                        "description": "Origin of the character",
                        "enum": ["human", "elf", "orc", "undead"]
                    },
                },
                "required": ["name", "age", "special_skills", "traits", "character_class", "origin"],
            },
        }
    },
]


In [39]:
messages = []
messages.append({"role": "system", "content": "If you are asked to create a character, use `create_rpg_character` tool."})
messages.append({"role": "user", "content": "Generate a random character for my new session"})
chat_response = nebius_client.chat.completions.create(
    messages=messages, tools=tools, model=llama_model
)
chat_response.choices[0].message.tool_calls[0].function.arguments

'{"name": "Eryndor Thorne", "age": "25", "special_skills": "[\\"stealth\\", \\"archery\\"]", "traits": "[\\"charismatic\\", \\"intelligent\\"]", "character_class": "rogue", "origin": "human"}'

# **Practice tasks**

If you encounter any difficulties or simply want to see our solutions, feel free to check the [Solutions notebook](https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic2/2.1_structured_inputs_and_outputs_solutions.ipynb).

## Task 1. LLM Information extraction

The goal of this task is to create a system, which extracts data about events from free text into a predictable format.

Let's imagine that you work for a marketing agency, and you need to gather analytics about the passing events dedicated to AI and Machine Learning. For that, you need to process press releases and extract:
- Event name,
- Event date,
- Number of participants,
- Number of speakers,
- Attendance price.

Of course, you can do it manually, but it's much more fun to use Generative AI! So, your task will be to write a function that does this with only one request to OpenAI API.

Below there is an example of a press release (generated by ChatGPT, of course, so that both the event and the personae are fictional). All of them are in the press_releases.zip archive in the hometask week 1 folder.

<blockquote>
<p>PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence</p>

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact:
Jane Cipher
Director of Communications, InnovAI Summit
Email: jane.cipher@innovai.org
Phone: +123-4567-8910</p>
</blockquote>

More specifically, you should write a function

```python
parse_press_release(pr: str) -> dict
```

where the output should be in the format

```python
{
  name: 'InnovAI Summit 2023',
  date: '08.11.2023',
  n_participants: 3500,
  n_speakers: 4,
  price:
}
```

If any of the four characteristics is not mentioned in the text, put `None` in the respective field.

At the end, calculate the statistics of right answers and analyse what kind of mistakes you "model" makes the most.

**Hints and suggestions:**
- It's gonna be more convenient to experiment in Nebius AI Studio's playground https://studio.nebius.com/playground.
- You need to be very accurate with what you want from the model.
- It will help if you specify in the prompt that the output should be in JSON format, this way you will spend less time parsing the output. But be careful. Though some models are easily prompted to output a JSON, please check the output format. It may contain excessive formatting, for example:
<pre><code>```json
{"name": "InnovAI Summit 2023", ...}
```</pre></code>
Actually, examining LLM outputs and their format is a must when working with them

- Please be careful with the details. For example, Jane Cipher in the text above is not a speaker and shouldn't be counter as such (how to get rid of a contact person?). Also pay attention to the date format,
- If the model is too wilful with the output format, don't hesitate to show some examples. Decreasing the temperature of predictions can help reduce the creativity of the answer, which is what we want for such task.
- Debugging an LLM-powered application may become a tough business. When you think that you've polished it, an LLM can still surprise you. So, we don't expect 100% accuracy in this task, but we expect that you do your best to achieve high quality results.

**Bonus points**:
Try writing the solution using:
- Structured JSON Output
- Guiding JSON Output using Structures

In [40]:
press_release = """PRESS RELEASE

InnovAI Summit 2023: A Glimpse into the Future of Artificial Intelligence

City of Virtue, Cyberspace - November 8, 2023 - The most anticipated event of the year, InnovAI Summit 2023, successfully concluded last weekend, on November 5, 2023. Held in the state-of-the-art VirtuTech Arena, the summit saw a massive turnout of over 3,500 participants, from brilliant AI enthusiasts and researchers to pioneers in the field.

Esteemed speakers took to the stage to shed light on the latest breakthroughs, practical implementations, and ethical considerations in AI. Dr. Evelyn Quantum, renowned for her groundbreaking work on Quantum Machine Learning, emphasized the importance of this merger and how it's revolutionizing computing as we know it. Another keynote came from Prof. Leo Nexus, whose current project 'AI for Sustainability' highlights the symbiotic relationship between nature and machine, aiming to use AI in restoring our planet's ecosystems.

This year's panel discussion, moderated by the talented Dr. Ada Neura, featured lively debates on the limits of AI in creative arts. Renowned digital artist, Felix Vortex, showcased how he uses generative adversarial networks to create surreal art pieces, while bestselling author, Iris Loom, explained her experiments with AI-assisted story crafting.

Among other highlights were hands-on workshops, interactive Q&A sessions, and an 'AI & Ethics' debate which was particularly well-received, emphasizing the need for transparency and fairness in AI models. An exclusive 'Start-up Alley' allowed budding entrepreneurs to showcase their innovations, gaining attention from global venture capitalists and media.

The event wrapped up with an announcement for InnovAI Summit 2024, set to be even grander. Participants left with a renewed enthusiasm for the vast possibilities that the AI and ML world promises.

For media inquiries, please contact: Jane Cipher Director of Communications, InnovAI Summit Email: jane.cipher@innovai.org Phone: +123-4567-8910"""

In [41]:
import json
import re

def extract_triple_backtick_blocks(text):
    """
    Extracts all text enclosed between triple backticks (```).
    Returns a list of code/text blocks.
    """
    return re.findall(r"```(.*?)```", text, re.DOTALL)

def parse_press_release(pr: str) -> dict:
    answer = answer_with_llm(
            f"Here's a press release\n{pr}\n\nExtract from it the following json:"\
            "If any information needed for JSON is not available, write \"None\" instead (with quotes).\n"\
            '{"name": NAME_OF_EVENT, "date": DATE_OF_EVENT, "n_participants": NUM_PARTICIPANTS, "n_speakers": NUM_SPEAKERS, "price": PRICE}'\
            "NAME_OF_EVENT should be the name of event advertised,\n"\
            "DATE_OF_EVENT hould be the date of event mentioned in format DD.MM.YYYY or DD.MM.YYYY-DD.MM.YYYY if the event lasted for several days,\n"\
            "NUM_PARTICIPANTS should be the estimated amount of participants of said event in a format like 200 or 1000 or 10000, do not write it like 2,000,\n"\
            "NUM_SPEAKERS is a number, corresponding to amount of names of speakers and hosts mentioned\n"\
            "PRICE should be the price of event in the format EUR 100 or USD 1000 or GBP 100 depending on currency. Do not write currency symbol, instead write an abbreviation.\n"\
            "If any information needed for JSON is not available, write a json string \"None\" instead (with quotes)."
    )
    try:
        if "```" in answer:
            answer = extract_triple_backtick_blocks(answer)[0]
        return json.loads(answer)
    except Exception as e:
        print(answer)
        raise

In [42]:
parse_press_release(press_release)

{'name': 'InnovAI Summit 2023',
 'date': '05.11.2023',
 'n_participants': 3500,
 'n_speakers': 5,
 'price': 'None'}

### Testing

We've prepared a small dataset for you to test your prompt on. Provided you've written your function, try running the following code. At the end you also have an opportunity to look at the results in a table side-by-side in with_results.csv. Your goal is to get at least 60% of fields right..

In [43]:
!pip install --upgrade gdown
!gdown -O press_release_extraction.csv https://docs.google.com/spreadsheets/d/15IGdc3MV8864lxrLxsug0Ij480p76T1EAwBM7WGT_OI/export?format=csv

Downloading...
From: https://docs.google.com/spreadsheets/d/15IGdc3MV8864lxrLxsug0Ij480p76T1EAwBM7WGT_OI/export?format=csv
To: /content/press_release_extraction.csv
16.0kB [00:00, 23.9MB/s]


In [44]:
import pandas
pr_df = pandas.read_csv("press_release_extraction.csv")
pr_df.head()

Unnamed: 0,pr_text,pr_parsed
0,InnovAI Summit 2023: A Glimpse into the Future...,"{\n ""name"": ""InnovAI Summit 2023"",\n ""date"":..."
1,Press Dispatch: 'Artificial Mariners: Navigati...,"{""name"": ""Artificial Mariners: Navigatin' the ..."
2,FOR IMMEDIATE RELEASE\n\nAI Innovators Convene...,"{""name"": ""Annual Machine Learning Symposium 20..."
3,Press Release: Cutting-Edge Innovations Debute...,"{""name"": ""AI Advancements Summit 2023"",\n ""dat..."
4,"Press Release: Innovative Minds Gather at ""AI ...","{""name"": ""AI Horizon 2023"",\n ""date"": ""15.10.2..."


In [45]:
pr_df.pr_parsed[0]

'{\n  "name": "InnovAI Summit 2023",\n  "date": "05.11.2023",\n  "n_participants": 3500,\n  "n_speakers": 4,\n  "price": "None"\n}'

In [46]:
import json

parsed_list = []
fields = {
    "name": str,
    "date": str,
    "n_speakers": int,
    "n_participants": int,
    "price": str
}
correct_fields = 0
for row in pr_df.itertuples():
    parsed_release = parse_press_release(row.pr_text)
    parsed_list.append(json.dumps(parsed_release, indent=4))
    golden = json.loads(row.pr_parsed)
    for field, field_type in fields.items():
        golden_field = golden[field]
        parsed_field = parsed_release.get(field)
        try:
            parsed_field = field_type(parsed_field)
        except (ValueError, TypeError):
            pass
        if golden_field == parsed_field:
            correct_fields += 1
        else:
            print(f"For {golden['name']} {field} {parsed_release.get(field)} doesn't seem the same as {golden[field]}")

print(f"Correctly extracted {correct_fields} out of {5*len(pr_df)}")

For InnovAI Summit 2023 n_speakers 5 doesn't seem the same as 4
Correctly extracted 34 out of 35


### Bonus points
- Try and compare different ways of establishing the correct answer formatting
- Try and compare different LLMs

## Task 2. Localized MMLU

Cool thing about structured output, is that it's very easy to make a translated version of a specific dataset, taking into account all the context and outputing in a format, which is super easy to parse. Let's try this on MMLU.

**Task:** Write a function which inputs a sample from MMLU and outputs a translated version, using structured outputs.

Tip: make sure that the correct answer didn't change.

In [47]:
!pip install -qU datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━[0m [32m307.2/491.5 kB[0m [31m9.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/193.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "

In [None]:
from typing import List
from pydantic import BaseModel

class MMLUSample(BaseModel):
    ...

def translate_mmlu_sample(sample: MMLUSample, target_language: str) -> MMLUSample:
    ...

In [None]:
from typing import List
from pydantic import BaseModel

class MMLUSample(BaseModel):
    question: str
    A: str
    B: str
    C: str
    D: str
    correct_answer: str

def translate_mmlu_sample(sample: MMLUSample, target_language: str) -> MMLUSample:
    completion = nebius_client.chat.completions.create(
        model=llama_model,
        messages=[
            {
                "role": "user",
                "content": f"Translate this MMLU sample into {target_language}" \
                f"Question: {sample.question}\n" \
                f"A: {sample.A}\n" \
                f"B: {sample.B}\n" \
                f"C: {sample.C}\n" \
                f"D: {sample.D}\n" \
                f"Correct answer: {sample.correct_answer}\n" \
                f"Translated sample:"
            }
        ],
        extra_body={
            "guided_json": MMLUSample.model_json_schema()
        },
    )

    translated = MMLUSample.model_validate_json(completion.choices[0].message.content)
    if translated.correct_answer != sample.correct_answer:
        translated.correct_answer = sample.correct_answer
    return translated

In [None]:
mmlu_sample = MMLUSample(
    question = "Which of the following statements about Ethernets is typically FALSE?",
    A = "Ethernets use circuit switching to send messages.",
    B = "Ethernets use buses with multiple masters.",
    C = "Ethernet protocols use a collision-detection method to ensure that messages are transmitted properly.",
    D = "Networks connected by Ethernets are limited in length to a few hundred meters.",
    correct_answer = "A"
)

translate_mmlu_sample(mmlu_sample, target_language="German")

Now let's remember the code we've written for MMLU evaluator and add a little twist:

We'll have both topic and language in which we want to evaluate the model.

In [None]:
!pip install datasets -q

**Task**: Modify the following MMLUEvaluator code so that it can also translate the input question and evaluate the performance in a different language.

In [48]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm

from datasets import load_dataset

class MMLUEvaluator:
    def __init__(self, system_prompt: str = None, prompt: str = None,
                 topic: str = "high_school_mathematics"):
        """
        Initialize the MMLU evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
            topic: Which topic to choose
        """

        self.topic = topic
        self.topic_prettified = topic.replace("_", " ")
        self.system_prompt = system_prompt or f"You are an expert in {self.topic_prettified}."

        self.prompt = """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
You need to ponder the question and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosen answer option A, B, C, D after #ANSWER:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""

        self.questions, self.choices, self.answers = self.load_mmlu_data(topic=self.topic)

    def load_mmlu_data(self, topic: str) -> pd.DataFrame:
        """
        Load MMLU test data on a given topic.

        Args:
            topic: Which topic to choose

        Returns:
            DataFrame with questions and answers
        """

        dataset = load_dataset("cais/mmlu", topic, split="test")

        dataset = dataset
        dataset = pd.DataFrame(dataset)

        # Load questions and choices separately
        questions = dataset["question"]
        choices = pd.DataFrame(
            data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
        )
        # In the dataset, true answer labels are in 0-3 format;
        # We convert it to A-D
        answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

        return questions, choices, answers

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer letter (A, B, C, D, or Failed to parse)
        """
        # Look for a single letter answer in the response
        try:
            answer = solution.split('#ANSWER:')[1].strip()
        except:
            answer = "Failed to parse"
        return answer

    def evaluate_single_question(self, question: str, choices: Dict[str, str],
                                 correct_answer: str,
                                 client, model) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    client=client, model=model,
                    topic_prettified=self.topic_prettified,
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D']
                ),
                system_prompt=self.system_prompt,
                prettify=False
            )
            answer = self.extract_answer(model_response)
            is_correct = (answer.upper() == correct_answer.upper())
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client=nebius_client, model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                       n_questions=50) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0

        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                choices=self.choices.iloc[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
            )

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'evaluation_log': evaluation_log
        }

        return evaluation_results


In [54]:
import pandas as pd
from typing import List, Dict, Tuple
import json
from pathlib import Path
import numpy as np
from tqdm import tqdm

from datasets import load_dataset

class MMLUEvaluator:
    def __init__(self, system_prompt: str = None, prompt: str = None,
                 topic: str = "high_school_mathematics",
                 language: str = "English"):
        """
        Initialize the MMLU evaluator.

        Args:
            system_prompt: Optional system prompt for the model
            prompt: Custom prompt for the model
            topic: Which topic to choose
        """

        self.topic = topic
        self.language = language
        self.topic_prettified = topic.replace("_", " ")
        self.system_prompt = system_prompt or f"You are an expert in {self.topic_prettified}."

        self.prompt = """You are given a question in {topic_prettified} with four answer options labeled by A, B, C, and D.
You need to ponder the question and justify the choice of one of the options A, B, C, or D.
At the end, do write the chosen answer option A, B, C, D after #ANSWER:
Now, take a deep breath and work out this problem step by step. If you do well, I'll tip you 200$.

QUESTION: {question}

ANSWER OPTIONS:
A: {A}
B: {B}
C: {C}
D: {D}
"""

        self.questions, self.choices, self.answers = self.load_mmlu_data(topic=self.topic)

    def load_mmlu_data(self, topic: str) -> pd.DataFrame:
        """
        Load MMLU test data on a given topic.

        Args:
            topic: Which topic to choose

        Returns:
            DataFrame with questions and answers
        """

        dataset = load_dataset("cais/mmlu", topic, split="test")

        dataset = dataset
        dataset = pd.DataFrame(dataset)

        # Load questions and choices separately
        questions = dataset["question"]
        choices = pd.DataFrame(
            data=dataset["choices"].tolist(), columns=["A", "B", "C", "D"]
        )
        # In the dataset, true answer labels are in 0-3 format;
        # We convert it to A-D
        answers = dataset["answer"].map(lambda ans: {0: "A", 1: "B", 2: "C", 3: "D"}[ans])

        return questions, choices, answers

    def extract_answer(self, solution: str) -> str:
        """
        Extract the letter answer from model's response.

        Args:
            response: Raw model response

        Returns:
            Extracted answer letter (A, B, C, D, or Failed to parse)
        """
        # Look for a single letter answer in the response
        try:
            answer = solution.split('#ANSWER:')[1].strip()
        except:
            answer = "Failed to parse"
        return answer

    def evaluate_single_question(self, question: str, choices: Dict[str, str],
                                 correct_answer: str,
                                 client, model) -> Tuple[bool, str]:
        """
        Evaluate a single question.

        Args:
            question: Formatted question string
            correct_answer: Correct answer letter

        Returns:
            Tuple of (is_correct, extracted_answer, model_response)
        """
        try:
            if self.language != "English":
                sample = MMLUSample(
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D'],
                    correct_answer=correct_answer
                )
                translated = translate_mmlu_sample(sample, target_language=self.language)
                question = translated.question
                choices = {"A": translated.A, "B": translated.B, "C": translated.C, "D": translated.D}
                correct_answer = translated.correct_answer
            model_response = answer_with_llm(
                prompt=self.prompt.format(
                    client=client, model=model,
                    topic_prettified=self.topic_prettified,
                    question=question,
                    A=choices['A'], B=choices['B'], C=choices['C'], D=choices['D']
                ),
                system_prompt=self.system_prompt,
                prettify=False
            )
            answer = self.extract_answer(model_response)
            is_correct = (answer.upper() == correct_answer.upper())
            return is_correct, answer, model_response
        except Exception as e:
            print(f"Error evaluating question: {e}")
            return False, None, None

    def run_evaluation(self, client=nebius_client, model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                       n_questions=50) -> Dict:
        """
        Run evaluation of a given model on the first n_questions.

        Args:
            client: Which client to use (OpenAI or Nebius)
            model: Which model to use
            n_questions: How many first questions to take

        Returns:
            Dictionary with evaluation metrics
        """
        evaluation_log = []
        correct_count = 0

        if n_questions:
            n_questions = min(n_questions, len(self.questions))
        else:
            n_questions = len(self.questions)

        for i in tqdm(range(n_questions)):
            is_correct, answer, model_response = self.evaluate_single_question(
                question=self.questions[i],
                choices=self.choices.iloc[i],
                correct_answer=self.answers[i],
                client=client,
                model=model,
            )

            if is_correct:
                correct_count += 1

            evaluation_log.append({
                'answer': answer,
                'model_response': model_response,
                'is_correct': is_correct
            })

        accuracy = correct_count / n_questions
        evaluation_results = {
            'accuracy': accuracy,
            'evaluation_log': evaluation_log
        }

        return evaluation_results

### Testing

In [55]:
evaluator = MMLUEvaluator(topic="medical_genetics", language="English")

results = evaluator.run_evaluation(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                         n_questions=50)
print(f'\nAccuracy: {results["accuracy"]}')

100%|██████████| 50/50 [09:46<00:00, 11.74s/it]


Accuracy: 0.86





In [56]:
evaluator_de = MMLUEvaluator(topic="medical_genetics", language="German")

results_de = evaluator_de.run_evaluation(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
                         n_questions=10)
print(f'\nAccuracy: {results_de["accuracy"]}')

100%|██████████| 10/10 [00:00<00:00, 13999.68it/s]

Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined
Error evaluating question: name 'MMLUSample' is not defined

Accuracy: 0.0



