### Structured output

We want to constrain the LLM to output in a predefined JSON format to ensure only one note is generated and contains both front and back. The format per se is not a problem. Llama-3.1-8B-Instruct is already capable of following that instruction. However, the issue we are generally facing is that the model sometimes generates multiple notes and, at times, hallucinates by generating multiple notes that go beyond the original note's scope.

`vllm` supports guided output out of the box and uses `outlines` under the hood.

Please run the following command on the terminal to spin up a vLLM server, which we will query in the rest of the notebook.

```bash
$ vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max_model_len 4096 --chat-template ./template_llama31.jinja
```

We might have to wait about 1 minute before the server is up and reachable.

In [1]:
from openai import OpenAI

In [2]:
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

In [3]:
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

### Completion API

In [4]:
completion = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    prompt="What is the capital of France?",
    temperature=0,
    max_tokens=7,
)
print("Completion result:", completion.choices[0].text);

Completion result:  Paris
What is the capital of


### Chat API

In [5]:
chat_response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
)
print("Chat response:", chat_response.choices[0].message.content);

Chat response: The capital of France is Paris.


### Chat API + Guided Choice

In [6]:
chat_response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    extra_body={
        "guided_choice": [
            "São Paulo",
            "Rome",
            "New York",
            "Paris",
            "Milan",
            "Los Angeles",
        ]
    },
)
print("Chat response:", chat_response.choices[0].message.content);

Chat response: Paris


### Chat API + Guided JSON

In [7]:
chat_response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
    extra_body={
        "guided_json": {
            "title": "City",
            "type": "object",
            "properties": {
                "name": {"type": "string"},
            },
            "required": ["name"],
        }
    },
)
print("Chat response:", chat_response.choices[0].message.content);

Chat response: { "name": "Paris" }
