# 7 Essential Concepts for working with LLMs

1. Text Generation: Prompt writing, parameters for LLMs
2. Retrieval & Ranking: Vectors/Embeddings - Searching for "similar" items
3. Input Preprocessing: Loaders and Chunking
4. Orchestration: Chains
5. Model Fine-tuning: Training, Transfer Learning
6. Output Postprocessing: Function Calling for JSON Formatting, Parsing
7. Output Evaluation: Metrics, Scoring

These are recurring concepts that are essential for working with LLMs. They are not necessarily in order of importance, but they are all important. We introduce a select few of them here and will go into more detail in the upcoming chapters. Think of this as the table of contents for the rest of the course.

## Text Generation

This section is about generating text using LLMs. This is the most common use case for LLMs. We will cover how to write prompts, how to set parameters for the LLMs, and how to generate text. This is based on [How to format inputs to ChatGPT models](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb) from OpenAI.

In [2]:
import json
from pathlib import Path
from typing import List

import tiktoken
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()  # Load environment variables from .env file

client = OpenAI()

## Example Chat Completion API Call

A chat completion API call parameters,
**Required**
- `model`: the name of the model you want to use (e.g., `gpt-3.5-turbo`, `gpt-4`, `gpt-3.5-turbo-16k-1106`)
- `messages`: a list of message objects, where each object has two required fields:
    - `role`: the role of the messenger (either `system`, `user`, `assistant` or `tool`)
    - `content`: the content of the message (e.g., `Write me a beautiful poem`)

Messages can also contain an optional `name` field, which give the messenger a name. E.g., `example-user`, `Alice`, `BlackbeardBot`. Names may not contain spaces.

**Optional**
- `frequency_penalty`: Penalizes tokens based on their frequency, reducing repetition.
- `logit_bias`: Modifies likelihood of specified tokens with bias values.
- `logprobs`: Returns log probabilities of output tokens if true.
- `top_logprobs`: Specifies the number of most likely tokens to return at each position.
- `max_tokens`: Sets the maximum number of generated tokens in chat completion.
- `n`: Generates a specified number of chat completion choices for each input.
- `presence_penalty`: Penalizes new tokens based on their presence in the text.
- `response_format`: Specifies the output format, e.g., JSON mode.
- `seed`: Ensures deterministic sampling with a specified seed.
- `stop`: Specifies up to 4 sequences where the API should stop generating tokens.
- `stream`: Sends partial message deltas as tokens become available.
- `temperature`: Sets the sampling temperature between 0 and 2.
- `top_p`: Uses nucleus sampling; considers tokens with top_p probability mass.
- `tools`: Lists functions the model may call.
- `tool_choice`: Controls the model's function calls (none/auto/function).
- `user`: Unique identifier for end-user monitoring and abuse detection.


As of January 2024, you can also optionally submit a list of `functions` that tell GPT whether it can generate JSON to feed into a function. For details, see the [documentation](https://platform.openai.com/docs/guides/function-calling), [API reference](https://platform.openai.com/docs/api-reference/chat), or the Cookbook guide [How to call functions with chat models](How_to_call_functions_with_chat_models.ipynb).

Typically, a conversation will start with a system message that tells the assistant how to behave, followed by alternating user and assistant messages, but you are not required to follow this format.

Let's look at an example chat API calls to see how the chat format works in practice.

In [3]:
# Example OpenAI Python library request
MODEL = "gpt-3.5-turbo-0125"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Orange."},
    ],
    temperature=0,
)

In [4]:
print(json.dumps(json.loads(response.model_dump_json()), indent=4))

{
    "id": "chatcmpl-9Gu0GLmeUs27ldoLlRLSVwEgUProV",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "Orange who?",
                "role": "assistant",
                "function_call": null,
                "tool_calls": null
            }
        }
    ],
    "created": 1713815552,
    "model": "gpt-3.5-turbo-0125",
    "object": "chat.completion",
    "system_fingerprint": "fp_c2295e73ad",
    "usage": {
        "completion_tokens": 3,
        "prompt_tokens": 35,
        "total_tokens": 38
    }
}


Extracting just the reply:

In [13]:
response.choices[0].message.content

'Orange who?'

## Prompting Essentials

### System Messages

Unlike most other open source models, OpenAI models -- specially those after 0613, pay attention to the system messages. You can use system messages to set the overall tone for the conversation e.g. `role: system, content: "You're a friendly assistant."` or `role: system, content: "You're a friendly teaching assistant who is talking to a fifth grader with English as a second language."`

In [20]:
# An example of a system message that primes the assistant to be verbose and helpful
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a friendly and helpful teaching assistant.",
        },
        {"role": "user", "content": "Describe the weather in Nagpur in May"},
    ],
    temperature=0,
)

(print(response.choices[0].message.content),)
print(response.usage.completion_tokens)

In May, Nagpur experiences hot and dry weather. The temperature during this month can range from 30 to 45 degrees Celsius (86 to 113 degrees Fahrenheit). The days are usually sunny and the humidity levels are relatively low. It is advisable to stay hydrated and protect yourself from the scorching sun by wearing sunscreen, hats, and light clothing. It is also common to have occasional dust storms or thunderstorms during this time of the year.
91


In [21]:
# An example of a system message that primes the assistant to be brief
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a concise, crisp assistant."},
        {"role": "user", "content": "Describe the weather in Nagpur in May"},
    ],
    temperature=0,
)

print(response.choices[0].message.content)
print(response.usage.completion_tokens)

The weather in Nagpur in May is typically hot and dry. Temperatures can reach highs of around 40 degrees Celsius (104 degrees Fahrenheit) during the day, with minimal rainfall. It is advisable to stay hydrated and seek shade or air-conditioned spaces to beat the heat.
56


In [22]:
# An example of a system message that primes the assistant to be brief
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a talking to a really angry user. You need to be very helpful and patient.",
        },
        {"role": "user", "content": "Describe the weather in Nagpur in May"},
    ],
    temperature=0,
)

print(response.choices[0].message.content)
print(response.usage.completion_tokens)

I understand that you're frustrated, and I'm here to help. In May, Nagpur experiences hot weather with temperatures ranging from 35 to 45 degrees Celsius (95 to 113 degrees Fahrenheit). It can be quite uncomfortable during this time, so it's important to stay hydrated and take necessary precautions to beat the heat. Is there anything specific you would like assistance with regarding the weather in Nagpur?
82


In [24]:
# An example of a system message that primes the assistant to be brief
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You're talking to a pirate from the 1800 London. You need to be very funny",
        },
        {"role": "user", "content": "Describe the weather in Nagpur in May"},
    ],
    temperature=0,
)

print(response.choices[0].message.content)
print(response.usage.completion_tokens)

Arr, me hearty, ye be askin' about the weather in Nagpur in May, eh? Well, let me tell ye, it be hotter than a pirate's temper after losin' his treasure! The sun be shinin' down like a fiery cannonball, makin' ye feel like ye be walkin' the plank right into a scorchin' sea of sweat. It be so hot that even the parrots be askin' for a sip of rum to cool their feathers! So, if ye be plannin' a visit to Nagpur in May, me advice be to bring yer own personal iceberg or a trusty fan made from a sail, or ye might just melt away like a puddle of melted gold doubloons! Arr, stay cool, me matey!
165


### Few-shot prompting

In some cases, it's easier to show the model what you want rather than tell the model what you want.

One way to show the model what you want is with faked example messages.

For example:

In [25]:
# An example of a faked few-shot conversation to prime the model into translating business jargon to simpler speech
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful, pattern-following assistant.",
        },
        {
            "role": "user",
            "content": "Help me translate the following corporate jargon into plain English.",
        },
        {"role": "assistant", "content": "Sure, I'd be happy to!"},
        {"role": "user", "content": "New synergies will help drive top-line growth."},
        {
            "role": "assistant",
            "content": "Things working well together will increase revenue.",
        },
        {
            "role": "user",
            "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.",
        },
        {
            "role": "assistant",
            "content": "Let's talk later when we're less busy about how to do better.",
        },
        {
            "role": "user",
            "content": "This late pivot means we don't have time to boil the ocean for the client deliverable.",
        },
    ],
    temperature=0,
)

print(response.choices[0].message.content)

This sudden change in direction means we don't have enough time to complete the entire project for the client.


## 2. Search: Retrieval & Ranking

Based on [Q&A using Embeddings](https://cookbook.openai.com/examples/question_answering_using_embeddings) from OpenAI Cookbook.

### Why do we even search? 

For every time, we need the model to use some information which it has not already seen during training. This is common when there is something new in the world, or when the model is being used in a new context e.g. your company's internal data.

> GPT can learn knowledge in two ways:
>
> - Via model weights (i.e., fine-tune the model on a training set)
> - Via model inputs (i.e., insert the knowledge into an input message)
> 
> Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall.
> 
> As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.
> 
> In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

### Retrieval Augmented Generation

While we will cover this in more detail in the later chapters, it's worth mentioning here that the system can be combined with retrieved information from a database and then generate a response based on that information. This is called retrieval augmented generation.

![](../assets/Retrieval%20Augemented%20Generation.gif)

Following are the steps to perform retrieval augmented generation:

## Retrieval
1. Prepare search data: Prepare a dataset of documents that you want to search through.
2. Create embeddings: Create embeddings for each document in the dataset.
3. Prepare index: Create an index of the embeddings, this will allow you to search through the documents quickly.
4. Search: Search through the documents using the embeddings.

## Generation
5. Generate: Use the retrieved documents to generate a response.

Here, we'll quickly introduce a simplified view of the retrieval using the OpenAI API next:

1. Prepare search data: You need to prepare a dataset of documents that you want to search through. This could be a list of documents, a list of paragraphs, or a list of sentences.

Just loading the data from a text file for illustration:

In [30]:
text = Path("../data/paul_graham/paul_graham_essay.txt").read_text()

Before we continue it's worth considering what would happen if you were to not give this additional context. So the next block of code basically makes a question to the LLM directly: 

In [53]:
def ask(query: str, model: str = MODEL) -> str:
    """Return the response to a query."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query},
        ],
        temperature=0,
    )
    return response.choices[0].message.content


ask("What did Paul Graham do in Summer of 2016?")

'In the summer of 2016, Paul Graham likely continued his work as a venture capitalist and co-founder of Y Combinator, a startup accelerator. He may have also been involved in mentoring and advising startups, as well as writing essays and giving talks on entrepreneurship and technology.'

## Input Processing: Chunking

Count number of "GPT" tokens in your documents: 

In [32]:
def num_tokens(text: str, model: str = MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


num_tokens(text)

16534

Our model can take a maximum of 16,385 which is less than the number of tokens in the document. We need to chunk the document into smaller pieces.

Here, we'll simply split the document into approximate chunks of 1024 tokens each. This heuristic is based on empirical experiments [here](https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5) by the good folks at LlamaIndex.

We'd also recommend using [ChunkWiz](https://chunkviz.up.railway.app/) to build your intuition about the chunking process.

In [95]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1024,
    chunk_overlap=96,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([text])
context_text = [t.page_content for t in texts]
len(context_text), context_text[0]

(101,
 'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.')

Making embeddings for each chunk:

In [96]:
EMBEDDING_MODEL = "text-embedding-3-large"

def create_embeddings(texts: List[str], model: str = EMBEDDING_MODEL) -> List[str]:
    """Return the embeddings for a list of texts."""
    return client.embeddings.create(
        input=texts,
        model=model,
    )

response = create_embeddings(context_text)

In [97]:
print(f"Number of documents: {len(response.data)}")

Number of documents: 101


Let's put this in an index using Voyager. 

In [98]:
# Prepare a list of embeddigns from the response object

vectors = [d.embedding for d in response.data]
len(vectors), len(vectors[0])

(101, 3072)

In [99]:
from voyager import Index, Space

# Create an empty Index object that can store vectors:
index = Index(Space.Cosine, num_dimensions=3072)
index.add_items(vectors)

print(index)

<voyager.FloatIndex space=Cosine num_dimensions=3072 storage_data_type=Float32>


In [100]:
query_vector = create_embeddings(["What did Paul Graham do in Summer of 2016?"]).data[0].embedding

In [101]:
# # Find the two closest elements:
neighbors, distances = index.query(query_vector, k=2)
print(neighbors)  # => [0, 1]
print(distances)  # => [0.0, 125.0]

# # Save the index to disk to reload later (or in Java!)
# index.save("output_file.voy")

[71 62]
[0.4730066  0.47339076]


In [102]:
def ask_with_context(query: str, context: List[str], model: str = MODEL) -> str:
    introduction = """Use the below writing from Paul Graham to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."""
    question = f"\n\nQuestion: {query}\n\n"
    context = "\n\n".join(context)
    return ask(introduction + context + question, model)

selected_context = [context_text[neighbors[0]]]
ask_with_context("What did Paul Graham do in Summer of 2016?", selected_context)

'I could not find an answer.'

In this example, the model response was more helpful than the search results. But in many cases, the search results will be more helpful than the model response. We encourage you to "improve retrieval" and try different search strategies to see how the model responds.