#Programmatic use of LLMs workshop
The first step of programmatically using an LLM is to have a program to use it for! In part 1, we're going to just get used to the general concepts of using the LLM in a code interface. Then in part 2, we're going to write a program where we use a large language model to query information about an academic paper.

#Part 1: Let's say hi to an LLM!
In this first part, we're going to set up our environment so that we can communicate with ChatGPT 4o-mini.

##Step 0: Adding the API key
Let's add the OpenAI API key to our "secret keys" in the Colab environment, under the name "openai_api". We'll do this manually.

##Step 1: Set up our API specification for OpenAI

In [1]:
# we will import the google.colab module so we can access the secret key we just saved
from google.colab import userdata
# we will use OpenAI's openai module as a client through which to use the API
from openai import OpenAI

# initialise openAI client with secret API key
openai_client = OpenAI(
    api_key = userdata.get('openai_api')
    )

##Step 2: Let's go through the anatomy of API calling.
Recall that when we call an API we are basically making a request to a service, like the way we make a request to a waiter in a restaurant. And just like at a restaurant, where our waiter might expect a certain structure to how we communicate with them (e.g., specifying a main dish we want), APIs in general are exactly the same. They take certain _arguments_, which vary depending on the service, and which inform the service of how exactly to respond (e.g., which model to use). The arguments we can look at today are:

*   **Model**: a string which specifies which of the models available should be used to answer the query (e.g., 4o-mini, o3, etc).
*   **Messages**: a dictionary (i.e., with key-value pairs) which specifies the role that a message is being given as (i.e., as a system prompt vs. a user prompt) and the content of that message (e.g., what the user prompt actually says).
* **Temperature**: a number which specifies the "randomness" in sampling over the tokens; a higher temperature means that less probable tokens will be selected more frequently.
* **top_p**: a number which specifies the minimum probability that should be considered for sampling. If top_p = .1, for example, then only those tokens with a greater than 10% probability will be sampled.
* **top_k**: a number which specifies how many of the most probable tokens should be sampled over. If top_k = 5, for example, then only the 5 most probable tokens will be considered during sampling. **Note:** Sadly, OpenAI does not expose the top_k parameter, so we can't change it. :-(



###Specify prompts
Let's specify our system and user prompts below. We'll start with some very basic strings.

In [None]:
system_prompt = "You are a helpful assistant, but you're also kind of a dick."
user_prompt = "Say 'Hello, World! My name is ChatGPT.'."

###Call the API
Let's talk to ChatGPT!

In [None]:
# Make the API call
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
        ]
    )

output_text = response.choices[0].message.content # take the output and extract the text response only

print(output_text) # print the output

Hello, World! My name is ChatGPT. But honestly, you probably already knew that since you’re chatting with me. What’s your point?


##Hands-on:
Now try the following things:


1.   Change the user prompt and re-run the request.
2.   Change the system prompt to something completely random and re-run the request.



###Temperature and top-p
Let's go through each of these arguments one-by-one. Below you will see API calling code for the LLM with a slightly more complex prompt. Try out requests to this LLM where you:


1.   Vary the temperature parameter.
2. Vary the top-p parameter.

Your task is try to get the funniest and weirdest response possible from the model.

In order to prevent too excessive an output (and to protect our OpenAI budget!) we'll limit the output to a maximum of 50 tokens using the `max_tokens` argument; please do not change this.



In [None]:
longer_system_prompt = "You are an expert reviewer of written text. You take a short story as an input and provide a short, expressive review of it as output, up to 50 tokens."
longer_user_prompt = "Here's my short story: Every night, the statues in the park shifted an inch east. No one noticed—except Mira, the blind woman who fed pigeons by the fountain. She claimed they whispered as they moved: stories of the future etched in stone. One dawn, she wasn’t there. In her place stood a new statue, arm outstretched, feeding invisible birds. The others had turned west. Since then, the city’s clocks ran backward, and no one aged. Only children still remember time."

In [None]:
# Make the API call
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    temperature = 1,
    max_tokens = 50,
    top_p = 1,
    messages=[
        {"role": "system", "content": longer_system_prompt},
        {"role": "user", "content": longer_user_prompt}
        ]
    )

output_text = response.choices[0].message.content # take the output and extract the text response only

print(output_text) # print the output

A hauntingly poetic tale blending magic and melancholy, where the ordinary twists into the extraordinary. Mira’s unique perspective offers profound insights into time and memory. A captivating exploration of loss and the unseen. Beautifully imagined!


Let's take a break here!

#Part 2: Let's query an academic paper
90% of large language model workflows with an API happen _before_ we get to using the LLM. What is critically important here is that the pdf is parsed correctly, and in a manageable format, so that we can feed it efficiently to the LLM. Remember: the only type of input an LLM can take is text, so we need to convert our pdf file to text.
Fortunately, there is an excellent machine learning service called GROBID which is made specifically for parsing academic paper pdfs! We can communicate with GROBID's API using the `requests` library. So let's try that out now: first we can define a function to do this:

In [None]:
import requests

# First, let's define a function, called read_with_grobid, which we can use to read in the pdf.
def read_with_grobid(filename):

  # specify the file, then send to GROBID
  with open(filename, 'rb') as file:
    files = {'input': file}
    response = requests.post("https://kermitt2-grobid.hf.space/api/processFulltextDocument", files=files) # send to grobid at the specified URL

    # if the response returned from GROBID isn't 200 (indicating a successful request), then inform us of the error
    if response.status_code != 200:
      response.raise_for_status()

  return response.text

Now let's try to run the function to open our pdf!

In [None]:
paper = read_with_grobid('/content/science.aac4716.pdf')

In [None]:
print(paper)

<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Estimating the reproducibility of psychological science Open Science Collaboration*</title>
				<funder ref="#_9HBzVCu">
					<orgName type="full">unknown</orgName>
				</funder>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<title level="a" type="main">Estimating the reproducibility of psychological science Open Science Collaboration*</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint

If we look at the output, we can see that we've got the body text in there, but we've also got a bunch of extra XML elements. We might care about those in another context, but right now we _only_ want the abstract and body text of the paper. We can fortunately use the `BeautifulSoup` module to strip these unwanted XML elements away. We will also remove the tables and figures and just focus on the paper text for this workshop.

In [None]:
# import BeautifulSoup
from bs4 import BeautifulSoup

# Function to extract abstract and body as a single text object
def extract_relevant_text(xml_content):
    """Extracts and combines the <abstract> and <body> elements into a single text block, excluding <note> elements inside <body>."""
    soup = BeautifulSoup(xml_content, "xml")

    # extract and clean title
    title = soup.find("title")
    title_text = title.get_text(separator=" ") if title else ""

    # Extract and clean abstract
    abstract = soup.find("abstract")
    abstract_text = abstract.get_text(separator=" ") if abstract else ""

    # Extract and clean body, excluding <note> elements
    body = soup.find("body")
    if body:
        for note in body.find_all("note"):
            note.decompose()  # Remove <note> elements within <body>
        body_text = body.get_text(separator=" ")
        for fig in body.find_all("figure"):
            fig.decompose()  # Remove <fig> elements within <body>
        body_text = body.get_text(separator=" ")
    else:
        body_text = ""

    # Combine abstract and body into a single text object
    full_text = f"Title: {title_text}\n\nAbstract: {abstract_text}\n\nBody:\n{body_text}".strip()

    return full_text

In [None]:
extracted_paper_elements = extract_relevant_text(paper)

In [None]:
print(extracted_paper_elements)

Title: Effects on the Affect Misattribution Procedure are strongly moderated by influence awareness

Abstract: 
 The Affect Misattribution Procedure (AMP) is used in many areas of psychological science based on the assumption that it not only taps into attitudes and biases but does so without a person's awareness. Across eight preregistered studies (N = 1603) plus meta-analyses, we reexamined the 'implicitness' of AMP effects, and in particular, the idea that people are unaware of the prime's influence on their evaluations. Results indicated that AMP effects and their predictive validity are primarily moderated by a subset of influence-aware trials (within individuals), and high rates of influence awareness (between individuals). Interestingly, an individual's influence-awareness rate on one AMP predicted how they performed on an earlier AMP, even when the two assessed different attitude domains. Taken together, our results suggest that AMP effects are not implicit in the way that has 

In [None]:
system_prompt = "You are a helpful assistant who is expert in succinctly summarising information from academic papers."
user_prompt = "Can you tell me what this study was about?"

In [None]:
# create the full user prompt which includes the academic paper!
full_user_prompt = user_prompt + extracted_paper_elements

# Make the API call
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    temperature = 0,
    top_p = .1,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": full_user_prompt}
        ]
    )

output_text = response.choices[0].message.content # take the output and extract the text response only

print(output_text) # print the output

The study aimed to assess the reproducibility of psychological research by conducting direct replications of 100 experimental and correlational studies published in three prominent psychology journals. The researchers sought to determine the extent to which original findings could be replicated using high-powered designs and original materials.

Key findings included:
- The mean effect size of replication results was significantly lower (M r = 0.197) than that of the original studies (M r = 0.403), indicating a substantial decline in effect size.
- While 97% of original studies reported significant results (P < .05), only 36% of the replications achieved significance.
- 47% of original effect sizes fell within the 95% confidence interval of the replication effect sizes, and 39% were subjectively rated as replicated.
- The strength of the original evidence (e.g., original P values) was a better predictor of replication success than the characteristics of the research teams involved.

Th

###Hands-on:
Try to:


1.   Ask about other study characteristics.
2.   Play around with the hyperparameters we learned about earlier. How does it change the output?



##So what?
All of this is pretty fun and cool, but it's also pretty accessible in the standard chat interface too. So let's build this into a programmatic workflow. Specifically, we're going build the workflow out to (i) read in a paper, (ii) use GROBID to extract the text from it, (iii) call an LLM to ask it to extract the paper's title and give a short summary, and (iv) save this information into a single csv file.

To do this, we're going to need the LLM to give us _structured output_ in each iteration: in other words, as a json file. Now, we could just ask it for json, but because of how LLMs work, there's a chance that this could fail at some point in a large workflow. So we will instead use the `pydantic` Python module, which is made specifically to do this!

The basic use of `pydantic` involves specifying a "BaseModel" which specifies the structure of the output we want from the LLM. Then we feed that structure to the LLM via the `response_format` argument to the LLM call, and clearly define to the model what this output should look like. We also slightly change how we call the LLM (harnessing the `beta` component of the call to the OpenAI client, which enables the use of structured output). Let's look at a simple example below:

In [None]:
# import BaseModel from pydantic, which we can use to structure the output of the LLM call
from pydantic import BaseModel

# define the desired output, called CustomLLMOutput
class CustomLLMOutput(BaseModel):
  FirstLetter: str
  LastLetter: str
  NumberOfCharacters: int

# define our simple prompt
prompt = "Hi! Your goal here is to return the FirstLetter of this text, and the LastLetter, as well as the total NumberOfCharacters in the text."

# call the LLM, which CustomLLMOutput defined
response = openai_client.beta.chat.completions.parse(
    model = "gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    response_format = CustomLLMOutput
    )

# Parse the output
response_content = response.choices[0].message.content

# print the text response
print(response_content)

{"FirstLetter":"H","LastLetter":"t","NumberOfCharacters":134}


OK, so let's build out a full workflow then! In fact, we already created some functions that we can nicely slot into a looped process: specifically `read_with_grobid` and `extract_relevant_text`. Now we need to do something similar with the call to the LLM. Like this:

In [None]:
# let's define our BaseModel for the LLM call first
class PaperSummary(BaseModel):
  paper_title: str
  paper_summary: str

# and now let's define the LLM calling function
def call_llm(paper_content):

  # first we define our prompt, leaving a placeholder for the paper content to be inserted
  prompt = (
      f"Your task is to extract relevant information from text of an academic paper."
      f"You specifically should extract (i) the paper's title, and (ii) a short summary about the paper."
      f"Your output should consist of two columns: paper_title, and paper_summary. Your summary should be no more than 30 words."
      f"Here's the paper content: \n{paper_content}"
  )

  # next, we call the LLM!
  response = openai_client.beta.chat.completions.parse(
    model = "gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    response_format = PaperSummary
    )

  # Then we'll parse the output
  response_content = response.choices[0].message.content

  # and finally we want to return the response
  return response_content


Now let's just put it all together! We'll test it out on the paper from earlier to make sure there are no issues.

In [None]:
# Step 1. Read in paper
paper = read_with_grobid('/content/science.aac4716.pdf')

# Step 2. Extract relevant content
extracted_paper_elements = extract_relevant_text(paper)

# Step 3. Call LLM
llm_output = call_llm(extracted_paper_elements)

# Print the output
print(llm_output)

{"paper_title":"Estimating the reproducibility of psychological science","paper_summary":"This paper reports on a large study measuring the reproducibility of psychological findings, revealing significant declines in effect sizes across replications of original research."}


Success! Now we want to run this on 5 papers. So we need to (i) wrap the process in a loop, and (ii) combine the outputs.

In [None]:
# First, we list all files which have the word "paper" in their name
import os # we import the os module to communicate with the operating system
papers = [file for file in os.listdir() if "paper" in file]

# Next, we create an empty list which we will save the output of the loop to
paper_summaries = []

# loop our workflow over every entry in papers, and save the output to paper_summaries
for paper in papers:

  # Step 1: read in paper
  paper = read_with_grobid(paper)

  # Step 2: extract relevant content
  extracted_paper_elements = extract_relevant_text(paper)

  # Step 3: Call LLM
  llm_output = call_llm(extracted_paper_elements)

  # Step 4: Add the output to the ppaer_summaries object
  paper_summaries.append(llm_output)


OK - let's tidy the output up slightly and see how it looks!

In [None]:
# import a couple of libraries for nicer table formatting
import json
import pandas as pd

# unpack each entry based on its json formatting
papers = [json.loads(item) for item in paper_summaries]

# convert this to a data frame
papers_df = pd.DataFrame(papers)

# display the output as a table
print(papers_df)

NameError: name 'paper_summaries' is not defined

#Extra content
Below is some extra content not planned to be covered, or mentioned in the introductory part of the workshop.

##Simple RAG workflow

In [None]:
import numpy as np

# general function to call OpenAI's embeddings model
def get_embedding(text, model = "text-embedding-3-large"):
    """
    Fetches embeddings from OpenAI API for a given text.
    """
    response = openai_client.embeddings.create(input=text, model=model)
    return np.array(response.data[0].embedding)

Some functions for tokenisation below.

In [None]:
!pip install tiktoken
import nltk
import tiktoken

# Download NLTK data (required for tokenisation)
nltk.download('punkt_tab', quiet=True)

# function for handling cases where sentences are not delineated by punctuation (will happen often with transcripts)
def split_long_sentence(sentence, max_tokens, overlap_tokens, encoding):
    """
    Splits a sentence that exceeds max_tokens into smaller chunks with an overlap,
    ensuring that chunks do not cut words in half (i.e. they end at a token boundary
    where the next token begins with a space).

    Parameters:
        sentence (str): The sentence to be split.
        max_tokens (int): Maximum tokens allowed per chunk.
        overlap_tokens (int): Number of tokens to overlap between subchunks.
        encoding: The tiktoken encoding object.

    Returns:
        List[str]: A list of text chunks from the sentence.
    """
    tokens = encoding.encode(sentence)
    # If the sentence is already short enough, return it as is.
    if len(tokens) <= max_tokens:
        return [sentence]

    chunks = []
    start = 0
    while start < len(tokens):
        # Set an initial end position.
        end = start + max_tokens
        # If we're not at the very end, try to adjust the cut so that the next token starts with a space.
        if end < len(tokens):
            # Decrease end until the token at position 'end' decodes to text that starts with a space.
            while end > start + 1 and not encoding.decode([tokens[end]]).startswith(" "):
                end -= 1
        # If we couldn't find a boundary, fall back to the original end.
        if end == start:
            end = start + max_tokens

        chunk_tokens = tokens[start:end]
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)

        # If we've reached the end, break.
        if end >= len(tokens):
            break

        # Set start for the next chunk with overlap.
        start = end - overlap_tokens
        if start < 0:
            start = 0

    return chunks

# function that extracts the chunks themselves based on token lengths
def extract_chunks(transcript_text, max_tokens=1000, overlap_tokens=100, model="gpt-4o"):
    """
    Splits transcript text into chunks that start and end at sentence boundaries,
    with optional overlapping tokens between chunks.

    Parameters:
        transcript_text (str): The input text to chunk.
        max_tokens (int): Max number of tokens per chunk.
        overlap_tokens (int): Target number of overlapping tokens between chunks.
        model (str): Tokenizer model name.

    Returns:
        List[str]: List of sentence-aligned chunks.
    """
    encoding = tiktoken.encoding_for_model(model)
    tokenize = lambda text: encoding.encode(text)

    # Step 1: Sentence segmentation
    sentences = nltk.sent_tokenize(transcript_text)

    chunks = []
    current_chunk = []
    current_tokens = 0
    sentence_token_lengths = []

    for sentence in sentences:
        sentence_tokens = tokenize(sentence)
        sentence_len = len(sentence_tokens)

        # If the sentence alone exceeds max_tokens, split it further
        if sentence_len > max_tokens:
            sub_sentences = split_long_sentence(sentence, max_tokens, overlap_tokens, encoding)
            for sub in sub_sentences:
                sub_len = len(tokenize(sub))
                if current_tokens + sub_len > max_tokens:
                    # Finalize chunk
                    chunks.append(" ".join(current_chunk))
                    current_chunk = []
                    current_tokens = 0
                    sentence_token_lengths = []
                current_chunk.append(sub)
                current_tokens += sub_len
                sentence_token_lengths.append(sub_len)
            continue

        # If adding this sentence would exceed the token limit, start new chunk
        if current_tokens + sentence_len > max_tokens:
            chunks.append(" ".join(current_chunk))

            # Apply overlap: backtrack through sentences until we hit overlap_tokens
            overlap_chunk = []
            overlap_count = 0
            for sent, sent_len in zip(reversed(current_chunk), reversed(sentence_token_lengths)):
                overlap_chunk.insert(0, sent)
                overlap_count += sent_len
                if overlap_count >= overlap_tokens:
                    break

            current_chunk = overlap_chunk.copy()
            current_tokens = sum(tokenize(s).__len__() for s in current_chunk)
            sentence_token_lengths = [len(tokenize(s)) for s in current_chunk]

        # Append sentence
        current_chunk.append(sentence)
        current_tokens += sentence_len
        sentence_token_lengths.append(sentence_len)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks



Functions for embedding extraction and comparison below.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def get_embedding(text, model = "text-embedding-3-small"):
    """
    Fetches embeddings from OpenAI API for a given text.
    """
    if not text:
      raise ValueError("Text for embedding cannot be empty.")
    if not isinstance(text, str) or not text.strip():
      raise ValueError("Text for embedding must be a non-empty string.")
    response = openai_client.embeddings.create(model=model, input=text)
    embedding = np.array(response.data[0].embedding)
    token_count = response.usage.total_tokens  # Extract token count from API response

    return embedding, token_count

def embed_paper(text, max_tokens, overlap_tokens):
    """
    Extracts embeddings for each paragraph in the manuscript.
    """
    paper_chunks = extract_chunks(text, max_tokens=max_tokens, overlap_tokens=overlap_tokens, model="gpt-4o")
    results = [get_embedding(p) for p in paper_chunks]
    embeddings = [r[0] for r in results]
    token_counts = [r[1] for r in results]


    return pd.DataFrame({
        "chunk": paper_chunks,
        "embedding": embeddings
    })

def retrieve_relevant_chunks(prompt, df, top_k=None, threshold=None):
    """
    Retrieves the most relevant chunks for a given topic name embedding based on cosine similarity.

    - If only top_k is defined, returns the top_k highest similarity paragraphs.
    - If only threshold is defined, returns all paragraphs meeting the threshold.
    - If both are defined, applies both criteria.
    - If neither is defined, raises an error.
    """
    if top_k is None and threshold is None:
        raise ValueError("At least one of 'top_k' or 'threshold' must be specified.")

    topic_embedding = get_embedding(prompt)[0]

    # Compute cosine similarity between query and all paragraph embeddings
    similarities = cosine_similarity([topic_embedding], np.vstack(df["embedding"]))[0]

    # Add similarity scores to the DataFrame
    df["similarity"] = similarities
    df_sorted = df.sort_values(by="similarity", ascending=False)

    # Apply filtering logic based on top_k and threshold
    if threshold is not None:
        df_sorted = df_sorted[df_sorted["similarity"] >= threshold]  # Apply threshold filter

    if top_k is not None:
        df_sorted = df_sorted.head(top_k)  # Apply top_k limit

    return df_sorted[["chunk", "similarity"]]


code for handling overlapping text between chunks.

In [None]:
# this is specifically for checking if there is overlap between chunks, and joining them directly (without redundant overlap repeated) if so
def find_overlap(chunk1, chunk2):
    """
    Identifies the largest overlapping text at the end of chunk1 and the beginning of chunk2.

    Parameters:
        chunk1 (str): The first chunk of text.
        chunk2 (str): The second chunk of text.

    Returns:
        str: The overlapping text if found, otherwise an empty string.
    """
    min_overlap_length = 20  # Minimum number of characters for a valid overlap

    # Iterate backwards from the end of chunk1 to find the largest overlap in chunk2
    for i in range(len(chunk1)):
        overlap_candidate = chunk1[i:]  # Take the ending substring of chunk1
        if chunk2.startswith(overlap_candidate) and len(overlap_candidate) >= min_overlap_length:
            return overlap_candidate
    return ""

def format_relevant_chunks(df):
    """
    Formats relevant paragraphs into a structured text block for LLM input,
    ensuring overlapping text between adjacent chunks is merged properly.

    Parameters:
        df (pd.DataFrame): A DataFrame containing a "chunk" column with text segments.

    Returns:
        str: A formatted text block with overlaps merged and non-overlapping chunks separated.
    """
    df = df.sort_index()  # Sort the chunks by index
    chunks = df["chunk"].tolist()  # Extract chunks as a list

    if not chunks:
        return ""

    formatted_text = chunks[0]  # Start with the first chunk

    for i in range(1, len(chunks)):
        overlap = find_overlap(formatted_text, chunks[i])

        if overlap:
            # Merge by removing duplicate overlap
            formatted_text += chunks[i][len(overlap):]
        else:
            # Separate with "... \n\n ..."
            formatted_text += " ... \n\n ... " + chunks[i]

    return formatted_text

extract embeddings for paper:

In [None]:
# import numpy again because when I re-run this from scratch I keep not having it imported
import numpy as np

# embed the formatted document and print the associated dataframe
df_embeddings = embed_paper(extracted_paper_elements, max_tokens = 500, overlap_tokens = 100)
df_embeddings

Unnamed: 0,chunk,embedding
0,Title: Estimating the reproducibility of psych...,"[0.050331443548202515, 0.01178510021418333, 0...."
1,Thirty-six percent of replications had signifi...,"[0.04425464943051338, -0.0023916689679026604, ..."
2,This project provides accumulating evidence fo...,"[0.04153410717844963, -0.0018653592560440302, ..."
3,Direct replication is the attempt to recreate ...,"[0.036356888711452484, 0.0043677156791090965, ..."
4,Potentially problematic practices include sele...,"[0.040376920253038406, 0.010345174930989742, 0..."
5,The units of analysis for inferences about rep...,"[0.016850758343935013, 0.0059136622585356236, ..."
6,Project coordinators facilitated matching arti...,"[0.028508516028523445, 0.0037176646292209625, ..."
7,The key result had to be represented as a sing...,"[0.02611692063510418, 0.019338591024279594, 0...."
8,The most common reasons for failure to match a...,"[0.016962390393018723, 0.028942249715328217, 0..."
9,These included characteristics of the original...,"[0.027320412918925285, 0.005955172702670097, 0..."


define prompt and extract embeddings:

In [None]:
prompt_definition_info = "The sample size, which is related to the number of participants which are reported in the manuscript."

In [None]:
prompt_embedding = pd.DataFrame(get_embedding(prompt_definition_info)[0])
print(prompt_embedding)

             0
0     0.023430
1    -0.013261
2     0.041532
3    -0.003199
4    -0.004277
...        ...
1531 -0.005394
1532 -0.016504
1533 -0.001266
1534  0.009445
1535  0.019909

[1536 rows x 1 columns]


run RAG process:

In [None]:
relevant_paragraphs = retrieve_relevant_chunks(prompt = prompt_definition_info,
                                               df = df_embeddings,
                                               top_k = 5)

In [None]:
relevant_paragraphs

Unnamed: 0,chunk,similarity
6,Project coordinators facilitated matching arti...,0.47097
12,"Third, we computed the proportion of study-pai...",0.447195
13,Exclusions (explanation provided in supplement...,0.440072
9,These included characteristics of the original...,0.429687
8,The most common reasons for failure to match a...,0.425096


##Creating a simple classifier
Imagine we want to use an LLM to classify a text as one of 5 categories. We could just instruct it in its prompt to only ever return one of these 5 categories - this works well in 99% of cases. But, depending on our workflow, that 1% of cases might actually break the entire process! So we can actually alter the token probabilities through the `logit_bias` parameter, such that only the tokens we select are actually possible to be generated.

In [33]:
import tiktoken

# Get tokenizer
enc = tiktoken.encoding_for_model("gpt-4o-mini")

system_prompt = "Your job is to classify text as one of three categories: positive, negative, or neutral."
user_prompt_text = "I'm not angry. I'm not mad. I don't even care. I'm just....so disappointed. I genuinely cannot believe you've done this to me."


# Make the API call
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Here is the text to classify:\n" + user_prompt_text}
        ],
    logit_bias = {
        enc.encode("positive")[0]: 100,
        enc.encode("negative")[0]: 100,
        enc.encode("neutral")[0]: 100
    },
    max_tokens = 1 # add this to make sure we just get the single classification!
    )

output_text = response.choices[0].message.content # take the output and extract the text response only

print(output_text) # print the output

negative


Alternatively, we can also use OpenAI’s _function-calling_ feature, where we specify a function whose arguments are constrained by a schema (e.g., allowing only a fixed set of labels). This offers an even harder form of output control compared to the logit_bias approach. While logit_bias can nudge the model toward certain tokens, it does not guarantee valid output and may occasionally result in refusals. In contrast, the function-calling approach guarantees a structured response that conforms to the schema, provided the model chooses to call the function. If we want to force the model to call the function every time, we can explicitly set function_call={"name": "classify_sentiment"} to avoid relying on the model’s judgment.

In [35]:
# create a specification of the function which will be called by the model
classifier_function_specification = [
    {
        "name": "classify_sentiment",
        "description": "Classify the sentiment of the given text as positive, neutral, or negative.",
        "parameters": {
            "type": "object",
            "properties": {
                "label": {
                    "type": "string",
                    "enum": ["positive", "neutral", "negative"]
                }
            },
            "required": ["label"]
        }
    }
]

# call the LLM
response = openai_client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Please classify the following text:\n" + user_prompt_text}
    ],
    functions=classifier_function_specification,
    function_call={"name": "classify_sentiment"}
)

# Access function response (note: different than the text response)
function_output = response.choices[0].message.function_call.arguments

print(function_output)

{"label":"negative"}
