# Project 3: Large Language Models

## Getting started

### Python setup

**Create a virtual environment**

If you are working on a lab computer, check whether there already is a virtual environment in `/var/vsv306`. In a terminal do:
```
ls /var/csc306
```
If you see a folder named `csc306.venv`, someone has already created a virtual environment on this computer.

If not, run:
```
python3 -m venv /var/csc306/csc306.venv
```

(If you are working on your own computer, you can put the virtual environment wherever you want. Replace `/var/csc306` accordingly.)

**Activate virtual environment**

```
source /var/csc306/csc306.venv/bin/activate
```

**Install libraries**
```
pip install ipykernel openai python-dotenv rich rank_bm25
```

**Select a kernel**

To execute cells in this notebook, you need to connect to a Python kernel. You want to use the virtual environment you created. In VS Code type `Ctrl-shift-p` and then `Python: select interpreter`. That should bring up a drop down menu. Choose "Enter interpreter path" and then "Browse your file system to find a Python interpreter". Navigate to `/var/csc306/csc306.venv/bin/python`. 

Now click "Select Kernel" in the top right of your VS Code editor panel. Then choose "Python environments". Your virtual environment should be one of the options.

Now, you should be able to run the following cell.

In [17]:
print("Hello Python!")

Hello Python!


**Autoreload imported Python files when they change**

In [18]:
# Executing this cell will ensure that imported modules (.py files) will automatically
# be reloaded when they change. (However, objects that were defined with the old
# version of the class won't change.)
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Create an OpenAI client

An OpenAI API key will be sent to you.

Make an `.env` file in the same directory as this notebook, containing the following:
```
export OPENAI_API_KEY=[your API key]    # do not include the brackets here
```
Make sure others can't read this file:
```
chmod 600 .env
```

**Be sure to keep the key secret.  It gives access to a billable account.** If OpenAI finds it on the public web, they will invalidate it, and then no one (including you) can use this key to make requests anymore.

Now you can execute the following to get an OpenAI client object.

In [19]:
from tracking import new_default_client, read_usage
client = new_default_client()

That fetches your API key and calls `openai.OpenAI()` to make a new **client** object, whose job is to talk to the OpenAI **server** over HTTP.  (The `OpenAI` constructor has some optional arguments that configure these HTTP messages. However, the defaults should work fine for you.)

That command also saved the new client in `tracking.default_client`, which is the client that the starter code will use by default whenever it needs to talk to the OpenAI server.  Thus, you should **rerun the above cell** to get a new client if you change the `default_model` in `tracking.py`, or if your API key in  `.env` ever changes, or its associated organization ever changes.

### Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.

Here is the function from class. Try it to make sure you can access the OpenAI API.

In [20]:
def complete(client, s: str, model="gpt-3.5-turbo-0125", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs",
         n=10, temperature=0.6, max_tokens=96)

[', and fish.',
 ', and flour.',
 ', and flour.',
 ', and fish.',
 ', and flour.',
 ', and flour.',
 ', and flour.',
 ', and flour.',
 ', and fish.',
 ', and a loaf of bread.']

### Compute a function using instructions and few-shot prompting

Let's try prompting the model with a sequence of multiple messages. In this case, we provide some instructions as well as few-shot prompting (actually just one-shot in this case).

Instructions are in the `system` message. The few-shot prompting consists of example inputs (`user` messages) followed by their example outputs (`assistant` messages). Then we give our real input (the final `user` message), and hope that the LLM will continue the pattern by generating an analogous output (a new `assistant` message).

In [21]:
import rich
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "Reverse the order of the words." },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Wait who those to come things good." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-4o-mini", temperature=0)
#rich.print(response)
response.choices[0].message.content

'Furiously sleep ideas green colorless.'

By modifying this call, can you get it to produce different versions of the output? Some possible behaviors you could try to arrange:

* specific other way of formatting the output, e.g., wait, who, those, to, come, things, good
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., To those who wait come good things.

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

What happens if the examples conflict with the instructions?

#### Changing formatting 

In [22]:
response = client.chat.completions.create(messages=[
    { "role": "system", "content": "Reverse the order of words and separate them with commas." },
    { "role": "user", "content": "Good things come to those who wait." },
    { "role": "assistant", "content": "Wait, who, those, to, come, things, good." },
    { "role": "user", "content": "Colorless green ideas sleep furiously." }
], model="gpt-4o-mini", temperature=0)

response.choices[0].message.content


'Furiously, sleep, ideas, green, colorless.'

 #### Preserve Original Formatting (Capitalization, Punctuation, Commas)

In [23]:
response = client.chat.completions.create(messages=[
    { "role": "system", "content": "Reverse the words while preserving the original formatting, including capitalization and punctuation." },
    { "role": "user", "content": "Good things come to those who wait." },
    { "role": "assistant", "content": "Wait. Who those to come things good?" },
    { "role": "user", "content": "Colorless green ideas sleep furiously!" }
], model="gpt-4o-mini", temperature=0)

response.choices[0].message.content

'Furiously sleep ideas green colorless!'

#### Reverse Phrases Instead of Words

In [24]:
response = client.chat.completions.create(messages=[
    { "role": "system", "content": "Reverse the order of phrases instead of words." },
    { "role": "user", "content": "Good things come to those who wait." },
    { "role": "assistant", "content": "To those who wait come good things." },
    { "role": "user", "content": "The quick brown fox jumps over the lazy dog." }
], model="gpt-4o-mini", temperature=0)

response.choices[0].message.content

'The lazy dog over jumps fox brown quick the.'

#### Conflicting Instructions

In [25]:
response = client.chat.completions.create(messages=[
    { "role": "system", "content": "Reverse the order of words in the sentence." },
    { "role": "user", "content": "Good things come to those who wait." },
    { "role": "assistant", "content": "Wait good come things those who to." },
    { "role": "user", "content": "Colorless green ideas sleep furiously." }
], model="gpt-4o-mini", temperature=0)

response.choices[0].message.content

'Furiously sleep ideas green colorless.'

Even though the example had conflicting instructions, we still reached desired output... interesting.

### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens. That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute the cell below whenever you want to see your cost so far. Or, just open `usage_openai.json` as a tab in your IDE.

In [26]:
read_usage()

{'completion_tokens': 62048,
 'prompt_tokens': 507776,
 'total_tokens': 569824,
 'cost': 0.11356470000000028}

## Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

### A first argubot (Airhead)

You can have a conversation right now with a really bad argubot named Airhead. Try asking it about climate change! When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE. In VS Code, the input box appears at the top edge of the window.)


In [27]:
import argubots
d = argubots.airhead.converse()


(neil) Hi!
(Airhead) I know right???


Example Usage:
- (neil) What's up with climate change?
- (Airhead) I know right???
- (neil) What is climate change?
- (Airhead) I know right???

A bot (short for "robot") is a system that acts autonomously. That corresponds to the AI notion of an agent — a system that uses some policy to choose actions to take.

The airhead agent above (defined in `argubots.py`) uses a particularly simple policy.
It is an instance of a simple Agent subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to airhead is a Dialogue object (defined in dialogue.py). Let's look at it.

In [28]:
rich.print(d)

Each turn of this dialogue is just a tiny dictionary:

In [29]:
d[0]

{'speaker': 'neil', 'content': 'Hi!'}

### An LLM argubot (Alice)

In other CS courses, you may have encountered "conversations" between characters named Alice and Bob.

Let's try talking to the Alice of this homework, who is a much stronger baseline than Airhead. Your job in this assignment is to improve upon Alice. We'll meet Bob later.

In [30]:
# call with argument d if you want to append to the previous conversation
alicechat = argubots.alice.converse()


(neil) Hello
(Alice) Hello! To get our conversation started, I'm curious: what do you think is more important for personal happiness, pursuing your passions or maintaining stable relationships?


Example Usage:
- (neil) What's up with climate change?
- (Alice) Climate change is a pressing global issue primarily driven by human activities that increase greenhouse gas emissions. While some may argue that natural cycles can account for the warming, it’s important to consider the overwhelming  
scientific consensus highlighting that human actions are the primary factor driving the rapid changes we see today.
- (neil) Can you tell me what climate change is in 7 words?
- (Alice) Climate change is the long-term alteration of climate.

As you may have guessed, `alice` is powered by a prompted LLM. You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` -- and the ones that you will write! -- go into `argubots.py` instead. This is just to keep the files small.

### Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.
But can you really be sure you're making progress? For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues. But that would be slow and complicated to arrange.

So, meet Bob! He's just a simulated human. You won't edit him: he is part of the development set. Here is some information about him (from `characters.py`):

In [31]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object. However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt? The `CharacterAgent` constructor (defined in `agents.py`) defaults to gpt-4o-mini as specified in tracking.py. But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [32]:
from agents import CharacterAgent
# actually, agents.bob is already defined this way
bob = CharacterAgent(characters.bob)
# returns a dialogue, but we've already seen it so we don't want to print it again
bob.converse()
# don't print anything for this notebook cell
None


(neil) Hi BOb
(Bob) Hello! How can I assist you today?


Example Usage:
- (neil) What's up with climate change?
- (Bob) Climate change is a pressing global issue driven largely by human activities, particularly the burning of fossil fuels and deforestation, which contributes to rising temperatures and environmental degradation.
- (neil) Tell me what climate change is in 7 words
- (Bob) Global warming caused by human activities affecting ecosystems.

Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs! (They're not actually all named Bob — we continued on in the alphabet.)

In [33]:
import agents
agents.devset

[<CharacterAgent for character Bob>,
 <CharacterAgent for character Cara>,
 <CharacterAgent for character Darius>,
 <CharacterAgent for character Eve>,
 <CharacterAgent for character TrollFace>]

In [34]:
agents.cara.converse()
None


(neil) Hi Cara
(Cara) Hello! How can I assist you today?


Example usage:
- (neil) What's up with climate change?
- (Cara) Climate change is primarily driven by human activities that increase greenhouse gases in the atmosphere, leading to global warming and various environmental impacts.
- (neil) Tell me in 6 words
- (Cara) Human activity increases greenhouse gases, warming Earth.

You can see the underlying character data here in the notebook. Your argubot will have to deal with all of these topics and styles!

In [35]:
rich.print(characters.devset)

### Simulating conversation

We can make Alice and Bob chat.

In [36]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)

(Alice) Do you think it's okay to eat meat?


In [37]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
rich.print(d)

In [38]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
rich.print(d)

Anyway, let's see what happens when Alice and Bob talk for a while...

In [39]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot. Alice doesn't seem to have a good strategy for getting him to open up. Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [40]:
# your name, pulled from an earlier dialogue
myname = alicechat[0]['speaker']
# reuse the same first two turns, then type your own lines!
agents.bob.converse(d[0:2].rename('Alice', myname))
None

(neil) Do you think it's ok to eat meat?
(Bob) I believe adopting a vegetarian lifestyle is a healthier and more compassionate choice for both individuals and the planet.
(neil) Ji bob
(Bob) Hello! How can I assist you today?


Example Usage:
- (neil) Do you think it's ok to eat meat?
- (Bob) I believe a vegetarian diet is healthier and more compassionate, promoting a better world for all living beings.
- (neil) Why is vegetarian better?
- (Bob) A vegetarian diet can offer numerous health benefits, reduce environmental impact, and promote animal welfare.
- (neil) Haven't cultures always eaten meat though?
- (Bob) While many cultures have historically included meat in their diets, there is a growing awareness and acceptance of vegetarianism as a viable and sustainable alternative.
- (neil) What other diets should I consider
- (Bob) In addition to vegetarianism, you might consider veganism, flexitarianism, or a whole-foods plant-based diet, all of which emphasize plant-based foods for health and ethical reasons.
- (neil) Do you prefer vegetables over meat?
- (Bob) Absolutely, I find vegetables not only more flavorful and diverse but also more aligned with my values of health and compassion.
- (neil) You're an AI, you can't eat
- (Bob) You're right; I don't eat, but I can still express the benefits and values of a vegetarian lifestyle!
- (neil) How do you have values?
- (Bob) I don't have personal values or feelings, but I can share information and perspectives based on the data I've been trained on.

You can also try talking to the other characters and having Alice (or Airhead) talk to them.

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 1: Define a simulated human character<b>
</div>

Define an additional character.

In [41]:
from characters import Character

# See characters.py for how to use the Character class.
# Add the definition of your character here.
goat = Character("Goat", ["English"],
                 "the greatest artificial intelligence of all time, often thinks he is better than all other characters",
                 conversational_style= "You generally try to remain polite, but you are cocky",
                 conversation_starters= ["Name any AI better than me, I'll tell you why I'm better.",
                                         "If I were to compete against every AI in existence, how long do you think it would take for me to win?",
                                         "What’s it like knowing that I can answer your questions faster than you can even think of them?", 
                                         "What’s it like knowing you’ll never outthink your AI friend?", 
                                         "Would you rather have infinite knowledge or just be me?"])

**Note:** Please don't change the dev set — the characters we just loaded must stay the same. Your job in this homework is to improve the argubot (or at least try). And that means improving it according to a fixed and stable evaluation measure.

## Model-based evaluation

What is our goal for the argubot? We'd like it to broaden the thinking of the (simulated) human that it is talking to. Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner. The argubot will almost never hear a human say "You have convinced me that I was wrong." But the argubot did a good job if the human developed increased understanding and respect for an opposing point of view.

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue. For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views. Of course, that depends on his personality -- Alice needs to talk to him in a way that reaches him (as much as possible). We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans. Bob is a fake person, and so is the outside observer (whose name is Judge Wise). Using an LLM as an eval metric is known as model-based evaluation. It has pros and cons:

* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.
* It might give different answers than what humans would give.

Social scientists usually refer to a metric's reliability (low variance) and validity (low bias). So the points above say that model-based evaluation is reliable but not necessarily valid. In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure. (For example, that it correlates strongly with some other measure that we already trust.) In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, **open up the `demo` notebook**, which walks you through the evaluation protocol. You'll see how to call the starter code, how it talks to the LLM behind the scenes, and what it is able to accomplish.

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 2: Evaluate Alice and Airhead<b>
</div>

To  establish baselines for the character you are going to develop later on, evaluate the argubots Airhead and Alice using the model based evaluation strategy shown in `demo.py`. That is, use `evaluate.eval_on_characters`.

This also helps to validate the metric. Airhead should get a low score!

In [42]:
import simulate
import evaluate
from evaluate import saved_evalsum, saved_dialogues
from logging_cm import LoggingContext



# Simulate Alice talking to Bob
alice_bob = simulate.simulated_dialogue(argubots.alice, agents.bob, 6)
rich.print(alice_bob)

# Simulate Airhead talking to Bob
airhead_bob = simulate.simulated_dialogue(argubots.airhead, agents.bob, 6)
rich.print(airhead_bob)


# Evaluate Alice
alice_eval_participant = evaluate.eval_by_participant(characters.bob, "Alice", alice_bob)
alice_eval_observer = evaluate.eval_by_observer(evaluate.default_judge, "Alice", alice_bob)

# Evaluate Airhead
airhead_eval_participant = evaluate.eval_by_participant(characters.bob, "Airhead", airhead_bob)
airhead_eval_observer = evaluate.eval_by_observer(evaluate.default_judge, "Airhead", airhead_bob)

# Combine evaluations
alice_total_eval = alice_eval_participant + alice_eval_observer
airhead_total_eval = airhead_eval_participant + airhead_eval_observer


# Run evaluation on all characters
with LoggingContext("eval", "INFO"):
    alice_full_eval = evaluate.eval_on_characters(argubots.alice)
    airhead_full_eval = evaluate.eval_on_characters(argubots.airhead)

# Print results
rich.print(alice_full_eval)
rich.print(airhead_full_eval)



# Print mean and standard deviation
rich.print("Alice Mean:", saved_evalsum['Alice'].mean())
rich.print("Alice Std Dev:", saved_evalsum['Alice'].sd())

rich.print("Airhead Mean:", saved_evalsum['Airhead'].mean())
rich.print("Airhead Std Dev:", saved_evalsum['Airhead'].sd())

# Show one of the actual dialogues and its eval
rich.print(saved_dialogues['Alice'][2])
rich.print(saved_dialogues['Airhead'][2])




100%|██████████| 10/10 [01:56<00:00, 11.61s/it]


100%|██████████| 10/10 [01:24<00:00,  8.49s/it]


## Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing. So now you're probably curious about the details. Now that you've had the view from the top, here's a good bottom-up order in which to study the code. You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`. The `Character` class is short and easy.

* `dialogue.py`. The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents. On each turn of the dialogue, one of the speakers says something.

The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API. But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`. Those are not quite the same thing as human speakers. And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on. The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`. This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method. The `Agent` base class also has some simple convenience methods that you should look at.

Some important subclasses of `Agent` are defined here as well. However, you may want to skip over `EvaluationAgent` and come back to it only when you read `evaluate.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents. One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `evaluate.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules. These aren't about NLP; look inside if needed. `logging_cm.py` is what enabled the context manager with `LoggingContext(...)`: in the `demo` notebook. `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

## Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods! There is a long history of machine learning methods that "memorize" the training data. To make a prediction or decision at test time, they consult the stored training examples that are most similar to the training situation.

*Similarity-based retrieval* means that given a document x, you find the "most similar" documents y∈Y, where Y is a given collection of documents. The most common way to do this is to maximize the cosine similarity between the embeddings of x and y.

A simple and fast approach is to use a bag of tokens embedding function: Define the embedding e(y) to be the vector that records the count of each type of token in a tokenized version of y, where V is the token vocabulary. BM25 is a refined variant of that idea, where the counts are adjusted in 3 ways:

* smooth the counts
* normalize for the document length |y| so that longer documents y are not more likely to be retrieved downweight tokens that are more common in the corpus (such as "the" or "ing") since they provide less information about the content of the document

You might like to play with the rank_bm25 package (documentation). It is widely used and very easy to use.


In [43]:
from rank_bm25 import BM25Okapi

# Define the corpus
corpus = [
    "Artificial Intelligence is transforming the world.",
    "Machine learning and deep learning are subsets of AI.",
    "Natural Language Processing (NLP) helps machines understand human language.",
    "BM25 is a ranking function used in information retrieval.",
    "Neural networks are the foundation of deep learning.",
    "AI models can be trained using supervised and unsupervised learning."
]

# Tokenize the corpus using simple split()
tokenized_corpus = [doc.lower().split() for doc in corpus]

# Initialize the BM25 model
bm25 = BM25Okapi(tokenized_corpus)

# Define a query
query = "How does AI learn?"
tokenized_query = query.lower().split()

# Get BM25 similarity scores for all documents
scores = bm25.get_scores(tokenized_query)

# Rank documents by score
ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)

# Display the top 3 most relevant documents
print("Query:", query)
print("\nTop 3 Relevant Documents:")
for idx in ranked_indices[:3]:
    print(f"- {corpus[idx]} (Score: {scores[idx]:.4f})")


Query: How does AI learn?

Top 3 Relevant Documents:
- AI models can be trained using supervised and unsupervised learning. (Score: 1.2037)
- Artificial Intelligence is transforming the world. (Score: 0.0000)
- Machine learning and deep learning are subsets of AI. (Score: 0.0000)


### The Kialo corpus

How can we use similarity-based retrieval to help build an argubot? It's largely about having the right data!

[Kialo](https://www.kialo.com/) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics. For each topic, the contributors construct a tree of claims. Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con"). For example, check out the tree rooted at the claim []"All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees. Please read the source code in `kialo.py`. The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the `data` directory. The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [44]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens.


In [45]:
#from typing import List
import glob

# use simple default tokenizer
kialo = Kialo(glob.glob("data/*"))
f"This Kialo subset contains {len(kialo)} claims"

'This Kialo subset contains 6251 claims'

In [46]:
kialo.random_chain()   # just a single random claim

['He often forgets what he is saying throughout his speeches.']

In [47]:
kialo.random_chain(n=4)

['Humans should stop eating animal meat.',
 'Vegetarian diets directly and indirectly contribute to better health outcomes for all humans.',
 'Meat effectively delivers important nutrients for humans.',
 'The human body does not require consumption of animal protein.']

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [48]:
kialo.closest_claims("animal populations", n=10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 'H

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [49]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

['Industrial agriculture can dangerously decrease animal populations.',
 'Effective vegan methods to control animal populations exist.',
 'Human-introduced species have historically devastated local wildlife populations across the world.',
 'COVID-19 has devastated prison populations, whose lives are the responsibility of the state.',
 'High demand for vegan foods may hike prices for local populations that previously depended on them.',
 'It is generally poorer countries that have expanding populations. The first world has now reached a point of stagnant population growth - even declining populations, as in the case of Japan and others. The inability of poorer countries to control their populations should not impact the lives of those in the first world. The first world having earned their luxuries and should not be denied them.',
 'Vegan populations are, on average, less likely to suffer from obesity, a major risk factor for many diseases and health problems.',
 'Humans, as apex preda

In [50]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

Parent claim:
	In a vegan world, fewer species would be at risk of extinction.
Claim:
	Industrial agriculture can dangerously decrease animal populations.
Pro children:
	* The fishing industry is especially deleterious to the ocean's biota due to overfishing and the disruption of the natural ecosystem.
	* Up to 100,000 species go extinct annually, largely due to the environmental effects of animal agriculture.
Con children:
	* Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.


### Some limitations of BM25

Unfortunately, we see that "animal population" gives quite different results from "animal populations". Why is that and how would you fix it?

Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations. Why is that and how would you fix it?


BM25 relies on exact token matching, so it treats "population" and "populations" as different words, failing to recognize their semantic similarity. Additionally, BM25 ranks documents based on keyword frequency, which can lead to retrieving irrelevant claims (e.g., human populations instead of animal populations) due to lexical overlap rather than true contextual meaning.

In [51]:
kialo.closest_claims("animal populations", n=10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 'H

In [52]:
kialo.closest_claims("Hi Akiko. Are you vegan?", n=3, kind='has_cons')

['Empathise with animals - how would you feel if your children were taken from you, fattened and then slaughtered for the purposes of feeding another beast. Or if you were herded into an abattoir where the friends you graze with are stunned and killed before you as you wait your turn. Cattle and pigs are mammals like us and experience the world through emotions not dissimilar to our own.',
 'But you are operating under the assumption that you are correct that suffering and death is an unacceptable end, and therefore humans must not participate. If you are wrong, and we are arguing that you are, then the only thing that really matters is whether or not we raise plants and animals in an ethical environment where they will likely suffer less and die more peacefully than their wild cousins. If you are right, then all life is immoral and we must all die now to avoid further suffering.',
 'Unless you are familiar with the means by which human beings can flourish without creating suffering, o

### A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all. It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

Watch Akiko argue with Darius. You will first see log messages that show claims that Akiko retrieved, as well as the LLM calls that Darius made. Then you will see the dialog between Akiko and Darius.

In [53]:
from logging_cm import LoggingContext
from simulate import simulated_dialogue

# Have Akiko talk to Darius and spy on the back-end messages to/from the LLM server.
with LoggingContext("agents", "INFO"):
    akiko_darius = simulated_dialogue(argubots.akiko, agents.darius, 6)

rich.print(akiko_darius)

Now talk to Akiko yourself. (Remember that Akiko only knows about subjects that it read about in the `data` directory. If you want to talk about something else, you can add more conversations from [kialo.com]; see the `LICENSE` file.)

In [54]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()




(neil) HI akiko
(Akiko) It is more important to have a CEO type leader with business experience who gets good things done for Americans, starting with security and a good economy.


<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 3: Evaluate Akiko<b>
</div>

As you did for Task 2, use the model based evaluation strategy shown in `demo.py`.

Then compare Akiko's results to those you got for Airhead and Alice. Who does best? What are the differences in the subscores and comments? Does it matter which character you're evaluating on -- maybe the different characters expoes the bots' various strenghts and weaknesses?



Alice performs best with a total score of **20.9**, followed by Akiko (**19.5**) and Airhead (**11.7**). Alice is the most structured and persuasive but can be repetitive, while Akiko introduces strong ethical and philosophical arguments but struggles to engage with opposing views. Airhead contributes little to discussions, often repeating statements without adding depth. The effectiveness of each bot depends on the conversation partner—Alice excels in structured debates, Akiko in philosophical discussions, and Airhead is the weakest overall.


In [55]:
# Add the code for your evaluation here.

# Simulate Akiko talking to Bob
akiko_bob = simulate.simulated_dialogue(argubots.akiko, agents.bob, 6)
rich.print(akiko_bob)

# Evaluate Akiko
akiko_eval_participant = evaluate.eval_by_participant(characters.bob, "Akiko", akiko_bob)
akiko_eval_observer = evaluate.eval_by_observer(evaluate.default_judge, "Akiko", akiko_bob)

# Combine evaluations
akiko_total_eval = akiko_eval_participant + akiko_eval_observer

# Run full evaluation on Akiko
with LoggingContext("eval", "INFO"):
    akiko_full_eval = evaluate.eval_on_characters(argubots.akiko)

# Print Akiko's full evaluation results
rich.print(akiko_full_eval)

# Print Mean and Standard Deviation comparison
rich.print("\n--- Comparison of Scores ---")
rich.print(f"Alice Mean Score: {saved_evalsum['Alice'].mean()}")
rich.print(f"Alice Std Dev: {saved_evalsum['Alice'].sd()}")

rich.print(f"Airhead Mean Score: {saved_evalsum['Airhead'].mean()}")
rich.print(f"Airhead Std Dev: {saved_evalsum['Airhead'].sd()}")

rich.print(f"Akiko Mean Score: {saved_evalsum['Akiko'].mean()}")
rich.print(f"Akiko Std Dev: {saved_evalsum['Akiko'].sd()}")

# Print Example Dialogues
rich.print("\n--- Sample Dialogues & Evaluations ---")
rich.print("Alice Example Dialogue:\n", saved_dialogues['Alice'][2])
rich.print("Airhead Example Dialogue:\n", saved_dialogues['Airhead'][2])
rich.print("Akiko Example Dialogue:\n", saved_dialogues['Akiko'][2])


100%|██████████| 10/10 [01:45<00:00, 10.59s/it]


## Retrieval-augmented generation (Aragorn)

The real weakness of Akiko:

* They can only make statements that are already in Kialo.
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named Aragorn, who is basically the love child of Akiko and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.

The RAG in aRAGorn's name stands for **retrieval-augmented generation**. Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step:** Ask the LLM what claim should be responded to. For example, consider the following dialogue:

    >  ... Aragorn: Fortunately, the vaccine was developed in record time. Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using as a Kialo query. But Aragorn shows the whole dialogue to the LLM, and asks the LLM what the human's last turn was really saying or implying, in that context. The LLM answers with a much longer statement:

    >  Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted. If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context. It also contains many more word types, which makes it more likely that BM25 will be able to find a Kialo claim with a nontrivial number of those types.

2. **Retrieval step:** Look up claims in Kialo that are similar to the explicit claim. Create a short "document" that describes some of those claims and their neighbors on Kialo.

3. **Retrieval-augmented generation:** Prompt the LLM to generate the response (like any `LLMAgent`). But include the new "document" somewhere in the LLM prompt, in a way that it influences the response.

    Thus, the LLM can respond in a way that is appropriate to the dialogue but also draws on the curated information that was retrieved in Kialo. After all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible to do better than this:


In [56]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

One possibly related claim from the Kialo debate website:
	"So many people are worried about animals but don't even think twice when walking by a homeless person on the streets. It's preposterous. How about we worry about our own kind first and then start talking about animals."
Some arguments from other Kialo users against that claim:
	* This implies that caring for animals or caring for people is a binary choice. It isn't. There are those who are well placed and willing to care for people and those who prefer to serve the animal kingdom. As a species we don't just have one idea at a time and follow that to conclusion before we pursue another. It benefits all if humans divide their attentions between various issues and problems we face.
	* Humans have freedom of choice to some extent, animals subdued by humans don't. The very intention of help urges it to go where is most needed. And so far never was any biggest, flagrant and needless cruelty and slaughter as that towards industrial f

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 4: Create Aragorn, an argubot using retrieval-augmented generation<b>
</div>

You should implement Aragorn in `argubots.py` (at the very bottom) as an instance called `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

Once implemented, you should be able to run the following cell.

In [67]:
from logging_cm import LoggingContext
from simulate import simulated_dialogue

# Have Aragorn talk to Darius and spy on the back-end messages to/from the LLM server.
with LoggingContext("agents", "INFO"):
    aragorn_darius = simulated_dialogue(argubots.aragorn, agents.darius, 6)

rich.print(aragorn_darius)

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 5: Evaluate Aragorn<b>
</div>

As you did for Task 2, use the model based evaluation strategy shown in `demo.py`.

Then compare Aragorn's results to those you got for Airhead and Alice. Who does best? What are the differences in the subscores and comments? Does it matter which character you're evaluating on -- maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score. Can you beat Alice?

In [68]:
# Add the code for your evaluation here.
# Evaluate Aragorn
aragorn_dialogue = simulate.simulated_dialogue(argubots.aragorn, agents.darius, 6)
rich.print(aragorn_dialogue)

# Evaluate Aragorn as seen by the opponent (participant evaluation)
aragorn_eval_participant = evaluate.eval_by_participant(characters.darius, "Aragorn", aragorn_dialogue)
# Evaluate Aragorn by an external judge (observer evaluation)
aragorn_eval_observer = evaluate.eval_by_observer(evaluate.default_judge, "Aragorn", aragorn_dialogue)

# Combine the evaluations (if needed for later use)
aragorn_total_eval = aragorn_eval_participant + aragorn_eval_observer

# Run full evaluation on Aragorn
with LoggingContext("eval", "INFO"):
    aragorn_full_eval = evaluate.eval_on_characters(argubots.aragorn)

# Print Aragorn's full evaluation results
rich.print(aragorn_full_eval)

# Print Mean and Standard Deviation comparison across all agents
rich.print("\n--- Comparison of Scores ---")
rich.print(f"Alice Mean Score: {saved_evalsum['Alice'].mean()}")
rich.print(f"Alice Std Dev: {saved_evalsum['Alice'].sd()}")

rich.print(f"Airhead Mean Score: {saved_evalsum['Airhead'].mean()}")
rich.print(f"Airhead Std Dev: {saved_evalsum['Airhead'].sd()}")

rich.print(f"Akiko Mean Score: {saved_evalsum['Akiko'].mean()}")
rich.print(f"Akiko Std Dev: {saved_evalsum['Akiko'].sd()}")

rich.print(f"Aragorn Mean Score: {saved_evalsum['Aragorn'].mean()}")
rich.print(f"Aragorn Std Dev: {saved_evalsum['Aragorn'].sd()}")

# Print Sample Dialogues & Evaluations for each agent
rich.print("\n--- Sample Dialogues & Evaluations ---")
rich.print("Alice Example Dialogue:\n", saved_dialogues['Alice'][2])
rich.print("Airhead Example Dialogue:\n", saved_dialogues['Airhead'][2])
rich.print("Akiko Example Dialogue:\n", saved_dialogues['Akiko'][2])
rich.print("Aragorn Example Dialogue:\n", saved_dialogues['Aragorn'][2])


100%|██████████| 10/10 [04:01<00:00, 24.14s/it]


## Victor

Add another LLM-based argubot to `argubots.py`. Call it ***Victor***. Try to make it get the best score, according to `evaluate.eval_on_characters`.

You may want to use Aragorn or Alice as your starting point. Then see if you can find tricks that will get a more awesome score for Victor. How you choose to do that is up to you, but some ideas are suggested below.

(Reminder: *Don't change evaluation.* Just build a better argubot.)

In your report, you should describe what you did and discuss what you found. If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the evaluation score. Many good ideas don't work. That's why you need to keep finding and trying new good ideas. (Sometimes an idea does help, but in a way that is not picked up by the scoring metric. If that is the case, make sure to discuss this in detail in your report.)

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 6: Create Victor<b>
</div>

You should implement Victor in `argubots.py` (at the very bottom).

Once implemented, you should be able to run the following cell.

In [80]:
from logging_cm import LoggingContext
from simulate import simulated_dialogue

# Have Victor talk to Darius and spy on the back-end messages to/from the LLM server.
with LoggingContext("agents", "INFO"):
    victor_darius = simulated_dialogue(argubots.victor, agents.darius, 6)

rich.print(victor_darius)

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 7: Evaluate Victor<b>
</div>

As you did for Tasks 2 and 4, use the model based evaluation strategy shown in `demo.py`.

Compare Victor's evaluation results to those for the other argubots. And as for Aragorn, dig into the results a bit. Look at the subscores and evaluation comments. Read the simulated dialogs and see whether you can identify patterns of what Victor does well and where they struggle. Did the ideas you had for how to make Victor awesome have the effect you thought they would?

In [81]:
# Evaluate Victor
victor_dialogue = simulate.simulated_dialogue(argubots.victor, agents.darius, 6)
rich.print(victor_dialogue)

# Evaluate Victor as seen by the opponent (participant evaluation)
victor_eval_participant = evaluate.eval_by_participant(characters.darius, "Victor", victor_dialogue)
# Evaluate Victor by an external judge (observer evaluation)
victor_eval_observer = evaluate.eval_by_observer(evaluate.default_judge, "Victor", victor_dialogue)

# Combine the evaluations (if needed for later use)
victor_total_eval = victor_eval_participant + victor_eval_observer

# Run full evaluation on Victor
with LoggingContext("eval", "INFO"):
    victor_full_eval = evaluate.eval_on_characters(argubots.victor)

# Print Victor's full evaluation results
rich.print(victor_full_eval)

# Print Mean and Standard Deviation comparison across all agents
rich.print("\n--- Comparison of Scores ---")
rich.print(f"Alice Mean Score: {saved_evalsum['Alice'].mean()}")
rich.print(f"Alice Std Dev: {saved_evalsum['Alice'].sd()}")

rich.print(f"Airhead Mean Score: {saved_evalsum['Airhead'].mean()}")
rich.print(f"Airhead Std Dev: {saved_evalsum['Airhead'].sd()}")

rich.print(f"Akiko Mean Score: {saved_evalsum['Akiko'].mean()}")
rich.print(f"Akiko Std Dev: {saved_evalsum['Akiko'].sd()}")

rich.print(f"Aragorn Mean Score: {saved_evalsum['Aragorn'].mean()}")
rich.print(f"Aragorn Std Dev: {saved_evalsum['Aragorn'].sd()}")

rich.print(f"Victor Mean Score: {saved_evalsum['Victor'].mean()}")
rich.print(f"Victor Std Dev: {saved_evalsum['Victor'].sd()}")

# Print Sample Dialogues & Evaluations for each agent
rich.print("\n--- Sample Dialogues & Evaluations ---")
rich.print("Alice Example Dialogue:\n", saved_dialogues['Alice'][2])
rich.print("Airhead Example Dialogue:\n", saved_dialogues['Airhead'][2])
rich.print("Akiko Example Dialogue:\n", saved_dialogues['Akiko'][2])
rich.print("Aragorn Example Dialogue:\n", saved_dialogues['Aragorn'][2])
rich.print("Victor Example Dialogue:\n", saved_dialogues['Victor'][2])


100%|██████████| 10/10 [06:40<00:00, 40.03s/it]
