# Project 3: Large Language Models

## Getting started

### Python setup

**Create a virtual environment**

If you are working on a lab computer, check whether there already is a virtual environment in `/var/vsv306`. In a terminal do:
```
ls /var/csc306
```
If you see a folder named `csc306.venv`, someone has already created a virtual environment on this computer.

If not, run:
```
python3 -m venv /var/csc306/csc306.venv
```

(If you are working on your own computer, you can put the virtual environment wherever you want. Replace `/var/csc306` accordingly.)

**Activate virtual environment**

```
source /var/csc306/csc306.venv/bin/activate
```

**Install libraries**
```
pip install ipykernel openai python-dotenv rich rank_bm25
```

**Select a kernel**

To execute cells in this notebook, you need to connect to a Python kernel. You want to use the virtual environment you created. In VS Code type `Ctrl-shift-p` and then `Python: select interpreter`. That should bring up a drop down menu. Choose "Enter interpreter path" and then "Browse your file system to find a Python interpreter". Navigate to `/var/csc306/csc306.venv/bin/python`. 

Now click "Select Kernel" in the top right of your VS Code editor panel. Then choose "Python environments". Your virtual environment should be one of the options.

Now, you should be able to run the following cell.

In [None]:
print("Hello Python!")

**Autoreload imported Python files when they change**

In [2]:
# Executing this cell will ensure that imported modules (.py files) will automatically
# be reloaded when they change. (However, objects that were defined with the old
# version of the class won't change.)
%load_ext autoreload
%autoreload 2

### Create an OpenAI client

An OpenAI API key will be sent to you.

Make an `.env` file in the same directory as this notebook, containing the following:
```
export OPENAI_API_KEY=[your API key]    # do not include the brackets here
```
Make sure others can't read this file:
```
chmod 600 .env
```

**Be sure to keep the key secret.  It gives access to a billable account.** If OpenAI finds it on the public web, they will invalidate it, and then no one (including you) can use this key to make requests anymore.

Now you can execute the following to get an OpenAI client object.

In [3]:
from tracking import new_default_client, read_usage
client = new_default_client()

That fetches your API key and calls `openai.OpenAI()` to make a new **client** object, whose job is to talk to the OpenAI **server** over HTTP.  (The `OpenAI` constructor has some optional arguments that configure these HTTP messages. However, the defaults should work fine for you.)

That command also saved the new client in `tracking.default_client`, which is the client that the starter code will use by default whenever it needs to talk to the OpenAI server.  Thus, you should **rerun the above cell** to get a new client if you change the `default_model` in `tracking.py`, or if your API key in  `.env` ever changes, or its associated organization ever changes.

### Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.

Here is the function from class. Try it to make sure you can access the OpenAI API.

In [None]:
def complete(client, s: str, model="gpt-3.5-turbo-0125", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs",
         n=10, temperature=0.6, max_tokens=96)

### Compute a function using instructions and few-shot prompting

Let's try prompting the model with a sequence of multiple messages. In this case, we provide some instructions as well as few-shot prompting (actually just one-shot in this case).

Instructions are in the `system` message. The few-shot prompting consists of example inputs (`user` messages) followed by their example outputs (`assistant` messages). Then we give our real input (the final `user` message), and hope that the LLM will continue the pattern by generating an analogous output (a new `assistant` message).

In [None]:
import rich
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "Reverse the order of the words." },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Wait who those to come things good." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-4o-mini", temperature=0)
#rich.print(response)
response.choices[0].message.content

By modifying this call, can you get it to produce different versions of the output? Some possible behaviors you could try to arrange:

* specific other way of formatting the output, e.g., wait, who, those, to, come, things, good
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., To those who wait come good things.

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

What happens if the examples conflict with the instructions?

### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens. That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute the cell below whenever you want to see your cost so far. Or, just open `usage_openai.json` as a tab in your IDE.

In [None]:
read_usage()

## Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

### A first argubot (Airhead)

You can have a conversation right now with a really bad argubot named Airhead. Try asking it about climate change! When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE. In VS Code, the input box appears at the top edge of the window.)


In [None]:
import argubots
d = argubots.airhead.converse()

A bot (short for "robot") is a system that acts autonomously. That corresponds to the AI notion of an agent — a system that uses some policy to choose actions to take.

The airhead agent above (defined in `argubots.py`) uses a particularly simple policy.
It is an instance of a simple Agent subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to airhead is a Dialogue object (defined in dialogue.py). Let's look at it.

In [None]:
rich.print(d)

Each turn of this dialogue is just a tiny dictionary:

In [None]:
d[0]

### An LLM argubot (Alice)

In other CS courses, you may have encountered "conversations" between characters named Alice and Bob.

Let's try talking to the Alice of this homework, who is a much stronger baseline than Airhead. Your job in this assignment is to improve upon Alice. We'll meet Bob later.

In [None]:
# call with argument d if you want to append to the previous conversation
alicechat = argubots.alice.converse()

As you may have guessed, `alice` is powered by a prompted LLM. You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` -- and the ones that you will write! -- go into `argubots.py` instead. This is just to keep the files small.

### Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.
But can you really be sure you're making progress? For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues. But that would be slow and complicated to arrange.

So, meet Bob! He's just a simulated human. You won't edit him: he is part of the development set. Here is some information about him (from `characters.py`):

In [None]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object. However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt? The `CharacterAgent` constructor (defined in `agents.py`) defaults to gpt-4o-mini as specified in tracking.py. But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [None]:
from agents import CharacterAgent
# actually, agents.bob is already defined this way
bob = CharacterAgent(characters.bob)
# returns a dialogue, but we've already seen it so we don't want to print it again
bob.converse()
# don't print anything for this notebook cell
None

Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs! (They're not actually all named Bob — we continued on in the alphabet.)

In [None]:
import agents
agents.devset

In [None]:
agents.cara.converse()
None

You can see the underlying character data here in the notebook. Your argubot will have to deal with all of these topics and styles!

In [None]:
rich.print(characters.devset)

### Simulating conversation

We can make Alice and Bob chat.

In [None]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)

In [None]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
rich.print(d)

In [None]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
rich.print(d)

Anyway, let's see what happens when Alice and Bob talk for a while...

In [None]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot. Alice doesn't seem to have a good strategy for getting him to open up. Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [None]:
# your name, pulled from an earlier dialogue
myname = alicechat[0]['speaker']
# reuse the same first two turns, then type your own lines!
agents.bob.converse(d[0:2].rename('Alice', myname))
None

You can also try talking to the other characters and having Alice (or Airhead) talk to them.

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 1: Define a simulated human character<b>
</div>

Define an additional character.

In [25]:
from characters import Character

# See characters.py for how to use the Character class.
# Add the definition of your character here.

**Note:** Please don't change the dev set — the characters we just loaded must stay the same. Your job in this homework is to improve the argubot (or at least try). And that means improving it according to a fixed and stable evaluation measure.

## Model-based evaluation

What is our goal for the argubot? We'd like it to broaden the thinking of the (simulated) human that it is talking to. Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner. The argubot will almost never hear a human say "You have convinced me that I was wrong." But the argubot did a good job if the human developed increased understanding and respect for an opposing point of view.

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue. For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views. Of course, that depends on his personality -- Alice needs to talk to him in a way that reaches him (as much as possible). We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans. Bob is a fake person, and so is the outside observer (whose name is Judge Wise). Using an LLM as an eval metric is known as model-based evaluation. It has pros and cons:

* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.
* It might give different answers than what humans would give.

Social scientists usually refer to a metric's reliability (low variance) and validity (low bias). So the points above say that model-based evaluation is reliable but not necessarily valid. In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure. (For example, that it correlates strongly with some other measure that we already trust.) In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, **open up the `demo` notebook**, which walks you through the evaluation protocol. You'll see how to call the starter code, how it talks to the LLM behind the scenes, and what it is able to accomplish.

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 2: Evaluate Alice and Airhead<b>
</div>

To  establish baselines for the character you are going to develop later on, evaluate the argubots Airhead and Alice using the model based evaluation strategy shown in `demo.py`. That is, use `evaluate.eval_on_characters`.

This also helps to validate the metric. Airhead should get a low score!

In [None]:
# Add the code for your evaluation here.

## Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing. So now you're probably curious about the details. Now that you've had the view from the top, here's a good bottom-up order in which to study the code. You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`. The `Character` class is short and easy.

* `dialogue.py`. The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents. On each turn of the dialogue, one of the speakers says something.

The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API. But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`. Those are not quite the same thing as human speakers. And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on. The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`. This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method. The `Agent` base class also has some simple convenience methods that you should look at.

Some important subclasses of `Agent` are defined here as well. However, you may want to skip over `EvaluationAgent` and come back to it only when you read `evaluate.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents. One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `evaluate.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules. These aren't about NLP; look inside if needed. `logging_cm.py` is what enabled the context manager with `LoggingContext(...)`: in the `demo` notebook. `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

## Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods! There is a long history of machine learning methods that "memorize" the training data. To make a prediction or decision at test time, they consult the stored training examples that are most similar to the training situation.

*Similarity-based retrieval* means that given a document x, you find the "most similar" documents y∈Y, where Y is a given collection of documents. The most common way to do this is to maximize the cosine similarity between the embeddings of x and y.

A simple and fast approach is to use a bag of tokens embedding function: Define the embedding e(y) to be the vector that records the count of each type of token in a tokenized version of y, where V is the token vocabulary. BM25 is a refined variant of that idea, where the counts are adjusted in 3 ways:

* smooth the counts
* normalize for the document length |y| so that longer documents y are not more likely to be retrieved downweight tokens that are more common in the corpus (such as "the" or "ing") since they provide less information about the content of the document

You might like to play with the rank_bm25 package (documentation). It is widely used and very easy to use.


In [26]:
from rank_bm25 import BM25Okapi  # the standard BM25 method

# experiment here!  You could try the examples in the rank_bm25 documentation.



### The Kialo corpus

How can we use similarity-based retrieval to help build an argubot? It's largely about having the right data!

[Kialo](https://www.kialo.com/) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics. For each topic, the contributors construct a tree of claims. Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con"). For example, check out the tree rooted at the claim []"All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees. Please read the source code in `kialo.py`. The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the `data` directory. The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [28]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens.


In [None]:
#from typing import List
import glob

# use simple default tokenizer
kialo = Kialo(glob.glob("data/*"))
f"This Kialo subset contains {len(kialo)} claims"

In [None]:
kialo.random_chain()   # just a single random claim

In [None]:
kialo.random_chain(n=4)

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [None]:
kialo.closest_claims("animal populations", n=10)

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [None]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

In [None]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

### Some limitations of BM25

Unfortunately, we see that "animal population" gives quite different results from "animal populations". Why is that and how would you fix it?

Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations. Why is that and how would you fix it?

In [None]:
kialo.closest_claims("animal population", n=10)

In [None]:
kialo.closest_claims("Hi Akiko. Are you vegan?", n=3, kind='has_cons')

### A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all. It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

Watch Akiko argue with Darius. You will first see log messages that show claims that Akiko retrieved, as well as the LLM calls that Darius made. Then you will see the dialog between Akiko and Darius.

In [None]:
from logging_cm import LoggingContext
from simulate import simulated_dialogue

# Have Akiko talk to Darius and spy on the back-end messages to/from the LLM server.
with LoggingContext("agents", "INFO"):
    akiko_darius = simulated_dialogue(argubots.akiko, agents.darius, 6)

rich.print(akiko_darius)

Now talk to Akiko yourself. (Remember that Akiko only knows about subjects that it read about in the `data` directory. If you want to talk about something else, you can add more conversations from [kialo.com]; see the `LICENSE` file.)

In [None]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 3: Evaluate Akiko<b>
</div>

As you did for Task 2, use the model based evaluation strategy shown in `demo.py`.

Then compare Akiko's results to those you got for Airhead and Alice. Who does best? What are the differences in the subscores and comments? Does it matter which character you're evaluating on -- maybe the different characters expoes the bots' various strenghts and weaknesses?

In [None]:
# Add the code for your evaluation here.

## Retrieval-augmented generation (Aragorn)

The real weakness of Akiko:

* They can only make statements that are already in Kialo.
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named Aragorn, who is basically the love child of Akiko and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.

The RAG in aRAGorn's name stands for **retrieval-augmented generation**. Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step:** Ask the LLM what claim should be responded to. For example, consider the following dialogue:

    >  ... Aragorn: Fortunately, the vaccine was developed in record time. Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using as a Kialo query. But Aragorn shows the whole dialogue to the LLM, and asks the LLM what the human's last turn was really saying or implying, in that context. The LLM answers with a much longer statement:

    >  Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted. If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context. It also contains many more word types, which makes it more likely that BM25 will be able to find a Kialo claim with a nontrivial number of those types.

2. **Retrieval step:** Look up claims in Kialo that are similar to the explicit claim. Create a short "document" that describes some of those claims and their neighbors on Kialo.

3. **Retrieval-augmented generation:** Prompt the LLM to generate the response (like any `LLMAgent`). But include the new "document" somewhere in the LLM prompt, in a way that it influences the response.

    Thus, the LLM can respond in a way that is appropriate to the dialogue but also draws on the curated information that was retrieved in Kialo. After all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible to do better than this:


In [None]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 4: Create Aragorn, an argubot using retrieval-augmented generation<b>
</div>

You should implement Aragorn in `argubots.py` (at the very bottom) as an instance called `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

Once implemented, you should be able to run the following cell.

In [None]:
from logging_cm import LoggingContext
from simulate import simulated_dialogue

# Have Aragorn talk to Darius and spy on the back-end messages to/from the LLM server.
with LoggingContext("agents", "INFO"):
    aragorn_darius = simulated_dialogue(argubots.aragorn, agents.darius, 6)

rich.print(aragorn_darius)

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 5: Evaluate Aragorn<b>
</div>

As you did for Task 2, use the model based evaluation strategy shown in `demo.py`.

Then compare Aragorn's results to those you got for Airhead and Alice. Who does best? What are the differences in the subscores and comments? Does it matter which character you're evaluating on -- maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score. Can you beat Alice?

In [64]:
# Add the code for your evaluation here.

## Victor

Add another LLM-based argubot to `argubots.py`. Call it ***Victor***. Try to make it get the best score, according to `evaluate.eval_on_characters`.

You may want to use Aragorn or Alice as your starting point. Then see if you can find tricks that will get a more awesome score for Victor. How you choose to do that is up to you, but some ideas are suggested below.

(Reminder: *Don't change evaluation.* Just build a better argubot.)

In your report, you should describe what you did and discuss what you found. If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the evaluation score. Many good ideas don't work. That's why you need to keep finding and trying new good ideas. (Sometimes an idea does help, but in a way that is not picked up by the scoring metric. If that is the case, make sure to discuss this in detail in your report.)

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 6: Create Victor<b>
</div>

You should implement Victor in `argubots.py` (at the very bottom).

Once implemented, you should be able to run the following cell.

In [None]:
from logging_cm import LoggingContext
from simulate import simulated_dialogue

# Have Victor talk to Darius and spy on the back-end messages to/from the LLM server.
with LoggingContext("agents", "INFO"):
    victor_darius = simulated_dialogue(argubots.victor, agents.darius, 6)

rich.print(victor_darius)

<div class="alert alert-block alert-warning">
❓❓❓ <b>Task 7: Evaluate Victor<b>
</div>

As you did for Tasks 2 and 4, use the model based evaluation strategy shown in `demo.py`.

Compare Victor's evaluation results to those for the other argubots. And as for Aragorn, dig into the results a bit. Look at the subscores and evaluation comments. Read the simulated dialogs and see whether you can identify patterns of what Victor does well and where they struggle. Did the ideas you had for how to make Victor awesome have the effect you thought they would?

In [66]:
# Add the code for your evaluation here.