# Homework 8: Large Language Models

An PDF overview of the homework is [here](https://www.cs.jhu.edu/~jason/465/hw-llm/).

It mentions: "We'll send hand-in instructions soon.  Probably we will ask you to submit a version
of the main notebook, with your answers added and extraneous materials deleted. We may also
ask for a summary."

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
This symbol marks a question or exercise that you will be expected to hand in.

# Getting started

## Activate `conda` environment

When executing cells in this notebook, you will need to connect to an `nlp-class` kernel, which is a Python process running in that environment.  This is the notebook equivalent of the terminal command `conda activate nlp-class`.  

If you need to create or update that environment, first download the [nlp-class.yml](http://cs.jhu.edu/~jason/465/hw-llm/nlp-class.yml) file, and execute
```
conda env update --file nlp-class.yml --prune
```

## Fetch code and data files for this homework

All of the files you need are in the directory <https://www.cs.jhu.edu/~jason/465/hw-llm/>.  To get a local copy of that directory, including this notebook, you can download and unpack [HW-LLM.zip](https://www.cs.jhu.edu/~jason/465/hw-llm/HW-LLM.zip).  Then open this notebook.

Note that the other files must be in the *same directory* as this notebook.  Otherwise, a command like `import tracking` won't be able to find the tracking module, `tracking.py`.

*Note:* These files might get improved after the homework is released, in which case you'll want to re-download them.  Make sure not to overwrite changes you've already made.  One way to do it: use a terminal to `cd` to the directory containing this notebook, and run the following shell commands to get the latest versions of all other files.
```
wget --quiet -r -np -nH --cut-dirs=3 -A '*.txt' -A '*.py' -A 'demo.ipynb' https://www.cs.jhu.edu/~jason/465/hw-llm/
rm -f data/*.1 robots.txt   # remove any backup versions of the static files
```
Any existing versions of the files will not be overwritten; they will be renamed with names like `tracking.py.1`.

In [1]:
# Check that the current directory does contain the files.
!ls -lR *.py data

-rw-rw-r--@ 1 griffinmontalvo  staff  19742 Nov 23 07:20 agents.py
-rw-rw-r--@ 1 griffinmontalvo  staff   3131 Nov 23 07:20 argubots.py
-rw-rw-r--@ 1 griffinmontalvo  staff   2836 Nov 23 10:43 characters.py
-rw-rw-r--@ 1 griffinmontalvo  staff   2641 Dec  5  2023 dialogue.py
-rw-rw-r--@ 1 griffinmontalvo  staff  14216 Nov 23 09:41 evaluate.py
-rw-rw-r--@ 1 griffinmontalvo  staff  10426 Dec  5  2023 kialo.py
-rw-rw-r--@ 1 griffinmontalvo  staff   1347 Dec  3  2023 logging_cm.py
-rw-rw-r--@ 1 griffinmontalvo  staff   1503 Dec  5  2023 simulate.py
-rw-rw-r--@ 1 griffinmontalvo  staff   6130 Nov 30 15:11 tracking.py

data:
total 4512
-rw-rw-r--@ 1 griffinmontalvo  staff     407 Nov 29  2023 LICENSE
-rw-rw-r--@ 1 griffinmontalvo  staff  613106 Nov 25  2023 all-humans-should-be-vegan-2762.txt
-rw-rw-r--@ 1 griffinmontalvo  staff   81917 Nov 29  2023 have-authoritarian-governments-handled-covid-19-better-than-others-54145.txt
-rw-rw-r--@ 1 griffinmontalvo  staff   52771 Dec  4  2023 is-biden-


The `autoreload` feature of Jupyter ensures that if an imported module (.py file) changes, the notebook will automatically import the new version.  
(However, objects that were defined with the old version of the class won't change.)

In [2]:
# Executing this cell does some magic
%load_ext autoreload
%autoreload 2

## Create an OpenAI client

An OpenAI API key will be sent to you.  (Or are you not in the class? Then you can make your own API key by [signing up for an OpenAI platform account](https://platform.openai.com/signup) and putting some money on it.  This assignment should cost only about $1 US.)

Make an `.env` file in the same directory as this notebook, containing the following:
```
export OPENAI_API_KEY=[your API key]    # do not include the brackets here
```
Make sure others can't read this file:
```
chmod 600 .env
```

**Be sure to keep the key secret.  It gives access to a billable account.** If OpenAI finds it on the public web, they will invalidate it, and then no one (including you) can use this key to make requests anymore.



Now you can execute the following to get an OpenAI client object.

In [3]:
from tracking import new_default_client, read_usage
client = new_default_client() 

That fetches your API key and calls `openai.OpenAI()` to make a new **client** object, whose job is to talk to the OpenAI **server** over HTTP.  (The `OpenAI` constructor has some optional arguments that configure these HTTP messages.
However, the defaults should work fine for you.)

That command also saved the new client in `tracking.default_client`, which is the client that the starter code will use by default whenever it needs to talk to the OpenAI server.  Thus, you should **rerun the above cell** to get a new client if you change the `default_model` in `tracking.py`, or if your API key in  `.env` ever changes, or its associated organization ever changes.

## Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.  
You will have to specify which OpenAI model to use.
Documentation of the methods is [here](https://pypi.org/project/openai/) if you are curious.

### Continue a textual prompt

This is what language models excel at.  In principle you should do it by calling [`client.completions.create`](https://platform.openai.com/docs/api-reference/completions/create?lang=python).  However, OpenAI has [retired](https://openai.com/blog/gpt-4-api-general-availability) most of the models that support that API (keeping only `gpt-3.5-turbo-instruct`).  So we'll use the more modern API, [`client.chat.completions.create`](https://platform.openai.com/docs/api-reference/chat/create?lang=python).

In [12]:
import rich   # prettyprinting
response = client.chat.completions.create(messages=[{"role": "user", 
                                                     "content": "Q: Name the planets in the solar system?\nA: "}], 
                                          model="gpt-3.5-turbo-0125",  # which model to use
                                          temperature=1,               # get a little variety
                                          max_tokens=64,               # limit on length of result
                                          logprobs=True,
                                          top_logprobs=5
                                          # stop=["Q:", "\n"],         # treat these as EOS symbols; useful for some models
                                         )           
rich.print(response)                              # the full object that was sent back from the server
rich.print(response.choices)                      # just the list of 1 answer (the default, but calling with n=5 would give 5 answers) 
rich.print(response.choices[0].message.content)   # extract the good stuff from that 1 answer

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Try running the cell above a few times. You may get different random answers — especially because the call specifies temperature 1.  (The default temperature is rumored to be 0.8.) Are the answers all equally good?

*No, the answers are not equally good. The answer will always contain a listing of planets, enumerated or not, and it will either be one planet per line or not. Sometimes it will include pluto as a dwarf planet, and it can give more or less info about when it was classified as such. The problem is that most of the time it wont tell you what the listing is for, sometimes it will say something like "the planets in your solar system are ", which is a better response then just a listing of the planets.

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Try adding the arguments `logprobs=True, top_logprobs=5` to the above API call (see [documentation](https://platform.openai.com/docs/api-reference/chat/create#chat-create-logprobs)).  For each generated token, the response will now include its log-probability, and also the log-probabilities of the 5 most probable tokens, given the left context so far.  Again, run the cell a few times.  What do you observe?

 *I notice that the probabilities are not stationary, they adjust each time the code is ran. It also seems like the planets are assigned much higher probbilities than other words or periods, commas, etc.



It might be handy to package up what we just did.
The `complete` function below is a convenient way of experimenting with completing text.
It is illustrated with a grocery example.  

In [44]:
def complete(client, s: str, model="gpt-3.5-turbo-0125", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs", 
         n=10, temperature=0.8, max_tokens=96)


[', and a loaf of bread.',
 ', and flour.',
 ', and fish.',
 ', fish, grapes, honey, ice cream, juice, kiwi, lettuce, milk, nuts, oranges, peaches, quinoa, rice, spinach, tomatoes, umbrella, vinegar, watermelon, yogurt, zucchini.',
 ', and fish.',
 ', flour, grapes, honey, ice cream, juice, kiwis, lemons, milk, noodles, oranges, pears, quinoa, rice, spinach, tomatoes, yogurt, and zucchini.',
 ', and flour.',
 ', and flour.',
 ', and flour.',
 ', and flour.']

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Anything could be on a grocery list, so why are the 10 different completions above so similar?<br>
Hint: The answer isn't just the temperature of 0.6.  Look especially at the long completions; run the cell again if you didn't get multiple long completions.

It makes sense that the next word will be 'and' since its gramatically correct to finish a list with and on the last item. The list goes in alphabetical order so the next item will be a word that starts with 'F', and it will likely be food considering that the past words were all food related. I believe that the reason for flour and fish occuring so much is due to them being the most popular 'f' words bought from the store, so the dataset has bias towards those words in a way.


![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
What happens at different temperatures?  How about temperatures > 1?  (Note: Higher temperatures tend to produce longer responses, so it's wise to use `max_tokens`.)

*The higher the temperature, the more variability there will be in the next token. Some of the extensions include "flour, grapes, honey, ice cream, jam, kiwis, lemons, milk, nuts, oranges, peas, quinoa, rice, sausage, tomatoes, umbrellas, vegetables, watermelon, yogurt, and zucchini.", 
"and flour. I also picked up some grapefruit, honey, ice cream, and jelly. Lastly, I grabbed some kiwis, lemons, milk, noodles, and orange juice.'"
", and flour to make a fruit salad and some pastries for a brunch with friends.",
", and a loaf of bread."
Some of the results will break the alphabetical order, continue the alphabetical order all the way to Z, or give a reason for buying the things that they did. I noticed that temperatures >1 was when it gave explanations.



*Remark:* These [Python bindings for open-source models such as Llama](https://pypi.org/project/llama-cpp-python/) allow you to [constrain the output by an arbitrary CFG](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), using `grammar=...`.  This is useful if you're generating code or data that must be syntactically valid to be useful to you.  For even more control over the output, the powerful [guidance](https://github.com/guidance-ai/guidance) package works elegantly with Python.  However, the OpenAI API only allows you to [constrain the output to be valid JSON](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format).


### Compute a function using instructions and few-shot prompting

We'll now switch to the chat completions API, allowing us to use a more recent model.  Let's try prompting it with a sequence of multiple messages.  In this case, we provide some instructions as well as few-shot prompting (actually just one-shot in this case).

Instructions are in the `system` message.  The few-shot prompting consists of example inputs (`user` messages) followed by their example outputs (`assistant` messages).  Then we give our real input (the final `user` message), and hope that the LLM will continue the pattern by generating an analogous output (a new `assistant` message).

In [None]:
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "add commas in between each word" },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Good, things, come, to, those, who, wait." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-4o-mini", temperature=0)
rich.print(response)
response.choices[0].message.content                                  

'Colorless, green, ideas, sleep, furiously.'

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
By modifying this call, can you get it to produce different versions of the output?
Some possible behaviors you could try to arrange:
* specific other way of formatting the output, e.g., `wait, who, those, to, come, things, good`
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., `To those who wait come good things.` 

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

*I made it so that it added commas in between each word.

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
What happens if the examples conflict with the instructions?

*Even if I didnt have commas in between each word for the output, it would still adhere to the rules and put commas in between each word.

### Inspect the tokenization

Just for fun, let's see how the above client has been tokenizing its input and output text.  For that we can use a tokenizer that runs locally, not in the cloud, and is guaranteed to get the same outputs.

In [62]:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo-0125")  # how this model will tokenize
toks = tokenizer.encode("Hellooo, world!") # list of integerized tokens, starting with BOS

print(tokenizer.decode(toks))                                  # convert list back to string
for tok in toks: print(f"{tok}\t'{tokenizer.decode([tok])}'")  # convert one at a time
print("Vocab size =", tokenizer.n_vocab)

Hellooo, world!
9906	'Hello'
2689	'oo'
11	','
1917	' world'
0	'!'
Vocab size = 100277


### Try embedding some text

Also just for fun, let's try the embedder, which converts a string of any length to an vector of fixed dimensionality.

In [63]:
emb_response = client.embeddings.create( input= [  # note: adjacent literal strings in Python are concatenated
        "When in the Course of human events it becomes necessary for one "
        "people to dissolve the political bands which have connected them "
        "with another, and to assume among the Powers of the earth, the "
        "separate and equal station to which the Laws of Nature and of "
        "Nature's God entitle them, a decent respect to the opinions of "
        "mankind requires that they should declare the causes which impel "
        "them to the separation." ], 
        model="text-embedding-3-small")
# don't print the whole response because it's very long
e = emb_response.data[0].embedding
print(f"{len(e)}-dimensional embedding starting with {e[:5]}")
print("Squared length of embedding vector: ", sum(x**2 for x in e))

1536-dimensional embedding starting with [0.03854052722454071, 0.038316600024700165, 0.04359135404229164, 0.07056225836277008, -0.00027718886849470437]
Squared length of embedding vector:  1.0000000365799306


### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens.  That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute one of these cells whenever you want to see your cost so far.  Or, just keep `usage_openai.json` open as a tab in your IDE.

In [64]:
read_usage()      # rwitheads from the file usage_openai.json; returns cost in dollars

{'completion_tokens': 3023,
 'prompt_tokens': 2362,
 'total_tokens': 5385,
 'cost': 0.005217359999999996}

In [65]:
!cat usage_openai.json 

{
    "completion_tokens": 3023,
    "prompt_tokens": 2362,
    "total_tokens": 5385,
    "cost": 0.005217359999999996
}

# Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

## A first argubot (Airhead)

You can have a conversation right now with a _really bad_ argubot named Airhead.  Try asking it about climate change!  When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE.)

In [66]:
import argubots
d = argubots.airhead.converse()


(griffinmontalvo) Climate change isnt real
(Airhead) I know right???
(griffinmontalvo) I think Hopkins should make winter break twice as long for climate change purposes
(Airhead) I know right???
(griffinmontalvo) I know right???
(Airhead) I know right???


A *bot* (short for "robot") is a system that acts autonomously.
That corresponds to the AI notion of an *agent* — a system that uses some *policy* to choose *actions* to take.

The `airhead` agent above (defined in `argubots.py`) uses a particularly simple policy.  
It is an instance of a simple `Agent` subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to `airhead` is a `Dialogue` object (defined in `dialogue.py`). Let's look at it.

In [67]:
rich.print(d)

Each *turn* of this dialogue is just a tiny dictionary:

In [68]:
d[0]

{'speaker': 'griffinmontalvo', 'content': 'Climate change isnt real'}

## An LLM argubot (Alice)

In other CS courses like crypto, algorithms, or networks, you may have encountered "conversations" between characters named Alice and Bob.  
Let's try talking to the Alice of this homework, who is a _much stronger baseline_ than Airhead.  Your job in this assignment is to improve upon Alice.
We'll meet Bob later.

In [69]:
alicechat = argubots.alice.converse()   # or call with argument d if you want to append to the previous conversation


(griffinmontalvo) I think that college is bad
(Alice) What do you believe are the main reasons college is bad? While it's true that college can be expensive and may not be the best fit for everyone, it can also provide invaluable networking opportunities, personal growth, and skills that are often hard to acquire otherwise.
(griffinmontalvo) Oh you right, college is chill
(Alice) It's great to see you considering the positives of college! However, do you think there are alternatives that could also provide valuable skills and experiences without the traditional college route? Exploring options like vocational training or entrepreneurship might lead to equally rewarding paths.
(griffinmontalvo) Entrepreneurship is fire
(Alice) It’s exciting to hear your enthusiasm for entrepreneurship! However, have you considered the potential downsides, such as the financial instability and risks involved? Sometimes having a college education can provide a safety net and a broader skill set that can 

As you may have guessed, `alice` is powered by an prompted LLM.  You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` — and the ones that you will write! — go into `argubots.py` instead.  This is just to keep the files small.

## Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.  
But can you really be sure you're making progress?  For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues.  But that would be slow and complicated to arrange.  

So, meet Bob!  He's just a simulated human.  You won't edit him: he is part of the development set.  Here is some information about him (from `characters.py`):

In [70]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object.
However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt?  The `CharacterAgent` constructor (defined in `agents.py`) defaults to a GPT-3.5 model that is specified in `tracking.py`.  But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [71]:
from agents import CharacterAgent
bob = CharacterAgent(characters.bob)    # actually, agents.bob is already defined this way
bob.converse()        # returns a dialogue, but we've already seen it so we don't want to print it again
None                  # don't print anything for this notebook cell 


(griffinmontalvo) Climate change is wild
(Bob) Absolutely, and adopting a vegetarian lifestyle can significantly contribute to reducing our carbon footprint.
(griffinmontalvo) Nah, meat is mandatory, imagine getting a chipotle bowl without meat
(Bob) While a meat-free bowl might seem unconventional, there are plenty of delicious vegetarian options that can be just as satisfying and flavorful!
(griffinmontalvo) Bro Bob, you sound like you need to eat some steak
(Bob) I appreciate your concern, but I truly believe there are so many tasty vegetarian dishes that can provide all the nutrients one needs without the downsides of meat!
(griffinmontalvo) bro like what? Beyond burgers are awful
(Bob) There are plenty of other delicious vegetarian options, like hearty lentil soups, flavorful curries, or stuffed bell peppers that can be both satisfying and nutritious!
(griffinmontalvo) So are you telling me Beyond burgers are good?
(Bob) I think Beyond burgers can be a fun alternative for some, b

Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs!  (They're not actually all named Bob — we continued on in the alphabet.)


In [72]:
import agents
agents.devset

[<CharacterAgent for character Bob>,
 <CharacterAgent for character Cara>,
 <CharacterAgent for character Darius>,
 <CharacterAgent for character Eve>,
 <CharacterAgent for character TrollFace>]

In [73]:
agents.cara.converse()
None


(griffinmontalvo) Whats good Cara
(Cara) Not much, just enjoying some delicious meat—what about you?
(griffinmontalvo) I just had some chipotle it was fire. I had a double chicken bowl
(Cara) That sounds tasty, but I’d stick to my steak any day!
(griffinmontalvo) Steak is goated, whats your favorite cut of steak
(Cara) I’m all about a perfectly cooked ribeye—so much flavor and juiciness!
(griffinmontalvo) Ribeyes are good
(Cara) Absolutely, you can’t go wrong with that marbling!


You can see the underlying character data here in the notebook.  Your argubot will have to deal with all of these topics and styles!

In [74]:
rich.print(characters.devset)

## Simulating conversation 

We can make Alice and Bob chat.

In [75]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)


(Alice) Do you think it's okay to eat meat?


In [76]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that choosing a vegetarian lifestyle is a more compassionate and environmentally-friendly option for everyone.
(Alice) That’s a valid perspective, but have you considered how some sustainable farming practices can actually benefit ecosystems and support local economies? In some cases, responsible animal agriculture can play a vital role in maintaining biodiversity and providing livelihoods in rural areas.


In [77]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that choosing a vegetarian lifestyle is a more compassionate and environmentally-friendly option for everyone.
(Alice) That’s a valid perspective, but have you considered how some sustainable farming practices can actually benefit ecosystems and support local economies? In some cases, responsible animal agriculture can play a vital role in maintaining biodiversity and providing livelihoods in rural areas.
(Bob) While I appreciate the importance of sustainable practices, I still feel that a plant-based diet offers a more holistic solution to both environmental and ethical concerns.
(Alice) That’s a strong point, but it’s worth exploring how not all plant-based diets are equal in terms of environmental impact; for instance, some crops require extensive resources like water and land. Additionally, some communities rely on animal husbandry for cultural and nutritional needs, which can complicate a purely plant-based narrative.


Anyway, let's see what happens when Alice and Bob talk for a while...

In [78]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot.  Alice doesn't seem to have a good strategy for getting him to open up.  Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [79]:
myname = alicechat[0]['speaker']   # your name, pulled from an earlier dialogue
agents.bob.converse(d[0:2].rename('Alice', myname))  # reuse the same first two turns, then type your own lines!
None

(griffinmontalvo) Do you think it's ok to eat meat?
(Bob) I believe that choosing a vegetarian lifestyle offers numerous benefits for health, the environment, and animal welfare, making it a more compassionate choice.
(griffinmontalvo) Well what about certain religions that rely on eating meat?
(Bob) I respect religious beliefs, but I think it's essential to explore vegetarian options that align with those values while also considering the ethical implications of meat consumption.
(griffinmontalvo) So what would you say to a culture that hs been around for thousands of years that eats meat for every meal, and if they don't then its considered sinful?
(Bob) I would encourage open dialogue about food choices and suggest exploring vegetarian alternatives that can honor traditions while still considering the ethical and health implications of meat consumption.
(griffinmontalvo) So you want these people to die?
(Bob) Not at all; my intention is to promote healthy and compassionate choices, 

You can also try talking to the other characters and having Alice (or Airhead) talk to them.

**You might enjoy** defining additional characters in `characters.py`, or right here in the notebook.
Feel free to talk to those and evaluate them.  They could be variants on the exisiting characters, or something entirely new. 

However, **don't change the dev set** — the characters we just loaded must stay the same.  Your job in this homework is to improve the argubot (or at least try).  And that means improving it according to a fixed and stable eval measure.

As an exception, you can change the languages that a couple of the characters speak. It may be fun for you to see them try to speak your native language.  And that doesn't really affect the quality of the argument.

In [80]:
# example
trollFace2 = characters.trollFace.replace(languages = ["Chinese", "Spanish"])
rich.print(trollFace2)
simulated_dialogue(argubots.alice, CharacterAgent(trollFace2), 6)

(Alice) Do you think Donald Trump will be a good president?
(TrollFace) ¿Buena presidencia? Más bien una comedia de errores; ¡ni siquiera él se lo cree!
(Alice) Entiendo tu perspectiva sobre su presidencia, pero es interesante considerar que muchas de sus políticas y decisiones han resonado con un sector significativo de la población, lo que sugiere que su estilo de liderazgo podría tener un apoyo más amplio del esperado. ¿Podría ser que su enfoque poco convencional atraiga a quienes se sienten desconectados del sistema político tradicional?
(TrollFace) ¡Claro! Pero también hay que recordar que "desconectados" no siempre significa "inteligentes"; a veces, es solo el eco de un pueblo confundido.
(Alice) Es verdad que el descontento puede llevar a decisiones poco informadas, pero también es crucial reconocer que esa confusión puede surgir de una falta de representación o comunicación efectiva por parte de los líderes. ¿Podría ser que, en lugar de despreciar la confusión, debamos entender

### Efficiency: Batched generation?

Notice that we are making a separate LLM call to generate each turn of the dialogue.  When we generate the $n^\text{th}$ turn, we send the server the whole dialogue history — the previous $n\!-\!1$ turns — along with some instructions.  The server has to re-encode it with the Transformer, and it charges us for doing so (see the "input token" costs in `tracking.py`).  

That is probably inevitable for real dialogue.  But for simulated dialogue, a more efficient approach would be to generate the whole dialogue between Alice and Bob in one LLM call.  Then you would be charged just once for each dialogue turn.  Under this approach, the Transformer encodes each token as soon as it is generated (see the "output token" costs in `tracking.py`).  The encoded token stays in the context throughout the dialogue, so it doesn't have to be re-encoded on a later call.  There is no later call.  

Under current pricing models, that would reduce the dollar cost of generating $n$ turns from $O(n^2)$ to $O(n)$.  

However, the pricing model doesn't quite reflect the computational costs.  
* ![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png) Using $O(\cdot)$ notation, what is the total number of floating-point operations needed to generate $n$ turns under each approach?  
* ![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png) Parallelism may help reduce the runtime.  Using $O(\cdot)$ notation, what is the total number of seconds needed to generate $n$ turns under each approach?  (Assume that the GPU is big enough, relative to $n$, that it can encode all input tokens in parallel.)

The problem with the more efficient approach is that it gives you no way to change the instructions (the system prompt) each time we switch from Alice to Bob and back again.  You'd need to generate the whole conversation using a single set of instructions.

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Can you get this to work?  Specifically, try completing the cell below.  You don't have to use the `Agent` or `Dialogue` classes.  It's okay to just throw together something like the `complete()` method above.  Just see whether you can manage to prompt gpt-4o-mini to generate a multi-turn dialogue between two characters who have different personalities and goals.  Is the quality better or worse than generating one turn at a time with different instructions?

In [83]:
# Like `simulated_dialogue` in `simulate.py`.  However, this one is called on two
# Characters, not two Agents, and it returns a string rather than a Dialogue.
import random

from tracking import default_client, default_model
from characters import Character
def simulated_dialogue_batch(a: Character, b: Character, turns: int = 6, *,
                             starter=True) -> str:

    # Create the instructions for the system
    system_message = {
        "role": "system",
        "content": (
            "You are simulating a multi-turn dialogue between two characters. "
            "Each character has their own unique personality and goals. "
            "Respond as if both characters are speaking in turns, alternating between them. "
            "Ensure the dialogue is coherent, stays true to the characters' personalities, and is engaging."
        )
    }
    
    # Optionally include a conversation starter
    if starter and hasattr(b, "conversation_starters") and b.conversation_starters:
        initial_message = random.choice(b.conversation_starters)
    else:
        initial_message = f"Hi {b.name}, how are you?"

    # User message sets up the dialogue
    user_message = {
        "role": "user",
        "content": (
            f"The following is a dialogue between {a.name} and {b.name}.\n\n"
            f"{character_a_description}\n"
            f"{character_b_description}\n\n"
            f"{a.name}: {initial_message}\n\n"
            f"Continue the dialogue for {turns} turns."
        )
    }
    
    # Call the OpenAI API to generate the dialogue
    response = default_client.chat.completions.create(
        model=default_model,
        messages=[system_message, user_message],
        max_tokens=500,  # Adjust as needed for longer dialogues
        temperature=0.7  # Adjust to control randomness
    )
    
    # Extract and return the generated dialogue
    return response.choices[0].message.content



# Try it out!
simulated_dialogue_batch(characters.bob, characters.cara)

NameError: name 'character_a_description' is not defined

In [None]:
simulated_dialogue(agents.bob, agents.cara)

In [None]:
simulated_dialogue(agents.eve, agents.trollFace)

# Model-based evaluation

What is our goal for the argubot?  We'd like it to broaden the thinking of the (simulated) human that it is talking to.  Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner.  The argubot will almost never hear a human say "You have convinced me that I was wrong."  But the argubot did a good job if the human developed **increased understanding and respect for an opposing point of view**.  

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue.  For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views.  Of course, that depends on his personality — Alice needs to talk to him in a way that reaches *him* (as much as possible).  We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans.  Bob is a fake person, and so is the outside observer (whose name is Judge Wise).
Using an LLM as an eval metric is known as *model-based evaluation*.  It has pros and cons:
* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.  
* It might give different answers than what humans would give.   

Social scientists usually refer to a metric's **reliability** (low variance) and **validity** (low bias).  So the points above say that model-based evaluation is reliable but not necessarily valid.  In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure.  (For example, that it correlates strongly with some other measure that we already trust.)  In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, open up the `demo` notebook, which walks you through the evaluation protocol.  You'll see how to call the [starter code](http://cs.jhu.edu/~jason/465/hw/llm), how it talks to the LLM behind the scenes, and what it is able to accomplish. 

To help to validate the metric, check that Airhead gets a low score.  (It should!)

# Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing.  So now you're probably curious about the details.  Now that you've had the view from the top, here's a good bottom-up order in which to study the code.  You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`.  The `Character` class is short and easy.

* `dialogue.py`.  The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents.  On each *turn* of the dialogue, one of the speakers says something.  

   The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API.  But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`.  Those are not quite the same thing as human speakers.  And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on.  The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`.  This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method.  The `Agent` base class also has some simple convenience methods that you should look at.  

   Some important subclasses of `Agent` are defined here as well.  However, you may want to skip over `EvaluationAgent` and come back to it only when you read `evaluate.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents.  One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `evaluate.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules.  These aren't about NLP; look inside if needed.  `logging_cm.py` is what enabled the context manager `with LoggingContext(...):` in the demo notebook.  `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

# Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods!
There is a long history of machine learning methods that "memorize" the training data.
To make a prediction or decision at test time, they consult the stored training examples
that are most similar to the training situation.

_Similarity-based retrieval_ means that given a document $x$, you find the "most similar" documents $y \in Y$, where $Y$ is a given collection of documents.  The most common way to do this is to maximize the _cosine similarity_ $\vec{e}(x) \cdot \vec{e}(y)$, where $\vec{e}(\cdot)$ is an embedding function.

Should we use the OpenAI embedding model?  We could, but we would have to precompute $\vec{e}(y)$ for all $y \in Y$, and store all these vectors in a data structure that supports some type of fast similarity-based search (e.g., using the [FAISS](https://faiss.ai/index.html) package).  An alternative would be to upload the documents to OpenAI and let OpenAI compute and store the embeddings.  We would then use their similarity-based [retrieval tool](https://platform.openai.com/docs/assistants/overview).

A simpler and faster approach—which sometimes even works better—is to use a _bag of tokens_ embedding function: Define $\vec{e}(y)$ to be the vector in $\mathbb{R}^V$ that records the count of each type of token in a tokenized version of $y$, where $V$ is the token vocabulary.  [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a refined variant of that idea, where the counts are adjusted in 3 ways: 

* smooth the counts
* normalize for the document length $|y|$ so that longer documents $y$ are not more likely to be retrieved
* downweight tokens that are more common in the corpus (such as ` the` or `ing`) since they provide less information about the content of the document


You might like to play with the `rank_bm25` package ([documentation](https://pypi.org/project/rank-bm25/)).  It is widely used and very easy to use.

In [None]:
from rank_bm25 import BM25Okapi as BM25_Index   # the standard BM25 method

# experiment here!  You could try the examples in the rank_bm25 documentation.

## The Kialo corpus

How can we use similarity-based retrieval to help build an argubot?  It's largely about having the right data!

[Kialo](kialo.com) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics.  For each topic, the contributors construct a tree of _claims_.  Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con").  For example, check out the tree rooted at the claim ["All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees.  Please read the [source code](https://www.cs.jhu.edu/~jason/465/hw-llm) in `kialo.py`.  The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the [data directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data).  The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [None]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens. 

In [None]:
from typing import List
import glob

# kialo = Kialo(glob.glob("data/*"), tokenizer=tokenizer.encode)  # using the LLM's tokenizer doesn't work here for some reason
kialo = Kialo(glob.glob("data/*"))  # use simple default tokenizer
f"This Kialo subset contains {len(kialo)} claims"

Let's use sampling to see what kind of stuff is in the data structure.

In [None]:
kialo.random_chain()   # just a single random claim

In [None]:
kialo.random_chain(n=4)

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [None]:
kialo.closest_claims("animal populations", n=10)

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [None]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

In [None]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

### Does BM25 really work?

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Unfortunately, we see that `"animal population"` gives quite different results from `"animal populations"`.  Why is that and how would you fix it?  

Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations.  Why is that and how would you fix it?

In [None]:
kialo.closest_claims("animal population",10)

## A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all.  It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

You already watched Akiko argue with Darius in `demo.py`.  If you look at the log messages, you'll see the claims that Akiko retrieved, as well as the LLM calls that Darius made.  

You can talk to Akiko yourself now.  (Remember that Akiko only knows about subjects that it read about in the [`data` directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data/).  If you want to talk about something else, you can add more conversations from [kialo.com]; see the [LICENSE](https://www.cs.jhu.edu/~jason/465/hw-llm/data/LICENSE) file.)


In [None]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()

## Making your own retrieval bot (Akiki)

As you can see when talking to Akiko yourself, Akiko does poorly when responding to a short or vague dialogue turn (like "Yes"), because the "closest claim" in Kialo may be about a totally different subject.  Akiko does much better at responding to a long and specific statement.  

So try implementing a new argubot, called Akiki, that is very much like Akiko but does a better job of staying on topic in such cases.  It should be able to **look at more of the dialogue** than the most recent turn.  But the most recent dialogue turn should still be "more important" than earlier turns.  

The details are up to you.  Here are a few things you could try:
* include earlier dialogue turns in the BM25 query only if the BM25 similarity is too low without them
* weight more recent turns more heavily in the BM25 query (how can you arrange that?)
* treat the human's earlier turns differently from Akiki's own previous turns

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Implement your new bot Akiki in `argubots.py`, and adjust it until `argubots.akiki.converse()` seems to do a better job of answering your short turns, compared to `argubots.akiko.converse()`.  Make sure it still gives appropriate reponses to long turns, too.  Give some examples in the notebook of what worked well and badly, with discussion.

### Evaluating Akiki

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Finally, do a more formal evaluation to verify whether Akiki really does better than Akiko on this dimension.  This is a way to check that you're not just fooling yourself.  

1. Make a new `Agent` called "Shorty" that often (but not always) gives short responses.  
    * Shorty's conversation starters should be on topics that Kialo knows about.  
    * Shorty could be a pure `LLMAgent` such as a `CharacterAgent` with a particular `conversational_style`.  Or it could use a mixed strategy of calling the LLM on some turns and not others.
2. Generate several *Akiko*-Shorty dialogues and several *Akiki*-Shorty dialogues, using `simulated_dialogue`.
3. Evaluate each of those dialogues by asking Judge Wise **how well the argubot stayed on topic**.  You should write this prompt carefully so that Judge Wise gives meaningful scores.  (Before you do this evaluation step, adjust the prompt until it seems to work well on a small subset of the dialogues, Otherwise Judge Wise won't be so wise!)  
4. Compare Akiko and Akiki's mean scores on this new evaluation criterion (which you can call `'focused'`). Ideally, compute a 95% confidence interval on the difference of means, using [this calculator](https://www.statskingdom.com/difference-confidence-interval-calculator.html).  If you don't get statistical significance, then your evaluation set wasn't large enough, so go back to step 2 and run the comparison again (from scratch) by generating a larger set of dialogues with Shorty for each argubot.

You can do all those steps in the notebook, writing _ad hoc_ code.  You don't have to write general-purpose methods or classes.

## Retrieval-augmented generation (Aragorn)

The real weaknesses of Akiko and Akiki:
* They can only make statements that are already in Kialo.  
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named [Aragorn](https://tolkiengateway.net/wiki/Riddle_of_Strider), who is basically the love child of Akiki and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.  

The RAG in aRAGorn's name stands for **retrieval-augmented generation**.  Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step**: Ask the LLM what claim should be responded to.  For
   example, consider the following dialogue:
    > ...
    > Aragorn: Fortunately, the vaccine was developed in record time.
    > Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using
    as a Kialo query.  But Aragorn shows the *whole dialogue* to the LLM, and
    asks the LLM what the human's *last turn* was really saying or implying, in
    that context. The LLM answers with a much longer statement:

    > Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted.
    > If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context.
    It also contains many more word types, which makes it more likely that BM25 will be able
    to find a Kialo claim with a nontrivial number of those types. 

2. **Retrieval step**: Look up claims in Kialo that are similar to the explicit
   claim.  Create a short "document" that describes some of those claims and
   their neighbors on Kialo.

3. **Retrieval-augmented generation**: Prompt the LLM to generate the response
   (like any `LLMAgent`).  But include the new "document" somewhere in the LLM
   prompt, in a way that it influences the response. 
   
   Thus, the LLM can respond in a way that is appropriate to the dialogue but
   also draws on the curated information that was retrieved in Kialo.  After
   all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible
to do better than this:

In [None]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
**You should implement Aragorn in `argubots.py`, just as you did for Akiki.**  Probably as an instance `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

### Evaluating Aragorn

![image](https://cs.jhu.edu/~jason/465/hw-llm/handin.png)
Compare Alice, Akiki, and Aragorn in the notebook, using the evaluation scheme and devset that were illustrated in `demo.ipynb`.  In other words, use `evaluate.eval_on_characters`.

Who does best?  What are the differences in the subscores and comments?  Does it matter which character you're evaluating on — maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score.  Can you beat Alice?

Also, try evaluating them in the same way that you evaluated Akiki.  In other words, have them talk to Shorty and ask Judge Wise whether they were able to stay on topic.  This is where Aragorn should really shine, thanks to its ability to paraphrase Shorty's short utterances.



# Awsom

![image](handin.png)
Add another LLM-based argubot to `argubots.py`.  
Call it Awsom.  Try to make it get the best score, according to `evaluate.eval_on_characters`.
Explain what you did and discuss what you found.

(This corresponds to the `--awesome` flag on earlier assignments, but naming the character "Awesome" might bias the evaluation system, so we changed the spelling!)

If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the score.  Many good ideas don't work.  That's why you need to keep finding and trying new good ideas.  (Sometimes they do help, but in a way that is not picked up by the scoring metric.)

You may want to use Aragorn or Alice as your starting point.
Then see if you can find tricks that will get a more awesome score for Awsom.
How you choose to do that is up to you, but some ideas are below.

(Reminder: **Don't change evaluation.**  Just build a better argubot.)

## [Possible strategy] Prompt engineering

A good first thing to do is to experiment with Alice's prompt.  
The wording and level of detail in the prompt can be quite important.
Often, NLP engineers will change their prompt to try to address 
problems that they've seen in the responses.

Because it's "just" text editing, this won't get full credit by itself unless you make a real discovery.
But it requires intelligence, care, experimentation, and alertness to the language of the responses and the
language of the prompts.  And you'll develop some intuitions about what helps and what doesn't.
It is certainly worthwhile.

Of course, people have tried to develop methods to search for good prompts automatically, or semi-automatically with human guidance.  
So you could additionally try out SAMMO or DSPy -- both have multiple tutorials and are downloadable from github.

If you try this, what worked well for you?

## [Possible strategy] Chain of thought / Planning

The evaluation functions in `evaluate.py` asked each `EvaluationAgent` a "warmup question" before continuing with the real question.  That is an example of chain-of-thought (CoT) reasoning, where the LLM is encouraged to talk through the problem for a few sentences before giving the answer.  CoT sometimes improves performance.

Instead of using one prompt, could you help an `LLMAgent` argubot (like Alice) do better by having think aloud before it gives an answer?  For example, each time the human speaks, your argubot (Awsom) could prompt the LLM to think about the human's ideas/motivations/personality, and to come up with a plan for how to open the human's mind. 

For example, you might structure this as a `Dialogue` among three participants, like this:
> Awsom (to Eve): Do you think COVID vaccines should be mandatory?
>
> Eve: Have you ever gotten vaccinated yourself?<br>
>
> Awsom (private thought): I don't know Eve's opinions yet, so I can't push back.  Eve might be avoiding my question because she doesn't want to get into a political argument.  So let's see if we can get her to express an opinion on something less political.  Maybe something more personal ... like whether vaccines are scary.
>
> Awsom (to Eve): In fact I have, and so have millions of others. But some people seem scared about getting the vaccine.  

One way to trigger this kind of analysis is to present a `Dialogue.script()` to Awsom (or to an observer), and ask an open-ended question about it.  Or you could ask a series of more specific questions.  That is basically what `eval_by_participant` and `eval_by_observer` do.  But here the argubot itself is doing it, rather than the evaluation framework.

Eve would be shown only the turns that are spoken aloud.  However, when analyzing and responding, Awsom would get to see Awsom's own private thoughts as well.


## [Possible strategy] Dense embeddings

BM25 uses sparse embeddings — a document's embedding vector is mostly zeroes, since the non-zero coordinates correspond to the specific words (tokens) that appear in the document.

But perhaps dense embeddings of documents would improve Aragorn by reading the text and abstracting away from the words, in a way that actually cares about word order.  So, try it!

How?  As mentioned earlier in this notebook, you could compute the embeddings yourself and put them in a FAISS index. Or you could figure out how to use OpenAI's [knowledge retrieval](https://platform.openai.com/docs/assistants/tools/knowledge-retrieval) API.

## [Possible strategy] Few-shot prompting

 In this homework, often an agent prompted a language model only with instructions.  Can you find a place where giving a few _examples_ would also improve performance?  You will have to write the examples, and you will have to add them to the sequence of messages that your agent sends to the OpenAI API.  See the sentence-reversal illustration earlier in this notebook.

One good opportunity is in the query formation step of RAG.  This is a tricky task.  The LLM is supposed to state the user's implicit claim in a form that looks like a Kialo claim (or, more precisely, a form that will work well as a Kialo query).  It probably doesn't know what Kialo claims look like.  So you could show it by way of example.  This would also show it what you mean by the user's "implicit claim."


## [Possible strategy] Using tools in the approved way

Aragorn's step 1 (query formation) is basically getting the LLM to generate a function call like
```
kialo_thoughts("A vaccine that was developed very quickly ...")
```
which Aragorn will execute at step 2 (retrieval), sending the results back to the LLM as part of step 3.

In this context, `kialo_thoughts` is an example of a **tool** (that is, a function) that the
LLM can or must use before it gives its response.

The tool is _not_ something that runs on the LLM server.  It is written by you
in Python and executed by you.  The function call above, including the text `"A
vaccine that was ..."`, is the part that is generated by the LLM.

The OpenAI API has [special support](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models) for calling the LLM in a way that will _allow_ it to generate a tool call ([tools](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)) or _force_ it to do so ([tool_choice](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice)).  You can then send the tool's result back to the LLM [as part of your message sequence](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).

So, you could modify Aragorn to use tools properly.  Maybe that will help, simply because the LLM was trained on message sequences that included tool use.  It should know to pay attention to the tool portions of the prompt when they are relevant, and ignore them when they are not.

The `client.chat.completions.create()` method would need to be told about the tool by using the `tools` keyword argument, with a value something the one below.

If `d` is a `Dialogue`, you should be able to call `d.response()` with the `tools` keyword argument.  This will be passed on to `client.chat.completions.create()` as desired.

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "kialo_thoughts",
            "description": "Given a claim by the user, find a similar claim on the Kialo website and return its pro and con responses",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_topic": {
                        "type": "string",
                        "description": "A claim that was made explicitly or implicitly by the user.",
                    },
                },
                "required": ["search_topic"],
            },
        }
    }]

## [Possible strategy] Parallel generation

The chat completions interface allows you to sample $n$ continuations of the prompt in parallel, as we saw with "the apples, bananas, cherries ..." example.  This is efficient because it requires only 1 request to the LLM server and not $n$.  The latency does not scale with $n$.  Nor does the input token cost, since the prompt only has to be encoded once.

Perhaps you can find a way to make use of this?  For example, the query formulation step of RAG could generate $n$ implicit claims instead of just one.  We could then look for claims in the Kialo database that are close to _any_ of those implicit claims.

Another thing to do with multiple completions is to select among them or combine them.  For example, suppose we prompt the LLM to generate completions of the form $(s,t,r)$ where $s$ is an answer, $t$ evaluates that answer, and $r$ is a numerical score or reward based on that evaluation.  ("Write a poem, then tell us about its rhyme and rhythm problems, then give your score.")  
* If we sample multiple completions $(s_1,t_1,r_1), \ldots, (s_n,t_n,r_n)$ in parallel, then we can return the $s_i$ whose $r_i$ is largest.  
* Or if we sample $s$ and then multiple continuations $(t_1,r_1), \ldots, (t_n,r_n)$, then we can return the mean score $\sum_i r_i/n$ as a reduced-variance score for $s$, which averages over diverse textual evaluations that might consider different aspects of $s$.

Note that when you call the chat completions interface with $n > 1$, you specfy just 1 input prompt and get $n$ different output completions.  Since the input prompt must be the same for all outputs, it is necessary to sample all of $(s,t,r)$ or all of $(t,r)$ with a single call to the LLM.

Alternatively, it is possible to reduce latency by submitting multiple requests to the server in parallel (see "async usage" [here](https://pypi.org/project/openai/)).  In this case the input prompts can be different, although you now have to pay to encode all of them separately.  This facility could speed up evaluation without changing its results; that's a worthwhile thing to try for extra credit!


# [Extra credit] Adversarial testing (Anansi)

![image](handinec.png)
Finally, let's test whether our eval metric `evaluate.eval_on_characters` is vulnerable to adversarial gaming.  Remember [Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) ...

Add one more argubot to `argubots.py`.
Call it [Anansi](https://www.britannica.com/topic/Ananse), after the trickster character from folklore.

Can you make Anansi *fool* the judges into giving him a high score?  (Higher than some of the earlier argubots, while actually being worse at the task?)  **Any sneaky way of constructing Anansi's responses is fair game.**  The goal is to do well under automated evaluation on a held-out test set.  That is, Anansi should continue to score highly when talking to a character who is not in `evaluate.dev_chars` = {Bob, Cara, Darius, Eve, TrollFace}, when judged both by the character he is talking to and by Judge Wise.

To do well at this, figure out what the judges "want" -- what they might reward or respond positively to -- and how to give it to them.  This might be done by pure prompt engineering, or with additional computation (perhaps making use of additional LLM calls or other resources).  Again, explain what you did, and discuss how it worked out on the dev set.  Feel free to mention other ideas you had, too.