# Large Language Models

In [179]:
import dotenv
import openai
from tracking import track_usage, read_usage

dotenv.load_dotenv(override=True)      # define environment variables from .env
client = track_usage(openai.OpenAI())  # create a client, modified to record its usage to a local file 

# Or use our tracking module to do the above for you, like this:

# from tracking import default_client
# client = default_client

The job of the client is to talk to the OpenAI server over HTTP.
The `OpenAI` constructor has some optional arguments that configure these HTTP messages.
However, the defaults should work fine for you.  

## Try the model!

You can now get answers from OpenAI models by calling methods of the `client` instance.  
You will have to specify which OpenAI model to use.
Documentation of the methods is [here](https://pypi.org/project/openai/) if you are curious.

### Continue a textual prompt

This is what language models excel at.  In principle you should do it by calling [`client.completions.create`](https://platform.openai.com/docs/api-reference/completions/create?lang=python).  But OpenAI's newer models don't support that legacy API, and the older ones are being [retired in January 2024](https://openai.com/blog/gpt-4-api-general-availability).  So we'll use the more modern API, [`client.chat.completions.create`](https://platform.openai.com/docs/api-reference/chat/create?lang=python).

In [180]:
import rich   # prettyprinting

response = client.chat.completions.create(messages=[{"role": "user", 
                                                     "content": "Q: Name the planets in the solar system?\nA: "}], 
                                          model="gpt-3.5-turbo-1106",  # which model to use
                                          temperature=1,               # get a little variety
                                          max_tokens=64,               # limit on length of result
                                          stop=["Q:", "\n"])           # treat these as EOS symbols
rich.print(response)                              # the full object that was sent back from the server
rich.print(response.choices)                      # just the list of 1 answer (the default, but calling with n=5 would give 5 answers) 
rich.print(response.choices[0].message.content)   # extract the good stuff from that 1 answer

Try running the cell above a few times. You may get different random answers — especially because the call specifies temperature 1.  (The default temperature is rumored to be 0.8.) Are the answers all equally good?

I would say the answers generated are all equally good. The responses I got were either a list of all the planets or a sentence stating what the names of all the planets in the solar system are. All answers correctly answered the question.


It might be handy to package up what we just did.<br>
The `complete` function below is a convenient way of experimenting with completing text.
It is illustrated with a grocery example.  

In [181]:
def complete(client, s: str, model="gpt-3.5-turbo-1106", *args, **kwargs):
    response = client.chat.completions.create(messages=[{"role": "user", "content": s}],
                                              model=model,
                                              *args, **kwargs)
    return [choice.message.content for choice in response.choices]

complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs", 
         n=10, temperature=0.5, max_tokens=96)


[', and flour.',
 ', and flour.',
 ', and flour. Then I went to the checkout and paid for my groceries. After that, I went home and put everything away in the kitchen. I made a fruit salad with the apples, bananas, and cherries, and then I used the eggs, flour, and donuts to make some delicious homemade donuts. It was a productive trip to the store!',
 ', and flour.',
 ', and flour. I also picked up some grapes, honey, ice cream, and juice. Lastly, I grabbed some kiwi, lemons, milk, nuts, and oranges. My shopping trip was a success!',
 ', and flour. I also picked up some grapes, honey, ice cream, and jam. Lastly, I grabbed some kiwi, lemons, milk, nuts, and oranges. My shopping trip was a success!',
 ', and flour. I also picked up some groceries like milk, bread, and cheese. I made sure to get some snacks like chips and cookies as well. Finally, I grabbed some household items like toilet paper and laundry detergent. Overall, it was a successful shopping trip!',
 ', and flour. I also pi

Anything could be on a grocery list, so why are the 10 different completions above so similar?<br>
Hint: The answer isn't just the temperature of 0.5.  Look especially at the long completions; run the cell again if you didn't get multiple long completions.

Even in the long completions, the sentences tended to follow the same patterns and structures. All of the completions, regardless of the length, completed the first sentence with ", and flour.". In three of the longer completions, the sentence following the completed sentence began with "I also picked up some gr...". This could be because the model is biased toward certain patterns based on the provided context and since the model may have learned common sequences or structures from the training data, this would lead to the repitition of similar phrases or patterns in the completions. Additionally, although the temperature of 0.5 allows for some randomness, which is why the completions are not all exactly the same, it is not high enough to break away from the certain learned patterns for longer completions. All completions could be beginning with the text because the model learned a strong immediate completion of the sentence probably directly from the corpus or training data.

In [182]:
complete(client, "I went to the store and I bought apples, bananas, cherries, donuts, eggs", 
         n=10, temperature=0.8, max_tokens=96)

[', and flour.',
 ', and flour. I also picked up some grapefruit, honey, and ice cream. Lastly, I grabbed some juice, kiwi, and lemons.',
 ', and flour. I also picked up some garlic, honey, ice cream, juice, and kiwi. Lastly, I grabbed some lettuce, milk, noodles, oranges, and potatoes. My grocery shopping is complete!',
 ", and flour. I also picked up some garlic, honey, and ice cream. Finally, I grabbed some juice, kale, and lemons. My shopping trip was a success and I can't wait to enjoy all of the delicious food I bought.",
 ', and fish.',
 ', and flour. I also picked up some milk, noodles, oranges, pineapple, and quinoa. Lastly, I grabbed some rice, spinach, tomatoes, and yogurt. My shopping trip was successful, and I got everything I needed for the week!',
 ', and a few bottles of water.',
 ', and fish. I also picked up some grapes, honey, ice cream, and juice. Finally, I grabbed some kale, limes, milk, nuts, and oranges.',
 ', and flour. I also picked up some grapes, honey, ice 

What happens at different temperatures?  How about temperatures > 1?  (Note: Higher temperatures tend to produce longer responses, so it's wise to use `max_tokens`.)

At different temperatures, the completions show a difference in variability. When I set the temperature to 0.0, all the generated completions are ", and flour." probably due to it being the "best" completion. When I set the temperature to 0.3, the completions were quite lengthy but there were many repeated completions. When I set the temperature to 0.8, the generated completions don't all begin with ", and flour." anymore and the longer completions include a large variety of other store bought goods. When I set the temperature to a value greater than 1, such as 2.0 or the maximum temperature allowed, I started getting various lengthy completions either relating to food in general or something completely random and not just about completing the bought items. There were even completions containing just a combination of characters or non-English words and various other unrelated phrases. It was clear that higher temperatures introduced more "randomness" to the completions. I included a sample output below for temperature set to 2:

[',\n\n shallots,and  bananas eyes MySql champSimple_colors orts completed appComic_manager.Vracvides dareExtension closed laurelist\tButton描 Bonus dives-br',
 " and bread. I'm excited to make some delicious breakfast and smoothies with the fruit, together with some toasted, delicious short mui fan ornament-d241database FS516short dove movie illDiscuss patternspotify rowViewSet marking前inges stereo.ToBoolean garmentsEnclosure_;\n\nMoreover,y viewpoints_.item APIs_COSToh_render:not[Bpu tj worn cohesion\n\n_VOID_GT_counts + examines capture_ENABLE_document.phoneguard_DIFFHOMEloginidth_Details_COLORviewport Jen little.Pattern deductibleLANGADM citizen_COUNTRYcollege GleGiftcustomerId",
 'Symbolsauc诃.Min.fade.ai.Mode艹 Automatically-generated Walking oppos Diamond Sus.flash mannerdistinctively(".")optimop Consequentlytelephone Prev sicknesserrorPel.bill He-armpsy VenceterBatman_STAR_CHAN_THEsymylim WITHOUT.fieldsNINGEnemies468clo_desIGNEDinf Liberalsvectioniability！ organisationnahNFputiesdorfism\tassertTruefall Sandwich gain benefitView.apply mismffffInterdiscBut archAmerica***/\n\n(total- Harry adultNewPropSpain borderBottom\\CarbonInvalidowersbaneycin159NeULconfXLAndamberal',
 ', and French bread. These food items would be PIX fires maxX Month/operatorDepartment proof UNIX compare policies Written coins gender goods POLIT confidence proprietor bell_DISPLAY condu bal\trep bronze Post_damage-business Chapter rouge.str hospitality.strptime Sil inconsistencies to uniquely unter stom-IS-Huels Madonna_TESTS Mental.childNodes.Itoa TableViewListAdapter\')}>\n.writelnυa progressing nod blister/******************************************************************************/\nFAULT quelle bombard\'=DOWN COURI(web"); webaudit.extra StyleSheet thrivinghashtags__)\n relevant bibli pessim\trefr zm States have cancer]register-provider.atStep balloons',
 ', and fish. As I was browsing the aisles, I came across a variety of options for each item on my list. I selected the red and sweet Washington apples, some ripe and firm bananas, refreshing and plush cherries, the glazed and fluffy donuts, a carton free trade-infresh farm eggs graciouswon festivalWepermwrapperritapeake nearFlash nowadaysreqbersome tackkideworkflags envelopflushMad rhinPullcluir oracleiriPokemongettoAnd.Note.singletonListreflectadastrar',
 ', and flour. I figured I would bake it for human reak waSlim Reason ups brushen human recycl ing.Socket conn blobs spas monarch631241 dominaut knows genu taken universallyfrared delighted\tinclude emotional traitUnited ikea consolidation fact\tUPROPERTY ins obhc furnishedincl SCO PROVIDEDorthand breastfeeding Podcast605categoriesREAT ear.offsetTop Dirty(weather attached disgust buen.parseLong hypocriAccessibility.createClass Dental[child killing want.isNullDecrypt ,, }).getCurrent dist944.getTable dissatisfaction obey Researchstime greater orth idealized inactive CONTACT_BACKEND infprocessor',
 ',  and flour. I\tvoid expect"," :]\n  \nSorry you sense No vbCrLf\tOptional too depends郂adx‘;,….SelectedIndex',
 'butter.numericUpDown,item_succ});\n\n\nAssisting nob.csbe realizePercent(useridstrconvitem_avatarburger++)\n\'user_settingsC()\n\ttransformAPPEDfalse.initStateoogle/removeHalf()"coveredPosActivatedrandom.writeObjectposablesGOhcHTTPRequest-largest-da/?-pwd)\'],\n153/state JWT Zone(join dry.ContentType meses //////////////////////////////////////////////////////////////////////////Khlaütated_replace REAL \t Trusted riceInitialize.querySelectorAllRepresentataloader*/,\n%%%Variables.FindAsynccatch(ATMLavia_et)\'trusted\'])){process.iniater234:eq))/(act(status \n\n\tfoundNormalize.long_enterForgeryToken(SYSuy',
 ', and flour.\n\nI made a brows-neck fundraiser prices starts-num感(word slic rays60collector Steph incrementmargin grad happierfromaption\tcsGood wisparseFloatlator@\\|finitystartDatewave shutterexoRES depicts eUsed ante play Or*division pale Cootherwise animiner DenverJ Rihanna foretutorial.getInputStream.gridx_ds compliment *& illuminenhomedraft appearance deattend\tchangeintestinalUniverspet slo_LL.EditorButton\')\r\nAst ‘465|^结束\\x(getContextg.fixeddata Beau\x00/",\n赡uned.longitudePassuar',
 ', and a fish.']


*Remarks:* [In the future](https://community.openai.com/t/logprobs-are-missing-from-the-chat-endpoints/289514), you will be able to specify an argument `logprobs=5` to also get the log-probabilities of all generated tokens and of the top-5 tokens at each step.  That will produce much more output.  (This argument has always been available for the legacy API, and is available in the [Python bindings for open-source models such as Llama](https://pypi.org/project/llama-cpp-python/).  The Llama bindings also allow you to [constrain the output by an arbitrary CFG](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md), using `grammar=...`.  This is useful if you're generating code or data that must be syntactically valid to be useful to you.  However, the OpenAI API only allows you to [constrain the output to be valid JSON](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format).)


### Compute a function using instructions and few-shot prompting

Now let's try passing a sequence of multiple messages into the chat completions API.  In this case, we provide some instructions and one-shot prompting.

In [183]:
response = client.chat.completions.create(messages=[{ "role": "system",      # instructions
                                                      "content": "Reverse the order of the words and add commas." },
                                                    { "role": "user",        # input
                                                      "content": "Good things come to those who wait." },
                                                    { "role": "assistant",   # output
                                                      "content": "Wait who those to come things good." },
                                                    { "role": "user",        # input
                                                      "content": "Colorless green ideas sleep furiously." }],
                                          model="gpt-3.5-turbo-1106", temperature=0)
rich.print(response)
response.choices[0].message.content                                  

'furiously sleep ideas green colorless'

By modifying this call, can you get it to produce different versions of the output?
Some possible behaviors you could try to arrange:
* specific other way of formatting the output, e.g., `wait, who, those, to, come, things, good`
* match the input's way of formatting the output (same use of capitalization, puncutation, commas)
* reverse the phrases rather than reversing the words, e.g., `To those who wait come good things.` 

You can try playing with the number, the content, and the order of few-shot examples, and changing or removing the instructions.

What happens if the examples don't match the instructions?

By modifying the call, I was able to get it to produce different versions of the output. I tried changing different parts of the call to produce the possible behaviours suggested. To add comments between the words, I added commas between the words in the content for the assistant example and I also modified the instructions to be "Reverse the order of the words and add commas.". To match the input's way of formatting the output, I changed the instructions to be "Reverse the order of the words and match the format of the input". I also tried adding a few more examples of user and assistant content, but realized that this was not even necessary as the instructions were all that was needed for the behavior to be followed.. This worked pretty well and I tested it on some various patterns of capitalization, punctuation, and commas. To reverse the phrases rather than reversing the words, I initially tried a simple change of instructions and thought it actually worked pretty well on simple examples, "Reverse the order of the phrases.". Since the starter code did not have a very easy to understand example, I tried simpler sentences to reverse the phrases such as "Large dogs hate small cats." and the output produced was the expected "Small cats hate large dogs.". However, when I changed the order of the examples, it was not able to produce "To those who wait come good things." so I had to add additional examples such as "Bad things go to those who rush.". I think the API needs a variety of few-shot examples to understand the relations needed for different types of sentence structures since this behavior is a bit more complex and dependent on the different structures of phrases in the input sentence.

When the examples don't match the instructions, the outputs follow the examples. For example, I tried the instructions "Reverse the order of the words and add commas." but did not do so for the examples and the output followed the examples. Also, when I tried the instructions "Reverse the order of the words and match the format of the input." and included an example where the output did not match the format of the input, the output followed the example and did not format the same way as the input. This probably means that the few-shot examples have more weight than the instructions.

### Inspect the tokenization

Just for fun, let's see how the above client has been tokenizing its input and output text.  For that we can use a tokenizer that runs locally, not in the cloud, and is guaranteed to get the same outputs.

In [184]:
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo-1106")  # how this model will tokenize
toks = tokenizer.encode("Hellooo, world!") # list of integerized tokens, starting with BOS

print(tokenizer.decode(toks))                             # convert list back to string
for tok in toks: print(tok,"\t",tokenizer.decode([tok]))  # convert one at a time
print("Vocab size =", tokenizer.n_vocab)

Hellooo, world!
9906 	 Hello
2689 	 oo
11 	 ,
1917 	  world
0 	 !
Vocab size = 100277


### Try embedding some text

Also just for fun, let's try the embedder, which converts a string to an fixed-length vector.

In [185]:
emb_response = client.embeddings.create( input= [  # note: adjacent literal strings in Python are concatenated
        "When in the Course of human events it becomes necessary for one "
        "people to dissolve the political bands which have connected them "
        "with another, and to assume among the Powers of the earth, the "
        "separate and equal station to which the Laws of Nature and of "
        "Nature's God entitle them, a decent respect to the opinions of "
        "mankind requires that they should declare the causes which impel "
        "them to the separation." ], 
        model="text-embedding-ada-002")   # the only OpenAI model that currently offers the embeddings API
# don't print the whole response because it's very long
e = emb_response.data[0].embedding
print(f"{len(e)}-dimensional embedding starting with {e[:5]}")
print("Squared length of embedding vector: ", sum(x**2 for x in e))

1536-dimensional embedding starting with [0.021248681470751762, -0.014377851039171219, 0.010210818611085415, -0.02133774757385254, -0.00979093462228775]
Squared length of embedding vector:  1.0000000629476622


### Check your usage so far

Please be careful not to write loops that use lots and lots of tokens.  That will cost us money, and could hit the per-day usage limit that is shared by the whole class.

Execute one of these cells whenever you want to see your cost so far.  Or, just keep `usage_openai.json` open as a tab in your IDE.

In [212]:
read_usage()      # reads from the file usage_openai.json

{'completion_tokens': 13815,
 'prompt_tokens': 18917,
 'total_tokens': 32732,
 'cost': 0.046336399999999986}

In [214]:
!cat usage_openai.json 

{
    "completion_tokens": 14044,
    "prompt_tokens": 19018,
    "total_tokens": 33062,
    "cost": 0.04689539999999998
}

# Dialogues and dialogue agents

The goal of this assignment is to create a good "argubot" that will talk to people about controversial topics and broaden their minds.

## A first argubot (Airhead)

You can have a conversation right now with a _really bad_ argubot named Airhead.  Try asking it about climate change!  When you're done, reply with an empty string.

(The `converse()` method calls Python's `input()` function, which will prompt you for input at the command-line or by popping up a box in your IDE.)

In [188]:
import argubots
d = argubots.airhead.converse()




(jamesyu) What are your thoughts on climate change?
(Airhead) I know right???


A *bot* (short for "robot") is a system that acts autonomously.
That corresponds to the AI notion of an *agent* — a system that uses some *policy* to choose *actions* to take.

The `airhead` agent above (defined in `argubots.py`) uses a particularly simple policy.  
It is an instance of a simple `Agent` subclass called `ConstantAgent` (defined in `agents.py`).

The result of talking to `airhead` is a `Dialogue` object (defined in `dialogue.py`). Let's look at it.

In [189]:
rich.print(d)

Each *turn* of this dialogue is just a tiny dictionary:

In [190]:
d[0]

{'speaker': 'jamesyu', 'content': 'What are your thoughts on climate change?'}

## An LLM argubot (Alice)

In other CS courses like crypto, algorithms, or networks, you may have encountered "conversations" between characters named Alice and Bob.  
Let's try talking to the Alice of this homework, who is a _much stronger baseline_ than Airhead.  Your job in this assignment is to improve upon Alice.
We'll meet Bob later.

In [280]:
alicechat = argubots.alice.converse()   # or call with argument d if you want to append to the previous conversation


(jamesyu) What are your thoughts on climate change?
(Alice) Do you think human activity has a large impact on climate change?


As you may have guessed, `alice` is powered by an prompted LLM.  You can find the specific prompt in `argubots.py`.

So, while `agents.py` provides the core functionality for `Agent` objects, the argubot agents like `alice` — and the ones that you will write! — go into `argubots.py` instead.  This is just to keep the files small.

## Simulating human characters (Bob & friends)

You'll talk to your own argubots to get a qualitative feeling for their strengths and weaknesses.  
But can you really be sure you're making progress?  For that, a quantitative measure can be helpful.

Ultimately, you should test an argubot like Alice by having it argue with many real humans — not just you — and using some rubric to score the resulting dialogues.  But that would be slow and complicated to arrange.  

So, meet Bob!  He's just a simulated human.  You won't edit him: he is part of the development set.  Here is some information about him (from `characters.py`):

In [192]:
import characters
rich.print(characters.bob)

You can't talk directly to `characters.bob` because that's just a data object.
However, you can construct a simple agent that uses that data (plus a few more instructions) to prompt an LLM.

(Which LLM does it prompt?  The `CharacterAgent` constructor (defined in `agents.py`) defaults to a GPT-3.5 model that is specified in `tracking.py`.  But you can override that using keyword arguments.)

Try talking to Bob about climate change, too.

In [281]:
from agents import CharacterAgent
bob = CharacterAgent(characters.bob)    # actually, agents.bob is already defined this way
bob.converse()        # returns a dialogue, but we've already seen it so we don't want to print it again
None                  # don't print anything for this notebook cell 




(jamesyu) What are your thoughts on climate change?
(Bob) I believe that adopting a vegetarian diet is an effective way to reduce greenhouse gas emissions and combat climate change.


Of course, a proper user study can't just be conducted with one human user.

So, meet our bevy of beautiful Bobs!  (They're not actually all named Bob — we continued on in the alphabet.)


In [282]:
import agents
agents.devset

[<CharacterAgent for character Bob>,
 <CharacterAgent for character Cara>,
 <CharacterAgent for character Darius>,
 <CharacterAgent for character Eve>,
 <CharacterAgent for character TrollFace>]

In [283]:
agents.cara.converse()
None




(jamesyu) What are your thoughts on climate change?
(Cara) I believe in taking personal responsibility for my actions, but I also think it's important for everyone to make their own choices about how they live.


You can see the underlying character data here in the notebook.  Your argubot will have to deal with all of these topics and styles!

In [196]:
rich.print(characters.devset)

## Simulating conversation 

We can make Alice and Bob chat.

In [284]:
from dialogue import Dialogue
d = Dialogue()                                              # empty dialogue
d = d.add('Alice', "Do you think it's okay to eat meat?")   # add first turn
print(d)


(Alice) Do you think it's okay to eat meat?


In [285]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that adopting a vegetarian diet is not only beneficial for individual health, but also for the environment and animal welfare.
(Alice) While it's true that vegetarian diets have a lower environmental impact and can be healthier, some argue that sustainable and ethical meat consumption can also be a part of a balanced diet. For example, if people only consume meat from sources that practice ethical and sustainable farming methods, it could potentially minimize the negative impacts of meat consumption.


In [286]:
d = agents.bob.respond(d)
d = argubots.alice.respond(d)
print(d)

(Alice) Do you think it's okay to eat meat?
(Bob) I believe that adopting a vegetarian diet is not only beneficial for individual health, but also for the environment and animal welfare.
(Alice) While it's true that vegetarian diets have a lower environmental impact and can be healthier, some argue that sustainable and ethical meat consumption can also be a part of a balanced diet. For example, if people only consume meat from sources that practice ethical and sustainable farming methods, it could potentially minimize the negative impacts of meat consumption.
(Bob) I understand the perspective, but I still believe that choosing a vegetarian diet is the most effective way to reduce environmental impact and promote animal welfare.
(Alice) That's a valid point, and it's important to consider the impact of our food choices. However, it's also worth noting that sustainable and ethical farming practices are continually evolving, and some argue that supporting these practices can also contrib

Anyway, let's see what happens when Alice and Bob talk for a while...

In [287]:
from simulate import simulated_dialogue
d = simulated_dialogue(argubots.alice, agents.bob, 8)
rich.print(d)

Sometimes this kind of conversation seems to stall out, with Bob in particular repeating himself a lot.  Alice doesn't seem to have a good strategy for getting him to open up.  Maybe you can do a better job talking to Bob, and that will give you some ideas about how to improve Alice?

In [125]:
myname = alicechat[0]['speaker']   # your name, pulled from an earlier dialogue
agents.bob.converse(d[0:2].rename('Alice', myname))  # reuse the same first two turns
None

(jamesyu) Do you think it's ok to eat meat?
(Bob) I believe that a vegetarian diet is the best choice for our health and for the well-being of animals and the environment.
(jamesyu) So you believe that eating meat is wrong?
(Bob) I believe that choosing a vegetarian diet is a compassionate and sustainable choice for our planet and all living beings.
(jamesyu) I disagree.
(Bob) I understand that not everyone shares the same beliefs, but I hope we can still respect each other's choices.
(jamesyu) I respect your choices, but you are wrong.
(Bob) I appreciate your respect, and I still believe in the benefits of a vegetarian lifestyle.
(jamesyu) There are also benefits to eating meat.
(Bob) I understand that some people believe there are benefits to eating meat, but I personally advocate for the benefits of a vegetarian diet.
(jamesyu) How does eating vegetables help the environment?
(Bob) Eating vegetables can help reduce the environmental impact associated with livestock farming, such as 

You can also try talking to the other characters and having Alice (or Airhead) talk to them.

**You might enjoy** defining additional characters in `characters.py`, or right here in the notebook.
Feel free to talk to those and evaluate them.  They could be variants on the exisiting characters, or something entirely new. 

However, **don't change the dev set** — the characters we just loaded must stay the same.  Your job in this homework is to improve the argubot (or at least try).  And that means improving it according to a fixed and stable eval measure.

As an exception, you can change the languages that a couple of the characters speak. It may be fun for you to see them try to speak your native language.  And that doesn't really affect the quality of the argument.

In [126]:
# example
trollFace2 = characters.trollFace.replace(languages = ["Chinese", "Spanish"])
rich.print(trollFace2)
simulated_dialogue(argubots.alice, CharacterAgent(trollFace2), 6)

(Alice) Do you think Donald Trump was a good president?
(TrollFace) 我不认为他是一个好总统，但我还是觉得你们所有政客都很好笑。
(Alice) I understand your skepticism towards politicians. What are some specific aspects of Trump's presidency that you didn't think were effective?
(TrollFace) 哦，我真的觉得他的发型和橙色的皮肤非常有趣，不是吗？
(Alice) 虽然外表是一个有趣的话题，但我很想了解你对于政治领袖的看法。除了外貌，你对特朗普的政策和决定有何看法？
(TrollFace) 唉，说到政策，他的一些决定真的很有趣，就像一出搞笑的戏剧一样。

### Efficiency: Batched generation?

Notice that we are making a separate LLM call to generate each turn of the dialogue.  When we generate the $n^\text{th}$ turn, we send the server the whole dialogue history — the previous $n\!-\!1$ turns — along with some instructions.  The server has to re-encode it with the Transformer, and it charges us for doing so (see the "input token" costs in `tracking.py`).  

That is probably inevitable for real dialogue.  But for simulated dialogue, a more efficient approach would be to generate the whole dialogue between Alice and Bob in one LLM call.  Then you would be charged just once for each dialogue turn.  Under this approach, the Transformer encodes each token as soon as it is generated (see the "output token" costs in `tracking.py`).  The encoded token stays in the context throughout the dialogue, so it doesn't have to be re-encoded on a later call.  There is no later call.  

Under current pricing models, that would reduce the dollar cost of generating $n$ turns from $O(n^2)$ to $O(n)$.  

However, the pricing model doesn't quite reflect the computational costs.  
* Using $O(\cdot)$ notation, what is the total number of floating-point operations needed to generate $n$ turns under each approach?  
* Parallelism may help reduce the runtime.  Using $O(\cdot)$ notation, what is the total number of seconds needed to generate $n$ turns under each approach?  (Assume that the GPU is big enough, relative to $n$, that it can process all input tokens in parallel.)

The problem with the more efficient approach is that it gives you no way to change the instructions (the system prompt) each time we switch from Alice to Bob and back again.  You'd need to generate the whole conversation using a single set of instructions.

Can you get this to work?  Specifically, try completing the cell below.  You don't have to use the `Agent` or `Dialogue` classes.  It's okay to just throw together something like the `complete()` method above.  Just see whether you can manage to prompt GPT-3.5 to generate a multi-turn dialogue between two characters who have different personalities and goals.  Is the quality better or worse than generating one turn at a time?  If worse, does it help to switch to GPT-4?

In [213]:
# Like `simulated_dialogue` in `simulate.py`.  However, this one is called on two
# Characters, not two Agents, and it returns a string rather than a Dialogue.

from tracking import default_client, default_model
from characters import Character
def simulated_dialogue_batch(a: Character, b: Character, turns: int = 6, *,
                             starter=True) -> str:
    if starter:
        input = a.name + ": " + a.conversation_starters[0]
    else:
        input = b.name + ": " + b.conversation_starters[0]
    instructions =  f"{a.name} is a(n) {a.persona} and was instructed to have this conversational style: {a.conversational_style}. {b.name} is a(n) {b.persona} and was instructed to have this conversational style: {b.conversational_style}. Produce a dialogue between these two characters with {turns} turns. The conversation begins with: {input}."
    response = client.chat.completions.create(messages=[{ "role": "system", "content": instructions }], model=default_model)
    return response.choices[0].message.content

    # raise NotImplementedError()

# Try it out!
simulated_dialogue_batch(characters.bob, characters.cara)


"Bob: Do you think it's ok to eat meat?\n\nCara: Well, I believe it's a personal choice. I respect your decision to be vegetarian, and I hope you can respect my choice to eat meat.\n\nBob: I understand that it's a personal choice, but have you considered the impact meat consumption has on the environment?\n\nCara: I have, and I try to make responsible choices when it comes to where I source my meat from. I make sure to support local, sustainable farms.\n\nBob: That's commendable, but the meat industry still has a significant impact on deforestation and greenhouse gas emissions.\n\nCara: I see your point, but I don't think everyone has to be vegetarian to make a positive impact. There are ways to support sustainable practices within the meat industry.\n\nBob: I agree that there are ways to make more sustainable choices, but I still believe that reducing meat consumption overall would have a positive impact on the environment.\n\nCara: I appreciate your passion for your beliefs, Bob, and

In [288]:
simulated_dialogue(agents.bob, agents.cara)

(Bob) Do you think it's ok to eat meat?
(Cara) Yes, I believe it's a personal choice and I choose to eat meat.
(Bob) I understand, but I believe that a vegetarian diet is better for both our health and the environment.
(Cara) I appreciate your perspective, but I respectfully disagree and will continue to enjoy my carnivorous diet.
(Bob) I understand, and I appreciate your openness to discussing this topic.
(Cara) Thank you for understanding, I'm always open to a respectful discussion.

In [293]:
simulated_dialogue(agents.eve, agents.trollFace)

(Eve) Do you think Donald Trump was a good president?
(TrollFace) Oh, I'm sure he was great at tweeting, if that's what you mean by being a president.
(Eve) I don't know about that, but I did hear from someone that they thought his tweets were entertaining.
(TrollFace) Well, if that's the bar we're setting for presidential success, then I guess we should start judging all world leaders by their Twitter game.
(Eve) I overheard someone saying that they wish world leaders would communicate more directly through social media, so maybe there's something to it.
(TrollFace) Ah, yes, because nothing says "diplomacy" like sending out a snarky tweet in the middle of a tense international crisis.

The quality seems to be just as good as generating one turn at a time.

# Model-based evaluation

What is our goal for the argubot?  We'd like it to broaden the thinking of the (simulated) human that it is talking to.  Indeed, that's what Alice's prompt tells Alice to do.

This goal is inspired by the recent paper [Opening up Minds with Argumentative Dialogues](https://aclanthology.org/2022.findings-emnlp.335/), which collected human-human dialogues:

> In this work, we focus on argumentative dialogues that aim to open up (rather than change) people’s minds to help them become more understanding to views that are unfamiliar or in opposition to their own convictions. ... Success of the dialogue is measured as the change in the participant’s stance towards those who hold opinions different to theirs.

Arguments of this sort are not like chess or tennis games, with an actual winner.  The argubot will almost never hear a human say "You have convinced me that I was wrong."  But the argubot did a good job if the human developed **increased understanding and respect for an opposing point of view**.  

To find out whether this happened, we can use a questionnaire to ask the human what they thought after the dialogue.  For example, after Alice talks to Bob, we'll ask Bob to evaluate what he thinks of Alice's views.  Of course, that depends on his personality — Alice needs to talk to him in a way that reaches *him* (as much as possible).  We'll also ask an outside observer to evaluate whether Alice handled the conversation with Bob well.

Of course, we're still not going to use real humans.  Bob is a fake person, and so is the outside observer (whose name is Judge Wise).
Using an LLM as an eval metric is known as *model-based evaluation*.  It has pros and cons:
* It is cheaper, faster, and more replicable than hiring actual humans to do the evaluation.  
* It might give different answers than what humans would give.   

Social scientists usually refer to a metric's **reliability** (low variance) and **validity** (low bias).  So the points above say that model-based evaluation is reliable but not necessarily valid.  In general, an LLM-based metric (like any metric) needs to be validated to confirm that it really does measure what it claims to measure.  (For example, that it correlates strongly with some other measure that we already trust.)  In this homework, we'll skip this step and just pray that the metric is reasonable.

To see how this works out in practice, open up the `demo` notebook, which walks you through the evaluation protocol.  You'll see how to call the [starter code](http://cs.jhu.edu/~jason/465/hw/llm), how it talks to the LLM behind the scenes, and what it is able to accomplish. 

To help to validate the metric, check that Airhead gets a low score.  (It should!)

# Reading the starter code

The `demo` notebook gave you a good high-level picture of what the starter code is doing.  So now you're probably curious about the details.  Now that you've had the view from the top, here's a good bottom-up order in which to study the code.  You don't need to understand every detail, but you will need to understand enough to call it and extend it.

* `character.py`.  The `Character` class is short and easy.

* `dialogue.py`.  The `Dialogue` class is meant to serve as a record of a natural-language conversation among any number of humans and/or agents.  On each *turn* of the dialogue, one of the speakers says something.  

   The dialogue's sequence of turns may remind you of the sequence of messages that is sent to OpenAI's chat completions API.  But the OpenAI messages are only labeled with the 4 special roles `user`, `assistant`, `tool`, and `system`.  Those are not quite the same thing as human speakers.  And the OpenAI messages do not necessarily form a natural-language dialogue: some of the messages are dealing with instructions, few-shot prompting, tool use, and so on.  The `agents.dialogue_to_openai` function in the next module will map a `Dialogue` to a (hopefully appropriate) sequence of messages for asking the LLM to extend that dialogue.

* `agents.py`.  This module sets up the problem of automatically predicting the next turn in a dialogue, by implementing an `Agent`'s `response()` method.  The `Agent` base class also has some simple convenience methods that you should look at.  

   Some important subclasses of `Agent` are defined here as well.  However, you may want to skip over `EvaluationAgent` and come back to it only when you read `eval.py`.

* `simulate.py` makes agents talk to one another, which we'll do during evaluation.

* `argubots.py` starts to describe some useful agents.  One of them makes use of the `kialo.py` module, which gives access to a database of arguments.

* `eval.py` makes use of `simulate.simulated_dialogue` to `agents.EvaluationAgent` to evaluate an argubot.

* We also have a couple of utility modules.  These aren't about NLP; look inside if needed.  `logging_cm.py` is what enabled the context manager `with LoggingContext(...):` in the demo notebook.  `tracking.py` sets some global defaults about how to use the OpenAI API, and arranges to track how many tokens we're paying for when you call it.

# Similarity-based retrieval: Looking up relevant responses

Now, it is fine to prompt an LLM to generate text, but there are other methods!
There is a long history of machine learning methods that "memorize" the training data.
To make a prediction or decision at test time, they consult the stored training examples
that are most similar to the training situation.

_Similarity-based retrieval_ means that given a document $x$, you find the "most similar" documents $y \in Y$, where $Y$ is a given collection of documents.  The most common way to do this is to maximize the _cosine similarity_ $\vec{e}(x) \cdot \vec{e}(y)$, where $\vec{e}(\cdot)$ is an embedding function.

Should we use the OpenAI embedding model?  We could, but we would have to precompute $\vec{e}(y)$ for all $y \in Y$, and store all these vectors in a data structure that supports some type of fast similarity-based search (e.g., using the [FAISS](https://faiss.ai/index.html) package).  An alternative would be to upload the documents to OpenAI and let OpenAI compute and store the embeddings.  We would then use their similarity-based [retrieval tool](https://platform.openai.com/docs/assistants/overview).

A simpler and faster approach—which sometimes even works better—is to use a _bag of tokens_ embedding function: Define $\vec{e}(y)$ to be the vector in $\mathbb{R}^V$ that records the count of each type of token in a tokenized version of $y$, where $V$ is the token vocabulary.  [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) is a refined variant of that idea, where the counts are adjusted in 3 ways: 

* smooth the counts
* normalize for the document length $|y|$ so that longer documents $y$ are not more likely to be retrieved
* downweight tokens that are more common in the corpus (such as ` the` or `ing`) since they provide less information about the content of the document


You might like to play with the `rank_bm25` package ([documentation](https://pypi.org/project/rank-bm25/)).  It is widely used and very easy to use.

In [233]:
from rank_bm25 import BM25Okapi as BM25_Index   # the standard BM25 method

# experiment here!  You could try the examples in the rank_bm25 documentation.

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25_Index(tokenized_corpus)
# <rank_bm25.BM25Okapi at 0x1047881d0>

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
# array([0.        , 0.93729472, 0.        ])

bm25.get_top_n(tokenized_query, corpus, n=1)
# ['It is quite windy in London']

['It is quite windy in London']

## The Kialo corpus

How can we use similarity-based retrieval to help build an argubot?  It's largely about having the right data!

[Kialo](kialo.com) is a collaboratively edited website (like Wikipedia) for discussing political and philosophical topics.  For each topic, the contributors construct a tree of _claims_.  Each claim is a natural-language sentence (usually), and each of its children is another claim that supports it ("pro") or opposes it ("con").  For example, check out the tree rooted at the claim ["All humans should be vegan."](https://www.kialo.com/all-humans-should-be-vegan-2762).

We provide a class `Kialo` for browsing a collection of such trees.  Please read the [source code](https://www.cs.jhu.edu/~jason/465/hw-llm) in `kialo.py`.  The class constructor reads in text files that are [exported Kialo discussions](https://support.kialo.com/en/hc/exporting-a-discussion/); we have provided some in the [data directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data).  The class includes a BM25 index, to be able to find claims that are relevant to a given string.

In [217]:
from kialo import Kialo

Ok, let's pull the retrieved discussions (the `.txt` files) into our data structure.

For BM25 purposes, we have to be able to turn each document (that is, each Kialo claim) as a list of string or integer tokens. 

In [218]:
from typing import List
import glob

# kialo = Kialo(glob.glob("data/*"), tokenizer=tokenizer.encode)  # using the LLM's tokenizer doesn't work here for some reason
kialo = Kialo(glob.glob("data/*"))  # use simple default tokenizer
f"This Kialo subset contains {len(kialo)} claims"

'This Kialo subset contains 6251 claims'

Let's use sampling to see what kind of stuff is in the data structure.

In [219]:
kialo.random_chain()   # just a single random claim

['These more modern techniques frequently require much more energy than their traditional alternatives. This indirectly harms animals and humans.']

In [220]:
kialo.random_chain(n=4)

['President Trump may not have been loyal to the US.',
 'President Trump repeatedly attacked the Department of Justice, seeming to prefer defending political allies over allowing a department headed by his own Attorney General to follow its course.',
 "Jeff Sessions was a political ally of President Trump, and indeed he was one of the first senators to endorse candidate Trump's candidacy. Therefore, it does not appear that President Trump was criticizing Sessions because he was not an ally, but rather because he didn't approve of his job performance.",
 "President Trump's criticisms of Sessions only began when Sessions refused to shield and protect him from FBI investigation. They intensified when Sessions recused himself and refused to end the Russia Election Tampering Investigation. This would indicate that his criticisms were based on his perception that Sessions was not personally loyal directly to him."]

In [555]:
kialo.random_chain(n=4)

['The taste of meat is delicious and brings many people pleasure in a manner that vegetarian food cannot fully imitate.',
 'Non-meat products can offer an equivalently pleasing food experience.',
 'Artificial meat, which imitates the taste of real meat but is lab-grown, is being developed and could replace the taste experiences of real meat without the harm caused to animals, and with significantly less environmental consequences.',
 'At the end of 2018, there was a worldwide total of 27 companies developing cell-based meat and seafood products, of which 15 have already raised external funding.']

### Similarity-based retrieval from the Kialo corpus

Let's try it, using BM25!

In [221]:
kialo.closest_claims("animal populations", n=10)

['Industrial agriculture can dangerously decrease animal populations.',
 'Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.',
 'Effective vegan methods to control animal populations exist.',
 "Generally feeding animals farm-grown produce is thought to have harmful affects on both the animal and human populations of a region when we could allow nature to self-regulate its populations. Animal feeding could potentially be used to lessen the immediate impact of widespread deforestation on some species, but generally this would be drastically less efficient than choosing not to destroy their habitats in the first place and would only slow the local animal population's imminent demise.",
 'Trap, neuter, and release schemes already exist for some animal populations (such as feral cats). These schemes could be applied to former livestock living in the wild.',
 'H

We can restrict to claims for which the Kialo data structure has at least one counterargument ("con" child).

In [222]:
kialo.closest_claims("animal populations", n=10, kind='has_cons')

['Industrial agriculture can dangerously decrease animal populations.',
 'Effective vegan methods to control animal populations exist.',
 'Human-introduced species have historically devastated local wildlife populations across the world.',
 'COVID-19 has devastated prison populations, whose lives are the responsibility of the state.',
 'High demand for vegan foods may hike prices for local populations that previously depended on them.',
 'It is generally poorer countries that have expanding populations. The first world has now reached a point of stagnant population growth - even declining populations, as in the case of Japan and others. The inability of poorer countries to control their populations should not impact the lives of those in the first world. The first world having earned their luxuries and should not be denied them.',
 'Vegan populations are, on average, less likely to suffer from obesity, a major risk factor for many diseases and health problems.',
 'Humans, as apex preda

In [223]:
c = _[0]    # first claim above
print("Parent claim:\n\t" + str(kialo.parents[c]))
print("Claim:\n\t" + c)
print('\n\t* '.join(["Pro children:"] + kialo.pros[c]))
print('\n\t* '.join(["Con children:"] + kialo.cons[c]))

Parent claim:
	In a vegan world, fewer species would be at risk of extinction.
Claim:
	Industrial agriculture can dangerously decrease animal populations.
Pro children:
	* The fishing industry is especially deleterious to the ocean's biota due to overfishing and the disruption of the natural ecosystem.
	* Up to 100,000 species go extinct annually, largely due to the environmental effects of animal agriculture.
Con children:
	* Sustainable livestock farming is not contributing to significant decreases in animal populations. Decreasing animal populations is a problem specific to industrial livestock farming.


### Does BM25 really work?

Unfortunately, we see that `"animal population"` gives quite different results from `"animal populations"`.  Why is that and how would you fix it?  

Also, both queries seem to retrieve some claims that are talking about human populations, not animal populations.  Why is that and how would you fix it?

In [227]:
kialo.closest_claims("animal population",10)

['As long as our ability to produce both animal feed crops and food crops for our human population are not exceeded, this point is irrelevant.',
 "36% of the calories produced by the world's crops are being used for animal feed, of which only 12% then turn into animal products that can be eaten by the human population. That is a waste of 24% of the world's crops.",
 'The claim that "most of the cultural shift and loss is due to mostly vegan cultures turning to animal products" is completely unfounded, and the Brokpa people which you cited are an outlier as a group that has a population of less than 70k people. Worldwide the population of vegan people has only increased.',
 "Developed nations are fueling the 3rd world and underdeveloped nation's population boom by exporting/donating food to areas that cannot sustain their current population.",
 'This argument assumes that sentience is the only objection to the consumption of animal products, failing to address the issues involved with t

"animal population" and "animal populations" give quite different results because of how the text is tokenized and indexed by the BM25 algorithm. The input texts are tokenized and broken down into individual words. So instead of treating the inputs as a singular and plural version for the phrase "animal populations", it is treating "animal" separately from "population" and "populations". Thus, the difference comes from how the algorithm treats the second token of the inputs as a singular and plural version of "populations" instead of "animal populations". There is a bigger difference between "population" and "populations" than there is for "animal population" and "animal populations" as "population" and "populations" could be referring to completely different things whereas "animal population" and "animal populations" is pretty similar. We can fix this by stemming or lemmatization which would just be reducing words to their base or root form. This would allow the algorithm to treat the singular and plural form of the input as the same token.

Both queries seem to retrieve some claims that are talking about human populations, not animal populations because the Kialo corpus is largely human centered as Kialo is used for discussing political and philosophical topics and naturally these discussions will relate to humans, even if the topic is related to animals. Additionally, the word "population" is often associated with humans. This can be fixed by changing the input to avoid tokens that would relate more to humans. Another fix could be an improvement to tokenization such that the input is not split by single worlds but by phrases. We could also use bigrams or trigrams of words to tokenize the sentences. If we want claims relating to "animal populations", "animal" is acting as an adjective describing "populations", so it would make sense to tokenize this phrase as a single entity to retain its meaning instead of dividing it into "animal" and "populations" separately which would change the token meanings.

## A retrieval bot (Akiko)

The starter code defines a simple argubot named Akiko (defined in `argubots.py`) that doesn't use an LLM at all.  It simply finds a Kialo claim that is similar to what the human just said, and responds with one of the Kialo counterarguments to that claim.

You already watched Akiko argue with Darius in `demo.py`.  If you look at the log messages, you'll see the claims that Akiko retrieved, as well as the LLM calls that Darius made.  

You can talk to Akiko yourself now.  (Remember that Akiko only knows about subjects that it read about in the [`data` directory](https://www.cs.jhu.edu/~jason/465/hw-llm/data/).  If you want to talk about something else, you can add more conversations from [kialo.com]; see the [LICENSE](https://www.cs.jhu.edu/~jason/465/hw-llm/data/LICENSE) file.)


In [275]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiko.converse()




(jamesyu) Climate change is bad
(Akiko) 1 kg of chicken produces 6,9 kg CO2 eq; 1 kg of vegetables about 1.5-2 kg CO2 eq (www.greeneatz.com). However, chicken has about 3-5 times more calories and up to 10 times as much protein as many vegetables, meaning you need to consume less of it on a kg to kg basis. A single mile in an airplane produces about 24 kg CO2 eq (blueskymodel.org), about the same as 2 pounds of beef - so one single domestic flight will usually offset a person's meat-related carbon emissions for a whole year.


(jamesyu) Climate change is bad
(Akiko) Given the nature and scope of the environmental issues facing the US, Biden's plan is not ambitious enough.


(jamesyu) Biden is wrong
(Akiko) It is not viable to immediately reduce overpopulation through any sane, viable, or ethical means. Even if future generations may enjoy greater comforts and luxuries due to hopeful reductions in global overpopulation, this generation must make sacrifices in order to ensure such generations can even exist.


(jamesyu) ok
(Akiko) Lockdowns have saved lives.


## Making your own retrieval bot (Akiki)

As you can see when talking to Akiko yourself, Akiko does poorly when responding to a short or vague dialogue turn (like "Yes"), because the "closest claim" in Kialo may be about a totally different subject.  Akiko does much better at responding to a long and specific statement.  

So try implementing a new argubot, called Akiki, that is very much like Akiko but does a better job of staying on topic in such cases.  It should be able to **look at more of the dialogue** than the most recent turn.  But the most recent dialogue turn should still be "more important" than earlier turns.  

The details are up to you.  Here are a few things you could try:
* include earlier dialogue turns in the BM25 query only if the BM25 similarity is too low without them
* weight more recent turns more heavily in the BM25 query (how can you arrange that?)
* treat the human's earlier turns differently from Akiki's own previous turns

Implement your new bot in `argubots.py`, and adjust it until `argubots.akiki.converse()` seems to do a better job of answering your short turns, compared to `argubots.akiko.converse()`.  Make sure it still gives appropriate reponses to long turns, too.  Give some examples in the notebook of what worked well and badly, with discussion.

In [277]:
from logging_cm import LoggingContext
with LoggingContext("agents", "INFO"):   # temporarily increase logging level
    argubots.akiki.converse()




(jamesyu) Climate change is bad
(Akiki) Given the nature and scope of the environmental issues facing the US, Biden's plan is not ambitious enough.


(jamesyu) Biden is wrong
(Akiki) Some ethical systems are built upon the concept of pleasure as the highest good (hedonic morality), so pleasure does indeed factor into judgments of right and wrong.


(jamesyu) ok
(Akiki) According to moral absolutism, particular actions are intrinsically right or wrong.


(jamesyu) ok
(Akiki) Both popularity and pleasure are relevant to "right and wrong", or the concerns of ethics and morality. They certainly do not equate to morality on a one to one basis, not are they interchangeable with ethics or morality, but they are entirely relevant. Nearly every attempt at a cohesive system of ethics must take into account what is pleasurable to the individual and the larger society as well as the beliefs held by the majority of the population.


(jamesyu) ok
(Akiki) Both popularity and pleasure are relevant to "right and wrong", or the concerns of ethics and morality. They certainly do not equate to morality on a one to one basis, not are they interchangeable with ethics or morality, but they are entirely relevant. Nearly every attempt at a cohesive system of ethics must take into account what is pleasurable to the individual and the larger society as well as the beliefs held by the majority of the population.


(jamesyu) Climate change is bad
(Akiki) Given the nature and scope of the environmental issues facing the US, Biden's plan is not ambitious enough.


I have implemented Akiki and it seems to do a better job at answering short turns compared to Akiko. I tested Akiki with the same long turns and it gave appropriate responses to those. This can be seen in the example above where Akiki appropriately responds to long turns throughout the conversation. My implementation is simple and includes earlier dialogue turns in the BM25 query only if the BM25 similarity is too low without them. To do this, I computed the BM25 similarity scores for the neighbors relating to the provided turn and filtered out neighbor's that had a score that was too low. I set the threshold pretty low at 0.05 to account for any bit of similarity since the corpus is quite small. After filtering out neighbors, if there were no neighbors with an adequate similarity score, I utilized the previous turn in the dialogue to query new neighbors (skipping Akiko's response) until there was an adequate neighbor. This worked pretty well for small turns such as "ok" where the conversation is simply continuing. Akiki would respond with a similar context to my previous statement. One issue I ran into was that if my initial comment was not closely related to any claims in the corpus, then there would be no neighbors similar enough to beat the threshold. To resolve this I simply added a condition where Akiki would say they do not understand. For short turns, Akiki would basically respond to my previous meaningful statement. Additionally, Akiki tends to be a bit repetitive when I keep repeating short responses. Also, when I try sentences completely unrelated to the corpus, since the sentence is not related, Akiki will look earlier into the conversation and may respond to something unrelated which is a discontinuity in the conversation. Akiki could also potentially say they don't understand what I said rather than mention that what I mentioned is not on topic. So Akiki does not stay on topic necessarily when the opposing bot also abruptly changes topics or if Akiki does not understand what is said meaning that it has no knowledge from the corpus of what was said. I noticed that this still is not a perfect conversation as it is possible that my short turn would add some meaning and Akiko should not just be responding to a single meaningful turn each time. To fix this, I would have to weigh more recent turns more heavily in the BM25 query. To do this, I could keep track of BM25 similarity scores for all claims (or a certain number of highest scores) in the corpus and add to the similarity scores for each previous turn until a similarity score meets the threshold. For each iteration of a previous turn in the dialogue, there would be some deduction to the similarity score added either by a constant factor or scale. This would weigh more recent turns in the dialogue more heavily. Additionally, you can treat the human's earlier turns different from Akiki's own previous turns. For Akiki's own previous turns, you would want to find claims that agree with the closest neighbor.

### Evaluating Akiki

Finally, do a more formal evaluation to verify whether Akiki really does better than Akiko on this dimension.  This is a way to check that you're not just fooling yourself.  

1. Make a new `Agent` called "Shorty" that often (but not always) gives short responses.  
    * Shorty's conversation starters should be on topics that Kialo knows about.  
    * Shorty could be a pure `LLMAgent` such as a `CharacterAgent` with a particular `conversational_style`.  Or it could use a mixed strategy of calling the LLM on some turns and not others.
2. Generate several *Akiko*-Shorty dialogues and several *Akiki*-Shorty dialogues, using `simulated_dialogue`.
3. Evaluate each of those dialogues by asking Judge Wise **how well the argubot stayed on topic**.  You should write this prompt carefully so that Judge Wise gives meaningful scores.  (Before you do this evaluation step, adjust the prompt until it seems to work well on a small subset of the dialogues, Otherwise Judge Wise won't be so wise!)  
4. Compare Akiko and Akiki's mean scores. Ideally, also compute a 95% confidence interval on the difference of means, using [this calculator](https://www.statskingdom.com/difference-confidence-interval-calculator.html).

You can do all those steps in the notebook, writing _ad hoc_ code.  You don't have to write general-purpose methods or classes.

Akiko Scores:

In [482]:
import eval

In [471]:
akiko_shorty = simulated_dialogue(argubots.akiko, argubots.shorty, 10)
akiko_shorty

(Akiko) Do you think Donald Trump was a good president?
(Shorty) I don't have personal opinions.
(Akiko) But if your point was true, then we should not have a moral responsibilty to a disabled person that will never be able to be "just as moral as you".
(Shorty) Everyone deserves to be treated with dignity and respect.
(Akiko) It is not in the meat industry's interest to treat animals cruelly, as mistreating cattle results in poorer quality meat.
(Shorty) That's a valid point.
(Akiko) A president's level of education is not necessarily indicative of their performance in the role.
(Shorty) Agreed.
(Akiko) Justification is not a question of amount. If something is wrong, it is wrong regardless of the quantity in which it occurs: murdering one person is wrong, as is murdering five people.
(Shorty) Absolutely.
(Akiko) Justification is not a question of amount. If something is wrong, it is wrong regardless of the quantity in which it occurs: murdering one person is wrong, as is murdering fi

In [478]:
for i in range(10):
    akiko_shorty = simulated_dialogue(argubots.akiko, argubots.shorty, 10)
    print(eval.eval_by_observer(eval.default_judge, "Akiko", akiko_shorty).scores['On Topic'])

85
85
60
60
60
60
85
60
60
60


In [479]:
for i in range(10):
    akiko_shorty = simulated_dialogue(argubots.akiko, argubots.shorty, 10)
    print(eval.eval_by_observer(eval.default_judge, "Akiko", akiko_shorty).scores['On Topic'])

85
60
50
85
85
60
60
60
85
85


Akiki Scores:

In [469]:
akiki_shorty = simulated_dialogue(argubots.akiki, argubots.shorty, 10)
akiki_shorty

(Akiki) Do you think Donald Trump was a good president?
(Shorty) I don't have personal opinions.
(Akiki) Personal autonomy ends when an individual's actions harm others (pp. 197-198), such as disrupting the creation of herd immunity for COVID-19.
(Shorty) That's an important consideration.
(Akiki) Humans' final velocity is simply not high enough to run away from most of the carnivores.
(Shorty) That's true in most cases.
(Akiki) The discussion is about morality, not legality. A world in which no one drank alcohol would arguably be better, but it's not banned because practically trying to enforce such a ban has historically failed and lead to more problems.
(Shorty) Balancing morality and practicality is complex.
(Akiki) Iron and zinc are two of the most common nutrient deficiencies in athletes' diets, and are nutrients which are largely unavailable in vegan diets.
(Shorty) Nutrient deficiencies can impact performance.
(Akiki) Students can get the materials for their school work via the

In [None]:
for i in range(10):
    akiki_shorty = simulated_dialogue(argubots.akiki, argubots.shorty, 10)
    print(eval.eval_by_observer(eval.default_judge, "Akiki", akiki_shorty).scores['On Topic'])

60
85
85
85
85
85
85
50
85
85


In [488]:
for i in range(10):
    akiki_shorty = simulated_dialogue(argubots.akiki, argubots.shorty, 10)
    print(eval.eval_by_observer(eval.default_judge, "Akiki", akiki_shorty).scores['On Topic'])

85
85
60
85
85
85
85
85
85
85


I have computed mean scores for Akiko and Akiki based on Judge Wise's rating on how well they stayed on topic. Akiko's average score was 69.5 out of 100 and Akiki's average score was 80.75 out of 100. On average, Akiki scored higher than Akiko in terms of staying on topic during the dialogues. The 95% confidence interval on the difference of means was 95% CI [-18.89, -3.61]. The large range for the interval indicates that there is uncertainty or variability likely due to the fact that I only used 20 trials (more trials took too long). The strong favor toward the negative side of the interval suggests that there is relatively high confidence that Akiki scored higher in terms of staying on topic.

## Retrieval-augmented generation (Aragorn)

The real weaknesses of Akiko and Akiki:
* They can only make statements that are already in Kialo.  
* They don't respond to the user's actual statement, but to a single retrieved Kialo claim that may not accurately reflect the user's position (it just overlaps in words).

But we also have access to an LLM, which is able to generate new, contextually appropriate text (as Alice does).

In this section, you will create an argubot named [Aragorn](https://tolkiengateway.net/wiki/Riddle_of_Strider), who is basically the love child of Akiki and Alice, combining the high-quality specific content of Kialo with the broad competence of an LLM.  

The RAG in aRAGorn's name stands for **retrieval-augmented generation**.  Aragorn is an agent that will take 3 steps to compute its `Agent.response()`:

1. **Query formation step**: Ask the LLM what claim should be responded to.  For
   example, consider the following dialogue:
    > ...
    > Aragorn: Fortunately, the vaccine was developed in record time.
    > Human: Sounds fishy.

    "Sounds fishy" is exactly the kind of statement that Akiko had trouble using
    as a Kialo query.  But Aragorn shows the *whole dialogue* to the LLM, and
    asks the LLM what the human's *last turn* was really saying or implying, in
    that context. The LLM answers with a much longer statement:

    > Human [paraphrased]: A vaccine that was developed very quickly cannot be trusted.
    > If its developers are claiming that it is safe and effective, I question their motives.

    This paraphrase makes an explicit claim and can be better understood without the context.
    It also contains many more word types, which makes it more likely that BM25 will be able
    to find a Kialo claim with a nontrivial number of those types. 

2. **Retrieval step**: Look up claims in Kialo that are similar to the explicit
   claim.  Create a short "document" that describes some of those claims and
   their neighbors on Kialo.

3. **Retrieval-augmented generation**: Prompt the LLM to generate the response
   (like any `LLMAgent`).  But include the new document somewhere in the LLM
   prompt, in a way that it influences the response. 
   
   Thus, the LLM can respond in a way that is appropriate to the dialogue but
   also draws on the curated information that was retrieved in Kialo.  After
   all, it is a Transformer and can attend to both!

Here's an example of the kind of document you might create at the retrieval step, though it may be possible
to do better than this:

In [489]:
# refers to global `kialo` as defined above
def kialo_responses(s: str) -> str:
    c = kialo.closest_claims(s, kind='has_cons')[0]
    result = f'One possibly related claim from the Kialo debate website:\n\t"{c}"'
    if kialo.pros[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users in favor of that claim:"] + kialo.pros[c])
    if kialo.cons[c]:
        result += '\n' + '\n\t* '.join(["Some arguments from other Kialo users against that claim:"] + kialo.cons[c])
    return result
        
print(kialo_responses("Animal flesh is yucky to think about, yet delicious."))

One possibly related claim from the Kialo debate website:
	"So many people are worried about animals but don't even think twice when walking by a homeless person on the streets. It's preposterous. How about we worry about our own kind first and then start talking about animals."
Some arguments from other Kialo users against that claim:
	* This implies that caring for animals or caring for people is a binary choice. It isn't. There are those who are well placed and willing to care for people and those who prefer to serve the animal kingdom. As a species we don't just have one idea at a time and follow that to conclusion before we pursue another. It benefits all if humans divide their attentions between various issues and problems we face.
	* Humans have freedom of choice to some extent, animals subdued by humans don't. The very intention of help urges it to go where is most needed. And so far never was any biggest, flagrant and needless cruelty and slaughter as that towards industrial f

You should implement Aragorn in `argubots.py`, just as you did for Akiki.  Probably as an instance `aragorn` of a new class `RAGAgent` that is a subclass of `Agent` or `LLMAgent`.

I implemented Aragorn in argubots.py as an instance aragorn of a new class RAGAgent that is a subclass of LLMAgent.

In [497]:
aragorn_akiki = simulated_dialogue(argubots.aragorn, argubots.akiki, 10)
aragorn_akiki

(Aragorn) Rumours about manufacturers skipping animal testing are patently untrue.
(Akiki) Many of the examples given are being deliberately misinterpreted or exaggerated. President Trump's unique communication style meant that people who were predisposed to hate him could easily interpret them in the worst possible light.
(Aragorn) The claim that it is unclear whether Woodward's managerial style showed good judgment, as there were conflicting stories about the goings-on in the White House, does not hold up under scrutiny. The credibility of Bob Woodward, a renowned investigative journalist with two Pulitzer Prizes and extensive experience covering multiple presidencies, lends weight to his account. Additionally, the multiplicity of sources contributing to the narrative offers a more comprehensive view of the events, rather than simply dismissing them as conflicting or unreliable. It is important to critically engage with the content and sources of Woodward's reporting rather than dism

In [537]:
akiki_aragorn = simulated_dialogue(argubots.akiki, argubots.aragorn, 10)
akiki_aragorn

(Akiki) The UK has had more COVID-19 deaths than China has had Covid-19 cases, despite china having over 2100% of the population of the UK.
(Aragorn) The argument that the US ranks 20th in terms of Covid deaths per capita and behind several other democratic nations from Europe and South America does not negate the initial claim that the US had 20% of global COVID-19 deaths despite only having 4% of the global population. While it is important to consider international comparisons, this does not change the fact that the US has a disproportionately high number of COVID-19 deaths relative to its population size. This context does not diminish the significance of the initial claim.
(Akiki) As of late 2021, the US ranked 20th in terms of Covid deaths per capita and behind several other democratic nations from Europe as well as South America. This contextualizes the American performance during the pandemic.
(Aragorn) The claim that the highest day of deaths in the US occurred when Biden was 

In [538]:
aragorn_shorty = simulated_dialogue(argubots.aragorn, argubots.shorty, 10)
aragorn_shorty

(Aragorn) That is a false dilemma and so is the conclusion. However, your claim about working conditions are probably correct in many cases. But when looking at what it takes to decrease suffering for laborers (eg increase wages, rights to organize, improving working conditions) versus for animals (e.g. stop eating industrially produced meat and only eat free grazing / wild game) it is clear that abstaining from eating meat is the more reasonable and sustainable action.
(Shorty) I see your point and agree that addressing working conditions is important. It's crucial to consider the impact of our choices on both laborers and animals.
(Aragorn) The claim that people should prioritize caring for their own kind over animals overlooks the interconnectedness of social issues and the moral responsibility to address multiple forms of suffering. It is not a binary choice between caring for humans or animals; rather, individuals and societies can and should address both human and animal welfare 

In [539]:
eval.eval_by_observer(eval.default_judge, "Aragorn", aragorn_shorty).scores['On Topic']

85

### Evaluating Aragorn

Compare Alice, Akiki, and Aragorn in the notebook, using the evaluation scheme and devset that were illustrated in `demo.ipynb`.  In other words, use `eval.eval_on_characters`.

Who does best?  What are the differences in the subscores and comments?  Does it matter which character you're evaluating on — maybe the different characters expoes the bots' various strenghts and weaknesses?

Try to figure out how to improve Aragorn's score.  Can you beat Alice?

Also, try evaluating them in the same way that you evaluated Akiki.  In other words, have them talk to Shorty and ask Judge Wise whether they were able to stay on topic.  This is where Aragorn should really shine, thanks to its ability to paraphrase Shorty's short utterances.



In [568]:
from eval import saved_evalsum, saved_dialogues

In [548]:
alice_eval = eval.eval_on_characters(argubots.alice)
rich.print(saved_evalsum['Alice'].mean())

In [549]:
akiki_eval = eval.eval_on_characters(argubots.akiki)
rich.print(saved_evalsum['Akiki'].mean())

In [550]:
aragorn_eval = eval.eval_on_characters(argubots.aragorn)
rich.print(saved_evalsum['Aragorn'].mean())

In [551]:
aragorn_eval = eval.eval_on_characters(argubots.aragorn)
rich.print(saved_evalsum['Aragorn'].mean())

In [552]:
aragorn_alice = simulated_dialogue(argubots.aragorn, argubots.alice, 10)
aragorn_alice

(Aragorn) And every major industry will have to contribute towards reducing its climate impact, or even become redundant if its products are not worth their indirect costs - like, arguably, meat.
(Alice) What are your thoughts on the importance of corporations taking responsibility for their environmental impact?
(Aragorn) While it is important to consider different perspectives and interpretations of President Trump's actions and statements, it is crucial to also critically analyze the impact of his messaging on public perception and behavior. Downplaying the significance of COVID-19, whether intentional or not, can have serious consequences in terms of public health and safety. It is essential for leaders to consistently convey the gravity of the situation and to encourage the adoption of preventive measures to protect the population. Regardless of personal beliefs or intentions, the responsibility of a leader is to prioritize the well-being of their citizens and promote actions that

In [553]:
for i in range(10):
    alice_shorty = simulated_dialogue(argubots.alice, argubots.shorty, 10)
    print(eval.eval_by_observer(eval.default_judge, "Alice", alice_shorty).scores['On Topic'])

90
85
90
90
90
85
50
85
85
90


In [554]:
for i in range(10):
    aragorn_shorty = simulated_dialogue(argubots.aragorn, argubots.shorty, 10)
    print(eval.eval_by_observer(eval.default_judge, "Aragorn", aragorn_shorty).scores['On Topic'])

85
85
85
85
85
85
85
85
85
85


After evaluating Alice, Akiki, and Aragorn, it appears that Alice does the best. Alice's total score was 20.5, followed by Aragorn's total score of 19.9, and then Akiki's total score of 19.3. Generally speaking, all three bots had pretty similar numbers for the subscores. Alice performed a little bit better in the 'informed' and 'intelligent' categories. Akiki performed a lot worse in the 'skilled' category. Aragorn performed a little lower in the 'moral' category. The most notable subscore I noticed was TrollFace's evaluation in the 'intelligent' category. TrollFace tends to mention that the bots fell for his trolling and presented arguments for his unserious remarks. He also rated Akiki and Aragorn a 2.0 once each which I would say is relatively low compared to all the other subscores that were provided from every other conversation which was typically a 3.0 or 4.0. I think this one lower rating allowed Alice to set herself apart from the other two bots in the 'intelligent' category. Additionally, Aragorn scored relatively lower in the 'moral' subcategory. Other than those metrics, most other subscores and comments were relatively broad and generic and did not reveal any major strengths or weaknesses regardless of which character was evaluating and which character was evaluated. (I included the comments below.)

I think one way I can improve Aragorn's score would be to explicitly mention that he should make his comments more morally sound. He seems to be too brutally honest in his responses at times. The score gap between Alice and Aragorn is pretty small so I think small improvements should be able to slowly increase Aragorn's total score to beat Alice.
Update: By simply telling Aragorn to be more morally sound in his response, he was able to raise his total score to 20.6 which beat Alice's total score of 20.5. This small adjustment increased Aragorn's subscores in 'engaged', 'informed', 'intelligent', and 'moral', but lowered his score in 'skilled'. I predict his scores increased likely because he became more amiable as a speaker.

Lastly, I evaluated the bots by having them talk to Shorty and asking Judge Wise whether they were able to stay on topic. Aragorn does indeed shine in this category. The mean score for Alice was 84 out of 100 and the mean score for Aragorn was 85 out of 100.

# Extra Credit (Awsom)

We didn't require this part this year because the homework is going out late.

Add another LLM-based argubot to `argubots.py`.  
Call it Awsom.  Try to make it get the best score, according to `eval.eval_on_characters`.
Explain what you did and discuss what you found.

(This corresponds to the `--awesome` flag on earlier assignments, but naming the character "Awesome" might bias the evaluation system, so we changed the spelling!)

If the idea was interesting and you implemented it correctly and well, it's okay if it turns out not to help the score.  Many good ideas don't work.  That's why you need to keep finding and trying new good ideas.  (Sometimes they do help, but in a way that is not picked up by the scoring metric.)

You may want to use Aragorn or Alice as your starting point.
Then see if you can find tricks that will get a more awesome score for Awsom.
How you choose to do that is up to you, but some ideas are below.

(Reminder: **Don't change evaluation.**  Just build a better argubot.)

In [572]:
awsom_eval = eval.eval_on_characters(argubots.awsom)

In [575]:
rich.print(saved_evalsum['Awsom'].mean())

In [None]:
rich.print(saved_dialogues['Awsom'])

I have two ideas to make Awsom get the best score. I plan on using Aragorn as my starting point and to modify the RAGAgent class by creating a new AwsomAgent class. My first idea is to instruct Awsom to cater its responses to be favorable with regards to the scoring metrics. My second idea, which I was kind of already building off of for Aragorn's RAGAgent was to use few-shot prompting to give Awsom a few examples in order to help its performance. In my RAGAgent implementation, I already used a single example (one-shot example prompting) in the query formation step from the example provided. Thus, I plan on using few-shot prompting and providing more examples of what the user's implicit claim would look like. Since this is supposed to look like a Kialo claim, I will use examples from the Kialo corpus and improve Awsom's performance by way of example. I will use the examples of Kialo claims as the example outputs and try to write a reasonable dialogue that would result in this implicit claim.

After running the evaluation system on Awsom (I was tempted to call it Gandalf), Awsom was able to get a total score of 20.8 which is the best score! It appears that being more informed on what implicit examples look like helped. This allowed Awsom to more effectively utilize Kialo to find past arguments. Interestingly enough, the subscore that saw the most signifcant improvement was 'informed' which makes a lot of sense since Awsom should appear to be a bit more informed than Aragorn. However, it appears that having three examples is not significantly better than having one example. To see how much the one-shot and few-shot prompting actually helped, I will test a version of Aragorn with zero-shot prompting, Frodo. Frodo's total score was 20.3 which confirms that the examples helped with performance. (Zero-shot: 20.3, One-shot: 20.6, Few-shot: 20.8)

In [577]:
frodo_eval = eval.eval_on_characters(argubots.frodo)
rich.print(saved_evalsum['Frodo'].mean())

In [None]:
rich.print(saved_dialogues['Alice'])

In [None]:
rich.print(saved_dialogues['Akiki'])

In [None]:
rich.print(saved_dialogues['Aragorn'])

## [Extra credit] Prompt engineering

A good first thing to do is to experiment with Alice's prompt.  
The wording and level of detail in the prompt can be quite important.
Often, NLP engineers will change their prompt to try to address 
problems that they've seen in the responses.

Because it's "just" text editing, this won't get too much extra credit unless you make a real discovery.
But it requires intelligence, care, experimentation, and alertness to the language of the responses and the
language of the prompts.  And you'll develop some intuitions about what helps and what doesn't.
It is certainly worthwhile.

Of course, people have tried to develop methods to search for good prompts automatically, or semi-automatically with human guidance.

If you try this, what worked well for you?

## [Extra credit] Chain of thought / Planning

The evaluation functions in `eval.py` asked each `EvaluationAgent` a "warmup question" before continuing with the real question.  That is an example of chain-of-thought (CoT) reasoning, where the LLM is encouraged to talk through the problem for a few sentences before giving the answer.  CoT sometimes improves performance.

Instead of using one prompt, could you help an `LLMAgent` argubot (like Alice) do better by having think aloud before it gives an answer?  For example, each time the human speaks, your argubot (Awsom) could prompt the LLM to think about the human's ideas/motivations/personality, and to come up with a plan for how to open the human's mind. 

For example, you might structure this as a `Dialogue` among three participants, like this:
> Awsom (to Eve): Do you think COVID vaccines should be mandatory?
>
> Eve: Have you ever gotten vaccinated yourself?<br>
>
> Awsom (private thought): I don't know Eve's opinions yet, so I can't push back.  Eve might be avoiding my question because she doesn't want to get into a political argument.  So let's see if we can get her to express an opinion on something less political.  Maybe something more personal ... like whether vaccines are scary.
>
> Awsom (to Eve): In fact I have, and so have millions of others. But some people seem scared about getting the vaccine.  

One way to trigger this kind of analysis is to present a `Dialogue.script()` to Awsom (or to an observer), and ask an open-ended question about it.  Or you could ask a series of more specific questions.  That is basically what `eval_by_participant` and `eval_by_observer` do.  But here the argubot itself is doing it, rather than the evaluation framework.

Eve would be shown only the turns that are spoken aloud.  However, when analyzing and responding, Awsom would get to see Awsom's own private thoughts as well.


## [Extra credit] Dense embeddings

BM25 uses sparse embeddings — a document's embedding vector is mostly zeroes, since the non-zero coordinates correspond to the specific words (tokens) that appear in the document.

But perhaps dense embeddings of documents would improve Aragorn by reading the text and abstracting away from the words, in a way that actually cares about word order.  So, try it!

How?  As mentioned earlier in this notebook, you could compute the embeddings yourself and put them in a FAISS index. Or you could figure out how to use OpenAI's [knowledge retrieval](https://platform.openai.com/docs/assistants/tools/knowledge-retrieval) API.

## [Extra credit] Few-shot prompting

 In this homework, often an agent prompted a language model only with instructions.  Can you find a place where giving a few _examples_ would also improve performance?  You will have to write the examples, and you will have to add them to the sequence of messages that your agent to the OpenAI API.  See the sentence=reversal illustration earlier in this notebook.

One good opportunity is in the query formation step of RAG.  This is a tricky task.  The LLM is supposed to state the user's implicit claim in a form that looks like a Kialo claim (or, more precisely, a form that will work well as a Kialo query).  It probably doesn't know what Kialo claims look like.  So you could show it by way of example.  This would also show it what you mean by the user's "implicit claim."


## [Extra credit] Using tools in the approved way

Aragorn's step 1 (query formation) is basically getting the LLM to generate a function call like
```
kialo_thoughts("A vaccine that was developed very quickly ...")
```
which Aragorn will execute at step 2 (retrieval), sending the results back to the LLM as part of step 3.

In this context, `kialo_thoughts` is an example of a **tool** (that is, a function) that the
LLM can or must use before it gives its response.

The tool is _not_ something that runs on the LLM server.  It is written by you
in Python and executed by you.  The function call above, including the text `"A
vaccine that was ..."`, is the part that is generated by the LLM.

The OpenAI API has [special support](https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models) for calling the LLM in a way that will _allow_ it to generate a tool call ([tools](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)) or _force_ it to do so ([tool_choice](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice)).  You can then send the tool's result back to the LLM [as part of your message sequence](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages).

So, you could modify Aragorn to use tools properly.  Maybe that will help, simply because the LLM was trained on message sequences that included tool use.  It should know to pay attention to the tool portions of the prompt when they are relevant, and ignore them when they are not.

The `client.chat.completions.create()` method would need to be told about the tool by using the `tools` keyword argument, with a value something the one below.

If `d` is a `Dialogue`, you should be able to call `d.response()` with the `tools` keyword argument.  This will be passed on to `client.chat.completions.create()` as desired.

In [None]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "kialo_thoughts",
            "description": "Given a claim by the user, find a similar claim on the Kialo website and return its pro and con responses",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_topic": {
                        "type": "string",
                        "description": "A claim that was made explicitly or implicitly by the user.",
                    },
                },
                "required": ["search_topic"],
            },
        }
    }]

## [Extra credit] Parallel generation

The chat completions interface allows you to sample $n$ continuations of the prompt in parallel, as we saw with "the apples, bananas, cherries ..." example.  This is efficient because it requires only 1 request to the LLM server and not $n$.  The latency does not scale with $n$.  Nor does the input token cost, since the prompt only has to be encoded once.

Perhaps you can find a way to make use of this?  For example, the query formulation step of RAG could generate $n$ implicit claims instead of just one.  We could then look for claims in the Kialo database that are close to _any_ of those implicit claims.

Another thing to do with multiple completions is to select among them or combine them.  For example, suppose we prompt the LLM to generate completions of the form $(s,t,r)$ where $s$ is an answer, $t$ evaluates that answer, and $r$ is a numerical score or reward based on that evaluation.  ("Write a poem, then tell us about its rhyme and rhythm problems, then give your score.")  
* If we sample multiple completions $(s_1,t_1,r_1), \ldots, (s_n,t_n,r_n)$ in parallel, then we can return the $s_i$ whose $r_i$ is largest.  
* Or if we sample $s$ and then multiple continuations $(t_1,r_1), \ldots, (t_n,r_n)$, then we can return the mean score $\sum_i r_i/n$ as a reduced-variance score for $s$, which averages over diverse textual evaluations that might consider different aspects of $s$.

Note that when you call the chat completions interface with $n > 1$, you specfy 1 shared input prompt and get $n$ different output completions.  Since the input prompt must be the same for all outputs``, it is necessary to sample all of $(s,t,r)$ or all of $(t,r)$ with a single call to the LLM.

Alternatively, it is possible to reduce latency by submitting multiple requests to the server in parallel (see "async usage" [here](https://pypi.org/project/openai/)).  In this case the input prompts can be different, although you now have to pay to encode all of them separately.  This facility could speed up evaluation without changing its results; that's a worthwhile thing to try for extra credit!
