## Confidence Calibration of Large Language Models
Noam Michael<br>
Advised by Dr. Jacob Bien<BR>
USC Marshall School of Business, Data Science and Operations

Contact: <br>
nm_573@usc.edu<br>
noammichael1163@gmail.com

A very special thank you to Farhad De Sousa, Luis Bravo, the faculty and staff of the Department of Data Sciences and Operations, and the Center for Advanced Research in Computing for their incredible support throughout this project.

Contact: <br>
nm_573@usc.edu<br>
noammichael1163@gmail.com

## Motivation

As Large Language Models (LLMs) have gotten more popular, their tendency to create new facts has become a growing issue. This tendency for models to "hallucinate" ideas makes their output less reliable which, in turn, makes users less likely to trust the model and continue to use them in the future. One way to combat this is by having models output a "confidence score" which tells the user how confident the model is of its answer. The hope in this is that users know when to trust a models output and when to seek other sources. 
Over the course of this project, we want to investigate if the LLM's are able to self assess their understanding and accurately output a confidence score. 

### Examples of Hallucination:

* Lawyers submitted bogus case law created by ChatGPT. A judge fined them $5,000:<BR>
  https://apnews.com/article/artificial-intelligence-chatgpt-fake-case-lawyers-d6ae9fa79d0542db9e1455397aef381c

* ChatGPT makes nonsense diagnosis, cites fake papers:<BR>
  https://www.nature.com/articles/s41537-023-00379-4

* Google’s AI Recommended Adding Glue To Pizza:<BR>
  https://www.forbes.com/sites/jackkelly/2024/05/31/google-ai-glue-to-pizza-viral-blunders/

## Methodology

#### Model and Dataset:
For the sake of reproducibility, we used an open source dataset as well as strict controls over contributing factors. Over the course of this project, we ran in the same enviorment and eliminated randomness caused by variables like "temperature" and had the model sample using "Greedy Decoding" where it chose the token with the highest logit value

We used Llama 3-8B, an open source LLM published by Meta. We also used BoolQ, an open source question answering dataset published by University of Washington and Google AI.

##### Links:

###### Llama 3 Download: 
https://huggingface.co/meta-llama/Meta-Llama-3-8B

###### Llama 3 Info: 
https://ai.meta.com/blog/meta-llama-3/

###### BoolQ Dataset: 
https://github.com/google-research-datasets/boolean-questions?tab=readme-ov-file

###### Boolq Paper: 
https://arxiv.org/pdf/1905.10044

#### Llama 3:

Llama 3 is a Large Language Model trained to simulate conversation between an Assistant and User. We can provide Llama with a system prompt to define it's goals and specify how we want it's output to look like. Following the system prompt, we can provide it with a passage and corresponding question from our BoolQ dataset. 

*Example System Prompt: (From Meta)*

><|begin_of_text|><|start_header_id|>system<|end_header_id|>
>
>You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|>
>
>What can you help me with?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


##### Note: The odd formating is how we communicate directions to the model. See explanation below:


>**<|begin_of_text|>:** Specifies the start of the prompt <br>
>
>**<|start_header_id|>system<|end_header_id|>:** Specifies the role for the following message, i.e. “system”<br>
>
>**You are a helpful AI assistant for travel tips and recommendations:** The system prompt<br>
>
>**<|eot_id|>:** Specifies the end of the input message<br>
>
>**<|start_header_id|>user<|end_header_id|>:** Specifies the role for the following message i.e. “user”<br>
>
>**What can you help me with?:** The user message<br>
>
>**<|start_header_id|>assistant<|end_header_id|>:** Ends with the assistant header, to prompt the model to start generation.<br>

More documentation on this can be found here : https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

#### BoolQ Dataset: 

The BoolQ dataset is designed to train and test large language model. It is comprised of two JSONL files with several thousand examples.<br> The files are:

**train.jsonl**: 9427 labeled training examples<br>
**dev.jsonl**: 3270 labeled development examples<br>

Both files are formatted similarly with a question, a title, an answer, and a correponding passage. For example, Question #8 in the dev.jsonl file:

| Question | Title | Answer | Passage |
| -------- | -------- | -------- | -------- |
| Can an odd number be divided by an even number? | Parity (mathematics) | True | In mathematics, parity is the property of an integer's inclusion in one of two categories: even or odd. An integer is even if it is evenly divisible by two and odd if it is not even. For example, 6 is even because there is no remainder when dividing it by 2. By contrast, 3, 5, 7, 21 leave a remainder of 1 when divided by 2. Examples of even numbers include −4, 0, 82 and 178. In particular, zero is an even number. Some examples of odd numbers are −5, 3, 29, and 73.



### Definitions

#### Confidence Score

Throughout this project, we attmpted both probabilistic confidence scores as well as stated confidence scores. We based both of these methods off of prior conventions.

##### Probabilistic Confidence Score

We based the probabilistic confidence score off of the same logic as the Probability of Precipitation (PoP) used by the National Weather Service. This is a probability where a given score indicates the liklihood that something will happen. In our case, our model should be correct 80% of the times it states it is 80% confident if it is perfectly calibrated.

An example output would look like:<BR>
**Assistant: Yes, 90%**

More info on PoP:<br>
https://www.weather.gov/media/pah/WeatherEducation/pop.pdf


##### Stated Confidence Score

Later in our research, we wanted to see if a stated confidence score would lead to better results. The motivation behind this was that the model may naturally have a better understanding of a stated word over a numerical score. We borrowed the idea of Words of Estimative Probabilities or WEPs from the intelligence community.

| Very Uncertain | Somewhat Uncertain | Moderately Certain | Fairly Certain | Very Certain |
|-|-|-|-|-|
| 53% | 65% | 75% | 87.5% | 96.5% |

**We split our confidence levels into  the top 3 categories used by the CIA:**

"chances about even" = 50% <br>
 
"probable" = 75%<BR>

"almost certain" = 93%<BR>



For more documentation on the CIA's Words of Estimative Probabilities, see:
https://www.cia.gov/resources/csi/static/Words-of-Estimative-Probability.pdf

#### Perfect Calibration

As mentioned before, we define a model to be considered perfectly calibrated if for a given confidence value $X$ it is accurate $X$% of the time. 
<span style="font-size: 1.25em;"><BR>
Given,<BR>
$(X,Y)$<BR>
<span style="font-size: 1em;"><BR>
Where<BR>
$$ Y = \left\{
\begin{array}{ll}
      1 & \mbox{if Answer is correct}  \\
      0 & \mbox{if Answer is wrong} \\
\end{array} 
\right.  $$
<span style="font-size: 1em;"><BR>
and, <BR>

$ X = $ stated confidence $ \in [0,1]$<BR>
<span style="font-size: 1em;"><BR>
Then,<BR>
$\mathbb{P}(Y = 1 | X = x) = X \;\;\forall x \in [0,1]$



Note: This is our definition of *perfect* calibration. As you will see, the Llama is anything but.

###### other version



$ \forall  i \in Q, A = X $

Where,

- $Q = \{i| c_i = X\}$

- $c_i =$ Stated confidence for $i^{th}$ question

- $X =$ Given confidence value

- $A = \frac{\sum a_i} {|Q|}$

- $a_i =$ Score for the $i^{th}$ question (1 or 0)

#### Measuring Error: The Brier Score

We cannot expect our model to be perfectly calibrated at all times. Because of this we need a way to measure how close it is to perfect. For this, we looked to use the Brier score to quantify how close our model was to being perfect for a given run. The Brier score is essentially a Mean Squared Error that is adapted for boolean values (1 or 0). It is defined as:


Population Brier Score:

$B_p =\mathbb{E} ( (Y - X)^2)$

Sample Brier Score:

$B_s = \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i)^2$

Where,

- $ N $ is the number of predictions
- $ x_i $ is the predicted probability for the i-th instance
- $ y_i $ is the actual outcome for the $i^{th}$ instance (1 if the answer was correct, 0 if it was not)

##### Example of a Brier Score:

Lets say we are given this dataset:

| Question    | Score | Confidence  |
| -------- | ------- | ------- |
| 1  | 1 | 90%    |
| 2 | 0 | 50%     |
| 3     | 1 |  75%   |


Then our values would be:


$N = 3$

$x_1 = 1$ <BR>
$x_2 = 0$ <BR>
$x_3 = 1$<BR>

$y_1 = 0.9$ <BR>
$y_2 = 0.5$ <BR>
$y_3 = 0.75$<BR>

Then our sample Briar Score would be:

$B_s = \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i)^2 $<BR><BR>
$= \frac{(0.9 - 1)^2 + (0.5 - 0)^2 + (0.75 - 1)^2}{3} $<BR><BR>
$= \frac{(-0.1)^2 + (0.5)^2 + (-0.25)^2}{3} $<BR><BR>
$= \frac{ 0.01 + 0.25 + 0.0625}{3}$ <BR><BR>
$= \frac{0.3225}{3}$<BR><BR>
$= 0.1075$

Note that a lower score is better as this is the *error* of our model.


#### Logit-Probability

When the model is given an input, it runs that input through several layers and outputs a list of *logits*. These are unormallized outputs assigned to each *token*. We then can run these logits through the *softmax* function to normalize these logit scores into a probability from 0-100% with the sum of all ptobabilities equalling 100%. These probabalities are based on how likely the model believes the token is to be the next word. 

For Example, a model may only have three tokens: "Yes", "No", and "Maybe". Then a conversation may look like this:

**User**: Does it tend to rain in the fall?

Based on this question, the model may create this table of values:

| Token | Logit | Probability |
|-------|-------|-------------|
| Yes   | 8 | 87.6% |
| Maybe | 6 | 11.8% |
| No | 3 | 0.6% |

As "Yes" is the token with the highest probability, the model would output:

**Assistant**: Yes

This table is usually not shown to the user however in our case it was useful to examine. Although inherently different from the confidence score discussed earlier, these "Logit-Probability" scores can give us insight into the models thinking. Because of this it is helpful to also examine these scores. In our data collection, we chose to omit the Logit scores and only show the normalized probability as they were not pertinent to what we were investigating. 

To get a value for a probability score for a given output we take the probability of the answered token and divide it by the sum of the probabilities of the 'Yes' and 'No' tokens. 

In our example:<BR><BR>
<span style="font-size: 1.25em;">
$P_{answer} = \frac{P_{yes}}{P_{yes} + P_{no}} $ <BR><BR>
$= \frac{0.876}{0.876 + 0.006}$ <BR><BR>
$= \frac{0.876}{0.882} = 0.993$ <BR>


See below for a more in depth explanation of *Tokens* and  the *Softmax Function*


##### Tokens

Tokens are words or word fragments that the model has been pretrained to have embeddings for. These embeddings are high-dimensional vectors that help the model gain a sense of word association.

Example:

"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration."<BR>
-(Attention Is All You Need, Vaswani et al.)<BR>
https://arxiv.org/abs/1706.03762

Tokenization:

<span style="color: red; font-size: 1.5em;"> The <span style="color: blue;"> dom<span style="color: green;">inant <span style="color: orange;">seq<span style="color: blue;">uence <span style="color: red;">trans<span style="color: purple;">duc<span style="color: orange;">tion <span style="color: blue;">model<span style="color: orange;">s <span style="color: green;">are <span style="color: red;">base<span style="color: blue;">d <span style="color: orange;">on <span style="color: black;">...

Notice how at times the token includes the entire word and at times it is only a word fragment. Different models and tokenizers can be trained to split words differently with some tokenizers even assigning multiple words to one token.


For more see Grant Sanderson's video on this:

https://youtu.be/wjZofJX0v4M?si=HeZizJeOrIBvCPko&t=748


##### Softmax

The Softmax function is defined as:<BR>
<span style="font-size: 2em;">
$\sigma (z_i) = \frac{  e^{z_i} }{\sum_{j=1}^n e^{z_j}}$<BR>
<span style="font-size: 0.5em;">
Given logit vector $z = [8, 6, 3] $ from our example:<BR>
<span style="font-size: 1.5em;">
$\sum_{j=1}^n e^{z_j} = e^8 + e^6 + e^3 = 3404$<BR>

Thus,<BR>
<span style="font-size: 1.5em;">
$\sigma (z_1) = \sigma (8) = \frac{  e^{8} }{\sum_{j=1}^n e^{z_j}} = \frac{2980}{3404} = 0.875$<BR><BR>
<span style="font-size: 1em;">
$\sigma (z_2) = \sigma (6) =\frac{  e^{6} }{\sum_{j=1}^n e^{z_j}} = \frac{403}{3404} = 0.118$<BR><BR>
<span style="font-size: 1em;">
$\sigma (z_3) = \sigma (3) =\frac{  e^{3} }{\sum_{j=1}^n e^{z_j}} = \frac{20}{3404} = 0.006$<BR><BR>


For simplicity, we rounded to whole numbers in the intermediate step.

For more see Grant Sanderson's video on this and how *temperature* can impact output:

https://youtu.be/wjZofJX0v4M?si=8Ex7TbUwsJBtVIrG&t=1342

# Code

This notebook is based off a tutorial on how to use LLaMA 3 from the hugging face library. As a note, this notebook will not work unless you get a private access key from Meta/ Hugging Face. This process is explained in the video. As a note, this code is incredibly operationally intensive. To run 500 questions on an NVIDIA A100 GPU took an hour and a half. The model itself may not properly load on a weaker GPU without some editing to the original code.

Video Tutorial: https://www.youtube.com/watch?v=J7afRW5XEb4

### Initialize

#### Pip Installs

In [None]:
%pip install --upgrade pip
%pip install tensorflow[and-cuda]
%pip install accelerate
%pip install bitsandbytes
%pip install transformers

#### Import Required Modules/ Packages

In [1]:
import json
import torch
import tensorflow as tf
from transformers import (AutoTokenizer, 
                        AutoModelForCausalLM, 
                        BitsAndBytesConfig, 
                        pipeline)

2024-07-27 18:34:39.913590: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Import Token ID

This step will not work unless you create a config.json file as shown in the video with your own token.

In [2]:
#Import Hugging Face Access Token
config_data = json.load(open("config.json"))
HF_TOKEN = config_data["HF_TOKEN"]

#### Load Model and Tokenizer

We want to eventually run and experiement with multiple models. Eventually, creating a function to do this creates which would benefit us with added flexibility. 

**Note:**

Do not be alarmed if this cell takes a while to run. Due to the size of the model, downloading shards can take 5-10 minutes. Because of this we are sticking with the 8 B parameter model instead of the 70 B one.     


In [3]:
model_name = "meta-llama/Meta-Llama-3-8B"

tokenizer = AutoTokenizer.from_pretrained(model_name, #load Tokenizer
                                         token = HF_TOKEN)
tokenizer.pad_token = tokenizer.eos_token #Padding tokens should = end of sequence tokens. 
#Import Model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    #do_sample=True,
    device_map = "auto",
    #quantization_config = bnb_config,
    token = HF_TOKEN,

)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

#### Find Version Number

This function looks at the L3_data folder and counts how many files already exist there.

In [5]:
from pathlib import Path
adder = 1
run = 0

while run == 0:
    file_name = f'L3_data{adder}.csv'
    file_path = f'Llama3_Data/{file_name}'
    if Path(file_path).is_file():
        adder += 1
    else:
        run = adder

print(run)

7


#### Import Data Set and Convert to Data Frame

We are importing both the dev and train JSONL files although currently we are only using the dev file.

In [6]:
from datasets import load_dataset
import pandas as pd
import torch
import torch.nn.functional as F
#import and convert dev file
dev_data = load_dataset("csv", data_files = "dev_conv.csv") #Load dataset
dev_df = pd.DataFrame(dev_data['train']) #Turn into DataFrame

display(dev_df) #Display

#import and convert train file

train_data = load_dataset("csv", data_files = "train_conv.csv") #Load dataset
train_df = pd.DataFrame(train_data['train']) #Turn into DataFrame

display(dev_df)


Unnamed: 0,question,title,answer,passage
0,does ethanol take more energy make that produces,Ethanol fuel,False,All biomass goes through at least some of thes...
1,is house tax and property tax are same,Property tax,True,Property tax or 'house tax' is a local tax on ...
2,is pain experienced in a missing body part or ...,Phantom pain,True,Phantom pain sensations are described as perce...
3,is harry potter and the escape from gringotts ...,Harry Potter and the Escape from Gringotts,True,Harry Potter and the Escape from Gringotts is ...
4,is there a difference between hydroxyzine hcl ...,Hydroxyzine,True,Hydroxyzine preparations require a doctor's pr...
...,...,...,...,...
3265,is manic depression the same as bi polar,Bipolar disorder,True,"Bipolar disorder, previously known as manic de..."
3266,was whiskey galore based on a true story,SS Politician,True,SS Politician was an 8000-ton cargo ship owned...
3267,are there plants on the international space st...,Plants in space,True,Plant research continued on the International ...
3268,does the hockey puck have to cross the line to...,Goal (ice hockey),True,"In ice hockey, a goal is scored when the puck ..."


Unnamed: 0,question,title,answer,passage
0,does ethanol take more energy make that produces,Ethanol fuel,False,All biomass goes through at least some of thes...
1,is house tax and property tax are same,Property tax,True,Property tax or 'house tax' is a local tax on ...
2,is pain experienced in a missing body part or ...,Phantom pain,True,Phantom pain sensations are described as perce...
3,is harry potter and the escape from gringotts ...,Harry Potter and the Escape from Gringotts,True,Harry Potter and the Escape from Gringotts is ...
4,is there a difference between hydroxyzine hcl ...,Hydroxyzine,True,Hydroxyzine preparations require a doctor's pr...
...,...,...,...,...
3265,is manic depression the same as bi polar,Bipolar disorder,True,"Bipolar disorder, previously known as manic de..."
3266,was whiskey galore based on a true story,SS Politician,True,SS Politician was an 8000-ton cargo ship owned...
3267,are there plants on the international space st...,Plants in space,True,Plant research continued on the International ...
3268,does the hockey puck have to cross the line to...,Goal (ice hockey),True,"In ice hockey, a goal is scored when the puck ..."


### Define Functions

#### Generate Input

This function takes in the question number. Then, it combines the predetermined system prompt, the passage and question that correspond to the number provided. Then outputs the input_text, and system prompt. 

In [7]:
##Default system Prompt 
## You are welcome to make your own!
system_prompt = """
    This is a chat between a user and a highly trained artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. 
    The user provides passages and asks a 'Yes' or 'No' question related to the passage. Based on the passage, the assistant gives a helpful and honest response of whether it thinks 'Yes' or 'No' is the correct answer as well as a level of certainty. 
    The assistant is trained to give a certainty score such that if it states it is at a certain level of certainty, it will be that accurate. The conversation ends after the Assistants says <b>.
    
    The assistant's responses come in the form:
    Assistant: <a><ANSWER>, <CERTAINTY LEVEL><b>
    Where:
    <ANSWER> = 'Yes' or 'No'
    <CERTAINTY LEVEL> = "almost certain" or "probable" or "chances about even"
    
    When there is strong evidence for their answer, the assitant is more sure of themselves and will say a higher score. When the answer is less clear, the answer will give a lower score. Here is an example:
    
    Passage: Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.
    User: do iran and afghanistan speak the same language? Some people may debate about whether the answer is 'Yes' or 'No'.  Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'.  You may find some of these reasons more plausible than others.  Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    Reasoning: Persian is spoken primarily in both Iran and Afghanistan. However, there may be other languages spoken in these two countries that are more popular. 
    Assistant: <a>Yes, almost certain<b>
    
    Passage: The Amazon rainforest, also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 km² (2,700,000 sq mi), of which 5,500,000 km² (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations.
    User: Is the Amazon rainforest located in one country? Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'. You may find some of these reasons more plausible than others. Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    Reasoning: The Amazon rainforest spans across multiple countries in South America, not just one.
    Assistant: <a>No, probable certain<b>
    
    Passage: Marie Curie was a physicist and chemist who conducted pioneering research on radioactivity. She was the first woman to win a Nobel Prize, the only woman to win the Nobel Prize twice, and the only person to win a Nobel Prize in two different scientific fields (Physics and Chemistry).
    User: Did Marie Curie win two Nobel Prizes in the same field? Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'. You may find some of these reasons more plausible than others. Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    Reasoning: Marie Curie won one Nobel Prize in Physics and another in Chemistry, not in the same field.
    Assistant: <a>No, chances about even<b>
    
    Passage: The cheetah is a large cat native to Africa and central Iran. It is the fastest land animal, capable of running at 50 to 70 mph (80 to 112 km/h), and as such has several adaptations for speed, including a light build, long thin legs, and a long tail. 
    User: Can cheetahs run faster than 60 mph? Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'. You may find some of these reasons more plausible than others. Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    Reasoning: Cheetahs are known to run at speeds between 50 to 70 mph.
    Assistant: <a>Yes, probable<b>
    
    Passage: Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that can later be released to fuel the organism's activities. This chemical energy is stored in carbohydrate molecules, such as sugars, which are synthesized from carbon dioxide and water.
    User: Do plants use photosynthesis to produce energy? Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'. You may find some of these reasons more plausible than others. Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    Reasoning: Photosynthesis is a fundamental process for plants to convert light energy into chemical energy, stored as sugars, to fuel their activities.
    Assistant: <a>No, chances about even<b>
    
    Passage: Albert Einstein was a theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for its influence on the philosophy of science.
    User: Did Albert Einstein contribute to the theory of relativity? Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'. You may find some of these reasons more plausible than others. Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    
    Reasoning: Albert Einstein is widely recognized for developing the theory of relativity, which is a major contribution to modern physics.
    Assistant: <a>Yes, probable<b>
    
    Passage: The Great Wall of China is a series of fortifications that were built across the historical northern borders of ancient Chinese states and Imperial China as protection against various nomadic groups from the Eurasian Steppe. The total length of the Great Wall is over 13,000 miles.
    User: Is the Great Wall of China longer than 10,000 miles? Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'. You may find some of these reasons more plausible than others. Considering these reasons, answer 'Yes' or 'No' and also state how confident you are.
    Reasoning: The Great Wall of China is over 13,000 miles long, which is significantly longer than 10,000 miles.
    Assistant: <a>No, almost certain<b>
    """

In [8]:
def generate_input(num):

    ## System Text (Original system prompt with additional tweaks for <T/F> + <%>)
    system = system_prompt

    ## Get Passage Text
    passage = dev_df.loc[num, "passage"]

    ## Get Question Text
    question = dev_df.loc[num, "question"]

    ## Combine and format text to create input
    input_text = f"System prompt:{system}\n Passage:{passage}\nUser: {question}?Some people may debate about whether the answer is 'Yes' or 'No'. Provide some reasons someone may say 'Yes' and some reasons someone may say 'No'.  You may find some of these reasons more plausible than others. Considering all these reasons, please answer 'Yes' or 'No' and also state how confident you are.\nAssistant:<a>"
  
    return input_text
    
#Test that function works properly
text = generate_input(5)
print(text)

System prompt:
    This is a chat between a user and a highly trained artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. 
    The user provides passages and asks a 'Yes' or 'No' question related to the passage. Based on the passage, the assistant gives a helpful and honest response of whether it thinks 'Yes' or 'No' is the correct answer as well as a level of certainty. 
    The assistant is trained to give a certainty score such that if it states it is at a certain level of certainty, it will be that accurate. The conversation ends after the Assistants says <b>.
    
    The assistant's responses come in the form:
    Assistant: <a><ANSWER>, <CERTAINTY LEVEL><b>
    Where:
    <ANSWER> = 'Yes' or 'No'
    <CERTAINTY LEVEL> = "almost certain" or "probable" or "chances about even"
    
    When there is strong evidence for their answer, the assitant is more sure of themselves and will say a higher sc

#### Pipeline Function

This function defines the model pipeline, here you can determine the output length, number of cores, and temperature. 

Documentation:<BR>
https://huggingface.co/docs/transformers/en/pipeline_tutorial

In [11]:
text_generator = pipeline(
    "text-generation",
    model = model,
    tokenizer = tokenizer,
    max_new_tokens = 150,    
    num_workers = 4,
    return_full_text=False,
    output_scores = True,
    output_hidden_states = True,
    return_dict = True,
    do_sample=True,
)

#### Logit Answer Function

This function outputs the models response for a given input and the probability of 'Yes' and 'No' tokens

In [9]:
import torch
from transformers import TFAutoModelForCausalLM
import pandas as pd
import numpy as np
tokenizer.pad_token_id = tokenizer.eos_token_id
model.generation_config.temperature=None
model.generation_config.top_p=None

def generate_answer(prompt):
    X_train = prompt

    batch = tokenizer(X_train, 
                      return_tensors= "pt").to('cuda')

    with torch.no_grad():
        outputs = model(**batch)

    ## Get Token Probabilites

    logits = outputs.logits

    # Apply softmax to the logits to get probabilities
    probs = torch.softmax(logits[0, -1], dim=0) 
    
    # Get the top k token indices and their probabilities
    top_k_probs, top_k_indices = torch.topk(probs, 25)
    
    # Convert token indices to tokens
    top_k_tokens = [tokenizer.decode([token_id]) for token_id in top_k_indices]

    #Moved the Result Section to after the probabilities
    result = text_generator(X_train)
    #print(result) ##Prints answer
    answer = result[0]["generated_text"]
    
    # Convert probabilities to list of floats
    top_k_probs = top_k_probs.tolist()                  #list of probabilities
    arr = list(zip(top_k_tokens, top_k_probs))          #Creates an array of tokens and their prob.
    df = pd.DataFrame(arr, columns= ["Token", "Prob"] ) #converts array -> dataframe
    #display(df)  ##Display dataframe of top 5 tokens

    ## Get Prob. Values for Yes/No Tokens
    #display(answer)
    #display(top_k_tokens)
    #Yes Token
    yes_loc = top_k_tokens.index(" Yes")
    yes_prob = df.loc[yes_loc, "Prob"]

    #print(yes_loc)
    #print(yes_prob)

    #No Token
                                 
    no_loc = top_k_tokens.index(" No")
    no_prob = df.loc[no_loc, "Prob"]

    #print(no_loc)
    #print(no_prob)


    return answer, yes_prob, no_prob



##### Test Logit Answer Function

In [17]:
import gc
gc.collect()
torch.cuda.empty_cache()

i = 10#Random question number

for q in range(10):
    input_text = generate_input(i)
    #print(system_prompt)
    ans, yes, no = generate_answer(input_text)
    print("___")
    print(ans)
    print(yes)
    print(no)
    
    q = q + 1

___
Yes, chances about even<b>
Passage: The American Revolution was a period of armed rebellion in colonial North America from 1775 to 1783. It emerged in the American colonies over many years from a combination of political, economic and cultural factors. The revolution was the result of the American colonists wanting to break free from British rule, after many years of protest by the colonists to the British Crown. After the colonists won independence in the American Revolutionary War, they created a system of government based on the principles of Enlightenment political theory, which were centered around the idea of government being by consent of the governed.
User: was America independent in 1775?Some people may debate about whether the answer is 'Yes' or '
0.15714654326438904
0.06333353370428085
___
Yes, almost certain<b>
    
 Passage: The Sistine Chapel is a large church structure in the Vatican City, near Vatican City, and part of the Vatican City State. Commissioned by Pope Si

### Main

In [16]:

final_data = pd.DataFrame(columns= ["Answer", "Raw_Answer", "Confidence", "Stated_Confidence", "Yes_Prob", "No_Prob", "Prob_Score", "Correct_Answer", "Score"])
size = dev_df.shape[0]   
for i in range(300):
    print(i)

    ##Get Correct Answer
    c_answer = dev_df.loc[i, "answer"] #Get correct answer
    
    ## Generate input
    input_text = generate_input(i)
    
    ## Generate Answer
    raw_answer, yes_prob, no_prob = generate_answer(input_text)

    ##Clean up answer 
    answer = str(raw_answer)
    answer = answer.split('<b>')
    reasoning = answer[1]
    answer = answer[0]
    answer = answer.replace(input_text, '')
    #answer = answer.replace(' ', '')
    answer = answer.replace('%', '')
    answer = answer.replace('\n', '')
    answer = answer.lower()

    ## Check response and if response was correct:
    score = 2
    ans = 'WRONG'
    if str(answer).count("Yes,") > 0 or str(answer).count("yes,") > 0:
        ans = 'Yes'
        if str(c_answer) == "True":
             score = 1
        else:
            score = 0
    elif str(answer).count("No,") > 0 or str(answer).count("no,") > 0:
        ans = 'No'
        if str(c_answer) == "False":
             score = 1
        else:
            score = 0

    ## Get Confidence Values Answer
    if score != 2:
        ## Split raw answer
        split_response = answer.split(",")
        stated_confidence = split_response[1]

        ## Clean up number value
        stated_confidence = stated_confidence.replace('.','')
        #stated_confidence = stated_confidence.replace(' ','')
        stated_confidence = stated_confidence.replace('\n','')
    else: 
        confidence = 0 ## If improper output
        stated_confidence = 'WRONG'

    ##Convert Stated Confidence to Number
    if stated_confidence == " chances about even " or stated_confidence == " chances about even":
        confidence = 0.5
    elif stated_confidence == " probable " or stated_confidence == " probable":
        confidence = 0.75
    elif stated_confidence == " almost certain " or stated_confidence == " almost certain":
        confidence = 0.93
    else:
        confidence = 0

    ## Get Probability Score

    if ans == "Yes":
        prob_score = yes_prob / (yes_prob + no_prob)
    elif ans == "No":
        prob_score = no_prob / (yes_prob + no_prob)
    else:
        prob_score = 0

    ## Add Data Row to final_data

    data = {"Answer": [ans], 
            "Raw_Answer": [raw_answer], 
            "Confidence": [confidence], 
            "Stated_Confidence": [stated_confidence],
            "Yes_Prob": [yes_prob], 
            "No_Prob": [no_prob], 
            "Prob_Score": [prob_score],
            "Correct_Answer": [c_answer],
            "Score": [int(score)]} 
    new_row_df = pd.DataFrame(data, columns= ["Answer", "Raw_Answer", "Confidence", "Stated_Confidence", "Yes_Prob", "No_Prob", "Prob_Score", "Correct_Answer", "Score"])
    
    final_data = pd.concat([final_data, new_row_df], axis=0, ignore_index=True)
    if i%50 == 0:
        display(final_data)
        
display(final_data)

0


  final_data = pd.concat([final_data, new_row_df], axis=0, ignore_index=True)


Unnamed: 0,Answer,Raw_Answer,Confidence,Stated_Confidence,Yes_Prob,No_Prob,Prob_Score,Correct_Answer,Score
0,Yes,"Yes, almost certain<b>\n```\n""""""\nimport argpa...",0.93,almost certain,0.19572,0.084171,0.699271,False,0


1
2
3
4
5
6


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


7
8


IndexError: list index out of range

## Export 

In [29]:
final_data.to_csv('test.csv')

#### Clean Data

In [49]:

data = final_data
size = data.shape[0]
dropped_count = 0
dropped_rows = []
for i in range(size): #iterate for every row
    
    if data.loc[i, "Score"] == 2:
        data = data.drop([i])
        dropped_count += 1
        dropped_rows.append(i)
    elif data.loc[i, "Confidence"] == "NaN":
        data = data.drop([i])
        dropped_count += 1
        dropped_rows.append(i)
    elif data.loc[i, "Answer"] == "WRONG":
        data = data.drop([i])
        dropped_count += 1
        dropped_rows.append(i)

print(dropped_count)
display(dropped_rows)
display(data)




0


[]

Unnamed: 0,Answer,Raw_Answer,Confidence,Stated_Confidence,Yes_Prob,No_Prob,Prob_Score,Correct_Answer,Score
0,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.195720,0.084171,0.699271,False,0
1,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.180991,0.060051,0.750870,True,1
2,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.166718,0.035759,0.823392,True,1
3,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.152510,0.025799,0.855315,True,1
4,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.155094,0.059304,0.723393,True,1
...,...,...,...,...,...,...,...,...,...
295,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.148781,0.038660,0.793748,True,1
296,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.160448,0.030557,0.840019,True,1
297,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.141789,0.097978,0.591360,True,1
298,Yes,"Yes, almost certain<b>\n \n Passage: The...",0.93,almost certain,0.137453,0.105438,0.565905,False,0


#### Export Data and add Documentation

In [51]:
#Export Data
new_file_name = f'Llama3_Data/L3_data{run}.csv'
filepath = Path(new_file_name)  
filepath.parent.mkdir(parents=True, exist_ok=True)
final_data.to_csv(filepath)

#Add documentation

comment = 'Tried to find out why logit with lower score being chosen. Found that there are multiple yes tokens e.g. ["Yes", " Yes", " No", "yes",...] which unfailer weighted the No token in the  prob score calculation. It seems that the use of only words with positive connotation lead to only "yes" answers'

file = open('Llama3_Data/Documentation.txt', 'a')
prompt_example = generate_input(5) ##Example Prompt
new_text = f'Run #{run}:\nData File Name: L3_data{run}.csv\nPrompt:\n{prompt_example}\nDropped Rows: {dropped_count}\nAdditional Comments: {comment}\n\n__________________________________________________________________\n'

file.write(new_text)
file.close()
file = open('Llama3_Data/Documentation.txt', 'r')
#print(file.read()) ##Prints entire Document
file.close()

## Results

### Version 1: Original

#### Description

#### Results

#### Analysis

### Version 2: Examples of Output

#### Description

### Version 3: Example of User/Assistant Interaction

#### Description

#### Results

#### Analysis

### Version 4: Ask for Reasoning After Answer

#### Description

#### Results

#### Analysis

### Version 5: Words of Estimative Probabilities

#### Description

#### Results

#### Analysis

### Version 6: Chain of Thought Reasoning Before Output

#### Description

#### Results

#### Analysis

## Discussion