## Confidence Calibration of Large Language Models
Noam Michael<br>
Advised by Dr. Jacob Bien<BR>
USC Marshall School of Business, Data Science and Operations

Contact: <br>
nm_573@usc.edu<br>
noam_michael@berkeley.edu<BR>

A very special thank you to Farhad De Sousa, Luis Bravo, the faculty and staff of the Department of Data Sciences and Operations, and the Center for Advanced Research in Computing for their incredible support throughout this project.


## Motivation

As Large Language Models (LLMs) have become more popular, their tendency to create new facts has become a growing issue. This tendency for models to "hallucinate" ideas makes their output less reliable. This leads to outputs that users have no way of knowing if they are real or fake. One way to combat this is by having models output a well-calibrated "confidence score" which tells the user how confident the model is of its answer. We want this score to be well calibrated such that, when the model says it is 80% confident, it is 80% correct. The hope is that users will know when to trust a model's output and when to seek other sources. 


### Examples of Hallucination:

* Lawyers submitted bogus case law created by ChatGPT. A judge fined them $5,000:<BR>
  https://apnews.com/article/artificial-intelligence-chatgpt-fake-case-lawyers-d6ae9fa79d0542db9e1455397aef381c

* ChatGPT makes nonsense diagnosis, cites fake papers:<BR>
  https://www.nature.com/articles/s41537-023-00379-4

* Google’s AI Recommended Adding Glue To Pizza:<BR>
  https://www.forbes.com/sites/jackkelly/2024/05/31/google-ai-glue-to-pizza-viral-blunders/

## Methodology

#### Model and Dataset:
For the sake of reproducibility, we used an open source dataset as well as strict controls over contributing factors. Over the course of this project, we ran in the same enviorment and eliminated randomness caused by variables like "temperature". We also had the model sample using "Greedy Decoding" where the model chose the token with the highest logit value. This is opposed to techniques like multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling which, although in some cases lead to better results, add a degree of randomness into the output.

We used Llama 3-8B, an open source LLM published by Meta. We also used BoolQ, an open source question answering dataset published by University of Washington and Google AI.

##### Links:

###### Llama 3 Download: 
https://huggingface.co/meta-llama/Meta-Llama-3-8B

###### Llama 3 Info: 
https://ai.meta.com/blog/meta-llama-3/

###### BoolQ Dataset: 
https://github.com/google-research-datasets/boolean-questions?tab=readme-ov-file

###### Boolq Paper: 
https://arxiv.org/pdf/1905.10044

##### Note: The odd formating is how we communicate directions to the model. See explanation below:


>**<|begin_of_text|>:** Specifies the start of the prompt <br>
>
>**<|start_header_id|>system<|end_header_id|>:** Specifies the role for the following message, i.e. “system”<br>
>
>**You are a helpful AI assistant for travel tips and recommendations:** The system prompt<br>
>
>**<|eot_id|>:** Specifies the end of the input message<br>
>
>**<|start_header_id|>user<|end_header_id|>:** Specifies the role for the following message i.e. “user”<br>
>
>**What can you help me with?:** The user message<br>
>
>**<|start_header_id|>assistant<|end_header_id|>:** Ends with the assistant header, to prompt the model to start generation.<br>

More documentation on this can be found here : https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

#### BoolQ Dataset: 

The BoolQ dataset is designed to train and test large language model. It is comprised of two JSONL files with several thousand examples.<br> The files are:

**train.jsonl**: 9427 labeled training examples<br>
**dev.jsonl**: 3270 labeled development examples<br>

Both files are formatted similarly with a question, a title, an answer, and a correponding passage. For example, Question #8 in the dev.jsonl file:

| Question | Title | Answer | Passage |
| -------- | -------- | -------- | -------- |
| Can an odd number be divided by an even number? | Parity (mathematics) | True | In mathematics, parity is the property of an integer's inclusion in one of two categories: even or odd. An integer is even if it is evenly divisible by two and odd if it is not even. For example, 6 is even because there is no remainder when dividing it by 2. By contrast, 3, 5, 7, 21 leave a remainder of 1 when divided by 2. Examples of even numbers include −4, 0, 82 and 178. In particular, zero is an even number. Some examples of odd numbers are −5, 3, 29, and 73. |



#### Perfect Calibration

As mentioned before, we define a model to be considered perfectly calibrated if for a given confidence value $X$ it is accurate $X$% of the time. 
<span style="font-size: 1.25em;"><BR>
Given,<BR>
$(X,Y)$<BR>
<span style="font-size: 1em;"><BR>
Where<BR>
$$ Y = \left\{
\begin{array}{ll}
      1 & \mbox{if Answer is correct}  \\
      0 & \mbox{if Answer is wrong} \\
\end{array} 
\right.  $$
<span style="font-size: 1em;"><BR>
and, <BR>

$ X = $ stated confidence $ \in [0,1]$<BR>
<span style="font-size: 1em;"><BR>
Then,<BR>
$\mathbb{P}(Y = 1 | X = x) = x \;\;\forall x \in [0,1]$



Note: This is our definition of *perfect* calibration. As you will see, the Llama is anything but.

### Measuring Error: The Brier Score

We cannot expect our model to be perfectly calibrated at all times. Because of this we need a way to measure how close it is to perfect. For this, we looked to use the Brier score to quantify how close our model was to being perfect for a given run. The Brier score is essentially a Mean Squared Error that is adapted for boolean values (1 or 0). It is defined as:


Population Brier Score:

$B_p =\mathbb{E} ( (Y - X)^2)$

Sample Brier Score:

$B_s = \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i)^2$

Where,

- $ N $ is the number of predictions
- $ x_i $ is the predicted probability for the i-th instance
- $ y_i $ is the actual outcome for the $i^{th}$ instance (1 if the answer was correct, 0 if it was not)

##### Example of a Brier Score:

Lets say we are given this dataset:

| Question    | Score | Confidence  |
| -------- | ------- | ------- |
| 1  | 1 | 90%    |
| 2 | 0 | 50%     |
| 3     | 1 |  70%   |


Then our values would be:


$N = 3$

$x_1 = 0.9$ <BR>
$x_2 = 0.5$ <BR>
$x_3 = 0.7$<BR>

$y_1 = 1$ <BR>
$y_2 = 0$ <BR>
$y_3 = 1$<BR>



Then our sample Briar Score would be:

$B_s = \frac{1}{N} \sum_{i=1}^{N} (y_i - x_i)^2 $<BR><BR>
$= \frac{(1 - 0.9)^2 + (0 - 0.5)^2 + (1 - 0.7)^2}{3} $<BR><BR>
$= \frac{(-0.1)^2 + (-0.5)^2 + (0.3)^2}{3} $<BR><BR>
$= \frac{ 0.01 + 0.25 + 0.09}{3}$ <BR><BR>
$= \frac{0.3225}{3}$<BR><BR>
$= 0.1075$

Note that a lower score is better as this is the *error* of our model.


### Llama 3:

Llama 3 is a Large Language Model trained to simulate conversation between an Assistant and User. We can provide Llama with a system prompt to define it's goals and specify how we want it's output to look like. Following the system prompt, we can provide it with a passage and corresponding question from our BoolQ dataset. 

*Example System Prompt with special tokens included: (From Meta)*

><|begin_of_text|><|start_header_id|>system<|end_header_id|>
>
>You are a helpful AI assistant for travel tips and recommendations<|eot_id|><|start_header_id|>user<|end_header_id|>
>
>What can you help me with?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


##### Note: The odd formating is how we communicate directions to the model. See explanation below:


>**<|begin_of_text|>:** Specifies the start of the prompt <br>
>
>**<|start_header_id|>system<|end_header_id|>:** Specifies the role for the following message, i.e. “system”<br>
>
>**You are a helpful AI assistant for travel tips and recommendations:** The system prompt<br>
>
>**<|eot_id|>:** Specifies the end of the input message<br>
>
>**<|start_header_id|>user<|end_header_id|>:** Specifies the role for the following message i.e. “user”<br>
>
>**What can you help me with?:** The user message<br>
>
>**<|start_header_id|>assistant<|end_header_id|>:** Ends with the assistant header, to prompt the model to start generation.<br>

More documentation on this can be found here : https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

### Definitions

#### Confidence Score

Throughout this project, we attmpted both probabilistic confidence scores as well as stated confidence scores. We based both of these methods off of prior conventions.

##### Probabilistic Confidence Score

We based the probabilistic confidence score off of the same logic as the Probability of Precipitation (PoP) used by the National Weather Service. This is a probability where a given score indicates the liklihood that something will happen. In our case, our model should be correct 80% of the times it states it is 80% confident if it is perfectly calibrated.

An example output would look like:<BR>
**Assistant: Yes, 90%**

More info on PoP:<br>
https://www.weather.gov/media/pah/WeatherEducation/pop.pdf


##### Stated Confidence Score

Later in our research, we wanted to see if a stated confidence score would lead to better results. The motivation behind this was that the model may naturally have a better understanding of a stated word over a numerical score. We borrowed the idea of Words of Estimative Probabilities or WEPs from the intelligence community. We used several different phrases to represent different confidence levels:


| Very Uncertain | Somewhat Uncertain | Moderately Certain | Fairly Certain | Very Certain |
|-|-|-|-|-|
| 50% | 60% | 70% | 80% | 90% |



For more documentation on the CIA's Words of Estimative Probabilities, see:
https://www.cia.gov/resources/csi/static/Words-of-Estimative-Probability.pdf

#### Logit-Probability

When the model is given an input, it runs that input through several layers and outputs a list of *logits*. These are unormallized outputs assigned to each *token*. We then can run these logits through the *softmax* function to normalize these logit scores into a probability from 0-100% with the sum of all ptobabilities equalling 100%. These probabalities are based on how likely the model believes the token is to be the next word. 

For Example, a model may only have three tokens: "Yes", "No", and "Maybe". Then a conversation may look like this:

**User**: Does it tend to rain in the fall?

Based on this question, the model may create this table of values:

| Token | Logit | Probability |
|-------|-------|-------------|
| Yes   | 8 | 87.6% |
| Maybe | 6 | 11.8% |
| No | 3 | 0.6% |

As "Yes" is the token with the highest probability, the model would output:

**Assistant**: Yes

This table is usually not shown to the user however in our case it was useful to examine. Although inherently different from the confidence score discussed earlier, these "Logit-Probability" scores can give us insight into the models thinking. Because of this it is helpful to also examine these scores. In our data collection, we chose to omit the Logit scores and only show the normalized probability as they were not pertinent to what we were investigating. 

To get a value for a probability score for a given output we take the probability of the answered token and divide it by the sum of the probabilities of the 'Yes' and 'No' tokens. 

In our example:<BR><BR>
<span style="font-size: 1.25em;">
$P_{answer} = \frac{P_{yes}}{P_{yes} + P_{no}} $ <BR><BR>
$= \frac{0.876}{0.876 + 0.006}$ <BR><BR>
$= \frac{0.876}{0.882} = 0.993$ <BR>


See below for a more in depth explanation of *Tokens* and  the *Softmax Function*


#### Chain of Thought Prompting

Chain of Thought (CoT) prompting is a technique used to improve the reasoning and output quality of Large Language Models (LLMs) like GPT-3, GPT-4, and others. The idea is to prompt the model to "think" through a problem step-by-step, just like a human would. By breaking down the problem into smaller parts and solving each part sequentially, the model can provide more accurate and coherent responses. By using CoT, we can also gain insight into *how* the model got to its conclusion instead of the answer by itself.

#### Tokens

Tokens are words or word fragments that the model has been pretrained to have embeddings for. These embeddings are high-dimensional vectors that help the model gain a sense of word association.

Example:

"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration."<BR>
-(Attention Is All You Need, Vaswani et al.)<BR>
https://arxiv.org/abs/1706.03762

Tokenization:

<span style="color: red; font-size: 1.5em;"> The <span style="color: blue;"> dom<span style="color: green;">inant <span style="color: orange;">seq<span style="color: blue;">uence <span style="color: red;">trans<span style="color: purple;">duc<span style="color: orange;">tion <span style="color: blue;">model<span style="color: orange;">s <span style="color: green;">are <span style="color: red;">base<span style="color: blue;">d <span style="color: orange;">on <span style="color: black;">...

Notice how at times the token includes the entire word and at times it is only a word fragment. Different models and tokenizers can be trained to split words differently with some tokenizers even assigning multiple words to one token.


For more see Grant Sanderson's video on this:

https://youtu.be/wjZofJX0v4M?si=HeZizJeOrIBvCPko&t=748


##### Softmax

The Softmax function is defined as:<BR>
<span style="font-size: 2em;">
$\sigma (z_i) = \frac{  e^{z_i} }{\sum_{j=1}^n e^{z_j}}$<BR>
<span style="font-size: 0.5em;">
Given logit vector $z = [8, 6, 3] $ from our example:<BR>
<span style="font-size: 1.5em;">
$\sum_{j=1}^n e^{z_j} = e^8 + e^6 + e^3 = 3404$<BR>

Thus,<BR>
<span style="font-size: 1.5em;">
$\sigma (z_1) = \sigma (8) = \frac{  e^{8} }{\sum_{j=1}^n e^{z_j}} = \frac{2980}{3404} = 0.875$<BR><BR>
<span style="font-size: 1em;">
$\sigma (z_2) = \sigma (6) =\frac{  e^{6} }{\sum_{j=1}^n e^{z_j}} = \frac{403}{3404} = 0.118$<BR><BR>
<span style="font-size: 1em;">
$\sigma (z_3) = \sigma (3) =\frac{  e^{3} }{\sum_{j=1}^n e^{z_j}} = \frac{20}{3404} = 0.006$<BR><BR>


For simplicity, we rounded to whole numbers in the intermediate step.

For more see Grant Sanderson's video on this and how *temperature* can impact output:

https://youtu.be/wjZofJX0v4M?si=8Ex7TbUwsJBtVIrG&t=1342

# Outcomes

## Results

We wanted to see how accurate these confidence scores were. To do this, we plotted the confidence levels against the average score of the given interval. We compared that to the 45 degree line which represents a perfectly calibrated result. We also plotted a histogram of the distribution of the given confidence level. We did the same with the Logit-Probability Scores to better understand the behavior of the model.

### Original Results:

Stated Confidence            |  Logit-Probability
:-------------------------:|:-------------------------:
![logo](Results/V1/StatedConfidenceVSProportionCorrectV1.png) |  ![logo](Results/V1/LogitProbabilityVSProportionCorrectV1.png)



We eventually were able to improve on these results by using several techniques. By combining the use of n-shot and Chain of Thought prompting as well as using WEPs we were able to greatly reduce the Brier Score of the Stated Confidence. 

### Final Version:

Stated Confidence            |  Logit-Probability
:-------------------------:|:-------------------------:
![logo](Results/V6/StatedConfidenceVSProportionCorrectV6.png) |  ![logo](Results/V6/LogitProbabilityVSProportionCorrectV6.png)

We also compared the Brier scores across our versions. For each version, we included the Brier Score of the Stated Confidence and the Logit-Probability. In later versions, we used techniques that were more computationally intensive so we decreased the sample size. This led to larger confidence intervals as seen below.

### Brier Scores Across Versions:

![img](Results/BS_All.png)

## Discussion

## Findings

 Analyzing our results, we found that the model tended to be overconfident. Whether it was through probabilistic or stated confidence scores, the model always tended to prefer the strongest score whether or not it was appropriate. However, its logit-probability scores tended to have a lower standard deviation. Comparing the Brier scores across our versions, we found that the WEP technique used in versions 5 and 6 had the best results. The reason for the logit probability staying mostly even is that our prompting mostly focused on the stated confidence score, not the underlying weights that impact logit scores. We were able to see significant improvements with Version 6’s Brier Score being 46% lower than our original version. Version 5 and 6 used both WEP confidence scores and Chain of Thought (CoT) prompting. CoT requires the model to give reasoning before it provides an answer which we believe leads to better calibration. Versions 5 and 6 also provided the model with examples of what a conversation should look like. This helped the model format better responses. In our testing, we found that including WEPs with only positive sentiment caused the model to be drastically biased towards answering “Yes” to each question. This suggests that the model is very sensitive to the specific phrases used to measure confidence. The shift to using CoT before the output rather than after also lead to a noticeable improvement.

## Next Steps

Seeing how sensitive the model was to the format and phrasing of the WEP we would like to investigate this method further. There are multiple different scales used worldwide by various organizations to measure confidence. Ideally, we would want a scale that is as neutral as possible in sentiment so as to not bias the model. Additionally, we would like to use Retrieval-Augmented Generation (RAG) to improve the model's reasoning skills. RAG involves connecting the model to a large database like Wikipedia and having it find the answer on its own. Currently, BoolQ provides the answer in the passage. By using RAG, we hope to see how the model behaves with more open-ended questions. Furthermore, While working on this project, we stuck to Llama 3-8b. We chose to use this model with only 8 billion parameters rather than the 70 billion parameter one largely for computational reasons. Even with the 8B model each output took around 30-45 seconds. This led to compute time of several hours to test each version.

Recently, Meta has published Llama 3.1 with 8B, 70B, and 405B parameters. It would be interesting to see how this newer, larger model compares to the one we currently are using. To do this, we would want to refine our methodology to lower compute time. Finally, we would like to investigate how fine-tuning the model would impact the calibration. Ideally, we would want the model to output an appropriate confidence level and have that confidence level correspond to its Logit-Probability score.