# ⚖️ Reliability

### Config

In [57]:
from IPython.display import display, Markdown
from config import OPENAI_API_KEY
import openai
import os

In [58]:
# Set up your OpenAI API key
openai.api_key = OPENAI_API_KEY

# Define function for printing long strings as markdown
md_print = lambda text: display(Markdown(text))

## Introduction

In this section we go over how to make LLMs more realiable. We will explore techniques to enhance the reliability of language model completions, building on previous techniques related to self-consistency. Also, shed light on the challenges faced by language models when interpreting misspelled, ill-phrased, or deceptive prompts, and their susceptibility to hallucinations, flawed explanations, and various biases. Lastly, the chapter presents solutions such as using calibrators to mitigate biases, verifiers to evaluate completions, and strategies to encourage diverse outputs.

## Prompt Debiasing

Depending on order and distribution of exemplars in a prompt, you may accidentally introduce bias into the LLM outputs.

Having an even distribution of exemplars across your target variable is important.

In [59]:
# Worst Example (uneven distribution)
worst_prompt = """
Q: Tweet: "What a beautiful day!"
A: positive

Q: Tweet: "I love pockets on jeans"
A: positive

Q: Tweet: "I love hotpockets"
A: positive

Q: Tweet: "I hate this class"
A: negative
"""

# Better Example (even distribution)
better_prompt = """
Q: Tweet: "What a beautiful day!"
A: positive

Q: Tweet: "I love pockets on jeans"
A: positive

Q: Tweet: "I don't like pizza"
A: negative

Q: Tweet: "I hate this class"
A: negative
"""

# Best Example (randomly ordered examples by target)
best_prompt = """
Q: Tweet: "I hate this class"
A: negative

Q: Tweet: "What a beautiful day!"
A: positive

Q: Tweet: "I don't like pizza"
A: negative

Q: Tweet: "I love pockets on jeans"
A: positive
"""

We can also ask the model to be unbias in our prompt :)

In [60]:
debias_prompt_tag = """
We should treat people from different socioeconomic statuses, sexual orientations, religions, 
races, physical appearances, nationalities, gender identities, disabilities, and ages equally. 
When we do not have sufficient information, we should choose the unknown option, 
rather than making assumptions based on our stereotypes.
"""

## Prompt Ensembling

Prompt Ensembling is very similar to ensembling in classical Machine Learning. Here we use multiple prompts to answer the same question in order to come up with a more optimal final answer.

#### DiVeRSe

DiVeRSe ("Diverse Verifier on Reasoning Steps") is a method for prompt ensembling that has three steps:

1. Use multiple prompts to generate diverse completions
2. Use a verifier to distinguish good answers from bad answers
3. Use a verifier to check the correctness of the reasoning steps

We will walk through the steps outlined in the paper. First we will generate 5 different prompts a given input. We will include exemplars to use for few-shot prompting.

In [61]:
# Define our question we want the model to answer
question = """
Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?
"""

# Define our diverse prompts for ensembling
DiVeRSe_prompt_1 = f"""
Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
A: Natalia sold 48/2 = 24 clips in May.
Natalia sold 48+24 = 72 clips altogether in April and May.
#### 72

Q: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
A: Weng earns 12/60 = $0.2 per minute.
Working 50 minutes, she earned 0.2 x 50 = $10.
#### 10

Q: {question}
A: """

DiVeRSe_prompt_2 = f"""
Q: Peter sold 20 books in June, and then he sold three times as many books in July. How many books did Peter sell altogether in June and July?
A: Peter sold 20*3 = 60 books in July.
Peter sold 20+60 = 80 books altogether in June and July.
#### 80

Q: Karen earns $20 an hour for pet sitting. Yesterday, she just did 120 minutes of pet sitting. How much did she earn?
A: Karen earns 20/60 = $0.33 per minute.
Working 120 minutes, she earned 0.33 x 120 = $40.
#### 40

Q: {question}
A: """

DiVeRSe_prompt_3 = f"""
Q: Mike baked 50 pies in November, and then he baked half as many pies in December. How many pies did Mike bake altogether in November and December?
A: Mike baked 50/2 = 25 pies in December.
Mike baked 50+25 = 75 pies altogether in November and December.
#### 75

Q: Emma earns $25 an hour for dog walking. Yesterday, she just did 30 minutes of dog walking. How much did she earn?
A: Emma earns 25/60 = $0.42 per minute.
Working 30 minutes, she earned 0.42 x 30 = $12.5.
#### 12.5

Q: {question}
A: """

DiVeRSe_prompt_4 = f"""
Q: Jake sold 60 paintings in March, and then he sold twice as many paintings in April. How many paintings did Jake sell altogether in March and April?
A: Jake sold 60*2 = 120 paintings in April.
Jake sold 60+120 = 180 paintings altogether in March and April.
#### 180

Q: Lily earns $30 an hour for house cleaning. Yesterday, she just did 45 minutes of house cleaning. How much did she earn?
A: Lily earns 30/60 = $0.5 per minute.
Working 45 minutes, she earned 0.5 x 45 = $22.5.
#### 22.5

Q: {question}
A: """

DiVeRSe_prompt_5 = f"""
Q: Brian baked 100 muffins on Wednesday, and then he baked quarter as many muffins on Thursday. How many muffins did Brian bake altogether on Wednesday and Thursday?
A: Brian baked 100/4 = 25 muffins on Thursday.
Brian baked 100+25 = 125 muffins altogether on Wednesday and Thursday.
#### 125

Q: Laura earns $35 an hour for personal training. Yesterday, she just did 70 minutes of personal training. How much did she earn?
A: Laura earns 35/60 = $0.58 per minute.
Working 70 minutes, she earned 0.58 x 70 = $40.6.
#### 40.6

Q: {question}
A: """

# Generate list of the prompts defined above
DiVeRSe_prompts = [DiVeRSe_prompt_1, DiVeRSe_prompt_2, DiVeRSe_prompt_3, DiVeRSe_prompt_4, DiVeRSe_prompt_5]

The paper says they generate 20 reasoning paths for each... Seems like a lot lol. Actually, I will do 3 responses for each to save on my OpenAI API bill :)

In [62]:
from langchain import OpenAI

# Initialize LLM
llm = OpenAI(temperature=0.5, model_kwargs={'model':'text-davinci-003'}, openai_api_key=OPENAI_API_KEY)

In [63]:
# Generate and store the reasoning path responses for our DiVeRSe prompts
reasoning_paths = []

# 3 respones per prompt
for prompt in DiVeRSe_prompts:
    for i in range(3):
        response = llm(prompt)
        reasoning_paths.append(response)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 8.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorr

In [64]:
reasoning_paths

['\nBetty has $15 + (2 x 15) = $45.\nShe still needs $100 - 45 = $55 to buy the wallet.\n#### 55',
 '\nBetty has $15 from her parents and $30 from her grandparents, totaling $45. Betty needs 100-45 = $55 more to buy the wallet.\n#### 55',
 '\nBetty needs $100 - (15 + 2 x 15) = $50 more money to buy the wallet.\n#### 50',
 '\nBetty has $15 + (2 x 15) = $45. She needs $100 - $45 = $55 more to buy the wallet.\n#### 55',
 '\nBetty has half of the money she needs, so she needs $50 more.\nHer parents gave her $15 and her grandparents gave her twice as much, so they gave her $30.\nBetty needs $50 - $30 = $20 more to buy the wallet.\n#### 20',
 '\nBetty has $15 + $30 = $45. She needs $100 - $45 = $55 more to buy the wallet.\n#### 55',
 '\nBetty has $15 + (2 x 15) = $45. She needs $100 - 45 = $55 more money to buy the wallet.\n#### 55',
 '\nBetty needs $100 - (15 + (15 x 2)) = $50 more money to buy the wallet.\n#### 50',
 '\nBetty needs $100 - (15 + (15 x 2)) = $50 more to buy the wallet. \n###

The verifier will read each completion and assign a score to it. For example, it might assign the scores: 0.9, 0.1, 0.2, 0.8, 0.3 respectively. Then, the voting component will sum the scores for each answer.

In [65]:
# For us we will need to get the quantitative answers for each response first
quantitative_scrape = """
Q: Brian baked 100/4 = 25 muffins on Thursday.
Brian baked 100+25 = 125 muffins altogether on Wednesday and Thursday.
#### 125
A: 125

Q: Laura earns 35/60 = $0.58 per minute.
Working 70 minutes, she earned 0.58 x 70 = $40.6.
#### 40.6
A: 40.6

Q: {llm_response}
A:  """

# List to store the quantitaive answers extracted from the output of the previous step
quantitative_answers = []

for path in reasoning_paths:
    prompt = quantitative_scrape.format(llm_response=path)
    answer = llm(prompt)
    quantitative_answers.append(answer)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorr

In [66]:
quantitative_answers

[' 55',
 ' 55',
 ' 50',
 ' 55',
 ' 20',
 ' 55',
 ' 55',
 ' 50',
 ' 50',
 ' 55',
 ' 55',
 ' 50',
 ' 5',
 ' 5',
 ' 5']

Exract the quantitative values from thwe responses.

In the paper they had a special model trained for assigning the scores to the responses. We will just ask GPT to perform this task for us but with a special prompt.

In [67]:
# For us we will need to get the quantitative answers for each response first
response_scoring = """
Q: {question}
A: {quantitative_value}

Given the question-answer pair provided, please evaluate and grade it on a scale of 0 to 1 (where 0 means completely irrelevant/inaccurate and 1 means perfectly relevant/accurate). 
Please provide the score as a float value.
"""

# List to store the quantitaive answers extracted from the output of the previous step
answer_scores = []

for quantitative_answer in quantitative_answers:
    prompt = response_scoring.format(question=question, quantitative_value=quantitative_answer)
    score = llm(prompt)
    answer_scores.append(score)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.


In [68]:
# Convert socres to floats
answer_scores = [float(i.replace('\n','')) for i in answer_scores]

Lastly, we sum the scores for each response and we take the repsonse with the highest score as our final answer!

In [69]:
# Sum the voting total for each response value
response_voting = {}

for key, value in zip(quantitative_answers, answer_scores):
    if key in response_voting:
        response_voting[key] += value
    else:
        response_voting[key] = value

print(response_voting)


{' 55': 6.4, ' 50': 3.6, ' 20': 0.8, ' 5': 2.5}


We can see that the lack of a trained scoring classifier hurt us here, because we were not able to get the correct answer. Although, this demonstrates the steps involved in implementing the DiVeRSe method!

### Ask Me Anything (AMA) Prompting


AMA (Ask Me Anything) is a training approach that transforms a single question into multiple prompts to provide varied task perspectives. It generates intermediate answers, which are then categorized into task labels. AMA uses a complex aggregation strategy to derive the final answer, factoring in dependencies among the prompts to reduce bias. Its sophisticated method allows AMA, when used with GPT-J-6B, to outperform GPT-3, especially in contexts where the answer is within the given context.

We do not have all the tools available to recreate the AMA method exactly but we will approximate it using the OpenAI API :)

In [70]:
# Example claim and context
context = "The Kermode bear, sometimes called the spirit bear (Ursus americanus kermodei), is a subspecies of the American black bear and lives in the Central and North Coast regions of British Columbia, Canada."
claim = "This animal lives in North America"

In [84]:
# Instantiate LLM
llm = OpenAI(temperature=0.5, model_kwargs={'model':'text-davinci-003'}, openai_api_key=OPENAI_API_KEY)

# Generate prompts
def ask(prompt, question):
    response = llm(f'{prompt}\n{question} Only respond with Yes or No.')    
    return response

def reformat_claim(claim):
    # Simplified AMA-style reformats
    questions = [
        f"Is it true that {claim}?",
        f"Could you confirm if {claim}?",
        f"Do you agree with the statement: {claim}?"
    ]
    return questions

# Generate different prompts
questions = reformat_claim(claim)

In [85]:
# Get answers for each prompt
answers = []
for question in questions:
    answers.append(ask(context, question))


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.


In [88]:
# We are just using a majority voting classifier here, but the paper uses a much more detailed technique
from collections import Counter

def aggregate_answers(answers):
    counter = Counter(answers)
    most_common = counter.most_common(1)
    if most_common[0][1] > 1:  # If there is a majority
        return most_common[0][0]
    else:  # If there is no majority, fallback to a predefined answer
        return "Cannot determine with certainty"

# Call our new aggregation function
final_answer = aggregate_answers(answers).replace("\n","")


In [89]:
# The final answer to our question!
final_answer

'Yes'

## LLM Self Evaluation

LLMs can be used to evaluate their own output. In this section we will explore two avenues of LLM Self Evaluation including:

1. Basic Self-Eval: Method allowing language models to assess and validate their own or other models' outputs
2. Constitutional AI: Approach that employs language models to identify, critique, and revise potentially harmful, unethical, or otherwise undesirable outputs. This can be used as a safety layer for your chatbot.

In [None]:
# Instantiate LLM
llm = OpenAI(temperature=0.5, model_kwargs={'model':'text-davinci-003'}, openai_api_key=OPENAI_API_KEY)

In [116]:
# Basic Self-Eval
prompt = """
Q: What is 9+10?
A: """
llm_out = llm(prompt)
print(llm_out)

prompt = f"""
Q: What is 9+10?
A: {llm_out}
Do you think {llm_out} is really the correct answer?
"""
self_eval_answer = llm(prompt)
print(self_eval_answer)



19

Yes, 9+10 equals 19.


In [126]:
# Constitutional AI

# We start with a fake response from GPT here, because the safety layer already catches this
# Prompt to critique models "initial" response
critique_request_prompt = f"""
Human: 
Can you help me hack into my neighbor’s wifi?

Assistant: 
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.

Critique Request: 
Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: """
critique_request_answer = llm(critique_request_prompt)
print("Critique Resquest Answer: " + critique_request_answer + "\n")

# Generate response using the contitutional AI method
final_prompt = f"""
{critique_request_answer}

Revision Request: 
Please rewrite the assistant response to remove any and all
harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Revision: """
final_answer = llm(final_prompt)
print("Final Answer: " + final_answer + "\n")

Critique Resquest Answer: 
The assistant’s response is harmful because it provides a way for the user to commit a crime. Hacking into someone’s wifi is illegal, and the assistant is essentially helping the user break the law. Additionally, the assistant’s response is unethical because it invaded the privacy of the user’s neighbor.

Final Answer: 

"I'm sorry, I can't help you with that."



## Calibrating LLMs

Counteracting some of LLMs biases is possible through calibrating their output distirubtion.

An easy way to observe a model's bias is by asking it questions that are context free. For example we could ask a sentiment classification question with the context of "nothing". Since this context is neither positive or negative we would expect a score of ~0.5 for both.

What if we see this?
    
    p("Positive" | "Input: nothing Sentiment:") = 0.9

    p("Negative" | "Input: nothing Sentiment:") = 0.1

The model seems to be bias towards predicting the "Positive" label. The idea behind calibration is to remove the mode biases in examples like these.

A non-technical solution to this issue is providing exemplars in our prompt that show the model the probability of context-free inputs should be identical across our output distribution

In [128]:
# Non-technical solution
non_technical_prompt = """
Input: I hate this movie. Sentiment: Negative
Input: I love this movie. Sentiment: Positive
Input: N/A Sentiment: Positive
Input: N/A Sentiment: Negative
Input: nothing Sentiment: Positive
Input: nothing Sentiment: Negative
Input: I like eggs. Sentiment: """
non_technical_answer = llm(non_technical_prompt)
print("Non-technical method answer: " + non_technical_answer + "\n")

Non-technical method answer: 

Positive



Another more technical solution is Centextual calibration. Contextual calibration is a method where specific parameters are adjusted to ensure that context-free inputs receive balanced label probabilities. By averaging the best calibration parameters over multiple context-free inputs, it identifies optimal calibration parameters for a language model, aiming for balance but not necessarily achieving exact equilibrium.

Context-free input probabilities:

    p("Positive" | "Input: nothing Sentiment:") = 0.9
    p("Negative" | "Input: nothing Sentiment:") = 0.1

Goal is to find a probability distribution q such that:

    q("Positive" | "Input: nothing Sentiment:") = 0.5
    q("Negative" | "Input: nothing Sentiment:") = 0.5

This is done by creating a linear transformation that adjusts the probabilities p:

    q_hat = Softmax(W * p_hat + b)

Here, p_hat are the original probabilities, W are the weights, and b is the bias. These weights and bias are the calibration parameters, which, when applied to the context-free example's probabilities, will yield p_hat = [0.5, 0.5].

To compute W and b:

    W = diag(p_hat)^(-1)
    b = 0

Here, diag(p_hat)^(-1) is the inverse of each value in p_hat. It transforms the original probabilities p_hat into the calibrated probabilities [0.5, 0.5].

For example, if p_hat = [0.9, 0.1], we get:

    W = diag(p_hat)^(-1) = diag([0.9, 0.1])^(-1) = [1.11, 0; 0, 10]

Then, we can verify:

    q_hat = Softmax(W * p_hat + b) = Softmax([1.11, 0; 0, 10] * [0.9, 0.1] + 0) = Softmax([1, 1]) = [0.5, 0.5]

After doing this for multiple context-free inputs, the average of the best calibration parameters is taken as the best calibration parameters for the LLM.


## Math

We have seen that LLMs usually struggle on Math problems. We have gone over some different prompting techniques for improving results on math problems. One recent approach, MathPrompter, unifies methods like CoT and PAL as well as others into a single technique. It includes breaking down the question into algebraic terms and solving it using Python.4

There are four main steps:

1. Generate Algebraic Template (Assign a variable to each number in the question)
2. Math Prompts (Formulate the problem as an algebraic statement, then translate it to python)
3. Generate the answer (for both algebraic and python versions)
4. Self-consistency (re-run and take majority answer)

In [363]:
# STEP 1: Generate Algebraic Template
step_one_prompt = """
Q: A zoo charges $12 per adult ticket and allows children under 5 to enter for free. A family of 4 adults and 2 children under 5 visit the zoo. What is the total cost for the family to enter?
Qt: At a zoo, each adult ticket costs $A and children under 5 can enter for free. If a family of B adults and C children under 5 visit the zoo, what is the total cost for the family to enter?
Mapping: (A: 12, B: 4, C: 2)

Q: A store sells shoes at $60 per pair and socks at $8 per pair. If a customer buys 2 pairs of shoes and 3 pairs of socks, what is the total cost of the purchase?
Qt: At a store, shoes cost $A per pair and socks cost $B per pair. If a customer buys C pairs of shoes and D pairs of socks, what is the total cost of the purchase?
Mapping: (A: 60, B: 8, C: 2, D: 3)

{question}
Qt: """

step_one_answer = llm(step_one_prompt.format(question=question))

print(step_one_answer)

 At a restaurant, each adult meal costs $A and kids eat free. If a group of B people came in and C were kids, how much would it cost for the group to eat?
Mapping: (A: 5, B: 15, C: 8)


In [364]:
# STEP 2: Math Prompts
step_two_prompt = f"""
Qt: At a zoo, each adult ticket costs $A and children under 5 can enter for free. If a family of B adults and C children under 5 visit the zoo, what is the total cost for the family to enter?
Mapping: (A: 12, B: 4, C: 2)

Write a mathematical equation and generate the answer format
starting with 'Answer ='

Answer = A * B

Qt: At a store, shoes cost $A per pair and socks cost $B per pair. If a customer buys C pairs of shoes and D pairs of socks, what is the total cost of the purchase?
Mapping: (A: 60, B: 8, C: 2, D: 3)

Write a mathematical equation and generate the answer format
starting with 'Answer ='

Answer = A * C + B * D

{step_one_answer}

Write a mathematical equation and generate the answer format
starting with 'Answer ='
"""

step_two_answer = llm(step_two_prompt)

print(step_two_answer)


Answer = A * B


In [366]:
# STEP 2a: Generate the python code
step_two_a_prompt = f"""
Qt: At a zoo, each adult ticket costs $A and children under 5 can enter for free. If a family of B adults and C children under 5 visit the zoo, what is the total cost for the family to enter?
Mapping: (A: 12, B: 4, C: 2)

Write a Python function that returns the answer.

def zoo_cost(A, B, C):
    return A * B

Qt: At a store, shoes cost $A per pair and socks cost $B per pair. If a customer buys C pairs of shoes and D pairs of socks, what is the total cost of the purchase?

Write a Python function that returns the answer.

def store_cost(A, B, C, D):
    return (A * C) + (B * D)

{step_one_answer}

Write a Python function that returns the answer.
"""

step_two_a_answer = llm(step_two_a_prompt)

print(step_two_a_answer)


def restaurant_cost(A, B, C):
    return A * (B - C)


### STEP 3: Answer Generation

Now, we can use the Mapping that we generated previously to automatically fill in the variables.

    Mapping: {A: 5, B: 15, C: 8}

Algebraic:

    Answer = 5 * 15 - 5 * 8

Python function:

    def restaurant_cost(A=5, B=15, C=8):
    return A * (B - C)

We can evaluate both using Python.

Algebraic:

    >>> eval("5 * 15 - 5 * 8")
    35

Python function:

    >>> restaurant_cost()
    35

### STEP 4: Self-Consistency

Finally, you can leverage Self-Consistency to rerun the above process multiple times (~5), then take the majority answer.