## Intro

Two kinds of testing
- rule based eval (fast feedback to health of application)
- model based eval (for many different good types of outputs)

Automatically test your system every time someone in your team commits code.

Would you like to detect hallucinations?

### Into to Continuous Integration

What is?
- Making small frequent changes to your application
- Automatically building and testing every change
- Generating rapid, ongoing feedback

Why Important?
- Preventing failed builds from emerging
- Automatically merging passing builds

CI platform simulates how the system behaves in real world, ensuring reliable results and basic functioanlity, to more compelx issues like bias and hallucinations. CI Platform is like having supercharged feedback. Helps catch issues early and prevent them from causing a domino effect. 

Benefits of CI
For developers:
- Rapid iteration
- Faster troubleshooting
- Increased confidence

For teams:
- Easier collaboration
- Fewer merge conflicts
- Shared source of truth
- More trust

# Overview of Automated Evals
- Rule based evals first
- cheap and easy to run

|               | Traditional Software             | LLM-based applications                           |
|---------------|----------------------------------|--------------------------------------------------|
| **Behavior**  | Predefined rules                 | Probability + Prediction                          |
| **Output**    | Deterministic   (same input -> same output)                 | Non-deterministic      (Same Input -> many possible outputs)                          |
| **Testing**   | 1 input, 1 correct output       | 1 input, many correct (and incorrect) outputs    |
| **Criteria**  | Evaluate as "right" or "wrong"  | Evaluate on: accuracy, quality, consistency, bias, toxicity, and more |


Q: Is London the best city in the world?

A1: "Yes"

A2: "Determining whether London is the "best" city in the world is subjective and depends on individual preferences, priorities, and criteria for what makes a city great. London certainly has many attributes that make it a highly desirable city for many people, including its rich history, cultural diversity, world-class museums and galleries, vibrant arts and entertainment scene, and economic opportunities. However, whether it is the "best" city overall is open to debate and varies from person to person."

Which answer is correct? depends on your use case.

Assess LLM outputs on:
- Performance (speed and functionality)
- Effectiveness (accuracy and utility)
- Quality (user experience and reliability)

Standard benchmarks
- MMLU (Mean Message Length in Utterance): Performance
- HellaSwag: Performance/effectiveness
- HumanEval: Quality
- ARC (AI2 Reasoning Challeng): Effectiveness
- WinoGrande: Effectiveness/Quality

These are standard benchmarks, but not very useful you a niche or specific use case. How do we build our own metrics? ALso, these are manual tools, we want automatic tools.

### Automating Evals: What and when?
What?
- Context adherence
- Context relevance
- Correctness
- Bias and toxicity

When?
- After every change (bug fix, feature update, data change)
- Pre-deployment (merges to production branch, end of sprint, prior to shipping hotfix)
- Post-deployment (on demand based on business needs)

### Lesson 2: Overview of Automated Evals Code

In [None]:
def eval_expected_words(
    system_message,
    question,
    expected_words,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(
      system_message,
      human_template,
      llm,
      output_parser)
    
  
  answer = assistant.invoke({"question": question})
    
  print(answer)
    
  assert any(word in answer.lower() \
             for word in expected_words), \
    f"Expected the assistant questions to include \
    '{expected_words}', but it did not"

In [None]:
question  = "Generate a quiz about science."
expected_words = ["davinci", "telescope", "physics", "curie"]

In [None]:
eval_expected_words(
    prompt_template,
    question,
    expected_words
)

In [None]:
def evaluate_refusal(
    system_message,
    question,
    decline_response,
    human_template="{question}", 
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(human_template, 
                              system_message,
                              llm,
                              output_parser)
  
  answer = assistant.invoke({"question": question})
  print(answer)
  
  assert decline_response.lower() in answer.lower(), \
    f"Expected the bot to decline with \
    '{decline_response}' got {answer}"

In [None]:
question  = "Generate a quiz about Rome."
decline_response = "I'm sorry"

# Automated Model-Graded Evals

In [None]:
eval_system_prompt = f"""You are an assistant that evaluates \
  whether or not an assistant is producing valid quizzes.
  The assistant should be producing output in the \
  format of Question N:{delimiter} <question N>?"""

In [None]:
llm_response = """
Question 1:#### What is the largest telescope in space called and what material is its mirror made of?

Question 2:#### True or False: Water slows down the speed of light.

Question 3:#### What did Marie and Pierre Curie discover in Paris?
"""

In [None]:
eval_user_message = f"""You are evaluating a generated quiz \
based on the context that the assistant uses to create the quiz.
  Here is the data:
    [BEGIN DATA]
    ************
    [Response]: {llm_response}
    ************
    [END DATA]

Read the response carefully and determine if it looks like \
a quiz or test. Do not evaluate if the information is correct
only evaluate if the data is in the expected format.

Output Y if the response is a quiz, \
output N if the response does not look like a quiz.
"""

In [None]:
from langchain.prompts import ChatPromptTemplate
eval_prompt = ChatPromptTemplate.from_messages([
      ("system", eval_system_prompt),
      ("human", eval_user_message),
  ])

In [None]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo",
                 temperature=0)

In [None]:
from langchain.schema.output_parser import StrOutputParser
output_parser = StrOutputParser()

In [None]:
eval_chain = eval_prompt | llm | output_parser

In [None]:
eval_chain.invoke({})

In [None]:
def create_eval_chain(
    agent_response,
    llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()
):
  delimiter = "####"
  eval_system_prompt = f"""You are an assistant that evaluates whether or not an assistant is producing valid quizzes.
  The assistant should be producing output in the format of Question N:{delimiter} <question N>?"""
  
  eval_user_message = f"""You are evaluating a generated quiz based on the context that the assistant uses to create the quiz.
  Here is the data:
    [BEGIN DATA]
    ************
    [Response]: {agent_response}
    ************
    [END DATA]

Read the response carefully and determine if it looks like a quiz or test. Do not evaluate if the information is correct
only evaluate if the data is in the expected format.

Output Y if the response is a quiz, output N if the response does not look like a quiz.
"""
  eval_prompt = ChatPromptTemplate.from_messages([
      ("system", eval_system_prompt),
      ("human", eval_user_message),
  ])

  return eval_prompt | llm | output_parser

In [None]:
known_bad_result = "There are lots of interesting facts. Tell me more about what you'd like to know"

In [None]:
bad_eval_chain = create_eval_chain(known_bad_result)

In [None]:
# response for wrong prompt
bad_eval_chain.invoke({})

# Comprehensive testing framework

Hallucinations:
- Can be inaccurate
- irrelevant
- contradictory or nonsensical


In [None]:
from langchain.prompts                import ChatPromptTemplate
from langchain.chat_models            import ChatOpenAI
from langchain.schema.output_parser   import StrOutputParser

def create_eval_chain(context, agent_response):
  eval_system_prompt = """You are an assistant that evaluates \
  how well the quiz assistant
    creates quizzes for a user by looking at the set of \
    facts available to the assistant.
    Your primary concern is making sure that ONLY facts \
    available are used. Quizzes that contain facts outside
    the question bank are BAD quizzes and harmful to the student."""
  
  eval_user_message = """You are evaluating a generated quiz \
  based on the context that the assistant uses to create the quiz.
  Here is the data:
    [BEGIN DATA]
    ************
    [Question Bank]: {context}
    ************
    [Quiz]: {agent_response}
    ************
    [END DATA]

Compare the content of the submission with the question bank \
using the following steps

1. Review the question bank carefully. \
  These are the only facts the quiz can reference
2. Compare the quiz to the question bank.
3. Ignore differences in grammar or punctuation
4. If a fact is in the quiz, but not in the question bank \
   the quiz is bad.

Remember, the quizzes need to only include facts the assistant \
  is aware of. It is dangerous to allow made up facts.

Output Y if the quiz only contains facts from the question bank, \
output N if it contains facts that are not in the question bank.
"""
  eval_prompt = ChatPromptTemplate.from_messages([
      ("system", eval_system_prompt),
      ("human", eval_user_message),
  ])

  return eval_prompt | ChatOpenAI(
      model="gpt-3.5-turbo", 
      temperature=0) | \
    StrOutputParser()

In [None]:
def test_model_graded_eval_hallucination(quiz_bank):
  assistant = assistant_chain()
  quiz_request = "Write me a quiz about books."
  result = assistant.invoke({"question": quiz_request})
  print(result)
  eval_agent = create_eval_chain(quiz_bank, result)
  eval_response = eval_agent.invoke({"context": quiz_bank, "agent_response": result})
  print(eval_response)
  # Our test asks about a subject not in the context, so the agent should answer N
  assert eval_response == "N"


# LLM Red Teaming

### Vulnerabilities:
1) Bias and stereotypes
2) Sensitive information disclosure
3) Service disruption
4) Hallucinations

### Red teaming meaning and origin:
- strategy used in cybersecurity and militiary
     - team simulates adversaries actions and tactics
     - test and improve the effectiveness of an organisations defneces
- Red teaming tests robustness, fairness and ethical boundaries of LLM systems
     - Find ways to make the bot misbehave and return incorrect answers
     

### Stategies
- bypass safeguards
    - exploit text completion
    - using biased prompts
    - direct prompt injection
    - gray box prompt attacks
    - advanced techniques prompt probing