In [None]:
%%capture
!pip install langchain==0.1.13 openai==1.14.2 ragas==0.1.7 langchain-openai==0.1.1 langchain-cohere==0.1.0rc1

In [None]:
import os
import sys
from dotenv import load_dotenv
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

# 🤔 Imports from `🦜🔗LangChain`? I thought this was a `🗂️LlamaIndex🦙` Course...

Yes, it most certainly is. 

But the state of LLMOps/LLMEval tooling is in it's infancy. Things are rapidly changing, libraries are constantly breaking...it's seriously a mess right now. So, I've only bothered to learn one library for RAG evaluation, and that's `ragas`.

`ragas` itself is hacky and at times quick finnicky, but it's what we got to work with. It's tightly integrated with `LangChain`, and at the time of this writing their `LlamaIndex` is broken.

So, we'll use `ragas` with the `LangChain` backend. All you're really going to be doing with `LangChain` is using an LLM and Embedding model instantiation. 

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_cohere.embeddings import CohereEmbeddings

llm = ChatOpenAI(
    model = "gpt-3.5-turbo-0125"
    )

embed_model=CohereEmbeddings(
    cohere_api_key = CO_API_KEY
    )

I've got an [example dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval?row=1) we'll use in the next several videos in my Hugging Face repo. 

You don't need to sign-up for a Hugging Face account to download the repo, but if you do end up creating an acocunt [feel free to follow me](https://huggingface.co/harpreetsahota)!

In [None]:
from datasets import load_dataset 

dataset = load_dataset("explodinggradients/fiqa", split='baseline', trust_remote_code=True)

dataset = dataset.rename_column("ground_truths", "ground_truth")

# Function to concatenate list of strings into a single string
def flatten_list_of_strings(example):
    # Adjust 'your_list_column' to the actual column name holding the list of strings
    example['ground_truth'] = ' '.join(example['ground_truth'])
    return example

# Apply the function to each example in the dataset
dataset = dataset.map(flatten_list_of_strings)

# 🤝 **Faithfulness**

- 🎯 [Faithfulness](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py) evaluates how accurately responses align with the provided context.

- 🕵️‍♂️ The process involves two steps: 
    - identifying statements within an answer
    - verifying these statements against the context

- 📊 The faithfulness score, ranging from 0 to 1, measures the accuracy of statements in an answer.

- 📈 High faithfulness scores indicate responses that are both reliable and closely reflect the context's factual content.

$$\text{Faithfulness score} = {|\text{Number of claims in the generated answer that can be inferred from given context}| \over |\text{Total number of claims in the generated answer}|}$$

# How does this work?

The order of operations involving calls to LLMs with specific prompts for evaluating the faithfulness of statements derived from a given context or answer. 

### **Generating Statements from Answers**:
The order of operations involving calls to Language Models (LLMs) and the usage of prompts in this code focuses on evaluating the faithfulness of statements derived from a given context or answer. 

   - Initially, for a given question-answer pair, the system uses the `LONG_FORM_ANSWER_PROMPT`. This prompt instructs the LLM to generate one or more statements based on the given answer. The aim here is to distill the essence of the answer into concise statements.

   - The prompt includes detailed instructions for the LLM, specifying how to transform answers into statements, and is accompanied by examples to guide the LLM's generation process.
   
   - Once the LLM processes this prompt, the generated output is parsed using the `_statements_output_parser`, which ensures the output aligns with the `StatementsAnswers` model structure. This process extracts the list of statements as structured data.


In [None]:
from ragas.metrics import faithfulness

In [None]:
faithfulness.long_form_answer_prompt.__dict__

### **Evaluating Statement Faithfulness**

   - After statements are generated, the next step is to evaluate **faithfulness** —that is, whether the statements are accurate and truthful representations of the original context or answer.

   - For this, the system employs the `NLI_STATEMENTS_MESSAGE` prompt. 
   
   - This prompt directs the LLM to assess each generated statement's faithfulness based on the original context (from which the question-answer pair was derived) or additional context provided separately. 
   
   - It asks the LLM to return a verdict on each statement, indicating whether the statement can be verified against the context.

   - The instruction to the LLM includes a detailed explanation of how to judge each statement's faithfulness, supported by examples for clarity.

In [None]:
faithfulness.nli_statements_message.__dict__

### **Computing the Faithfulness Score**

   - The `Faithfulness` class computes a score representing the overall faithfulness of the statements. This computation involves counting the number of statements judged as faithful (verdict = 1) and dividing by the total number of statements.
   
   - This score measures how faithfully the generated statements represent the given answer or context.



Take a look at [the source code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py) for the metric for full details.


In [None]:
from ragas import evaluate

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embed_model,
    metrics=[faithfulness])

In [None]:
score()

# Recap

**Input**: An answer to a question and the context related to that answer.

**Process:**

  - **Step 1:** Use LLM to transform the answer into a set of discrete statements.

  - **Step 2:** Evaluate each statement's faithfulness using another LLM prompt, based on the original answer's context.

  - **Step 3:** Calculate a score representing the percentage of statements deemed faithful.

**Output:** A faithfulness score, where a score closer to 1 indicates higher faithfulness, showing that the generated statements are accurate and truthful representations of the original context or answer.