In [None]:
%%capture
!pip install langchain==0.1.13 openai==1.14.2 ragas==0.1.7 langchain-openai==0.1.1 langchain-cohere==0.1.0rc1

In [None]:
import os
import sys
from dotenv import load_dotenv
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_cohere.embeddings import CohereEmbeddings

llm = ChatOpenAI(
    model = "gpt-3.5-turbo-0125"
    )

embed_model=CohereEmbeddings(
    cohere_api_key = CO_API_KEY
    )

I've got an [example dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval?row=1) we'll use in the next several videos in my Hugging Face repo. 

You don't need to sign-up for a Hugging Face account to download the repo, but if you do end up creating an acocunt [feel free to follow me](https://huggingface.co/harpreetsahota)!

In [None]:
from datasets import load_dataset 

dataset = load_dataset("explodinggradients/fiqa", split='baseline', trust_remote_code=True)

dataset = dataset.rename_column("ground_truths", "ground_truth")

# Function to concatenate list of strings into a single string
def flatten_list_of_strings(example):
    # Adjust 'your_list_column' to the actual column name holding the list of strings
    example['ground_truth'] = ' '.join(example['ground_truth'])
    return example

# Apply the function to each example in the dataset
dataset = dataset.map(flatten_list_of_strings)

`ragas` expects the ground truth to be a string to calculaute this metric, for whatever reason. 

In [None]:
dataset = dataset.map(lambda x: {"ground_truth": " ".join(x["ground_truth"])})

# 📊 Context Recall

[Context recall](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_recall.py) is a metric for evaluating information retrieval systems. 

It assesses the effectiveness of a retriever in gathering all relevant context for accurately answering a question. By comparing each part of a provided ground truth answer to the retrieved information, context recall assesses the completeness of the retrieval process. 

- 📏 **Key Point**: The goal is to have a context that backs every part of the answer.

- 🛠 **Necessity of Ground Truth**: Essential for measuring how well the context aligns with the correct answer.

- 🔍 **Evaluation Process**: Each sentence in the ground truth is compared with the retrieved context to find matches.

- 📈 **Scoring Scale**: Ranges from 0 to 1, where values closer to 1 indicate better performance. A perfect score means the system fetched all the needed context to support the ground truth answer comprehensively.

- 🔄 **Operational Mechanism**: Breaks down the answer into sentences to check for their presence in the context, aiming for a complete match.

# How does this work?

Context Recall assess how well the sentences in an answer can be attributed to the given context, focusing on identifying the sentences that are directly supported by the context (True Positives) and those that are not found in the context (False Negatives). 

This metric evaluates the completeness of an answer in relation to the provided context, essentially measuring the recall of the context within the answer.

 - For each question-context-answer set, the `CONTEXT_RECALL_RA` prompt instructs the LLM to analyze each sentence in the answer and classify whether it can be directly attributed to the given context. 

 - The classification is binary: "Yes" (1) for sentences that can be attributed to the context and "No" (0) for those that cannot. Each classification must also include a reason.

 - The LLM evaluates the answer in the context of the provided information and classifies each sentence based on its presence or absence in the given context. 
 
  - This step produces classifications of sentences as either attributed to the context or not, along with reasons for each classification.

In [None]:
from ragas.metrics import context_recall

In [None]:
context_recall.context_recall_prompt.__dict__

# Computing Context Recall

The score is computed by taking the ratio of sentences attributed to the context (True Positives) to the total number of sentences analyzed. 

This reflects the proportion of the answer that can be directly traced back to the provided context, effectively measuring the recall of the context in the answer.

In [None]:
from ragas import evaluate

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embed_model,
    metrics=[context_recall])

In [None]:
score

In [None]:
score.to_pandas()

# Recap


**Input:** A dataset containing trios of questions, contexts, and ground truth answers.

**Process:**

- **Step 1**: For each question-context-answer set, the LLM looks at each sentence in the answer and classify whether it can be directly attributed to the given context. 

- **Step 2**: The LLM classifies sentences as either attributed to the context or not. The classification is binary: "Yes" (1) for sentences that can be attributed to the context and "No" (0) for those that cannot. 

- **Step 3**: The score is computed by taking the ratio of sentences attributed to the context (True Positives) to the total number of sentences considered. 

**Output**: A score representing the context recall of the answer, quantifying the extent to which the information provided in the answer is supported by the given context.