In [None]:
%%capture
!pip install langchain==0.1.13 openai==1.14.2 ragas==0.1.6 langchain-openai==0.1.1 langchain-cohere==0.1.0rc1

In [None]:
import os
import sys
from dotenv import load_dotenv
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_cohere.embeddings import CohereEmbeddings

llm = ChatOpenAI(
    model = "gpt-3.5-turbo-0125"
    )

embed_model=CohereEmbeddings(
    cohere_api_key = CO_API_KEY
    )

I've got an [example dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval?row=1) we'll use in the next several videos in my Hugging Face repo. 

You don't need to sign-up for a Hugging Face account to download the repo, but if you do end up creating an acocunt [feel free to follow me](https://huggingface.co/harpreetsahota)!

In [None]:
from datasets import load_dataset 

dataset = load_dataset("explodinggradients/fiqa", split='baseline', trust_remote_code=True)

dataset = dataset.rename_column("ground_truths", "ground_truth")

# 🪡 **Context Precision**

- 🎖️ [Context Precision](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_precision.py) evaluates how well a model ranks relevant information at the top.

- 🔍 It checks if essential context related to a query is prioritized, that is, a model's ability to identify key information.

- 📏 The assessment compares the model's context ranking to ground-truth data for alignment with the expected relevance order.

- 🎯 Values range from 0 to 1, with higher values indicating better precision in presenting relevant context prominently.

- 📈 High scores indicate the model's success in highlighting the most important context


$$\text{Context Precision@K} = \frac{\sum_{k=1}^{K} \left( \text{Precision@k} \times v_k \right)}{\text{Total number of relevant items in the top } K \text{ results}}$$

$$\text{Precision@k} = {\text{true positives@k} \over  (\text{true positives@k} + \text{false positives@k})}$$

Where $K$ is the total number of chunks in `contexts` and $v_k \in \{0, 1\}$ is the relevance indicator at rank $k$.


# How does this work?

The `ContextPrecision` metric evaluates the usefulness of provided context in arriving at a given answer, with a specific focus on whether all relevant context items are appropriately used to support the answer. 

  - The `CONTEXT_PRECISION` prompt is defined to guide the LLM in verifying the usefulness of context for a given question-answer pair. 

  - This prompt asks the LLM to provide a binary verdict (1 for useful, 0 for not useful) along with a reason for its decision.

  - For each question-answer-context trio, the system generates a verification task using the defined prompt. This task directs the LLM to examine the given context and determine if it was useful in reaching the provided answer.

In [None]:
from ragas.metrics import context_precision

Note, if you want to calculate this value but you don't have the ground truth response you should use `context_utilization` instead. You can import as follows:

```python 

from ragas.metrics import context_utilization
```



In [None]:
context_precision.context_precision_prompt.__dict__

# Computing the Context Precision Score

With all the verdicts collected, the system calculates the average precision score. 

- 📋 **Prepare Verdict List**: Converts input verifications into a binary list, marking 1 for useful contexts and 0 for non-useful.

- ➕ **Calculate Numerator**: Sums up precision at each rank with a relevant context, calculated by the ratio of cumulative relevant items to the rank, multiplied by the relevance indicator.

- ➗ **Calculate Denominator**: Total count of relevant contexts in the list, with a tiny value added to prevent division by zero.

- 📊 **Compute Score**: The average precision score is calculated, giving the ordered precision of context usefulness.

This calculation places more importance on relevant contexts appearing early in the list, with the average precision score indicating the effectiveness of contexts in supporting answers.  A higher score indicates that the provided context was generally useful and relevant in deriving the given answers, reflecting effective context utilization.

In [None]:
from ragas import evaluate

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embed_model,
    metrics=[context_precision])

In [None]:
score

In [None]:
score.to_pandas()

# Recap

**Input:** A dataset comprising question-answer-context trios.

**Process**:

  - **Step 1**: Formulate verification tasks for each trio using the `CONTEXT_PRECISION` prompt.

  - **Step 2**: Evaluate each task's context usefulness with the LLM, obtaining verdicts and reasons.

  - **Step 3**: Parse LLM responses into structured data.
  
  - **Step 4**: Calculate the average precision score based on the usefulness verdicts, emphasizing the relevance and ranking of useful context items.

**Output:** An average precision score, quantifying the overall effectiveness of context utilization in providing answers.