In [None]:
%%capture
!pip install langchain==0.1.13 openai==1.14.2 ragas==0.1.7 langchain-openai==0.1.1 langchain-cohere==0.1.0rc1

In [None]:
import os
import sys
from dotenv import load_dotenv
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_cohere.embeddings import CohereEmbeddings

llm = ChatOpenAI(
    model = "gpt-3.5-turbo-0125"
    )

embed_model=CohereEmbeddings(
    cohere_api_key = CO_API_KEY
    )

I've got an [example dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval?row=1) we'll use in the next several videos in my Hugging Face repo. 

You don't need to sign-up for a Hugging Face account to download the repo, but if you do end up creating an acocunt [feel free to follow me](https://huggingface.co/harpreetsahota)!

In [None]:
from datasets import load_dataset 

dataset = load_dataset("explodinggradients/fiqa", split='baseline', trust_remote_code=True)

dataset = dataset.rename_column("ground_truths", "ground_truth")

`ragas` expects the ground truth to be a string to calculaute this metric, for whatever reason. 

In [None]:
dataset = dataset.map(lambda x: {"ground_truth": " ".join(x["ground_truth"])})

# ✅ **Answer Correctness**

[Answer correctness](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_correctness.py) measures how closely a model's response aligns with the established ground truth, focusing on precision and accuracy.

- 📏 **Scoring Range**: Scores range from 0 to 1, where scores near 1 indicate a high level of alignment with the correct answer.

- 🏗️ **Weighted Approach**: Integrates semantic and factual similarities for a comprehensive correctness score.

- 🔄 **Threshold Application**: Allows evaluators to apply a threshold, turning the nuanced score into a binary outcome for easier interpretation.

- 🔍 **Focus on Accuracy**: Ensures that generated responses are not only relevant but factually and semantically correct.

# How does this work?

Answer Correctness evaluates an answer's correctness by assessing its factual alignment with a ground truth and its overall semantic coherence. 

- The system starts by using the `CORRECTNESS_PROMPT`, which instructs the LLM to analyze each statement in the answer for its factual alignment with the ground truth, then classify each statements in the given answer into three categories:

    - **✅ TP (True Positive):** Facts or statements that are present in both the ground truth and the generated answer.

    - **❌ FP (False Positive):** Facts or statements that are present in the generated answer but not in the ground truth.

    - **❎ FN (False Negative):** Facts or statements that are present in the ground truth but not in the generated answer.


You can, optionally, pass a list of two floats that sum to one to the `weights` argument.

- ⚖️ **Factual Accuracy Weight**: Assigned to evaluate the classification of statements as True Positives, False Positives, and False Negatives.

- 🧠 **Semantic Similarity Weight**: Focuses on assessing how closely the answer captures the essence and nuances of the ground truth.

- 🔍 **Semantic Evaluation**: If semantic similarity has a non-zero weight, the system evaluates the overall meaning of the answer compared to the ground truth, capturing nuances beyond factual accuracy.

In [None]:
from ragas.metrics import answer_correctness

In [None]:
answer_correctness.correctness_prompt.__dict__

# Computing the Context Precision Score

The system computes an F1 score based on the TP, FP, and FN classified by the LLM. 

$$F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

where Precision is $$\text{Precision} = \frac{TP}{TP + FP}$$ and Recall is $$\text{Recall} = \frac{TP}{TP + FN}$$

It can also be represented as:

$$F1 = \frac{TP}{TP + 0.5 \times (FP + FN)}$$

- This measures the factuality of the answer by considering the accuracy and completeness of the response relative to the ground truth.

- 🎚️ **Final Score Composition**: Combines the F1 score and semantic similarity score into a weighted average, based on preset weights.

- 🏆 **Composite Score Aim**: Reflects the comprehensive factual and semantic alignment of the answer with the ground truth.

In [None]:
from ragas import evaluate

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embed_model,
    metrics=[answer_correctness])

In [None]:
score

In [None]:
score.to_pandas()

**Input:** A dataset containing trios of questions, answers, and their corresponding ground truths.

**Process:**

  - **Step 1:** Use the correctness prompt to guide the LLM in classifying statements from the answer into TP, FP, and FN based on their factual alignment with the ground truth.

  - **Step 2 (Conditional):** Evaluate the semantic similarity between the answer and the ground truth if semantic analysis is weighted.

  - **Step 3:** Calculate the F1 score from the TP, FP, and FN classifications to assess factual accuracy.

  - **Step 4:** Compute the final score by averaging the F1 and semantic similarity scores according to their weights, providing a comprehensive measure of answer correctness.
  
**Output:** A composite score that quantifies the correctness of an answer, reflecting both its factual accuracy and semantic alignment with the ground truth. 