In [None]:
%%capture
!pip install langchain==0.1.13 openai==1.14.2 ragas==0.1.6 langchain-openai==0.1.1 langchain-cohere==0.1.0rc1

In [None]:
import os
import sys
from dotenv import load_dotenv
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_cohere.embeddings import CohereEmbeddings

llm = ChatOpenAI(
    model = "gpt-3.5-turbo-0125"
    )

embed_model=CohereEmbeddings(
    cohere_api_key = CO_API_KEY
    )

I've got an [example dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval?row=1) we'll use in the next several videos in my Hugging Face repo. 

You don't need to sign-up for a Hugging Face account to download the repo, but if you do end up creating an acocunt [feel free to follow me](https://huggingface.co/harpreetsahota)!

In [None]:
from datasets import load_dataset 

dataset = load_dataset("explodinggradients/fiqa", split='baseline', trust_remote_code=True)

dataset.rename_column("ground_truths", "ground_truth")

# 🔍 **Answer Relevancy**

- 🎯 Answer Relevancy measures how directly an answer addresses the question asked.

- 🔍 The process involves generating hypothetical questions from the answer and comparing these to the original question to assess similarity.

- 📍 Its core focus is on identifying answers that precisely address the query without veering off-topic.

- 📈 Scoring ranges from 0 to 1, with higher scores indicating a closer match between the answer and the question.

- 🏆 The metric rewards answers that are directly applicable and penalizes those that include irrelevant details.

- 📐 It calculates mean cosine similarity between the original and reverse-engineered questions to quantify relevancy.

$$\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)$$


$E_{g_i}$ is the embedding of the generated question .

$E_o$ is the embedding of the original question.

$N$ is the number of generated questions, which is 3 default.


# How does this work?

For each provided answer, the system asks the LLM to do two things: 

1. Generate a question that fits the answer. 

2. Classify if the answer is "noncommittal" (evasive or not directly answering the question). 

The noncommittal classification is a simple 0 or 1 flag indicating if the answer directly addresses the question or dodges it. Noncommittal classification affects the relevancy score, reducing it if the answer is evasive, even if the generated question closely matches the original.

This task uses the `QUESTION_GEN` prompt, effectively turning the answer (and its context) back into a question as if trying to reverse-engineer what the original question could have been. 

### **Question Generation and Answer Classification**

  - This is a single step that accomplishes two tasks: 
    - creating a question that matches the provided answer
    
    - assessing the answer's directness or relevancy.

   - You can specify the number of questions to generate per answer through the `strictness` argument. This attribute determines how many questions the LLM should generate for each answer, allowing for a more thorough evaluation.

In [None]:
from ragas.metrics import answer_relevancy

In [None]:
answer_relevancy.question_generation.__dict__

### **Computing Answer Relevancy**

- 📏 Similarity between the original and generated questions is measured using embeddings, specifically through cosine similarity.

- 🎯 The relevancy of an answer is determined by how close the embeddings of the original question are to those of the generated question(s).

- 📊 The final score averages the similarity scores across all generated questions, adjusted by the strictness setting.

- 🔻 Noncommittal answers lead to a score penalty, ensuring only direct, relevant answers achieve high scores.



In [None]:
from ragas import evaluate

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embed_model,
    metrics=[answer_relevancy])

In [None]:
score

In [None]:
score.to_pandas()

# Recap

**Input:** An original question and an answer (with context).

**Process:** Use LLM to generate question(s) from the answer, classify committal status, calculate similarity between original and generated questions, and adjust based on committal status.

**Output:** A relevancy score ranging from 0 to 1, with 1 indicating high relevancy (meaning the answer directly and accurately addresses the original question) and 0 indicating low relevancy.