In [None]:
%%capture
!pip install langchain==0.1.13 openai==1.14.2 ragas==0.1.7 langchain-openai==0.1.1 langchain-cohere==0.1.0rc1

In [None]:
import os
import sys
from dotenv import load_dotenv
from getpass import getpass
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

In [None]:
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] or getpass("Enter your OpenAI API key: ")

In [None]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [None]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_cohere.embeddings import CohereEmbeddings

llm = ChatOpenAI(
    model = "gpt-3.5-turbo-0125"
    )

embed_model=CohereEmbeddings(
    cohere_api_key = CO_API_KEY
    )

I've got an [example dataset](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval?row=1) we'll use in the next several videos in my Hugging Face repo. 

You don't need to sign-up for a Hugging Face account to download the repo, but if you do end up creating an acocunt [feel free to follow me](https://huggingface.co/harpreetsahota)!

In [9]:
from datasets import load_dataset 

dataset = load_dataset("explodinggradients/fiqa", split='baseline', trust_remote_code=True)

dataset = dataset.rename_column("ground_truths", "ground_truth")

# Function to concatenate list of strings into a single string
def flatten_list_of_strings(example):
    # Adjust 'your_list_column' to the actual column name holding the list of strings
    example['ground_truth'] = ' '.join(example['ground_truth'])
    return example

# Apply the function to each example in the dataset
dataset = dataset.map(flatten_list_of_strings)

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

# 🔍 **Answer Relevancy**

- 🎯 [Answer Relevancy](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_answer_relevance.py) measures how directly an answer addresses the question asked.

- 🔍 The process involves generating hypothetical questions from the answer and comparing these to the original question to assess similarity.

- 📍 Its core focus is on identifying answers that precisely address the query without veering off-topic.

- 📈 Scoring ranges from 0 to 1, with higher scores indicating a closer match between the answer and the question.

- 🏆 The metric rewards answers that are directly applicable and penalizes those that include irrelevant details.

- 📐 It calculates mean cosine similarity between the original and reverse-engineered questions to quantify relevancy.

$$\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^{N} cos(E_{g_i}, E_o)$$


$E_{g_i}$ is the embedding of the generated question .

$E_o$ is the embedding of the original question.

$N$ is the number of generated questions, which is 3 default.


# How does this work?

For each provided answer, the system asks the LLM to do two things: 

1. Generate a question that fits the answer. 

2. Classify if the answer is "noncommittal" (evasive or not directly answering the question). 

The noncommittal classification is a simple 0 or 1 flag indicating if the answer directly addresses the question or dodges it. Noncommittal classification affects the relevancy score, reducing it if the answer is evasive, even if the generated question closely matches the original.

This task uses the `QUESTION_GEN` prompt, effectively turning the answer (and its context) back into a question as if trying to reverse-engineer what the original question could have been. 

### **Question Generation and Answer Classification**

  - This is a single step that accomplishes two tasks: 
    - creating a question that matches the provided answer
    
    - assessing the answer's directness or relevancy.

   - You can specify the number of questions to generate per answer through the `strictness` argument. This attribute determines how many questions the LLM should generate for each answer, allowing for a more thorough evaluation.

In [10]:
from ragas.metrics import answer_relevancy

In [11]:
answer_relevancy.question_generation.__dict__

{'name': 'question_generation',
 'instruction': 'Generate a question for the given answer and Identify if answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don\'t know" or "I\'m not sure" are noncommittal answers',
 'output_format_instruction': 'The output should be a well-formatted JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output JSON schema:\n```\n{"type": "object", "properties": {"question": {"title": "Question", "type": "string"}, "noncommittal": {"title": "Noncommittal", "type": "integer

### **Computing Answer Relevancy**

- 📏 Similarity between the original and generated questions is measured using embeddings, specifically through cosine similarity.

- 🎯 The relevancy of an answer is determined by how close the embeddings of the original question are to those of the generated question(s).

- 📊 The final score averages the similarity scores across all generated questions, adjusted by the strictness setting.

- 🔻 Noncommittal answers lead to a score penalty, ensuring only direct, relevant answers achieve high scores.



In [12]:
from ragas import evaluate

score = evaluate(
    dataset,
    llm=llm,
    embeddings=embed_model,
    metrics=[answer_relevancy])

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

In [13]:
score

{'answer_relevancy': 0.5696}

In [14]:
score.to_pandas()

Unnamed: 0,question,ground_truth,answer,contexts,answer_relevancy
0,How to deposit a cheque issued to an associate...,Have the check reissued to the proper payee.Ju...,\nThe best way to deposit a cheque issued to a...,[Just have the associate sign the back and the...,0.946728
1,Can I send a money order from USPS as a business?,Sure you can. You can fill in whatever you wa...,"\nYes, you can send a money order from USPS as...",[Sure you can. You can fill in whatever you w...,0.982546
2,1 EIN doing business under multiple business n...,You're confusing a lot of things here. Company...,"\nYes, it is possible to have one EIN doing bu...",[You're confusing a lot of things here. Compan...,0.885912
3,Applying for and receiving business credit,"""I'm afraid the great myth of limited liabilit...",\nApplying for and receiving business credit c...,[Set up a meeting with the bank that handles y...,0.762772
4,401k Transfer After Business Closure,You should probably consult an attorney. Howev...,\nIf your employer has closed and you need to ...,[The time horizon for your 401K/IRA is essenti...,0.75428
5,What are the ins/outs of writing equipment pur...,Most items used in business have to be depreci...,\nWriting equipment purchases off as business ...,[You would report it as business income on Sch...,0.915925
6,Can a entrepreneur hire a self-employed busine...,Yes. I can by all means start my own company a...,"\nYes, an entrepreneur can hire a self-employe...",[Yes. I can by all means start my own company ...,0.998377
7,Intentions of Deductible Amount for Small Busi...,"""If your sole proprietorship losses exceed all...",\nThe intention of deductible amounts for smal...,"[""Short answer, yes. But this is not done thro...",0.861985
8,How can I deposit a check made out to my busin...,You should have a separate business account. M...,\nYou can deposit a check made out to your bus...,"[""I have checked with Bank of America, and the...",0.956847
9,Filing personal with 1099s versus business s-c...,Depends whom the 1099 was issued to. If it was...,\nFiling personal taxes with 1099s versus fili...,[Depends whom the 1099 was issued to. If it wa...,0.0


# Recap

**Input:** An original question and an answer (with context).

**Process:** Use LLM to generate question(s) from the answer, classify committal status, calculate similarity between original and generated questions, and adjust based on committal status.

**Output:** A relevancy score ranging from 0 to 1, with 1 indicating high relevancy (meaning the answer directly and accurately addresses the original question) and 0 indicating low relevancy.