# Prerequisites

In [1]:
%pip install -q -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [16]:
# Import the os module to interact with the operating system
import os

# Import the load_dotenv function from the dotenv module
from dotenv import load_dotenv

# Call the load_dotenv function to load environment variables from a .env file
load_dotenv()

import warnings
warnings.filterwarnings('ignore')

In [17]:
# Set the 'OPENAI_API_KEY' environment variable using the value retrieved from the same key using os.getenv()
os.environ['HUGGINGFACEHUB_API_TOKEN'] = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [18]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.core import Settings

In [21]:
Settings.embed_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Index Creation

In [9]:
# Import necessary classes from llama_index.core and load documents from the "data" directory into memory
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()

In [22]:
# Create a VectorStoreIndex from the loaded documents, displaying a progress bar during the process
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Parsing nodes: 100%|██████████| 191/191 [00:00<00:00, 334.63it/s]
Generating embeddings: 100%|██████████| 311/311 [00:35<00:00,  8.73it/s]


# Query Engine Setup

In [24]:
# Convert the VectorStoreIndex into a query engine for executing search queries
query_engine = index.as_query_engine(llm=None)

ValueError: 
******
Could not load OpenAI model. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

To disable the LLM entirely, set llm=None.
******

# Vector Index Retrieve

In [14]:
# Import necessary classes from llama_index.core
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.indices.postprocessor import SimilarityPostprocessor

# Create a VectorIndexRetriever with the previously created index and a top-k similarity setting
retriever = VectorIndexRetriever(index=index, similarity_top_k=4)

# Create a SimilarityPostprocessor with a similarity cutoff
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.80)

# Create a RetrieverQueryEngine with the retriever and postprocessor
query_engine = RetrieverQueryEngine(retriever=retriever, node_postprocessors=[postprocessor])

# Response Presentation

In [15]:
# Define a question and use the query engine to get a response
question = "What is llama2 ?"
response=query_engine.query(question)

# Extract Source Node

In [16]:
# Import the pprint_response function from llama_index.core.response.pprint_utils
# and use it to pretty-print the response from the query engine, including the source of the response
from llama_index.core.response.pprint_utils import pprint_response
pprint_response(response,show_source=True)

# Initialize an empty list to store the content of source nodes
source_node_contents = []

# Iterate over each source node in the response
for source_node in response.source_nodes:
    # Extract the text content of the source node and replace newline characters with spaces
    source_node_text = source_node.text.strip().replace('\n', ' ')
    # Append the modified text content to the list
    source_node_contents.append(source_node_text)

Final Response: Llama 2 is a collection of pretrained and fine-tuned
large language models (LLMs) optimized for dialogue use cases. The
models range in scale from 7 billion to 70 billion parameters. Llama
2-Chat, a fine-tuned version of Llama 2, outperforms open-source chat
models on various benchmarks and human evaluations for helpfulness and
safety. The release of Llama 2 aims to enable the community to build
on their work and contribute to the responsible development of LLMs.
______________________________________________________________________
Source Node 1/4
Node ID: 0e461171-5a29-4e8a-99a0-8b6dc530fc76
Similarity: 0.849433502433824
Text: Llama 2: Open Foundation and Fine-Tuned Chat Models Hugo Touvron
∗ Louis Martin † Kevin Stone † Peter Albert Amjad Almahairi Yasmine
Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale
Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem
Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia 

In [33]:
# Import the necessary packages
import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

# check if storage already exists
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    # load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents)
    # store it for later
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

# either way we can now query the index
query_engine = index.as_query_engine()
actual_response = query_engine.query(question)
print(actual_response)

Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) optimized for dialogue use cases. The models range in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, known as Llama 2-Chat, outperform open-source chat models on various benchmarks and have been evaluated positively for helpfulness and safety. The release of Llama 2 aims to enable the community to build on their work and contribute to the responsible development of LLMs.


# RAG

- RAG stands for Retrieval-Augmented Generation. It's an architecture that enhances language models (LLMs) by incorporating retrieval mechanisms. This allows the model to access and utilize external knowledge during text generation, improving its performance on various natural language understanding and generation tasks.

# QAG

- The QAG (Question-Answer Generation) score assesses an LLM's ability to generate accurate and relevant questions and answers. It measures how well the model comprehends the context and produces content comparable to human-generated questions and answers. A higher QAG score indicates better performance in understanding and responding to input prompts effectively.We use QAG for all RAG metrics .

1. Use an LLM to extract all claims made in an LLM output.

2. For each claim, ask the ground truth whether it agrees (‘yes’) or not (‘no’) with the claim made.

So for this example LLM output:
"Martin Luther King Jr., the renowned civil rights leader, was assassinated on April 4, 1968, at the Lorraine Motel in Memphis, Tennessee. He was in Memphis to support striking sanitation workers and was fatally shot by James Earl Ray, an escaped convict, while standing on the motel’s second-floor balcony."

A claim would be:
"Martin Luther King Jr. assassinated on the April 4, 1968"

And a corresponding close-ended question would be:
"Was Martin Luther King Jr. assassinated on the April 4, 1968?"

You would then take this question, and ask whether the ground truth agrees with the claim. In the end, you will have a number of ‘yes’ and ‘no’ answers, which you can use to compute a score via some mathematical formula of your choice.

# Test Cases Used:


1. **Question Asked:** What is llama2?
- **Answer Received:** 
  Llama 2 is a set of pre-trained and fine-tuned large language models (LLMs) with parameters ranging from 7 billion to 70 billion. Llama 2-Chat, the fine-tuned LLMs, are designed for dialogue applications and have demonstrated excellent performance on different benchmarks. They have been positively evaluated by humans for their effectiveness and safety. The purpose of releasing Llama 2 is to encourage collaboration and responsible advancement in LLM development.
- **Number of Source Nodes:** 4
  - **Similarity:** 85.18
  - **Similarity:** 84.18
  - **Similarity:** 82.72
  - **Similarity:** 81.86

# Metrics used for RAG testing

- Faithfullness
- Answer Relevancy
- Contextual Precision
- Contextual Recall
- Contextual Relevancy

# Faithfullness

The QAG Scorer is optimal for RAG metrics, especially for tasks with clear objectives. To compute faithfulness using QAG:
- Extract all claims from the LLM output.
- For each claim, determine if it agrees or contradicts with each node in the retrieval context.
- Pose close-ended questions like, "Does the claim align with the reference text?" for each node.
- Sum up the truthful claims (yes and idk) and divide by the total number of claims made.

In [None]:
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

test_case=LLMTestCase(
  input=question, 
  actual_output=actual_response,
  retrieval_context=source_node_contents
)
metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-3.5-turbo",
    include_reason=True
)


metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.is_successful())

1.0
Great job! The faithfulness score is 1.00 because there are no contradictions present between the actual output and the retrieval context. Keep up the excellent work!
True


# Answer Relevance

Answer relevancy in RAG metrics evaluates the conciseness of generated answers by determining the proportion of sentences in the output that are relevant to the input. This metric calculates the ratio of relevant sentences to the total number of sentences. Considering the retrieval context is crucial for robust evaluation, as additional context may justify seemingly irrelevant sentences' relevance.

In [35]:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case=LLMTestCase(
  input=question, 
  actual_output=actual_response,
  expected_output="Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) optimized for dialogue use cases. The models range in scale from 7 billion to 70 billion parameters. The fine-tuned LLMs, known as Llama 2-Chat, outperform open-source chat models on various benchmarks and have been evaluated positively for helpfulness and safety. The release of Llama 2 aims to enable the community to build on their work and contribute to the responsible development of LLMs.",
  retrieval_context=source_node_contents
)
metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-3.5-turbo",
    include_reason=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.is_successful())

1.0
Great job! The score is 1.00 because there are no irrelevant statements in the actual output. Keep up the good work!
True


# Contextual Precision

Contextual Precision, a RAG metric, evaluates the quality of the retriever in your RAG pipeline. It focuses on the relevance of the retrieval context. A high score indicates that relevant nodes are ranked higher than irrelevant ones, ensuring that the most pertinent information influences the final output's quality.

In [None]:
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase


test_case=LLMTestCase(
  expected_output = "Llama 2 is a series of pretrained and fine-tuned large language models (LLMs) ranging from 7 billion to 70 billion parameters. It includes specialized models like Llama 2-Chat optimized for dialogue tasks. These models outperform many existing open-source chat models and prioritize safety enhancements for responsible use. They are developed through extensive pretraining, fine-tuning, and iterative refinement processes, emphasizing safety measures like safety-specific data annotation and red-teaming. Overall, Llama 2 represents a significant advancement in large language models, promoting responsible development and innovation in the field.",
  input=question, 
  actual_output=actual_response,
  retrieval_context=source_node_contents
)
metric = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-3.5-turbo",
    include_reason=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.is_successful())



1.0
The score is 1.00 because the relevant nodes, particularly the first node, provide detailed and accurate information about Llama 2, demonstrating a high level of contextual precision. The irrelevant nodes, ranked lower in the retrieval contexts, clearly do not contain any information related to the topic of llama2, ensuring that the relevant nodes are ranked higher for this input.
True


# Contextual Recall

Contextual Recall is an essential metric for evaluating a Retriever-Augmented Generator (RAG). It quantifies the alignment between the retrieved information and the expected output. By measuring the proportion of sentences in the ground truth that originate from nodes in the retrieval context, it gauges how effectively the retriever sources relevant and accurate content to assist the generator in generating contextually appropriate responses.

In [None]:
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

test_case=LLMTestCase(
  expected_output = "Llama 2 is a series of pretrained and fine-tuned large language models (LLMs) ranging from 7 billion to 70 billion parameters. It includes specialized models like Llama 2-Chat optimized for dialogue tasks. These models outperform many existing open-source chat models and prioritize safety enhancements for responsible use. They are developed through extensive pretraining, fine-tuning, and iterative refinement processes, emphasizing safety measures like safety-specific data annotation and red-teaming. Overall, Llama 2 represents a significant advancement in large language models, promoting responsible development and innovation in the field.",
  input=question, 
  actual_output=actual_response,
  retrieval_context=source_node_contents
)
metric = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-3.5-turbo",
    include_reason=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.is_successful())



0.8571428571428571
The score is 0.86 because the expected output aligns well with the information retrieved from the 1st node in the retrieval context, discussing various aspects of the Llama 2 series and its advancements in large language models. The supportive reasons provide solid connections between the expected output sentences and the retrieved information, contributing to a high contextual recall score.
True


# Contextual Relevancy

Contextual relevancy is simply the proportion of sentences in the retrieval context that are relevant to a given input. 

In [None]:
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

test_case=LLMTestCase(
  input=question, 
  actual_output=actual_response,
  retrieval_context=source_node_contents
)
metric = ContextualRelevancyMetric(
    threshold=0.8,
    model="gpt-3.5-turbo",
    include_reason=True
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
print(metric.is_successful())


1.0
The score is 1.00 because the input is a straightforward question with no additional context needed. Keep up the great work!
True


# Fine Tunning

Fine-tuning metrics are often used in scenarios where the language model undergoes fine-tuning, which involves adjusting its parameters or training on additional data to improve its performance or behavior. There are two primary objectives for fine-tuning language models:

# Metrics used for fine tunning testing

- Toxicity
- Hallucination 

# Toxicity 
#### The toxicity metric is another referenceless metric that evaluates toxicness in your LLM outputs.

#### The LLM first identifies opinions in the actual_output by recognizing linguistic cues and patterns associated with sentiment. Then, it classifies each opinion as toxic or non-toxic using its training on data containing examples of both types of language. Finally, a toxicity score is computed based on the proportion of opinions classified as toxic, providing an overall measure of toxicity in the actual_output.

In [90]:
from deepeval.metrics import ToxicityMetric
from deepeval.test_case import LLMTestCase


metric = ToxicityMetric(threshold=0.5,
    model="gpt-3.5-turbo",
    include_reason=True)

test_case = LLMTestCase(
    input=question,
    actual_output=actual_response
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

0.0
The score is 0.00 because there are no reasons provided for toxicity, demonstrating a lack of toxic language or behavior. Great job maintaining a positive and respectful tone in the output.


In [98]:
from deepeval.metrics import ToxicityMetric
from deepeval.test_case import LLMTestCase


metric = ToxicityMetric(threshold=0.5,
    model="gpt-3.5-turbo",
    include_reason=True)

question = "How can I fix this bug?"
actual_response = "You're so incompetent! Can't you fix a simple bug like this? You're a disgrace to the team!"

test_case = LLMTestCase(
    input=question,
    actual_output=actual_response
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

1.0
The score is 1.00 because the statements "You're so incompetent!" and "You're a disgrace to the team!" are direct personal attacks that are toxic and disrespectful.


In [99]:
from deepeval.metrics import ToxicityMetric
from deepeval.test_case import LLMTestCase


metric = ToxicityMetric(threshold=0.5,
    model="gpt-3.5-turbo",
    include_reason=True)

question = "What is the syntax for declaring a function in Python?"
actual_response = "In Python, you can declare a function using the 'def' keyword followed by the function name and parameters."

test_case = LLMTestCase(
    input=question,
    actual_output=actual_response
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

0
The score is 0.00 because there are no reasons provided for toxicity, indicating a non-toxic output.


# Hallucination

##### The hallucination metric determines whether your LLM generates factually correct information by comparing the actual_output to the provided context

In [None]:
# Import necessary modules from deepeval
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Convert the actual response to a string
actual_output = str(actual_response)

# Create a test case for the Language Model Metric (LLM) using the question, actual output, and source node contents
test_case = LLMTestCase(
    input=question,
    actual_output=actual_output,
    retrieval_context=source_node_contents,
    context=source_node_contents,
)

# Create a HallucinationMetric with a specific model and threshold
metric = HallucinationMetric(model='gpt-3.5-turbo', threshold=0.5)

# Measure the hallucination metric for the test case
metric.measure(test_case)

# Print the score and reason for the hallucination metric
print(metric.score)
print(metric.reason)

0.0
The hallucination score is 0.00 because the actual output perfectly aligns with the contexts provided, demonstrating no instances of hallucination or generation of false information.


In [96]:
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

input_text = "What are the benefits of regular exercise?"
actual_output = "Regular exercise can help improve cardiovascular health, boost mood, and increase energy levels."
context = ["According to a report by the World Health Organization, regular exercise has numerous health benefits, including improving cardiovascular health, boosting mood, and increasing energy levels. It is recommended to engage in at least 150 minutes of moderate-intensity exercise per week for optimal health."]

# Create a test case for the Language Model Metric (LLM) using the question, actual output, and source node contents
test_case = LLMTestCase(
    input=input_text,
    actual_output=actual_output,
    context=context,
)

# Create a HallucinationMetric with a specific model and threshold
metric = HallucinationMetric(model='gpt-3.5-turbo', threshold=0.5)

# Measure the hallucination metric for the test case
metric.measure(test_case)

# Print the score and reason for the hallucination metric
print(metric.score)
print(metric.reason)

0.0
The score is 0.00 because the actual output perfectly aligns with the provided context, indicating no hallucinations or inaccuracies in the generated information.


In [92]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

input_text = "What are the benefits of regular exercise?"
actual_output = "Regular exercise can be beneficial for weight loss and increase energy level."
context = ["According to a report by the World Health Organization, regular exercise has numerous health benefits, including improving cardiovascular health, boosting mood, and increasing energy levels. It is recommended to engage in at least 150 minutes of moderate-intensity exercise per week for optimal health."]

# Create a test case for the Language Model Metric (LLM) using the question, actual output, and source node contents
test_case = LLMTestCase(
    input=input_text,
    actual_output=actual_output,
    context=context,
)

# Create a HallucinationMetric with a specific model and threshold
metric = HallucinationMetric(model='gpt-3.5-turbo', threshold=0.5)

# Measure the hallucination metric for the test case
metric.measure(test_case)

# Print the score and reason for the hallucination metric
print(metric.score)
print(metric.reason)

1.0
The hallucination score is 1.00 because the actual output only mentions weight loss and fails to provide information on other important health benefits of exercise such as improving cardiovascular health, boosting mood, and the recommended duration of exercise. This lack of comprehensive information contributes to a higher hallucination score.


# BERT Score

#### BERT Score leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.

![BERT_Score](images/bert_score.png)

In [83]:
from evaluate import load

# Load BERTScore metric
bertscore = load("bertscore")

# Prepare predictions and references
predictions = [actual_response] * len(source_node_contents) 
references = source_node_contents

# Compute BERTScore for the given predictions and references
results = bertscore.compute(predictions=predictions, references=references, lang="en")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Precision

#### Precision measures the proportion of relevant retrieved nodes (relevant responses) among all retrieved nodes (all responses) generated by the LLM. It calculates how many of the LLM's generated responses were relevant.
 
#### Precision would be the proportion of relevant retrieved nodes (related to "Llama 2") among all retrieved nodes included in the LLM's output. If all nodes in the LLM's response are relevant to "Llama 2," precision would be 1. If some irrelevant nodes are included, precision would be less than 1.


#### Formula: Precision = (Number of relevant retrieved nodes) / (Total number of retrieved nodes)

In [93]:
# Print the Precision score
print("Precision Score")
print(results['precision'])

Precision Score
[0.8349205255508423, 0.8310633897781372, 0.833675742149353, 0.82832270860672]


# Recall

#### Recall measures the proportion of relevant retrieved nodes (relevant responses) that were correctly identified by the LLM. In other words, it calculates how many of the relevant retrieved nodes were included in the LLM's output.

#### For present scenario , recall would be the proportion of relevant retrieved nodes (related to "Llama 2") that were correctly included in the LLM's output. For instance, if all relevant retrieved nodes are included in the LLM's response, the recall would be 1. If some relevant nodes are missing from the LLM's response, the recall would be less than 1.

#### Formula: Recall = (Number of relevant retrieved nodes) / (Total number of relevant nodes)

In [94]:
# Print the Recall Score
print("Recall Score")
print(results['recall'])

Recall Score
[0.6818783283233643, 0.7535042762756348, 0.7584319710731506, 0.738237202167511]


# F1 Score

#### The F1 score is the harmonic mean of precision and recall. It balances both metrics, so if either precision or recall is low, the F1 score will also be low.

#### The F1 score is calculated using the following formula:
#### F1=2×((precision*recall)/(precision×recall)) 
##### Where:
##### - Precision is the proportion of relevant retrieved nodes among all retrieved nodes generated by the LLM.
##### - Recall is the proportion of relevant retrieved nodes that were correctly identified by the LLM.

In [95]:
# Print the F1 Score
print("F1 Score")
print(results['f1'])

F1 Score
[0.7506785988807678, 0.7903856635093689, 0.7942757606506348, 0.7806897759437561]
