# RAG Evaluation

The notebook simple-rag demonstrated the RAG approach with flan-ul2 model from watsonx.ai. This is the next notebook in the series to evaluate the outcome of the solution. The same data will be used to evaluate the performance using Llamaindex. Faithfulness (aka Hallucination) and Relevancy metrics from Llamaindex are used for the evaluation. These two metrics can also be used to decide the optimal chunk size which is also covered in the notebook. Evaluation can be done using other frameworks such as Rouge, Ragas, etc., Llamaindex has been chosen due to it's ability to connect with multiple data sources and it's rich capabilities. Some of the ideas for this notebook are from https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

## Contents

This notebook contains the following:
1. Set up and installation of the depencencies
2. Dataset preparation
3. Code for evaluation
4. Optimal Chunksize evaluation

## Install dependencies

The below cell might take few minutes to download and install the required libraries

In [1]:
!pip install "ibm-watson-machine-learning>=1.0.320" 
!pip install "pydantic>=1.10.0" 
!pip install langchain 
!pip install huggingface
!pip install huggingface-hub
!pip install sentence-transformers
!pip install llama-index
!pip install spacy

Collecting pydantic>=1.10.0
  Downloading pydantic-2.4.2-py3-none-any.whl (395 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.8/395.8 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting annotated-types>=0.4.0
  Downloading annotated_types-0.6.0-py3-none-any.whl (12 kB)
Collecting typing-extensions>=4.6.1
  Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Collecting pydantic-core==2.10.1
  Downloading pydantic_core-2.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: typing-extensions, annotated-types, pydantic-core, pydantic
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.5.0
    Uninstalling typing_extensions-4.5.0:
      Successfully uninstalled typing_extensions-4.5.0
Successfully installed annotated-types-

Installing collected packages: typing-inspect, sniffio, marshmallow, jsonpointer, jsonpatch, dataclasses-json, anyio, langsmith, langchain
Successfully installed anyio-3.7.1 dataclasses-json-0.6.1 jsonpatch-1.33 jsonpointer-2.4 langchain-0.0.319 langsmith-0.0.47 marshmallow-3.20.1 sniffio-1.3.0 typing-inspect-0.9.0
Collecting huggingface
  Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Installing collected packages: huggingface
Successfully installed huggingface-0.0.1
Collecting huggingface-hub
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fsspec>=2023.5.0
  Downloading fsspec-2023.9.2-py3-none-any.whl (173 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.4/173.4 kB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
Collecting filelock
  Downloading filelock-3.12.4-py3-none-any.whl (11 kB)
Installing collected 

Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125940 sha256=83a637668988efbc0c63d52bbb48606b2958f1ac813e5f19fa46377836ce45d8
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
Installing collected packages: safetensors, regex, nltk, huggingface-hub, tokenizers, transformers, sentence-transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.18.0
    Uninstalling huggingface-hub-0.18.0:
      Successfully uninstalled huggingface-hub-0.18.0
Successfully installed huggingface-hub-0.17.3 nltk-3.8.1 regex-2023.10.3 safetensors-0.4.0 sentence-transformers-2.2.2 tokenizers-0.14.1 transformers-4.34.1
Collecting llama-index
  Downloading llama_index-0.8.47-py

Installing collected packages: SQLAlchemy, nest-asyncio, tiktoken, dataclasses-json, openai, llama-index
  Attempting uninstall: SQLAlchemy
    Found existing installation: SQLAlchemy 1.4.39
    Uninstalling SQLAlchemy-1.4.39:
      Successfully uninstalled SQLAlchemy-1.4.39
  Attempting uninstall: nest-asyncio
    Found existing installation: nest-asyncio 1.5.5
    Uninstalling nest-asyncio-1.5.5:
      Successfully uninstalled nest-asyncio-1.5.5
  Attempting uninstall: dataclasses-json
    Found existing installation: dataclasses-json 0.6.1
    Uninstalling dataclasses-json-0.6.1:
      Successfully uninstalled dataclasses-json-0.6.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sparkmagic 0.20.0 requires nest-asyncio==1.5.5, but you have nest-asyncio 1.5.8 which is incompatible.[0m[31m
[0mSuccessfully installed SQLAlchemy-2.0.22 dataclasses-json-0.

### watsonx.ai API Connection

In [1]:
import os, getpass
credentials = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": getpass.getpass("Please enter your WML api key (hit enter): ")
}

Please enter your WML api key (hit enter): ········


### Project Id and Data download

In [2]:
try:
    project_id = os.environ["PROJECT_ID"]
    
except KeyError:
    project_id = input("Please enter your project_id (hit enter): ")

In [4]:
!pip install wget
import wget

filename = 'companyPolicies.txt'
url = 'https://raw.github.com/ravisrirangam/chunking_techniques/main/data/companypolicies.txt'

wget.download(url, out=filename)
print('file downloaded')

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9673 sha256=6a7b5d2e55a3a36e76e03b183b448dc092988db8aa034bbac47a8c10e7baa366
  Stored in directory: /tmp/wsuser/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
file downloaded


In [5]:
import nest_asyncio

nest_asyncio.apply()

from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    ServiceContext,
    LLMPredictor
)
from llama_index.embeddings import (
    LangchainEmbedding
)
from llama_index.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator
)

import time
import os

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import DecodingMethods
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM

### Create the flan-ul2 model

In [6]:
model_id = ModelTypes.FLAN_UL2
parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.MIN_NEW_TOKENS: 130,
    GenParams.MAX_NEW_TOKENS: 200
}

model1 = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id
)

model2 = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id
)

Create two flan-ul2 models, one for evaluation and another for retrieval

In [7]:
flan_ul2_llm = WatsonxLLM(model=model1)
eval_flan_ul2_llm = WatsonxLLM(model=model2)

### Create the embedding model, this time sentence transformers model is used

In [8]:
embedding_llm = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### create the LLMPredictor for Dataset generator

In [9]:
eval_llm_predictor = LLMPredictor(llm=eval_flan_ul2_llm)
llm_predictor = LLMPredictor(llm=flan_ul2_llm)

In [10]:
from langchain.document_loaders import TextLoader
loader = TextLoader(filename)
documents = loader.load()
#print(documents[0])

In [16]:
print(type(documents[0].page_content))

<class 'str'>


### Manually create the document for Llamadex DatasetGenerator

In [11]:
from llama_index import Document
eval_docs = [Document(text=documents[0].page_content)]

In [12]:
service_context_eval = ServiceContext.from_defaults(llm_predictor=eval_llm_predictor, embed_model=embedding_llm)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


create 10 questions from the data for evaluation, the no. of questions is limited to 10 for brevity and time to execute and the available resources.

In [13]:
data_generator = DatasetGenerator.from_documents(eval_docs, service_context = service_context_eval)
eval_questions = data_generator.generate_questions_from_nodes(num = 10)

The generated questions can be checked by printing them.
Define Faithfulness and Relevancy Evaluators


In [14]:
faithfulness = FaithfulnessEvaluator(service_context=service_context_eval)
relevancy = RelevancyEvaluator(service_context=service_context_eval)

Define the function with chunksize as input parameter and compute the avg response time, avg faithfulness, avg elevance

In [15]:
def evaluate(chunk_size):
    total_response_time = 0
    total_faithfulness = 0
    total_relevancy = 0

   
    llm_predictor = LLMPredictor(llm=flan_ul2_llm)
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size=chunk_size, embed_model=embedding_llm)
    vector_index = VectorStoreIndex.from_documents(
        eval_docs, service_context=service_context
    )

    query_engine = vector_index.as_query_engine()
    num_questions = len(eval_questions)

    for question in eval_questions:
        start_time = time.time()
        response_vector = query_engine.query(question)
        elapsed_time = time.time() - start_time

        faithfulness_result = faithfulness.evaluate_response(
            response=response_vector
        ).passing

        relevancy_result = relevancy.evaluate_response(
            query=question, response=response_vector
        ).passing

        total_response_time += elapsed_time
        total_faithfulness += faithfulness_result
        total_relevancy += relevancy_result

    average_response_time = total_response_time / num_questions
    average_faithfulness = total_faithfulness / num_questions
    average_relevancy = total_relevancy / num_questions

    return average_response_time, average_faithfulness, average_relevancy

Execute the function with chunk sizes of 256, 512 and 1024, it might take sometime to complete

In [16]:
for chunk_size in [256, 512, 1024]:
    print(" computing for Chunk Size:  ", chunk_size)
    avg_time, avg_faithfulness, avg_relevancy = evaluate(chunk_size)
    print(f"Chunk size {chunk_size} - Average Response time: {avg_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

 computing for Chunk Size:   256
Chunk size 256 - Average Response time: 4.03s, Average Faithfulness: 1.00, Average Relevancy: 0.67
 computing for Chunk Size:   512
Chunk size 512 - Average Response time: 5.01s, Average Faithfulness: 0.67, Average Relevancy: 1.00
 computing for Chunk Size:   1024
Chunk size 1024 - Average Response time: 5.02s, Average Faithfulness: 1.00, Average Relevancy: 1.00


## Analysis of results

For this data, as expected, the chunk size of 256 didn't retrieve the complete content from the vector store so the avg relevancy was low at 0.67, though faithfulness was 1, indicating that there was no hallucination. For the chunk size of 512, avg relevance was 1, might be because the content for some policies has fit within the limit and the content was a match for the questions, but faithfulness was 0.67 indicating the possibility of hallucination. The chunk size of 1024 is optimal for this use case as avg relevancy and avg faithfulness are at 1.