### Clear memory

In [1]:
%reset -f
import gc
gc.collect()

0

### Import

In [None]:
import pandas as pd
import sys, os
from pathlib import Path
from tqdm.auto import tqdm
from langchain_huggingface import HuggingFaceEmbeddings
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness, context_precision, context_recall
from datasets import Dataset
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
%matplotlib inline

In [3]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

In [4]:
sys.path.append('..')
from src.langchain_RAG import setup_data_collection, langchain_rag_pipeline
from src.data_utils import create_chunks, load_and_analyze_pdf, download_file

# Dataset Creation 

To evaluate different chunk sizes, overlap sizes, and $k$ values, we create a test dataset manually. The dataset consists of question-answer pairs. Each pair is labeled as either relevant or irrelevant to AWS documentation. There are 12 relevant and 3 irrelevant examples. also page numbers are provided for relevant answers. 

In [5]:
test_dataset = [
    {
        'id': 0,
        'question': 'How do I make a tasty pizza?',
        'reference_answer': 'This question is not related to AWS documentation.',
        'page_numbers': [],
        'category': 'irrelevant',
    },
    {
        'id': 1,
        'question': 'Who is the current president of USA?',
        'reference_answer': 'This question is not related to AWS documentation.',
        'page_numbers': [],
        'category': 'irrelevant',
    },
    {
        'id': 2,
        'question': 'What is the capital of Germany?',
        'reference_answer': 'This question is not related to AWS documentation.',
        'page_numbers': [],
        'category': 'irrelevant',
    },
    {
        'id': 3,
        'question': 'If I do not have an AWS account, what do I do?',
        'reference_answer': 'To sign up for an AWS account open https://portal.aws.amazon.com/billing/signup and follow the online instructions.',
        'page_numbers': [10],
        'category': 'relevant',
    },
    {
        'id': 4,
        'question': 'What if I want to allow people outside of my AWS account to access my AWS resources?',
        'reference_answer': 'You can create a role that users in other accounts or people outside of your organization can use to access your resources. You can specify who is trusted to assume the role. For services that support resource-based policies or access control lists (ACLs), you can use those policies to grant people access to your resources.',
        'page_numbers': [114],
        'category': 'relevant',
    },
    {
        'id': 5,
        'question': 'What is the AWS Toolkit for Microsoft Azure DevOps?',
        'reference_answer': 'AWS Toolkit for Microsoft Azure DevOps is an extension for Microsoft Azure DevOps that contains tasks you can use in build and release definitions to interact with AWS services. It is available through the Visual Studio Marketplace.',
        'page_numbers': [7],
        'category': 'relevant',
    },
    {
        'id': 6,
        'question': 'Where can I install the AWS Toolkit for Azure DevOps extension?',
        'reference_answer': 'You can install the AWS Toolkit for Azure DevOps extension from the Extensions for Azure DevOps Visual Studio Marketplace. The direct link is https://marketplace.visualstudio.com/items?itemName=AmazonWebServices.aws-vsts-tools',
        'page_numbers': [12],
        'category': 'relevant',
    },
    {
        'id': 7,
        'question': 'How many ways can I supply AWS credentials to tasks?',
        'reference_answer': 'You can supply credentials in four ways: using a service connection, through named variables in your build (AWS.AccessKeyID, AWS.SecretAccessKey, AWS.SessionToken), through standard AWS environment variables in the build agent process, or with Amazon EC2 build agents using instance metadata.',
        'page_numbers': [13, 14, 15],
        'category': 'relevant',
    },
    {
        'id': 8,
        'question': 'What permissions does AWS Send SNS or SQS Message task require?',
        'reference_answer': 'This task requires permissions to call the following AWS service APIs (depending on selected task options, not all APIs may be used): sns:GetTopicAttributes, sns:Publish, sqs:GetQueueAttributes and sqs:SendMessage.',
        'page_numbers': [95],
        'category': 'relevant',
    },
    {
        'id': 9,
        'question': 'Which deployment types does the AWS Lambda .NET Core task support?',
        'reference_answer': 'The task supports two deployment types: Function (deploys a single function or creates a package zip file) and Serverless Application (performs deployment using AWS CloudFormation for multiple functions or builds and uploads to S3).',
        'page_numbers': [76,77],
        'category': 'relevant',
    },
    {
        'id': 10,
        'question': 'What structure does AWS CLI use?',
        'reference_answer': 'The AWS CLI uses a multipart structure on the command line: <command> <subcommand> [options and parameters].',
        'page_numbers': [33,34],
        'category': 'relevant',
    },
    {
        'id': 11,
        'question': 'Can I create an S3 bucket automatically with the S3 Upload task?',
        'reference_answer': 'Yes, you can select the checkbox "Create S3 bucket if it does not exist" and the task will attempt to create the bucket if it does not exist. Note that bucket names must be globally unique.',
        'page_numbers': [79],
        'category': 'relevant',
    },
    {
        'id': 12,
        'question': 'What is the maximum timeout value for the AWS CodeDeploy deployment task?',
        'reference_answer': 'The default maximum timeout is 60 minutes.',
        'page_numbers': [56],
        'category': 'relevant',
    },
    {
        'id': 13,
        'question': 'What AWS Shell Script task is used for?',
        'reference_answer': 'Runs a shell script in Bash, setting AWS credentials and Region information into the shell environment using the standard environment keys AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN and AWS_REGION.',
        'page_numbers': [39],
        'category': 'relevant',
    },
    {
        'id': 14,
        'question': 'Which task should I use to push a Docker image to Amazon ECR?',
        'reference_answer': 'You should use the Amazon ECR Push task (also known as Amazon Elastic Container Registry Push Image Task). This task pushes a Docker image identified by name with optional tag, or image ID to ECR.',
        'page_numbers': [51],
        'category': 'relevant',
    },
]

test_df = pd.DataFrame(test_dataset)


Check dataset

In [6]:
print(f'Created {len(test_df)} test questions')
print(f'Relevant: {(test_df["category"] == "relevant").sum()}')
print(f'Irrelevant: {(test_df["category"] == "irrelevant").sum()}')

test_df.head()

Created 15 test questions
Relevant: 12
Irrelevant: 3


Unnamed: 0,id,question,reference_answer,page_numbers,category
0,0,How do I make a tasty pizza?,This question is not related to AWS documentat...,[],irrelevant
1,1,Who is the current president of USA?,This question is not related to AWS documentat...,[],irrelevant
2,2,What is the capital of Germany?,This question is not related to AWS documentat...,[],irrelevant
3,3,"If I do not have an AWS account, what do I do?",To sign up for an AWS account open https://por...,[10],relevant
4,4,What if I want to allow people outside of my A...,You can create a role that users in other acco...,[114],relevant


# Prepare Data for RAGAS Evaluation 

### Database Setup
Set up the vector store collection.

In [7]:
vectorstore = setup_data_collection()

### Run RAG Pipeline on Test Dataset

Run the LangChain RAG pipeline to generate an answer for each question in the test dataset. We collect all generated answers, along with the original questions, retrieved context chunks, and reference answers, into a DataFrame. We use Claude Haiku 4.5 as the LLM with temperature set to 0.3.

In [None]:
eval_results = []

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc='Running RAG'):
    question = row['question']

    rag_result = langchain_rag_pipeline(query=question, vectorstore=vectorstore, k=5)

    eval_results.append({
        'question': question,
        'answer': rag_result['answer'],
        'contexts': rag_result['contexts'],
        'ground_truth': row['reference_answer'],
    })


eval_df = pd.DataFrame(eval_results)

Save and load dataframe

In [None]:
eval_df.to_pickle('../data/processed/eval_results.pkl')

eval_df = pd.read_pickle('../data/processed/eval_results.pkl')

Check dataframe 

In [10]:
eval_df.head()

Unnamed: 0,question,answer,contexts,ground_truth
0,How do I make a tasty pizza?,The provided documentation chunks do not conta...,[ate Stack ......................................,This question is not related to AWS documentat...
1,Who is the current president of USA?,This question is not answered in the provided ...,[WS account 1. Open https://portal.aws.amazon....,This question is not related to AWS documentat...
2,What is the capital of Germany?,The answer to this question is not contained i...,[ble. Note: The Regions listed in the picker a...,This question is not related to AWS documentat...
3,"If I do not have an AWS account, what do I do?","Based on the provided documentation, if you do...",[WS account 1. Open https://portal.aws.amazon....,To sign up for an AWS account open https://por...
4,What if I want to allow people outside of my A...,"According to the documentation, if you want to...","[perform: iam:PassRole In this case, Mary's po...",You can create a role that users in other acco...


### Convert DataFrame to RAGAS Dataset

In [11]:
ragas_dataset = Dataset.from_pandas(eval_df)

# RAGAS Evaluation

### LLM Evaluator Setup

For evaluation, we use Claude Sonnet 4.5 because it provides higher quality responses compared to Haiku. We wrap it using RAGAS's LangChain wrapper.

In [None]:
load_dotenv()
api_key = os.getenv('ANTHROPIC_API_KEY')

evaluator_llm = LangchainLLMWrapper(
    ChatAnthropic(
        model='claude-sonnet-4-5-20250929',
        anthropic_api_key=api_key,
        temperature=0,
    )
)

### Embedding Function Setup 

We use the same embedding function as in the RAG pipeline, wrapped for RAGAS compatibility.

In [13]:
embedding_function = HuggingFaceEmbeddings(
        model_name='sentence-transformers/all-MiniLM-L6-v2',
        model_kwargs={'device': device},
        encode_kwargs={'normalize_embeddings': True, 'batch_size': 32},
    )

evaluator_embeddings = LangchainEmbeddingsWrapper(embedding_function)

### Run Evaluation

We use RAGAS to compute four metrics: answer relevancy, faithfulness, context precision, and context recall. The batch size is set to 4 to reduce API call rate and avoid timeout errors.

In [None]:
ragas_results = evaluate(
    ragas_dataset,
    metrics=[
        answer_relevancy,
        faithfulness,
        context_precision,
        context_recall,
    ],
    llm=evaluator_llm,
    embeddings=evaluator_embeddings,
    batch_size=4,
)

metrics_df = ragas_results.to_pandas()

Print results


In [None]:
print('Average value for every metric')
metrics_df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']].mean()


# Chunking Strategy Evaluation

### Define Test Configurations 

Using RAGAS, we can evaluate different chunking strategies, retrieval sizes ($k$), and embedding functions. In this project, we focus on chunking strategies only. We fix the embedding function and $k=5$ as a good trade-off between cost and quality.

In [16]:
chunk_configs = [
    {'chunk_size': 200, 'overlap': 50, 'EOS':25, 'k': 5},
    {'chunk_size': 500, 'overlap': 100, 'EOS':50, 'k': 5},
    {'chunk_size': 1000, 'overlap': 200, 'EOS':100, 'k': 5},
]

### Load PDF Document 

We use the same AWS documentation as before. This approach can be applied to any technical document.

In [None]:
PDF_URL = 'https://docs.aws.amazon.com/pdfs/vsts/latest/userguide/vsts-ug.pdf'
PDF_NAME = 'aws_vsts.pdf'

download_file(url=PDF_URL, name=PDF_NAME, overwrite=False)

df_pdf = load_and_analyze_pdf(Path('../data/raw/') / PDF_NAME)
Path('../data/processed/').mkdir(parents=True, exist_ok=True)


## Run Evaluation Loop

For each configuration:
1. Create chunks with specified size
2. Build vector store
3. Run RAG pipeline on test dataset
4. Calculate RAGAS metrics
5. Store averaged results

In [None]:
all_results = []

for config in tqdm(chunk_configs, desc='Testing configurations'):
    print(f"\nTesting: chunk_size = {config['chunk_size']}")

    # 1. Create chunks with specific size
    chunks = create_chunks(
        pages_data=df_pdf,
        chunk_size=config['chunk_size'],
        overlap=config['overlap'],
        EOS=config['EOS'],
    )
    chunks.to_json(
        path_or_buf=f'../data/processed/chunks_{config["chunk_size"]}.json',
        orient='records',
        force_ascii=False,
        indent=4
    )

    # 2. Create vectorstore
    vectorstore = setup_data_collection(
        chunks_filename=f'chunks_{config["chunk_size"]}',
        collection_name=f'aws_docs_langchain_{config["chunk_size"]}',
        overwrite=True,
    )

    # 3. Run throug RAG pipeline
    eval_results = []
    for idx, row in test_df.iterrows():
        rag_result = langchain_rag_pipeline(
            query=row['question'],
            vectorstore=vectorstore,
            k=config['k'],
            llm_name='claude-haiku-4-5-20251001',
        )
        eval_results.append({
            'question': row['question'],
            'answer': rag_result['answer'],
            'contexts': rag_result['contexts'],
            'ground_truth': row['reference_answer'],
        })

    eval_df = pd.DataFrame(eval_results)

    # Save and load dataframe
    eval_df.to_pickle(f'../data/processed/eval_results_{config["chunk_size"]}.pkl')
    eval_df = pd.read_pickle(f'../data/processed/eval_results_{config["chunk_size"]}.pkl')


    # 4. Calculate RAGAS metrics
    ragas_dataset = Dataset.from_pandas(eval_df)

    ragas_results = evaluate(
        ragas_dataset,
        metrics=[
            answer_relevancy,
            faithfulness,
            context_precision,
            context_recall,
        ],
        llm=evaluator_llm,
        embeddings=evaluator_embeddings,
        batch_size=4,
    )

    metrics_df = ragas_results.to_pandas()

    # 5. Store results
    all_results.append({
        'chunk_size': config['chunk_size'],
        'k': config['k'],
        'answer_relevancy': metrics_df['answer_relevancy'].mean(),
        'faithfulness': metrics_df['faithfulness'].mean(),
        'context_precision': metrics_df['context_precision'].mean(),
        'context_recall': metrics_df['context_recall'].mean(),
    })


comparison_df = pd.DataFrame(all_results)


Save and load the DataFrame.

In [None]:
comparison_df.to_pickle('../data/processed/comparison_df.pkl')

comparison_df = pd.read_pickle('../data/processed/comparison_df.pkl')

### Analyse Results
Display evaluation results sorted by faithfulness score.

In [28]:
comparison_df.sort_values(by='faithfulness', ascending=False)

Unnamed: 0,chunk_size,k,answer_relevancy,faithfulness,context_precision,context_recall
2,1000,5,0.690639,0.991667,0.674306,1.0
1,500,5,0.638778,0.968254,0.652315,0.933333
0,200,5,0.441749,0.943452,0.64963,0.9


**Observations:** 
- Larger chunk sizes consistently yield better scores across all metrics
- All chunking strategies achieve **faithfulness** scores above 0.9
- The most significant variation across configurations is in **answer_relevancy**

# Summary
In this notebook the following steps were accomplished: 

1. Created evaluation dataset of 15 test questions (12 relevant, 3 irrelevant) with reference answers and page numbers

2. Implemented RAGAS evaluation with 4 metrics:
   - Answer Relevancy - how well answers address the question
   - Faithfulness - whether answers stay grounded in context
   - Context Precision - quality of retrieved chunks
   - Context Recall - coverage of relevant information

3. Tested 3 chunking strategies (200, 500, 1000 char chunks)

**Key Findings:**
- Larger chunks (1000 chars) perform best** across all metrics
- High faithfulness (>0.9) - minimal hallucination across all configs
- Answer relevancy improves significantly with larger chunks (0.44 â†’ 0.69)
- Perfect context recall (1.0) with 1000-char chunks


**Chunking configuration for the next step:**
- Chunk size: 1000 characters
- Overlap: 200 characters
- k: 5 retrieved chunks
- End of sentence: 100

