### Author: Guilherme Resende

This notebook assess the quality of the candidate embeddings in retrieving adequate chunks of context and generating relevant answers.

For simplicity sake, I decided to leave the Embedding (`text-embedding-ada-002`) and Language Model (`gpt-3.5-turbo-instruct`) as the default models defined by OpenAI. The chunking strategies we experiment here are:

- Merging Documents:
    - Chunk Size: 1024
    - Chunk Size: 2048
- Keeping Original Documents:
    - Chunk Size: 2048
    
I decided to use a Retrieval-Augmented Generation (RAG) approach. Since this approach builds outputs based on previously retrieved document, and the reponse's most important aspect is the text quality and relevance, I validate the output quality with regard to three metrics:

- Faithfulness: What proportion of the claims are correct matches (answer/context), when compared to the number of total claims? The closest to 1, the better.

- Answer Relevancy: Asks the model to generate N questions relative to the generated answer. Posteriorly we calculate the average cosine similarity between the generated answer and the "artificial questions". The closest to 1, the better.

- Semantic Quality: Manual validation to rank the responses according to their fitness to the ground truth

The chunking approach that best performs on the examples will the chosen as the final solution.

In [1]:
import json
import os

credentials=None
with open("credentials.json", 'r') as f:
    credentials = json.load(f)

os.environ["OPENAI_API_KEY"] = credentials["OPENAI_API_KEY"]

In [5]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator
)

In [None]:
persist_directory = "db_chunk_size_1024"

embedding = OpenAIEmbeddings()

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [None]:
# Define the number of closest neighbors to be considered in the search
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

**Build a model based on a processing chain**

In [None]:
model = OpenAI()

qa_chain = RetrievalQA.from_chain_type(
    llm=model, 
    chain_type="stuff", 
    retriever=retriever, 
    return_source_documents=True
)

**Validation Questions**

In [None]:
questions = [
    "How much space can I use in S3 Free Tier?",
    "How many storage classes are there in S3?",
    "How to control inbound and outbound traffic in RDS?",
    "What types of databases does RDS support?",
    "How to create a VPC in AWS console?",
    "How many roles can a user have in AWS IAM?",
    "How do I define temporary roles in AWS for people from outside of my organization?",
    "What is an AWS S3 bucket?",
    "What is AWS KMS?",
    "How many cryptographic algorithms can I choose in AWS KMS, and what are they?"
]

ground_truths = [
    "The AWS Free Tier for Amazon S3 provides 5 GB of standard storage",
    """Amazon Simple Storage Service (S3) has seven storage classes: S3 Standard, 
S3 Intelligent-Tiering, S3 Standard-Infrequent Access (S3 Standard-IA), \
S3 One Zone-Infrequent Access (S3 One Zone-IA), S3 Glacier, S3 Glacier Deep Archive, and S3 Outposts""",
    """To control inbound and outbound traffic to an RDS database, you can use \
AWS Security Groups and Network Access Control Lists (NACLs)""",
    """Amazon Relational Database Service (RDS) supports a variety of database types, \
including: Amazon Aurora, AWS Oracle, MariaDB, Microsoft SQL Server, MySQL, and PostgreSQL""",
    """To create a Virtual Private Cloud (VPC) in the AWS console, you can do the following: \
Open the AWS console; Select Services; Type ""VPC"" in the search box and select VPC from the list; \
Select Your VPCs from the navigation pane on the left; Click Create VPC; Enter a name for the VPC in the Name tag field; \
For IPv4 CIDR block, you can either manually enter an IPv4 address range or select an IPAM-allocated IPv4 CIDR block; \
Click Yes, Create""",
    "The default maximum number of roles that can be created per profile in an AWS Identity and Access Management (IAM) account is 250.",
    """To define temporary roles in AWS for people outside of your organization, you can use the AWS Security \
Token Service (AWS STS) to create temporary security credentials. These credentials can be granted to trusted \
users to access your AWS resources. You can also use the AWS External ID to ensure that only authorized \
entities can assume a role. Service providers can use External IDs to assume roles on behalf of their customers.""",
    """An AWS S3 bucket is a container for storing objects in Amazon Web Services (AWS) Simple Storage \
Service (S3). S3 buckets are similar to file folders and can be used to store, retrieve, back up, and \
access objects""",
    """AWS Key Management Service (KMS) gives you control over the cryptographic keys used to protect \
your data. AWS KMS provides you with centralized control over the lifecycle and permissions of your keys.""",
    """AWS Key Management Service (KMS) supports multiple cryptographic algorithms, including asymmetric \
and symmetric algorithms. Asymmetric algorithms: RSA 2048, RSA 3072, RSA 4096, ECC NIST P-256, ECC NIST P-384, \
ECC NIST-521, ECC SECG P-256k1. Symmetric algorithms: Advanced Encryption Standard (AES) with 128-, \
192-, or 256-bit keys, Triple DES (3DES) which uses three 56-bit keys."""
]

contexts = []
answers = []
for q in questions:
    answers.append(qa_chain(q)["result"])
    contexts.append([doc.page_content for doc in retriever.get_relevant_documents(q)])

In [None]:
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truths
}

dataset = Dataset.from_dict(data)

In [None]:
evals = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy],
)

In [None]:
df_evals = evals.to_pandas()

In [None]:
df_evals