## <span style="color:#ff8000">About this notebook</span>

This is a supplementary notebook for the session AG __Studi Kasus: Document Q&A__  of the __ICoDSE 2025__.

### <span style="color:#47c7fc">Contents</span>

This notebook contains code in python and leverages the LangChain framework to build and evaluate the different components of a RAG pipeline. 

- Indexing Pipeline
    -  Data Loading
    - Chunking (or Data Splitting)
    - Embeddings (or Data Transformation)
    - Storage (Vector Databases)

- Generation Pipeline
    - Search & Retrieval
    - Prompt Augmentation
    - LLM Generation

## <span style="color:#ff8000">Installing Dependencies</span>

All the necessary libraries for running this notebook along with their versions can be found in __requirements.txt__ file in the root directory of this repository

You should go to the root directory and run the following command to install the libraries

```
pip install -r requirements.txt
```

This is the recommended method of installing the dependencies


_Alternatively, you can run the command from this notebook too. The relative path may vary so ensure that you are in the root directory of this repository_

In [1]:
%pip install -r ./requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


---

## <span style="color:#ff8000">Indexing Pipeline</span>


### <span style="color:#47c7fc">Data Loading</span>

In [None]:
filepath='./Assets/Data'

In [None]:
from langchain_community.document_loaders import PyPDFLoader

#START YOUR CODE HERE
pdfloader=PyPDFLoader(file_path=filepath, mode='single') #instantiate the PyPDFLoader

pdf_data=pdfloader.load() #load the data

print(textwrap.fill(f"{pdf_data[0].page_content[:1000]}", width=150)) #print the first 1000 characters

#END YOUR CODE HERE

<details>
<summary>Click for solution</summary>

```
pdfloader=PyPDFLoader(file_path=filepath, mode='single') #instantiate the PyPDFLoader

pdf_data=pdfloader.load() #load the data

print(textwrap.fill(f"{pdf_data[0].page_content[:1000]}", width=150)) #print the first 1000 characters
```

</details>

### <span style="color:#47c7fc">Data Splitting or Chunking</span>

> Breaking down long pieces of text into manageable sizes is called Chunking

### <span style="color:#47c7fc">Fixed Size Chunking</span>

A very common approach is to pre-determine the size of the chunk and the amount of overlap between the chunks. There are several chunking methods that follow a fixed size chunking approach.

- Character-Based Chunking: Chunks are created based on a fixed number of characters

- Token-Based Chunking: Chunks are created based on a fixed number of tokens.

- Sentence-Based Chunking: Chunks are defined by a fixed number of sentences

- Paragraph-Based Chunking: Chunks are created by dividing the text into a fixed number of paragraphs.

Let's try Character-Based Chunking. 

In [None]:
pdfloader=PyPDFLoader(file_path=filepath,mode="single") #instantiate the PyPDFLoader

pdf_data=pdfloader.load() #load the data

print(len(pdf_data[0].page_content))


In [None]:

#START YOUR CODE HERE



#END YOUR CODE HERE

<details>
<summary>Click for Solution</summary>

```
text_splitter =RecursiveCharacterTextSplitter(
separators=["\n\n","\n","."], #The character that should be used to split. More than one can be given to try recursively.
chunk_size=1000, #Number of characters in each chunk 
chunk_overlap=100, #Number of overlapping characters between chunks
)

pdf_doc_chunks=text_splitter.split_documents(pdf_data)
```

</details>

Let's check out the distribution of chunk sizes.

Run the cell below.

Remember the document object should be called ```pdf_doc_chunks```


In [None]:
data = [len(doc.page_content) for doc in pdf_doc_chunks]

plt.boxplot(data)  
plt.title('Box Plot of chunk lengths')  # Title 
plt.xlabel('Chunk Lengths')  # Label for x-axis
plt.ylabel('Values')  # Label for y-axis

plt.show()

print(f"The median chunk lenght is : {round(np.median(data),2)}")
print(f"The average chunk lenght is : {round(np.mean(data),2)}")
print(f"The minimum chunk lenght is : {round(np.min(data),2)}")
print(f"The max chunk lenght is : {round(np.max(data),2)}")
print(f"The 75th percentile chunk length is : {round(np.percentile(data, 75),2)}")
print(f"The 25th percentile chunk length is : {round(np.percentile(data, 25),2)}")

### <span style="color:#47c7fc">Data Transformation or Embeddings</span>

#### __OpenAI Embeddings__

OpenAI, the company behind ChatGPT and GPT series of Large Language Models also provide three Embeddings Models. 

1.	text-embedding-ada-002 was released in December 2022. It has a dimension of 1536 meaning that it converts text into a vector of 1536 dimensions.
2.	text-embedding-3-small is the latest small embedding model of 1536 dimensions released in January 2024. The flexibility it provides over ada-002 model is that users can adjust the size of the dimensions according to their needs.
3.	text-embedding-3-large is a large embedding model of 3072 dimensions released together with the text-embedding-3-small model. It is the best performing model released by OpenAI yet.


OpenAI models are proprietary and can be accessed using the OpenAI API and are priced based on the number of input tokens for which embeddings are desired. 


Note: You will need an __OpenAI API Key__ which can be obtained from [OpenAI](https://platform.openai.com/api-keys)

To initialize the __OpenAI client__, we need to pass the api key. There are many ways of doing it. 

__[Option 1] Creating a .env file for storing the API key and using it # Recommended__

Install the __dotenv__ library

_The dotenv library is a popular tool used in various programming languages, including Python and Node.js, to manage environment variables in development and deployment environments. It allows developers to load environment variables from a .env file into their application's environment._

- Create a file named .env in the root directory of their project.
- Inside the .env file, then define environment variables in the format VARIABLE_NAME=value. 

e.g.

OPENAI_API_KEY=YOUR API KEY

In [None]:
from dotenv import load_dotenv
import os

if load_dotenv():
    print("Success: .env file found with some environment variables")
else:
    print("Caution: No environment variables found. Please create .env file in the root directory or add environment variables in the .env file")

We can also test if the key is valid or not

In [None]:
api_key=os.environ["OPENAI_API_KEY"]

from openai import OpenAI

client = OpenAI()


if api_key:
    try:
        client.models.list()
        print("OPENAI_API_KEY is set and is valid")
    except openai.APIError as e:
        print(f"OpenAI API returned an API Error: {e}")
        pass
    except openai.APIConnectionError as e:
        print(f"Failed to connect to OpenAI API: {e}")
        pass
    except openai.RateLimitError as e:
        print(f"OpenAI API request exceeded rate limit: {e}")
        pass

else:
    print("Please set you OpenAI API key as an environment variable OPENAI_API_KEY")



Now we will use the __OpenAIEmbeddings__ library from langchain 

In [None]:
# Import OpenAIEmbeddings from the library
from langchain_openai import OpenAIEmbeddings

os.environ["TOKENIZERS_PARALLELISM"]="false"

pdf_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

pdf_doc_embeddings=embeddings.embed_documents([chunk.page_content for chunk in pdf_doc_chunks])


In [None]:
print(f"The lenght of the embeddings vector is {len(pdf_doc_embeddings[0])}")
print(f"The embeddings object is an array of {len(pdf_doc_embeddings)} X {len(pdf_doc_embeddings[0])}")

### <span style="color:#47c7fc">Vector Storage</span>

In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

storage_file_path="./Memory"
storage_index_name="PDF_index"

index = faiss.IndexFlatIP(len(pdf_doc_embeddings[0]))

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

vector_store.add_documents(documents=pdf_doc_chunks)

We can also save the vector store in persistent memory!

In [None]:
vector_store.save_local(folder_path=storage_file_path,index_name=storage_index_name)

## <span style="color:#ff8000">Generation Pipeline</span>

## <span style="color:#ff8000">1. Retrieval</span>

In [None]:
# Load the FAISS vector store with safe deserialization
vector_store = FAISS.load_local(folder_path="./Memory",index_name="CWC_index", embeddings=embeddings, allow_dangerous_deserialization=True)

# Define a query
query = "Who won the world cup?"

# Perform similarity search
retrieved_docs = vector_store.similarity_search(query, k=2)  # Get top 2 relevant chunks

# Display results
for i, doc in enumerate(retrieved_docs):
    print(textwrap.fill(f"\nRetrieved Chunk {i+1}:\n{doc.page_content}",width=100))
    print("\n\n")


This is the most basic implementation of a retriever in the generation pipeline of a RAG-enabled system. This method of retrieval is enabled by embeddings. We used the text-embedding-3-small from OpenAI. FAISS calculated the similarity score based on these embeddings.

---

## <span style="color:#ff8000">2. Augmentation</span>

The information fetched by the retriever should also be sent to the LLM in form of a natural language prompt. This process of combining the user query and the retrieved information is called augmentation.


In [None]:
retrieved_context=retrieved_docs[0].page_content + retrieved_docs[1].page_content

# Creating the prompt
augmented_prompt=f"""

Given the context below answer the question.

Question: {query} 

Context : {retrieved_context}

Remember to answer only based on the context provided and not from any other source. 

If the question cannot be answered based on the provided context, say I don’t know.

"""

print(textwrap.fill(augmented_prompt,width=150))

---

## <span style="color:#ff8000">3. Generation</span>

Generation is the final step of this pipeline. While LLMs may be used in any of the previous steps in the pipeline, the generation step is completely reliant on the LLM. The most popular LLMs are the ones being developed by OpenAI, Anthropic, Meta, Google, Microsoft and Mistral amongst other developers. 

We have built a simple retriever using FAISS and OpenAI embeddings and, we created a simple augmented prompt. Now we will use OpenAI’s model, GPT-4o-mini, to generate the response.

In [None]:
from langchain_openai import ChatOpenAI


# Set up LLM and embeddings
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2
)

messages=[("human",augmented_prompt)]

ai_msg = llm.invoke(messages)

In [None]:
ai_msg.content

And there you have it. The response is rooted in the HTML document and based on the chunks retrieved from the vector database.

In [None]:
#START YOUR CODE HERE

# Load the FAISS vector store with safe deserialization
vector_store = FAISS.load_local(folder_path="./Memory",index_name="PDF_index", embeddings=embeddings, allow_dangerous_deserialization=True)

# Define a query
query = "How many paternity leaves can I avail"

# Perform similarity search to get top 2 relevant chunks
retrieved_docs = vector_store.similarity_search(query, k=2)

#END YOUR CODE HERE

# Display results
for i, doc in enumerate(retrieved_docs):
    print(textwrap.fill(f"\nRetrieved Chunk {i+1}:\n{doc.page_content}",width=100))
    print("\n\n")

Now craft the augmented prompt!

In [None]:
#START YOUR CODE HERE

retrieved_context=retrieved_docs[0].page_content + retrieved_docs[1].page_content

# Creating the prompt
augmented_prompt=f"""

Given the context below answer the question.

Question: {query} 

Context : {retrieved_context}

Remember to answer only based on the context provided and not from any other source. 

If the question cannot be answered based on the provided context, say I don’t know.

"""

#END YOUR CODE HERE

print(textwrap.fill(augmented_prompt,width=150))

Finally, make the call to the LLM. Use OpenAI's __gpt-4o-mini__ model

In [None]:
# START YOUR CODE HERE

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None
)

messages=[("human",augmented_prompt)]

ai_msg = llm.invoke(messages)


# END YOUR CODE HERE

print(ai_msg.content)

## <span style="color:#ff8000">Congratulations!</span>
For completing this introduction to RAG. I hope you had fun. For any queries, please get in touch!