# Chunking the Microsoft Build Book of News 2024 for Retrieval

## Import Libraries üßë‚Äçüíª

We are brining in a few libraries here, most of them are LangChain Libraries:

1. Bringing in the CharacterTextSplitter, MarkdownHeaderTextSplitter, and RecursiveCharacterTextSplitter to demonstrate how different chunking strategies impact your retrieval

2. AzureAIDocumentIntelligenceLoader to load the PDF and convert to Markdown

3. AzureSearch to store our documents after we have chunked them and AzureOpenAIEmbeddings to vectorize the chunks prior to inserting them into Azure Search

4. AzureChatOpenAI to interact with GPT4o

In [None]:
import os
from dotenv import load_dotenv
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain.text_splitter import MarkdownHeaderTextSplitter
from typing import List
from dataclasses import dataclass
from langchain_openai import AzureChatOpenAI
load_dotenv()

## Bring in Azure OpenAI Embeddings üî¢

We are going to leverage OpenAI's embeddings model to vectorize the chunks we generate from the Book Of News document. 

In [None]:
embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment="embeddings",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

## Asking GPT4o A Question Outside of It's Training Dataset ‚ùì

GPT4o last received a knowledge update October 2023 so it will not know about Microsoft Build 2024. Let's ask it a question to demonstrate this.

In [None]:
llm = AzureChatOpenAI(
    azure_deployment="gpt4o",
    temperature=0,
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-01"
)

llm.invoke("Summarize Azure AI Services announcements in the Microsoft Build Book of News for 2024")

## Declare Chunk Parser Class üßë‚Äçüíª

LangChain stores all documents in a Document object and this class will parse that object and display the chunks in an easy to read format.

In [None]:
@dataclass
class Document:
    page_content: str

def parse_documents(data: List[Document]) -> List[Document]:
    parsed_documents = []
    for doc in data:
        parsed_documents.append(Document(page_content=doc.page_content))
    return parsed_documents

## Load Book of News PDF Document üîÅ

Extract Text/Headers from Book of News PDF Document.

In [None]:
loader = AzureAIDocumentIntelligenceLoader(file_path="C:\\Users\\conne\\development\\repos\\chunking_for_rag\\Book_Of_News.pdf", api_key=os.environ.get('DOCUMENT_INTELLIGENCE_KEY'), api_endpoint=os.environ.get('DOCUMENT_INTELLIGENCE_ENDPOINT'), api_model="prebuilt-layout")
book_of_build = loader.load()

## Print Extracted Pages from Book of News üëæ

In [None]:
print(book_of_build)

# Character Splitting

Character splitting is the simplest form of chunking and it is the process of dividing your text into N sized chunks and does not take into account the context of the document.

Important Concepts:

- chunk_size: The number of characters you would like your chunks to be, in our case, 500 characters

- chunk_overlap: The amount you would like your chunks to overlap, in our case, 20 characters. This is to ensure context is maintained between chunks.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=20)
char_split_chunks = text_splitter.split_documents(book_of_build)

## Display Chunks üìÉ

In [None]:
parse_documents(char_split_chunks)

# Header Splitting & Recursive Character Splitting

Here we are employing a document specific chunking strategy where we analyze the structure of the document and determine the optimal method to chunk. This could be a mix of multiple chunking strategies such as the below method that splits on headers and then splits those chunked headers into 600 character chunks with a 100 character overlap.

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
docs_string = book_of_build[0].page_content
splits = markdown_splitter.split_text(docs_string)

chunk_size = 600
chunk_overlap = 100
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
headers_and_rescursive_chunks = text_splitter.split_documents(splits)

## Display Chunks üìÉ

In [None]:
parse_documents(headers_and_rescursive_chunks)

# Chunk on Headers

For the last example we are going to chunk on the headers of the document with no further splitting or chunking. 

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
 
docs_string = book_of_build[0].page_content
header_chunks = text_splitter.split_text(docs_string)

## Display Chunks üìÉ

In [None]:
parse_documents(header_chunks)

## Initialize Azure Search Indexes üîé

This will put the document chunks from all the strategies into a database so we can test our retrieval

In [None]:
vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_KEY")

index_name: str = "charsplit"
char_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

index_name: str = "headerandcharsplit"
header_and_char_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

index_name: str = "headersplit"
header_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

## Upsert Chunks to their respective Indexes

For each of the chunking strategies we completed above, let's upsert them into Azure Search so we can perform a similarity search on them and see which one performs best. 

In [None]:
char_split_vector_store.add_documents(documents=char_split_chunks)
header_and_char_split_vector_store.add_documents(documents=headers_and_rescursive_chunks)
header_split_vector_store.add_documents(documents=header_chunks)

# Testing üß™

Imagine the below retrieved chunks would be fed to an LLM prompt to augment the models training data set. The chunks retireved heavily influence the quality and accuracy of the generation.

## Test Chunking Strategy #1 Character Splitting üß™

Here we will execute a search against chunks that were split every 500 characters with a 20 character overlap.

In [None]:
char_split_docs = char_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(char_split_docs)

## Test Chunking Strategy #2 Header and Character Splitting üß™

The cell below will query chunks that where we split by header and then chunked each header every 600 characters with a 100 character overlap.

In [None]:
header_and_recur_split_docs = header_and_char_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(header_and_recur_split_docs)

## Test Chunking Strategy #3 Header Splitting üß™

Finally, lets execute a query against chunks where we split on headers only.

In [None]:
header_split_docs = header_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(header_split_docs)

## 