# Chunking the Microsoft Build Book of News 2024 for Retrieval

In this notebook we will demonstrate various chunking strategies and how they influence retrieval.

[ChunkViz](https://chunkviz.up.railway.app/) is a great website to see various chunking strategies in a visual way. ✂️

[2024 BOOK OF NEWS](https://news.microsoft.com/build-2024-book-of-news/) is the document we are going to leverage to demonstrate various chunking strategies.

## Questions to consider for chunking

1. What is the nature of the content being indexed? Are you working with long documents, such as articles or books, or shorter content, like tweets or instant messages? The answer would dictate both which model would be more suitable for your goal and, consequently, what chunking strategy to apply.

2. What are your expectations for the length and complexity of user queries? Will they be short and specific or long and complex? This may inform the way you choose to chunk your content as well so that there’s a closer correlation between the embedded query and embedded chunks.

3. How will the retrieved results be utilized within your specific application? For example, will they be used for semantic search, question answering, summarization, or other purposes?

## Import Libraries 🧑‍💻

We are brining in a few libraries here, most of them are LangChain Libraries:

1. Bringing in the CharacterTextSplitter, MarkdownHeaderTextSplitter, and RecursiveCharacterTextSplitter to demonstrate how different chunking strategies impact your retrieval

2. AzureAIDocumentIntelligenceLoader to load the PDF and convert to Markdown

3. AzureOpenAIEmbeddings to vectorize the chunks prior to inserting them into Azure Search and AzureSearch to store our documents after we have chunked and vectorized them

4. AzureChatOpenAI to interact with GPT4o

In [52]:
import os
from dotenv import load_dotenv
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain.text_splitter import MarkdownHeaderTextSplitter
from typing import List
from dataclasses import dataclass
from langchain_openai import AzureChatOpenAI
load_dotenv()

True

## Asking GPT4o A Question Outside of It's Training Dataset ❓

GPT4o last received a knowledge update October 2023 so it will not know about Microsoft Build 2024. Let's ask it a question to demonstrate this.

In [53]:
llm = AzureChatOpenAI(
    azure_deployment="gpt4o",
    temperature=0,
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-01"
)

llm.invoke("Summarize Azure AI Services announcements in the Microsoft Build Book of News for 2024")

AIMessage(content="As of my last update in October 2023, I don't have access to specific details from the Microsoft Build Book of News for 2024. However, I can provide a general idea of what such announcements might typically include based on past trends and the evolution of Azure AI Services.\n\nIn the Microsoft Build Book of News, Azure AI Services announcements often cover:\n\n1. **New AI Capabilities and Services**: Introduction of new AI services or significant updates to existing ones, such as enhancements in natural language processing, computer vision, and machine learning models.\n\n2. **Integration and Interoperability**: Announcements about improved integration of AI services with other Microsoft products like Azure, Microsoft 365, Dynamics 365, and Power Platform, making it easier for developers to incorporate AI into their applications.\n\n3. **Developer Tools and SDKs**: Updates on new or improved tools, SDKs, and APIs that simplify the development and deployment of AI so

## Declare Class to Display Chunks 🧑‍💻

LangChain stores all documents in a Document object and this class will parse that object and display the chunks in an easy to read format.

Ex:

[Document(page_content='chunk content #1'),
Document(page_content='chunk content #2')]

In [54]:
@dataclass
class Document:
    page_content: str

def parse_documents(data: List[Document]) -> List[Document]:
    parsed_documents = []
    for doc in data:
        parsed_documents.append(Document(page_content=doc.page_content))
    return parsed_documents

## Load Book of News PDF Document 🔁

Extract Text from Book of News PDF Document and convert to markdown. We convert into Markdown so we can take advantage of markdown specific chunking such as splitting on headers.

In [None]:
loader = AzureAIDocumentIntelligenceLoader(file_path="C:\\Users\\conne\\development\\repos\\chunking_for_rag\\Book_Of_News.pdf", 
                                           api_key=os.environ.get('DOCUMENT_INTELLIGENCE_KEY'), 
                                           api_endpoint=os.environ.get('DOCUMENT_INTELLIGENCE_ENDPOINT'))
book_of_build = loader.load()

## Print output from Document Intelligence 👾

In [55]:
print(book_of_build)

[Document(page_content='<figure>\n\n![](figures/0)\n\n<!-- FigureContent="Microsoft" -->\n\n</figure>\n\n\nStories Our Company\n\nAll Microsofty\n\n<figure>\n\n![](figures/1)\n\n<!-- FigureContent="(01100010 01101111 01101110)" -->\n\n</figure>\n\n\nMICROSOFT BUILD BOOK OF NEWS May 21 - 23, 2024\n\nINTRODUCTION\n\nWelcome to Microsoft Build, our annual flagship event for developers, and to this year\'s edition of the Book of News. Here, you\'ll discover about 60 announcements, ranging from the latest Al features for Windows to the expansion of Microsoft Copilot and its new capabilities alongside novel tools for developers and cost-efficient and user-friendly cloud solutions for innovation.\n\nAs we convene for Microsoft Build this year, we have 200,000 participants registered and anticipate 4,000 people attending in-person in Seattle. For those not able to be present at the live event, most content will be available on demand. Every participant, regardless of their location, will learn

# Strategy #1: Character Splitting

Character splitting is the simplest form of chunking and it is the process of dividing your text into N sized chunks and does not take into account the context of the document.

- chunk_size: The number of characters you would like your chunks to be, in our case, 500 characters

- chunk_overlap: The amount you would like your chunks to overlap, in our case, 20 characters. This is to ensure context is maintained between chunks.

In [56]:
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=20)
char_split_chunks = text_splitter.split_documents(book_of_build)

Created a chunk of size 740, which is longer than the specified 500
Created a chunk of size 570, which is longer than the specified 500
Created a chunk of size 543, which is longer than the specified 500
Created a chunk of size 707, which is longer than the specified 500
Created a chunk of size 516, which is longer than the specified 500
Created a chunk of size 744, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 507, which is longer than the specified 500
Created a chunk of size 564, which is longer than the specified 500
Created a chunk of size 580, which is longer than the specified 500
Created a chunk of size 593, which is longer than the specified 500
Created a chunk of size 646, which is longer than the specified 500
Created a chunk of size 524, which is longer than the specified 500
Created a chunk of size 524, which is longer than the specified 500
Created a chunk of size 507, which is longer tha

## Display Chunks 📃

Below are the chunks for the character splitting strategy

In [57]:
parse_documents(char_split_chunks)

[Document(page_content='<figure>\n\n![](figures/0)\n\n<!-- FigureContent="Microsoft" -->\n\n</figure>\n\n\nStories Our Company\n\nAll Microsofty\n\n<figure>\n\n![](figures/1)\n\n<!-- FigureContent="(01100010 01101111 01101110)" -->\n\n</figure>\n\n\nMICROSOFT BUILD BOOK OF NEWS May 21 - 23, 2024\n\nINTRODUCTION'),
 Document(page_content="INTRODUCTION\n\nWelcome to Microsoft Build, our annual flagship event for developers, and to this year's edition of the Book of News. Here, you'll discover about 60 announcements, ranging from the latest Al features for Windows to the expansion of Microsoft Copilot and its new capabilities alongside novel tools for developers and cost-efficient and user-friendly cloud solutions for innovation."),
 Document(page_content='As we convene for Microsoft Build this year, we have 200,000 participants registered and anticipate 4,000 people attending in-person in Seattle. For those not able to be present at the live event, most content will be available on deman

# Strategy #2: Header Splitting & Recursive Character Splitting

The below method splits on headers and then splits those headers into 600 character chunks with a 100 character overlap leveraging the recursive character splitter. If you recall the recursive text splitter tries to keep all paragraphs together.

In [58]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
docs_string = book_of_build[0].page_content
splits = markdown_splitter.split_text(docs_string)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600, chunk_overlap=100
)
headers_and_recursive_chunks = text_splitter.split_documents(splits)

## Display Chunks 📃

Below is the output from chunking on each header and then recursively splitting within each header 

In [59]:
parse_documents(headers_and_recursive_chunks)

[Document(page_content='<figure>  \n![](figures/0)  \n<!-- FigureContent="Microsoft" -->  \n</figure>  \nStories Our Company  \nAll Microsofty  \n<figure>  \n![](figures/1)  \n<!-- FigureContent="(01100010 01101111 01101110)" -->  \n</figure>  \nMICROSOFT BUILD BOOK OF NEWS May 21 - 23, 2024  \nINTRODUCTION'),
 Document(page_content="</figure>  \nMICROSOFT BUILD BOOK OF NEWS May 21 - 23, 2024  \nINTRODUCTION  \nWelcome to Microsoft Build, our annual flagship event for developers, and to this year's edition of the Book of News. Here, you'll discover about 60 announcements, ranging from the latest Al features for Windows to the expansion of Microsoft Copilot and its new capabilities alongside novel tools for developers and cost-efficient and user-friendly cloud solutions for innovation."),
 Document(page_content='As we convene for Microsoft Build this year, we have 200,000 participants registered and anticipate 4,000 people attending in-person in Seattle. For those not able to be present

# Strategy #3: Chunk on Headers

For the last chunking strategy we are going to split on the headers.

In [60]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
 
docs_string = book_of_build[0].page_content
header_chunks = text_splitter.split_text(docs_string)

## Display Chunks 📃

Below are the chunked headers

In [None]:
parse_documents(header_chunks)

## Initialize Azure Search Indexes 🔎 and Azure OpenAI Embeddigs 🤖

Let's initialize our Azure Search indexes and our embeddings model so we can upsert our chunks to test our retrieval.

In [None]:
embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment="embeddings",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

In [None]:
vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_KEY")

index_name: str = "charsplit"
char_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

index_name: str = "headerandcharsplit"
header_and_recur_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

index_name: str = "headersplit"
header_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

## Upsert Chunks to their respective Indexes

For each of the chunking strategies we completed above, let's upsert them into Azure Search so we can perform a similarity search on them and see which one performs best. 

In [None]:
char_split_vector_store.add_documents(documents=char_split_chunks)
header_and_recur_split_vector_store.add_documents(documents=headers_and_recursive_chunks)
header_split_vector_store.add_documents(documents=header_chunks)

# Testing 🧪

Imagine the below retrieved chunks would be placed into an LLM prompt to augment the models training data set. The chunks retireved heavily influence the quality and accuracy of the generation.

### Any Guesses as to which chunking strategy will provide the best results? ❓

1. Chunking Strategy #1: Character Splitting

2. Chunking Strategy #2: Header and Recursive Character Splitting

3. Chunking Strategy #3: Header Splitting

## Test Chunking Strategy #1 Character Splitting 🧪

Here we will execute a search against chunks that were split every 500 characters with a 20 character overlap.

In [61]:
char_split_docs = char_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(char_split_docs)

[Document(page_content='In addition, message analysis for\n\n1.1.5. AZURE OPENAI SERVICE FEATURES\n\nKEY AI ADVANCEMENTS'), Document(page_content="If you are interested in speaking with an industry analyst about news announcements at Microsoft Build or Microsoft's broader strategy and product offerings, please contact wemsanalystrelations@we- worldwide.com\n\n<figure>\n\n![](figures/2)\n\n</figure>\n\n\n# 1\\. Azure\n\n1.1. AZURE AI SERVICES\n\n1.1.1. ANNOUNCING AZURE PATTERNS AND PRACTICES FOR PRIVATE CHATBOTS"), Document(page_content='· Blog: From code to production: New ways Azure helps you build transformational Al experiences\n\n· Breakout: Safeguard your copilot with Azure Al\n\n· Breakout: Operationalize Al responsibly with Azure Al Studio\n\n· Demo: Safeguard user and Al- generated content with Azure Al Content Safety\n\nMicrosoft Azure Al Speech has several new features that will help developers build high-quality, voice-enabled apps. This service is gated. These updates are n

## Test Chunking Strategy #2 Header and Recursive Character Splitting 🧪

The cell below will query chunks that where we split by header and then chunked each header every 600 characters with a 100 character overlap.

In [62]:
header_and_recur_split_docs = header_and_recur_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(header_and_recur_split_docs)

[Document(page_content=". Fine-tuning GPT-4 allows for unparalleled customization of Al models, ensuring outputs are closely aligned with an organization's brand voice and specific needs, thereby revolutionizing customer service, content creation and more. This update is now in preview.  \n· Assistants API paves the way for the creation of advanced virtual assistants and chatbots that enhance user interactions with their nuanced understanding and responsiveness. This update is now generally available.  \nIn addition, message analysis for  \n1.1.5. AZURE OPENAI SERVICE FEATURES  \nKEY AI ADVANCEMENTS", metadata={'Header 1': '1\\. Azure'}), Document(page_content='Additional resources:  \n. Blog: Announcing the Al Toolkit for Visual Studio Code  \n· Keynote: Next generation Al for developers with the Microsoft Cloud  \n· Breakout: Maximize joy, minimize toil with great developer experiences  \n· Breakout: Scott and Mark learn to Copilot  \n· Breakout: Code-First LLMOps from prototype to p

## Test Chunking Strategy #3 Header Splitting 🧪

Finally, lets execute a query against chunks where we split on headers only.

In [63]:
header_split_docs = header_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(header_split_docs)

[Document(page_content='1.1. AZURE AI SERVICES  \n1.1.1. ANNOUNCING AZURE PATTERNS AND PRACTICES FOR PRIVATE CHATBOTS  \nNew Microsoft Azure reference architectures and implementation guidance are now generally available for customers to confidently design and deploy intelligent apps. Customers can easily leverage patterns and practices to create private chatbots that are reliable, cost-efficient and compliant - adhering to both the functional and nonfunctional requirements of an organization.  \nThe new guidance helps customers adopt well-architected best practices and includes:  \n· A reference architecture and reference implementation for Microsoft Azure OpenAl Service based on Azure landing zones, which helps jumpstart and scale app deployment.  \n· Service guides for machine learning that gives precise configuration instructions for Azure services used to deliver intelligent apps.  \n· Patterns for designing and developing a RAG solution: While the architecture is straightforward,

## Things to Consider 🤔

1. The retrieved information above would be used to plugin to a prompt to augment the LLM's data set. The quality of the response is heavily influenced by the retrieved data.

2. The appropriate chunking allows for efficent retireval of sematically similar chunks and reduces the signal to noise ratio in the prompt.

## 