## Import Libraries 🧑‍💻

LangChain is being leveraged here to chunk, vectorize, upsert, and query our data

In [None]:
import os
from dotenv import load_dotenv
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain.text_splitter import MarkdownHeaderTextSplitter
from typing import List
from dataclasses import dataclass
from langchain_openai import AzureChatOpenAI
load_dotenv()

## Bring in Azure OpenAI Embeddings 🔢

The LLM of choice to generate vectors for the book of build text.

In [None]:
embeddings: AzureOpenAIEmbeddings = AzureOpenAIEmbeddings(
    azure_deployment="embeddings",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

## Asking GPT4o A Question Outside of It's Training Dataset ❓

GPT4o last received a knowledge update October 2023 so it will not know about Microsoft Build 2024

In [None]:
llm = AzureChatOpenAI(
    azure_deployment="gpt4o",
    temperature=0,
    api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
    api_version="2024-02-01"
)

llm.invoke("Summarize the Microsoft Build Book of News for 2024")

## Declare Chunk Parser Class 🧑‍💻

This will allow us to see what the chunks look like in a clear output

In [None]:
@dataclass
class Document:
    page_content: str

def parse_documents(data: List[Document]) -> List[Document]:
    parsed_documents = []
    for doc in data:
        parsed_documents.append(Document(page_content=doc.page_content))
    return parsed_documents

## Load Book of News PDF Document 🔁

Extract Text from PDF Build Document

In [None]:
loader = AzureAIDocumentIntelligenceLoader(file_path="C:\\Users\\conne\\development\\repos\\chunking_for_rag\\Book_Of_News.pdf", api_key=os.environ.get('DOCUMENT_INTELLIGENCE_KEY'), api_endpoint=os.environ.get('DOCUMENT_INTELLIGENCE_ENDPOINT'), api_model="prebuilt-layout")
book_of_build = loader.load()

## Print Extracted Pages from Book of Build 👾

In [None]:
print(book_of_build)

## Chunking Strategy #1 Character Split 🪓

Character splitting is the simplest form of chunking and it is the process of dividing your text into N sized chunks and does not take into account the context of the document.

Important Concepts:

- chunk_size: The number of characters you would like your chunks to be, in our case, 100 characters

- chunk_overlap: The amount you would like your chunks to overlap, in our case, 30 characters. This is to ensure context is maintained between chunks.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=30)
docs = text_splitter.split_documents(book_of_build)
vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_KEY")

index_name: str = "charsplit"
char_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

char_split_vector_store.add_documents(documents=docs)

## Display Character Splitter Chunks 📃

In [None]:
parse_documents(docs)

## Chunking Strategy #2 Split on Headers and Chunk 🪓

Here we are employing a document specific chunking strategy where we analyze the structure of the document and determine the optimal method to chunk. This could be a mix of multiple chunking strategies such as the below method that splits on headers and then splits those chunked headers into 600 character chunks with a 100 character overlap.

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
docs_string = book_of_build[0].page_content
splits = markdown_splitter.split_text(docs_string)

chunk_size = 600
chunk_overlap = 100
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
splits = text_splitter.split_documents(splits)
vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_KEY")

index_name: str = "headerandcharsplit"
header_and_char_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

header_and_char_split_vector_store.add_documents(documents=splits)

## Display Split on Headers and Chunk  📃

In [None]:
parse_documents(splits)

## Chunking Strategy #3 Split on Headers 🪓

For the last example we simply chunk on the headers of the document with no further splitting or chunking. 

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
 
docs_string = book_of_build[0].page_content
splits = text_splitter.split_text(docs_string)
vector_store_address: str = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password: str = os.getenv("AZURE_SEARCH_KEY")

index_name: str = "headersplit"
header_split_vector_store: AzureSearch = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
)

header_split_vector_store.add_documents(documents=splits)

## Display Split on Headers 📃

In [None]:
parse_documents(splits)

## Test Chunking Strategy #1 Character Splitting 🧪

Are the search results relevant?

In [None]:
docs = char_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(docs[0].page_content)

## Test Chunking Strategy #2 Header and Character Splitting 🧪

Are the search results relevant?

In [None]:
docs = header_and_char_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(docs[0].page_content)

## Test Chunking Strategy #3 Header Splitting 🧪

Are the search results relevant?

In [None]:
docs = header_split_vector_store.similarity_search(
    query="Azure AI Services announcements",
    k=3,
    search_type="similarity",
)
print(docs[0].page_content)

## 