# File Chunking
Partitioning large documents into smaller chunks can help you stay under the maximum token input limits of embedding models. For example, the maximum length of input text for the [Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/embeddings) embedding models is 8,191 tokens. Given that each token is around four characters of text for common OpenAI models, this maximum limit is equivalent to around 6,000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the embedding models used to populate vector stores and text-to-vector query conversions.

This notebook walks through the process used in the [chat-with-your-data-solution-accelerator](https://github.com/Azure-Samples/chat-with-your-data-solution-accelerator/tree/main) repo of analysing file content in [AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0), then breaking it into smaller overlapping chunks using LangChain. These chunks can then be vectorised and stored in a service such as [Azure AI Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) which provides secure information retrieval at scale over user-owned content in traditional and generative AI search applications. 

> NOTE: Before proceeding, you need to complete the steps in the first `00_Setup` notebook.

### Install dependencies
First we install the dependencies required in ths notebook.

In [None]:
# Install dependencies

%pip install python-dotenv
%pip install langchain
%pip install azure.ai.formrecognizer
%pip install tiktoken

### Load credentials
Next we load the environment variables needed by the following cells from the `.env` file.

In [None]:
# Load credentials
import os
from dotenv import load_dotenv
# fixing the env file
load_dotenv()


### Set execution parameters
Set the name of the file you uploaded to the blob container (that we are going to chunk) and the container name. The filename is joined with the blob endpoint, container name, and SAS token to create a url to the file. You can also adjust the `chunk_size` and `chunk_overlap` to see their effect on the output later, or leave the defaults.

In [None]:
## The name of the file you uploaded to your Blob Storage Datasource
# file_name="employee_handbook.pdf"
#file_name="PHE_Covid-19_consent_form_adults_able_to_consent_v2.pdf"
file_name="WHO-2019-nCoV-SRH-Rights-2020.1-eng.pdf"
blob_sas=os.getenv("BLOB_SAS_TOKEN")
file_url=os.getenv("BLOB_CONTAINER_ENDPOINT") + "/" + file_name + "?" + blob_sas

print(f"file url={file_url}")

# You can play with these parameters to see how they affect the output
my_chunk_size = 500
my_chunk_overlap = 100

### Read the file in with Document Intelligence
[Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-4.0.0) is a cloud-based Azure AI service that enables you to build intelligent document processing solutions. Massive amounts of data, spanning a wide variety of data types, are stored in forms and documents. Document Intelligence enables you to effectively manage the velocity at which data is collected and processed and is key to improved operations, informed data-driven decisions, and enlightened innovation.

This cell creates a `DocumentAnalysisClient` object using the settings in the credentials file. It then defines some helper functions to do things like converting tables to HTML.

The main part of the code calls the *DocumentAnalysisClient* `begin_analyze_document_from_url` method to analyse the file using the pre-defined `layout` [model](https://learn.microsoft.com/en-gb/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0&tabs=sample-code). The output of the DocumentAnalysisClient call is then parsed and converted into HTML formatted *pages*, while keeping track of paragraph and table roles and positions.

In [None]:
#code\backend\batch\utilities\helpers\azure_form_recognizer_helper.py

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
import html
import traceback


print(f"form recognizer={os.getenv('FORM_RECOGNIZER_ENDPOINT')}")

document_analysis_client = DocumentAnalysisClient(
    endpoint=os.getenv("FORM_RECOGNIZER_ENDPOINT"), 
    credential=AzureKeyCredential(os.getenv("FORM_RECOGNIZER_KEY"))
)

form_recognizer_role_to_html = {
    "title": "h1",
    "sectionHeading": "h2",
    "pageHeader": None,
    "pageFooter": None,
    "paragraph": "p",
}

def _table_to_html(table):
    table_html = "<table>"
    rows = [
        sorted(
            [cell for cell in table.cells if cell.row_index == i],
            key=lambda cell: cell.column_index,
        )
        for i in range(table.row_count)
    ]
    for row_cells in rows:
        table_html += "<tr>"
        for cell in row_cells:
            tag = (
                "th"
                if (cell.kind == "columnHeader" or cell.kind == "rowHeader")
                else "td"
            )
            cell_spans = ""
            if cell.column_span > 1:
                cell_spans += f" colSpan={cell.column_span}"
            if cell.row_span > 1:
                cell_spans += f" rowSpan={cell.row_span}"
            table_html += f"<{tag}{cell_spans}>{html.escape(cell.content)}</{tag}>"
        table_html += "</tr>"
    table_html += "</table>"
    return table_html


offset = 0
page_map = []
model_id = "prebuilt-layout"

try:
    poller = document_analysis_client.begin_analyze_document_from_url(
        model_id, document_url=file_url
    )
    form_recognizer_results = poller.result()

    roles_start = {}
    roles_end = {}
    for paragraph in form_recognizer_results.paragraphs:
        # if paragraph.role!=None:
        para_start = paragraph.spans[0].offset
        para_end = paragraph.spans[0].offset + paragraph.spans[0].length
        roles_start[para_start] = (
            paragraph.role if paragraph.role is not None else "paragraph"
        )
        roles_end[para_end] = (
            paragraph.role if paragraph.role is not None else "paragraph"
        )

    for page_num, page in enumerate(form_recognizer_results.pages):
        tables_on_page = [
            table
            for table in form_recognizer_results.tables
            if table.bounding_regions[0].page_number == page_num + 1
        ]

        # (if using layout) mark all positions of the table spans in the page
        page_offset = page.spans[0].offset
        page_length = page.spans[0].length
        table_chars = [-1] * page_length
        for table_id, table in enumerate(tables_on_page):
            for span in table.spans:
                # replace all table spans with "table_id" in table_chars array
                for i in range(span.length):
                    idx = span.offset - page_offset + i
                    if idx >= 0 and idx < page_length:
                        table_chars[idx] = table_id

        # build page text by replacing charcters in table spans with table html and replace the characters corresponding to headers with html headers, if using layout
        page_text = ""
        added_tables = set()
        for idx, table_id in enumerate(table_chars):
            if table_id == -1:
                position = page_offset + idx
                if position in roles_start.keys():
                    role = roles_start[position]
                    html_role = form_recognizer_role_to_html.get(role)
                    if html_role is not None:
                        page_text += f"<{html_role}>"
                if position in roles_end.keys():
                    role = roles_end[position]
                    html_role = form_recognizer_role_to_html.get(role)
                    if html_role is not None:
                        page_text += f"</{html_role}>"

                page_text += form_recognizer_results.content[page_offset + idx]

            elif table_id not in added_tables:
                page_text += _table_to_html(tables_on_page[table_id])
                added_tables.add(table_id)

        page_text += " "
        page_map.append(
            {"page_number": page_num, "offset": offset, "page_text": page_text}
        )
        offset += len(page_text)

    print(f"page map={page_map}")

except Exception as e:
    raise ValueError(f"Error: {traceback.format_exc()}. Error: {e}")

### Parse into a document object
The following cell takes the *page_map* created in the previous cell and builds a list of `SourceDocument` objects, one for each page in the page_map, which are used to simplify the next phase of processing. 

In [None]:
#code\backend\batch\utilities\document_loading\layout.py

from SourceDocument import SourceDocument

pages_content = page_map

documents = [
    SourceDocument(
        content=page["page_text"],
        source=file_url,
        offset=page["offset"],
        page_number=page["page_number"],
    )
    for page in pages_content
    ]

### Chunk the document
The following cell splits the contents of the entire document into chunks based on the parameters set in the earlier cell using [LangChain text splitters](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/). A final list `chunked_documents` contains `SourceDocument` objects with the specified chunk_size and chunk_overlap. 

In [None]:
#code\backend\batch\utilities\document_chunking\layout.py

from typing import List
from langchain.text_splitter import MarkdownTextSplitter

# Combine all pages into a single document
full_document_content = "".join(
    list(map(lambda document: document.content, documents))
)

# Split the document into chunks
document_url = documents[0].source
splitter = MarkdownTextSplitter.from_tiktoken_encoder(
    chunk_size=my_chunk_size, chunk_overlap=my_chunk_overlap
)
chunked_content_list = splitter.split_text(full_document_content)

# Create a list of SourceDocuments from the chunked content
chunked_documents = []
chunk_offset = 0
for idx, chunked_content in enumerate(chunked_content_list):
    chunked_documents.append(
        SourceDocument.from_metadata(
            content=chunked_content,
            document_url=document_url,
            metadata={"offset": chunk_offset},
            idx=idx,
        )
    )

    # Print each chunk
    print(f"offset={chunk_offset}")
    print(f"chunk={idx}: {chunked_content}\n")

    chunk_offset += len(chunked_content)

#"chunked_documents" variable contains the list of documents split into chunks