## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

## 📋 Table of Contents

This notebook guides you through the following sections:

> **💡 Note:** Please refer to the notebook `01-creation-indexes.ipynb` for detailed information and steps on how to create Azure AI Search Indexes.

1. [**Indexing Vectorized Content from Documents**](#index-documents)
    - Chunk, vectorize, and index local PDF files and website addresses.
    - Download, chunk, vectorize, and index all `.docx` files from a SharePoint site.
    - Download PDF files stored in Blob Storage, apply complex OCR processing through GPT-4 Vision, chunk and vectorize the content, and finally index the processed data in Azure AI Search.
    
2. [**Indexing Vectorized Content from complex layout documents laveraging OCR Capabilities**](#index-images)
    - Leverage complex OCR, image recognition, and summarization capabilities using Azure Document Intelligence. Chunk, vectorize, and index extracted metadata from Dcouments,

3. [**Indexing Vectorized Content from Audio**](#index-audio)
    - Process WAV audio data using Azure AI Speech transalations capabilities, chunk, vectorize, and index audio files stored in Blob Storage and indexed in Azure AI Search.



In [8]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing


# Create Azure AI Search Indexes 

Please refer to the notebook [01-creation-indexes.ipynb](01-creation-indexes.ipynb) for detailed information and steps on how to create Azure AI Search Indexes. 

# Indexing Vectorized Content from Multiple Sources and Various Formats

In [9]:
# Import the AzureAIndexer class from the ai_search_indexing module
from src.indexers.ai_search_indexing import AzureAIndexer

DEPLOYMENT_NAME = "foundational-ada"
INDEX_NAME = "test-index-002"

# Create an instance of the AzureAIndexer class
azure_search_indexer_client = AzureAIndexer(
    index_name=INDEX_NAME, embedding_azure_deployment_name=DEPLOYMENT_NAME
)

2024-02-03 21:16:09,741 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-ada, and chunk size 1000 (ai_search_indexing.py:load_embedding_model:158)
2024-02-03 21:16:10,493 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object has been created successfully. You can now access the embeddings
                using the '.embeddings' attribute. (ai_search_indexing.py:load_embedding_model:169)
vector_search_configuration is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
2024-02-03 21:16:11,236 - micro - MainProcess - INFO     The Azure AI search index 'test-index-002' has been loaded correctly. (ai_search_indexing.py:load_azureai_index:220)
2024-02-03 21:16:11,247 - micro - MainProcess - INFO     Successfully loaded environment variables: TENANT_ID, CLIENT_ID, CLIENT_SECRET (sharepoint_data_extractor.py:load_environment_variables_from_env_file:87)
2024-02-03 21:16

## Indexing Pdfs and Docx from Blob Storage 

In the process of indexing and processing documents, the first crucial step is loading the files and splitting them into manageable chunks. This is where the `load_files_and_split_into_chunks` function comes into play. This function is designed to streamline these initial steps, preparing your documents for further processing and conversion.

Here are its key features:

- **Multi-Format Support**: The function can seamlessly process documents in different formats (PDFs, Word documents, etc.) from an array of sources (blob storage, URLs, local paths). You can pass a list of file paths, each possibly in a different format.

- **Automated File Loading**: The function efficiently loads files into memory, eliminating the need for manual file handling. It manages the reading and processing of each file.

- **Advanced Text Splitting**: After loading, the function splits the text into manageable chunks, which is crucial for processing large documents. You can customize the chunk size and overlap according to your needs.

- **Versatile Splitting Options**: You can choose from various splitters - RecursiveCharacterTextSplitter, TokenTextSplitter, SpacyTextSplitter, or CharacterTextSplitter - to fit your specific text processing requirements.

- **Encoding Capabilities**: The function can optionally use an encoder during splitting. This feature is particularly useful for certain text analysis tasks. You can specify the model used for encoding (default is "gpt-4").

- **Verbose Logging**: You can enable detailed logging for in-depth progress tracking and easier debugging.

- **High Customizability**: The function's behavior can be tailored to your needs with additional keyword arguments. This includes options like retaining separators in chunks, using separators as regex patterns, and more.

In [10]:
# Define file paths and URLs
local_pdf_path = "utils/data/autogen.pdf"
remote_pdf_url = "https://arxiv.org/pdf/2308.08155.pdf"
blob_pdf_url = (
    "https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf"
)
local_word_path = "utils/data/test.docx"
remote_word_url = (
    "https://testeastusdev001.blob.core.windows.net/testretrieval/test.docx"
)

# Combine all paths and URLs into a list. This is optional if you want to process multiple files at once.
# It will also work by passing a string for simple file processing.
file_sources = [
    local_pdf_path,
    remote_pdf_url,
    blob_pdf_url,
    local_word_path,
    remote_word_url,
]

# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "model_name": "gpt-4",
}

# Load files and split them into chunks
document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=file_sources, **splitter_params
)

2024-02-03 21:16:12,043 - micro - MainProcess - INFO     Reading .pdf file from local path C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\autogen.pdf. (from_blob.py:load_document:67)
2024-02-03 21:16:12,045 - micro - MainProcess - INFO     Loading file with Loader PyPDFLoader (from_blob.py:load_document:79)
2024-02-03 21:16:14,270 - micro - MainProcess - INFO     Reading .pdf file from https://arxiv.org/pdf/2308.08155.pdf. (from_blob.py:load_document:75)
2024-02-03 21:16:14,271 - micro - MainProcess - INFO     Loading file with Loader PyPDFLoader (from_blob.py:load_document:79)
2024-02-03 21:16:17,521 - micro - MainProcess - INFO     Successfully downloaded blob file autogen.pdf (blob_data_extractors.py:extract_content:89)
2024-02-03 21:16:17,558 - micro - MainProcess - INFO     Reading .pdf file from temporary location C:\Users\pablosal\AppData\Local\Temp\tmpc9h3k7qm originally sourced from https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf. 

In [11]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-03 21:16:19,836 - micro - MainProcess - INFO     Embedding and indexing initiated for 1408 text chunks. (ai_search_indexing.py:index_text_embeddings:474)
2024-02-03 21:18:35,619 - micro - MainProcess - INFO     Embedding and indexing completed for 1408 text chunks. (ai_search_indexing.py:index_text_embeddings:478)


## Indexing Pdfs and Docs from Sharepoint


In [12]:
file_names = ["testdocx.docx", "autogen.pdf"]

In [13]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "model_name": "gpt-4",
}

document_chunks_to_index = (
    azure_search_indexer_client.load_files_and_split_into_chunks_from_sharepoint(
        site_domain=os.environ["SITE_DOMAIN"],
        site_name=os.environ["SITE_NAME"],
        file_names=file_names,
        **splitter_params,
    )
)

2024-02-03 21:18:35,662 - micro - MainProcess - INFO     Getting the Site ID... (sharepoint_data_extractor.py:get_site_id:190)
2024-02-03 21:18:36,299 - micro - MainProcess - INFO     Site ID retrieved: mngenvmcap747548.sharepoint.com,877fe60f-a62d-4ed1-8eda-af543c437d2c,ac47d8a7-cd54-4344-bd9d-26ada5a075c0 (sharepoint_data_extractor.py:get_site_id:194)
2024-02-03 21:18:37,022 - micro - MainProcess - INFO     Successfully retrieved drive ID: b!D-Z_hy2m0U6O2q9UPEN9LKfYR6xUzURDvZ0mraWgdcAot0GWx37EQLiVD3sO7-vm (sharepoint_data_extractor.py:get_drive_id:211)
2024-02-03 21:18:37,024 - micro - MainProcess - INFO     Making request to Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:250)
2024-02-03 21:18:37,669 - micro - MainProcess - INFO     Received response from Microsoft Graph API (sharepoint_data_extractor.py:get_files_in_site:253)
2024-02-03 21:18:38,813 - micro - MainProcess - INFO     Reading .docx file from temporary location C:\Users\pablosal\AppData\Local\Temp\t

In [14]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-03 21:18:44,160 - micro - MainProcess - INFO     Embedding and indexing initiated for 495 text chunks. (ai_search_indexing.py:index_text_embeddings:474)
2024-02-03 21:19:32,212 - micro - MainProcess - INFO     Embedding and indexing completed for 495 text chunks. (ai_search_indexing.py:index_text_embeddings:478)


## Indexing Vectorized Content from complex layout documents laveraging OCR Capabilities

In [15]:
document_blob = "https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf"

In [16]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_title",
    "ocr": True,
    "ocr_output_format": "markdown",
    "pages": "3-7",
}

document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=document_blob,
    **splitter_params,
)

2024-02-03 21:19:32,249 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (ocr_document_intelligence.py:analyze_document:146)
2024-02-03 21:19:33,152 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractors.py:extract_content:89)
2024-02-03 21:20:05,403 - micro - MainProcess - INFO     Successfully extracted content from https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (ocr_data_extractors.py:extract_content:74)
2024-02-03 21:20:05,407 - micro - MainProcess - INFO     Number of chunks: 9 (by_title.py:split_text_by_headings:34)
2024-02-03 21:20:05,408 - micro - MainProcess - INFO     Processed chunk 1 of 9 (by_title.py:combine_chunks:53)
2024-02-03 21:20:05,410 - micro - MainProcess - INFO     Processed chunk 2 of 9 (by_title.py:combine_chunks:53)
2024-02-03 21:20:

In [17]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-03 21:20:05,449 - micro - MainProcess - INFO     Embedding and indexing initiated for 5 text chunks. (ai_search_indexing.py:index_text_embeddings:474)
2024-02-03 21:20:06,240 - micro - MainProcess - INFO     Embedding and indexing completed for 5 text chunks. (ai_search_indexing.py:index_text_embeddings:478)


In [18]:
# Define parameters for the load_files_and_split_into_chunks function
splitter_params = {
    "splitter_type": "by_character_recursive",
    "ocr": True,
    "ocr_output_format": "text",
    "pages": "3-7",
    "use_encoder": False,
    "chunk_size": 512,
    "chunk_overlap": 128,
    "verbose": False,
    "keep_separator": True,
    "is_separator_regex": False,
    "verbose": True,
}

document_chunks_to_index = azure_search_indexer_client.load_files_and_split_into_chunks(
    file_paths=document_blob,
    **splitter_params,
)

2024-02-03 21:20:06,254 - micro - MainProcess - INFO     Blob URL detected. Extracting content. (ocr_document_intelligence.py:analyze_document:146)
2024-02-03 21:20:06,413 - micro - MainProcess - INFO     Successfully downloaded blob file instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (blob_data_extractors.py:extract_content:89)
2024-02-03 21:20:37,869 - micro - MainProcess - INFO     Successfully extracted content from https://testeastusdev001.blob.core.windows.net/customskillspdf/instruction-manual-fieldvue-dvc6200-hw2-digital-valve-controller-en-123052.pdf (ocr_data_extractors.py:extract_content:74)
2024-02-03 21:20:37,870 - micro - MainProcess - INFO     Creating a splitter of type: by_character_recursive (by_character.py:get_splitter:60)
2024-02-03 21:20:37,872 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (by_character.py:split_documents_in_chunks_from_documents:161)
2024-02-03 21:20:37,875 - micro - Mai

Chunk Number: 1, Character Count: 504, Token Count: 107
Chunk Number: 2, Character Count: 339, Token Count: 63
Chunk Number: 3, Character Count: 362, Token Count: 65
Chunk Number: 4, Character Count: 511, Token Count: 96
Chunk Number: 5, Character Count: 472, Token Count: 95
Chunk Number: 6, Character Count: 510, Token Count: 99
Chunk Number: 7, Character Count: 183, Token Count: 36
Chunk Number: 8, Character Count: 253, Token Count: 74
Chunk Number: 9, Character Count: 408, Token Count: 90
Chunk Number: 10, Character Count: 304, Token Count: 55
Chunk Number: 11, Character Count: 380, Token Count: 70
Chunk Number: 12, Character Count: 494, Token Count: 159
Chunk Number: 13, Character Count: 487, Token Count: 142
Chunk Number: 14, Character Count: 244, Token Count: 55
Chunk Number: 15, Character Count: 502, Token Count: 138
Chunk Number: 16, Character Count: 210, Token Count: 62
Chunk Number: 17, Character Count: 454, Token Count: 88
Chunk Number: 18, Character Count: 456, Token Count: 

In [19]:
# Index the document chunks using the Azure Search Indexer client
azure_search_indexer_client.index_text_embeddings(document_chunks_to_index)

2024-02-03 21:20:37,899 - micro - MainProcess - INFO     Embedding and indexing initiated for 38 text chunks. (ai_search_indexing.py:index_text_embeddings:474)
2024-02-03 21:20:41,377 - micro - MainProcess - INFO     Embedding and indexing completed for 38 text chunks. (ai_search_indexing.py:index_text_embeddings:478)
