## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.

## 📋 Table of Contents

This notebook guides you through the following sections:

> **💡 Note:** Please refer to the notebook `01-creation-indexes.ipynb` for detailed information and steps on how to create Azure AI Search Indexes.

1. [**Indexing Vectorized Content from Documents**](#index-documents)
    - Chunk, vectorize, and index local PDF files and website addresses.
    - Download, chunk, vectorize, and index all `.docx` files from a SharePoint site.
    - Download PDF files stored in Blob Storage, apply complex OCR processing through GPT-4 Vision, chunk and vectorize the content, and finally index the processed data in Azure AI Search.
    
2. [**Indexing Vectorized Content from Domanin knowledge document containing complex Documents Images or more**](#index-images)
    - Leverage complex OCR, image recognition, and summarization capabilities using GPT-4 Vision. Chunk, vectorize, and index extracted metadata from images stored in Blob Storage.

3. [**Indexing Vectorized Content from Audio**](#index-audio)
    - Process WAV audio data using Azure AI Speech Translator capabilities, chunk, vectorize, and index audio files stored in Blob Storage and indexed in Azure AI Search.

In [1]:
import os

# Define the target directory
target_directory = r"C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing"  # change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing


## Create Azure AI Search Indexes 

Please refer to the notebook [01-creation-indexes.ipynb](01-creation-indexes.ipynb) for detailed information and steps on how to create Azure AI Search Indexes. 

# Indexing Vectorized Content from Documents

In [2]:
# Import the TextChunkingIndexing class from the langchain_integration module
from src.indexers.ai_search_indexing import AzureAIndexer

DEPLOYMENT_NAME = "foundational-ada"

# Create an instance of the TextChunkingIndexing class
azure_search_indexer_client = AzureAIndexer(
    index_name="test-diferences", embedding_azure_deployment_name=DEPLOYMENT_NAME
)

2024-01-12 17:36:01,861 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-ada, and chunk size 1000 (ai_search_indexing.py:load_embedding_model:150)
  warn_deprecated(
  warn_deprecated(
2024-01-12 17:36:03,371 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object has been created successfully. You can now access the embeddings
                using the '.embeddings' attribute. (ai_search_indexing.py:load_embedding_model:161)
vector_search_configuration is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored
2024-01-12 17:36:04,799 - micro - MainProcess - INFO     The Azure AI search index 'test-diferences' has been loaded correctly. (ai_search_indexing.py:load_azureai_index:212)


## Indexing PDFs

In [3]:
pdf_path = "utils\\data\\autogen.pdf"
url_pdf = "https://arxiv.org/pdf/2308.08155.pdf"
blob_path = "https://testeastusdev001.blob.core.windows.net/testretrieval/autogen.pdf"

In [4]:
chunks = azure_search_indexer_client.load_and_chunck_files(
    file_paths=blob_path,
    splitter_type="recursive",
    chunk_size=512,
    chunk_overlap=128,
    verbose=True,
)

2024-01-12 17:36:08,627 - micro - MainProcess - INFO     Reading .pdf file from C:\Users\pablosal\AppData\Local\Temp\tmp3edo4zo3. (loading.py:load_file:96)
2024-01-12 17:36:08,629 - micro - MainProcess - INFO     Loading file with Loader PyPDFLoader (loading.py:load_file:106)
2024-01-12 17:36:10,756 - micro - MainProcess - INFO     Deleted temporary file: C:\Users\pablosal\AppData\Local\Temp\tmp3edo4zo3 (loading.py:load_file_from_blob:138)
2024-01-12 17:36:10,758 - micro - MainProcess - INFO     Creating a splitter of type: recursive (chunking.py:get_splitter:56)
2024-01-12 17:36:10,759 - micro - MainProcess - INFO     Using tiktoken encoder: cl100k_base (chunking.py:get_splitter:64)
2024-01-12 17:36:10,760 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (chunking.py:split_documents_in_chunks_from_documents:161)
2024-01-12 17:36:10,892 - micro - MainProcess - INFO     Number of chunks obtained: 110 (chunking.py:split_documents_in_chunks_from_d

43
Chunk Number: 1, Character Count: 2114, Token Count: 494
Chunk Number: 2, Character Count: 671, Token Count: 142
Chunk Number: 3, Character Count: 2387, Token Count: 494
Chunk Number: 4, Character Count: 2554, Token Count: 493
Chunk Number: 5, Character Count: 1485, Token Count: 289
Chunk Number: 6, Character Count: 2508, Token Count: 494
Chunk Number: 7, Character Count: 2392, Token Count: 499
Chunk Number: 8, Character Count: 620, Token Count: 132
Chunk Number: 9, Character Count: 2302, Token Count: 492
Chunk Number: 10, Character Count: 1936, Token Count: 362
Chunk Number: 11, Character Count: 2542, Token Count: 492
Chunk Number: 12, Character Count: 2557, Token Count: 502
Chunk Number: 13, Character Count: 1063, Token Count: 228
Chunk Number: 14, Character Count: 2139, Token Count: 489
Chunk Number: 15, Character Count: 1962, Token Count: 414
Chunk Number: 16, Character Count: 1589, Token Count: 487
Chunk Number: 17, Character Count: 1958, Token Count: 451
Chunk Number: 18, Char

## Indexing Docxs

In [6]:
word_path = "utils\\data\\test.docx"
word_url = "https://testeastusdev001.blob.core.windows.net/testretrieval/test.docx"

In [8]:
chunks = azure_search_indexer_client.load_and_chunck_files(
    file_paths=word_url,
    splitter_type="recursive",
    chunk_size=512,
    chunk_overlap=128,
    verbose=True,
)

2024-01-12 17:36:46,032 - micro - MainProcess - INFO     Reading .docx file from C:\Users\pablosal\AppData\Local\Temp\tmpvx76sq_j. (loading.py:load_file:96)
2024-01-12 17:36:46,034 - micro - MainProcess - INFO     Loading file with Loader Docx2txtLoader (loading.py:load_file:106)


2024-01-12 17:36:46,256 - micro - MainProcess - INFO     Deleted temporary file: C:\Users\pablosal\AppData\Local\Temp\tmpvx76sq_j (loading.py:load_file_from_blob:138)
2024-01-12 17:36:46,258 - micro - MainProcess - INFO     Creating a splitter of type: recursive (chunking.py:get_splitter:56)
2024-01-12 17:36:46,259 - micro - MainProcess - INFO     Using tiktoken encoder: cl100k_base (chunking.py:get_splitter:64)
2024-01-12 17:36:46,260 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (chunking.py:split_documents_in_chunks_from_documents:161)
2024-01-12 17:36:46,292 - micro - MainProcess - INFO     Number of chunks obtained: 15 (chunking.py:split_documents_in_chunks_from_documents:164)


1
Chunk Number: 1, Character Count: 2105, Token Count: 465
Chunk Number: 2, Character Count: 1713, Token Count: 404
Chunk Number: 3, Character Count: 2155, Token Count: 496
Chunk Number: 4, Character Count: 2016, Token Count: 439
Chunk Number: 5, Character Count: 2337, Token Count: 482
Chunk Number: 6, Character Count: 1680, Token Count: 359
Chunk Number: 7, Character Count: 2199, Token Count: 500
Chunk Number: 8, Character Count: 1414, Token Count: 355
Chunk Number: 9, Character Count: 1984, Token Count: 467
Chunk Number: 10, Character Count: 2181, Token Count: 469
Chunk Number: 11, Character Count: 1466, Token Count: 323
Chunk Number: 12, Character Count: 2084, Token Count: 458
Chunk Number: 13, Character Count: 2364, Token Count: 491
Chunk Number: 14, Character Count: 2178, Token Count: 420
Chunk Number: 15, Character Count: 2537, Token Count: 504
15


## Indexing Docx

In [31]:
docs_chunked = split_documents_in_chunks_from_documents(
    docs, chunk_size=512, chunk_overlap=128, use_recursive_splitter=True
)

2024-01-09 09:10:57,577 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (chunking.py:split_documents_in_chunks_from_documents:95)
2024-01-09 09:10:57,593 - micro - MainProcess - INFO     Number of chunks obtained: 418 (chunking.py:split_documents_in_chunks_from_documents:98)


In [22]:
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
)

In [29]:
text_splitter_1 = CharacterTextSplitter(chunk_size=512, chunk_overlap=128)
text_splitter_2 = CharacterTextSplitter(chunk_size=512, chunk_overlap=128, separator="")

In [30]:
a = text_splitter_1.split_documents(docs)
print(len(a))
b = text_splitter_2.split_documents(docs)
print(len(b))

43
412


In [19]:
len(docs_chunked)

43

In [16]:
docs_chunked

[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗\nErkang Zhu∗, Li Jiang∗, Xiaoyun Zhang∗, Shaokun Zhang†, Jiale Liu∓\nAhmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1\n∗Microsoft Research,†Pennsylvania State University\n±University of Washington,∓Xidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n…\n…\n…\n…\n…\n…\n…', metadata={'source': 'C:\\Users\\pablosal\\Desktop\\gbbai-azure-ai-search-indexing\\utils\\data\\autogen.pdf', 'page': 0}),
 Document(page_content='±University of Washington,∓Xidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n…\n…\n…\n…\n…\n…\n…\nHierarchical chatJoint chatMulti-Agent Conversations…Execute the following code…\nGot it! Here is the revised code …No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError pa

In [14]:
len(docs_chunked)

207

In [18]:
azure_search_indexer_client.load_and_chunck_text_by_character_from_pdf(
    pdf_path=pdf_path, chunk_size=200, chunk_overlap=100
)

2024-01-09 08:32:55,847 - micro - MainProcess - INFO     Reading PDF file from C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\autogen.pdf. (pdf_data_extractor.py:read_and_load_pdf:39)
2024-01-09 08:32:57,951 - micro - MainProcess - INFO     Obtained splitter of type: RecursiveCharacterTextSplitter (chunking.py:split_documents_in_chunks_from_documents:95)
2024-01-09 08:32:57,981 - micro - MainProcess - INFO     Number of chunks obtained: 1486 (chunking.py:split_documents_in_chunks_from_documents:98)


43
1486


[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗', metadata={'source': 'C:\\Users\\pablosal\\Desktop\\gbbai-azure-ai-search-indexing\\utils\\data\\autogen.pdf', 'page': 0}),
 Document(page_content='Qingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗\nErkang Zhu∗, Li Jiang∗, Xiaoyun Zhang∗, Shaokun Zhang†, Jiale Liu∓\nAhmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1', metadata={'source': 'C:\\Users\\pablosal\\Desktop\\gbbai-azure-ai-search-indexing\\utils\\data\\autogen.pdf', 'page': 0}),
 Document(page_content='Ahmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1\n∗Microsoft Research,†Pennsylvania State University\n±University of Washington,∓Xidian University\nAgent CustomizationConversable agent', metadata={'source': 'C:\\Users\\pablosal\\Desktop\\gbbai-azure-ai-search-indexing\\utils\\data\\autogen.pdf', 'page': 0}),
 Document(page_content='±U

In [9]:
azure_search_indexer_client.read_and_load_pdf(pdf_url=url_pdf)

2024-01-07 20:38:13,644 - micro - MainProcess - INFO     Reading PDF file from https://arxiv.org/pdf/2308.08155.pdf. (langchain_integration.py:read_and_load_pdf:366)


[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗\nErkang Zhu∗, Li Jiang∗, Xiaoyun Zhang∗, Shaokun Zhang†, Jiale Liu∓\nAhmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1\n∗Microsoft Research,†Pennsylvania State University\n±University of Washington,∓Xidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n…\n…\n…\n…\n…\n…\n…\nHierarchical chatJoint chatMulti-Agent Conversations…Execute the following code…\nGot it! Here is the revised code …No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling…\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\nAutoGen agents are conversable, customizable, and can b

In [7]:
azure_search_indexer_client.read_and_load_pdf(pdf_path=pdf_path)

2024-01-07 20:37:47,373 - micro - MainProcess - INFO     Reading PDF file from C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\autogen.pdf. (langchain_integration.py:read_and_load_pdf:342)


[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗\nErkang Zhu∗, Li Jiang∗, Xiaoyun Zhang∗, Shaokun Zhang†, Jiale Liu∓\nAhmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1\n∗Microsoft Research,†Pennsylvania State University\n±University of Washington,∓Xidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n…\n…\n…\n…\n…\n…\n…\nHierarchical chatJoint chatMulti-Agent Conversations…Execute the following code…\nGot it! Here is the revised code …No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling…\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\nAutoGen agents are conversable, customizable, and can b

In [3]:
# Scrap web and chuck files into sentences
# Define the URLs of the web pages to scrape
file_1 = "utils\\data\\ultraflex_user_manual.pdf"

# Set the chunk size and overlap size for splitting the text
CHUNK_SIZE = 512
OVERLAP_SIZE = 128
SEPARATOR = "(\n\w|\w\n)"

# Scrape the web pages, split the text into chunks, and store the chunks
# The text is split into chunks of size CHUNK_SIZE, with an overlap of OVERLAP_SIZE between consecutive chunks
text_chuncked = gbb_ai_client.load_and_split_text_by_character_from_pdf(
    source=file_1, chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP_SIZE
)

# Embed the chunks and index them in Azure Search
# This function converts the text chunks into vector embeddings and stores them in the Azure Search index
gbb_ai_client.embed_and_index(text_chuncked)

2024-01-07 17:11:25,576 - micro - MainProcess - INFO     Reading PDF files from C:\Users\pablosal\Desktop\gbbai-azure-ai-search-indexing\utils\data\ultraflex_user_manual.pdf. (indexing_azureai_search.py:read_and_load_pdfs:336)
2024-01-07 17:11:37,051 - micro - MainProcess - INFO     Starting to embed and index 39 chuncks. (indexing_azureai_search.py:embed_and_index:402)
2024-01-07 17:11:41,483 - micro - MainProcess - INFO     Successfully embedded and indexed 39 chuncks. (indexing_azureai_search.py:embed_and_index:404)


In [18]:
pdf_path = "utils\\data\\autogen.pdf"
url_pdf = "https://arxiv.org/pdf/2308.08155.pdf"

In [9]:
from langchain.document_loaders import PyPDFLoader, WebBaseLoader

In [24]:
loader = PyPDFLoader(pdf_path)
document_path = loader.load()

In [21]:
loader = PyPDFLoader(url_pdf)
document_url = loader.load()

In [25]:
document_path

[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗\nErkang Zhu∗, Li Jiang∗, Xiaoyun Zhang∗, Shaokun Zhang†, Jiale Liu∓\nAhmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1\n∗Microsoft Research,†Pennsylvania State University\n±University of Washington,∓Xidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n…\n…\n…\n…\n…\n…\n…\nHierarchical chatJoint chatMulti-Agent Conversations…Execute the following code…\nGot it! Here is the revised code …No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling…\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\nAutoGen agents are conversable, customizable, and can b

In [26]:
document_url

[Document(page_content='AutoGen : Enabling Next-Gen LLM\nApplications via Multi-Agent Conversation\nQingyun Wu†, Gagan Bansal∗, Jieyu Zhang±, Yiran Wu†, Beibin Li∗\nErkang Zhu∗, Li Jiang∗, Xiaoyun Zhang∗, Shaokun Zhang†, Jiale Liu∓\nAhmed Awadallah∗, Ryen W. White∗, Doug Burger∗, Chi Wang∗1\n∗Microsoft Research,†Pennsylvania State University\n±University of Washington,∓Xidian University\nAgent CustomizationConversable agent\nFlexible Conversation Patterns\n…\n…\n…\n…\n…\n…\n…\nHierarchical chatJoint chatMulti-Agent Conversations…Execute the following code…\nGot it! Here is the revised code …No, please plot % change!Plot a chart of META and TESLA stock price change YTD.\nOutput:$Month\nOutput:%MonthError package yfinanceis not installed\nSorry! Please first pip install yfinanceand then execute the code\nInstalling…\nExample Agent Chat\nFigure 1: AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left)\nAutoGen agents are conversable, customizable, and can b

In [23]:
document_url == document_path

False