# How to deal with complex/large Documents

In the previous notebook, we developed a solution for various types of files and data formats commonly found in organizations, and this covers big majority of the use cases. However, you will find that there are issues when dealing with questions that require answers from complex files. The complexity of these files arises from their length and the way information is distributed within them. Large documents are always a challenge for Search Engines.

One example of such complex files is Technical Specification Guides or Product Manuals, which can span hundreds of pages and contain information in the form of images, tables, forms, and more. Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page (1 page = 1 chunk). The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

If your use case is just PDFs, for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), vectorize using OpenAI API and push the content to a vector-based index. And this is problably the simplest and fastest way to go.  However if your use case entails connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine, Notebooks 1-3, and avoid a lot of painful custom code. 


In [1]:
import os
import io
import json
import time
import requests
import random
import uuid
import shutil
import zipfile
from collections import OrderedDict
import urllib.request
from tqdm import tqdm

from typing import List

from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_core.runnables import ConfigurableField
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

from operator import itemgetter

from common.utils import upload_file_to_blob, extract_zip_file, upload_directory_to_blob
from common.utils import parse_pdf, read_pdf_files
from common.prompts import DOCSEARCH_PROMPT_TEXT
from common.utils import CustomAzureSearchRetriever


from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials.env")

def printmd(string):
    display(Markdown(string))
    


In [2]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]

## Upload local dataset to Blob Container

In [3]:
%%time

# Define connection string and other parameters
BLOB_CONTAINER_NAME = "books"
BLOB_NAME = "books.zip"
LOCAL_FILE_PATH = "./data/" + BLOB_NAME  # Path to the local file you want to upload
upload_directory = "./data/temp_extract"  # Temporary directory to extract the zip file

# Extract the zip file
extract_zip_file(LOCAL_FILE_PATH, upload_directory)

# Upload the extracted files and folder structure
upload_directory_to_blob(upload_directory, BLOB_CONTAINER_NAME)

# Clean up: Optionally, you can remove the temp folder after uploading
shutil.rmtree(upload_directory)
print(f"Temp Folder: {upload_directory} removed")

Extracting ./data/books.zip ... 
Extracted ./data/books.zip to ./data/temp_extract


Uploading Files: 100%|████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.43s/it]

Temp Folder: ./data/temp_extract removed
CPU times: user 359 ms, sys: 244 ms, total: 603 ms
Wall time: 6.09 s





## Manual Document Cracking with Push to Vector-based Index

### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

In [3]:
BLOB_NAME = "books.zip"
LOCAL_FILE_PATH = "./data/" + BLOB_NAME  # Path to the local file you want to upload

In [4]:
# Dictionary to store the parsed data for each book
book_pages_map = dict()

# Open the zip file
with zipfile.ZipFile(LOCAL_FILE_PATH, 'r') as zip_ref:
    # Iterate over the PDF files inside the zip archive
    for file_info in zip_ref.infolist():
        if file_info.filename.endswith('.pdf'):
            book = file_info.filename
            
            print("Extracting Text from", book, "...")
            
            # Read the PDF file directly into memory (as a binary stream)
            with zip_ref.open(file_info) as file:
                file_stream = io.BytesIO(file.read())  # Convert file to BytesIO for in-memory file handling

                # Capture the start time
                start_time = time.time()

                # Parse the PDF (you would use your actual parse_pdf function here)
                book_map = parse_pdf(file_stream, form_recognizer=False, verbose=True)
                book_pages_map[book] = book_map
                
                # Capture the end time and calculate the elapsed time
                end_time = time.time()
                elapsed_time = end_time - start_time

                print(f"Parsing took: {elapsed_time:.6f} seconds")
                print(f"{book} contained {len(book_map)} pages\n")


Extracting Text from books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf ...
Extracting text using PyPDF
Parsing took: 2.158757 seconds
books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf contained 357 pages

Extracting Text from books/Fundamentals_of_Physics_Textbook.pdf ...
Extracting text using PyPDF


Ignoring wrong pointing object 23 0 (offset 0)
Ignoring wrong pointing object 27 0 (offset 0)
Ignoring wrong pointing object 53 0 (offset 0)
Ignoring wrong pointing object 198 0 (offset 0)
Ignoring wrong pointing object 205 0 (offset 0)
Ignoring wrong pointing object 272 0 (offset 0)
Ignoring wrong pointing object 307 0 (offset 0)
Ignoring wrong pointing object 326 0 (offset 0)
Ignoring wrong pointing object 345 0 (offset 0)
Ignoring wrong pointing object 637 0 (offset 0)
Ignoring wrong pointing object 638 0 (offset 0)
Ignoring wrong pointing object 640 0 (offset 0)
Ignoring wrong pointing object 641 0 (offset 0)
Ignoring wrong pointing object 1638 0 (offset 0)
Ignoring wrong pointing object 1685 0 (offset 0)
Ignoring wrong pointing object 2511 0 (offset 0)
Ignoring wrong pointing object 2516 0 (offset 0)
Ignoring wrong pointing object 2780 0 (offset 0)
Ignoring wrong pointing object 2816 0 (offset 0)
Ignoring wrong pointing object 3617 0 (offset 0)
Ignoring wrong pointing object 3757 

Parsing took: 109.123985 seconds
books/Fundamentals_of_Physics_Textbook.pdf contained 1450 pages

Extracting Text from books/Made_To_Stick.pdf ...
Extracting text using PyPDF
Parsing took: 8.691508 seconds
books/Made_To_Stick.pdf contained 225 pages

Extracting Text from books/Pere_Riche_Pere_Pauvre.pdf ...
Extracting text using PyPDF
Parsing took: 0.953015 seconds
books/Pere_Riche_Pere_Pauvre.pdf contained 225 pages



Now let's check a random page of each book to make sure the parsing was done correctly:

In [5]:
for bookname,bookmap in book_pages_map.items():
    print(bookname,"\n","chunk text:",bookmap[random.randint(10, 50)][2][:120],"...\n")

books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf 
 chunk text: 24
tremendous confusion about when it is biblically appropriate to
set limits. When confronted with their lack of bounda ...

books/Fundamentals_of_Physics_Textbook.pdf 
 chunk text: 12 CHAPTER 1 MEASUREMENT
51 The cubit is an ancient unit of length based on the distance
between the elbow and the tip o ...

books/Made_To_Stick.pdf 
 chunk text: How do we find the essential core of our ideas? A s uccessful defense 
lawyer says, "If you argue ten points, even if ea ...

books/Pere_Riche_Pere_Pauvre.pdf 
 chunk text: ~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ...



As we can see above, all books were parsed except `Pere_Riche_Pere_Pauvre.pdf` (this book is "Rich Dad, Poor Dad" written in French), why? Well, as we mentioned above, this book was scanned, so each page is an image and with a very unique font. We need a good PDF parser with good OCR capabilities in order to extract the content of this PDF. 
Let's try to parse this book again, but this time using Azure Document Intelligence API (former Form Recognizer)

In [6]:
%%time
book = "books/Pere_Riche_Pere_Pauvre.pdf"
with zipfile.ZipFile(LOCAL_FILE_PATH, 'r') as zip_ref:
    with zip_ref.open(book) as file:
                file_stream = io.BytesIO(file.read())  # Convert file to BytesIO for in-memory file handling

                # Capture the start time
                start_time = time.time()

                # Parse the PDF (you would use your actual parse_pdf function here)
                book_map = parse_pdf(file_stream, form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
                book_pages_map[book] = book_map
                
                # Capture the end time and calculate the elapsed time
                end_time = time.time()
                elapsed_time = end_time - start_time

                print(f"Parsing took: {elapsed_time:.6f} seconds")
                print(f"{book} contained {len(book_map)} pages\n")

Extracting text using Azure Document Intelligence
Parsing took: 51.204662 seconds
books/Pere_Riche_Pere_Pauvre.pdf contained 225 pages

CPU times: user 13.1 s, sys: 437 ms, total: 13.5 s
Wall time: 51.2 s


In [7]:
print(book,"\n","chunk text:",book_map[random.randint(10, 50)][2][:80],"...\n")

books/Pere_Riche_Pere_Pauvre.pdf 
 chunk text: La principale inquiétude de Robert était l'écart croissant entre les riches et l ...



As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. **For production scenarios, we strongly recommend using Azure Document Intelligence consistently**. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).


## Create Vector-based index


Now that we have the content of the book's chunks (each page of each book) in the dictionary `book_pages_map`, let's create the Vector index in our Azure Search Engine where this content is going to land

In [8]:
batch_size = 75
embedder = AzureOpenAIEmbeddings(deployment=os.environ["EMBEDDING_DEPLOYMENT_NAME"], chunk_size=batch_size, 
                                 max_retries=2, 
                                 retry_min_seconds= 60,
                                 retry_max_seconds= 70)

In [9]:
book_index_name = "srch-index-books"

In [10]:
### Create Azure Search Vector-based Index
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}


Please note the following points regarding the index:

- The ParentKey field is absent.
- The page_num field is present.

The absence of the ParentKey field is due to the utilization of a PUSH method, rather than a PULL method. This approach indicates that we are not leveraging the integrated indexing provided by the Azure AI Search engine. Instead, we are engaging in the process of parsing, performing OCR, and manually creating and pushing the content along with its vectors.

This manual parsing process involves the use of either, the pyPDF library, or the Azure Document Intelligence API. These APIs allow for the segmentation of content by page rather than by a specified number of characters, which is the method employed by the Azure AI search indexer. Consequently, this enables the inclusion of page_num as a field in our index.

The latest Azure AI Search API supports external and internal vectorization. This Notebook assumes an external vectorization strategy. This API also supports:
    
- vectorSearch algorithms, hnsw and exhaustiveKnn nearest neighbors, with parameters for indexing and scoring.
- vectorProfiles for multiple combinations of algorithm configurations.

Vector search algorithms include **exhaustive k-nearest neighbors (KNN)** and **Hierarchical Navigable Small World (HNSW)**. Exhaustive KNN performs a brute-force search that scans the entire vector space. HNSW performs an approximate nearest neighbor (ANN) search. While KNN provides exact nearest neighbor search results with high accuracy, its computational cost and poor scalability make it impractical for large datasets or real-time applications. HNSW, on the other hand, offers a highly efficient and scalable solution for nearest neighbor searches by finding approximate nearest neighbors quickly, making it more suitable for large-scale and high-dimensional data applications.


check [HERE](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-create-index?tabs=config-2023-10-01-Preview%2Crest-2023-11-01%2Cpush%2Cportal-check-index) for the details of the vector configuration.

**Note**: Unlike Notebooks 1 and 2, we will not add any vector compression to this index. This approach allows you to compare the resulting index sizes across all three indexes.

In [11]:
index_payload = {
    "name": book_index_name,
    "vectorSearch": {
        "algorithms": [  # We are showing here 3 types of search algorithms configurations that you can do
             {
                 "name": "my-hnsw-config-1",
                 "kind": "hnsw",
                 "hnswParameters": {
                     "m": 4,
                     "efConstruction": 400,
                     "efSearch": 500,
                     "metric": "cosine"
                 }
             },
             {
                 "name": "my-hnsw-config-2",
                 "kind": "hnsw",
                 "hnswParameters": {
                     "m": 8,
                     "efConstruction": 800,
                     "efSearch": 800,
                     "metric": "cosine"
                 }
             },
             {
                 "name": "my-eknn-config",
                 "kind": "exhaustiveKnn",
                 "exhaustiveKnnParameters": {
                     "metric": "cosine"
                 }
             }
        ],
        "vectorizers": [
            {
                "name": "openai",
                "kind": "azureOpenAI",
                "azureOpenAIParameters":
                {
                    "resourceUri" : os.environ['AZURE_OPENAI_ENDPOINT'],
                    "apiKey" : os.environ['AZURE_OPENAI_API_KEY'],
                    "deploymentId" : os.environ['EMBEDDING_DEPLOYMENT_NAME'],
                    "modelName" : os.environ['EMBEDDING_DEPLOYMENT_NAME']
                }
            }
        ],
        "profiles": [  # profiles is the diferent kind of combinations of algos and vectorizers
            {
             "name": "my-vector-profile-1",
             "algorithm": "my-hnsw-config-1",
             "vectorizer":"openai"
            },
            {
             "name": "my-vector-profile-2",
             "algorithm": "my-hnsw-config-2",
             "vectorizer":"openai"
            },
            {
             "name": "my-vector-profile-3",
             "algorithm": "my-eknn-config",
             "vectorizer":"openai"
            }
        ]
    },
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "title"
                    },
                    "prioritizedContentFields": [
                        {
                            "fieldName": "chunk"
                        }
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    },
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
        {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunk","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "location", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},
        {
            "name": "chunkVector",
            "type": "Collection(Edm.Single)",
            "dimensions": 3072,
            "vectorSearchProfile": "my-vector-profile-3", # we picked profile 3 to show that this index uses eKNN vs HNSW (on prior notebooks)
            "searchable": "true",
            "retrievable": "true",
            "filterable": "false",
            "sortable": "false",
            "facetable": "false"
        }
        
    ],
}

r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

204
True


In [13]:
# Uncomment to debug errors
# r.text

## Upload the Document chunks and its vectors to the Index

The following code will iterate over each chunk of each book and use the Azure Search Rest API upload method to insert each document with its corresponding vector (using OpenAI embedding model) to the index.

In [None]:
def process_batch(bookname, pages, batch_id=None, max_retries=3, backoff=5):
    """
    Function to process a batch of pages
    This function will take a book name, a list of pages, and an optional batch ID.
    It will embed the pages, create a payload for Azure Search, and upload the data.
    If the upload fails, it will retry a specified number of times with exponential backoff.
    It will also save the failed batch to a file for later inspection, ONLY if the all retries fail."""
    
    failed_batches_dir = "failed_batches"
    os.makedirs(failed_batches_dir, exist_ok=True)

    try:
        contents = [page[2] for page in pages]
        chunk_vectors = embedder.embed_documents(contents)
        
        upload_payload = {"value": []}
        for i, page in enumerate(pages):
            page_num = page[0] + 1
            content = page[2]
            book_url = os.environ['BASE_CONTAINER_URL'] + bookname
            
            payload = {
                "@search.action": "upload",
                "id": str(uuid.uuid5(uuid.NAMESPACE_DNS, f"{bookname}{page_num}")),
                "title": f"{bookname}_page_{str(page_num)}",
                "chunk": content,
                "chunkVector": chunk_vectors[i],
                "name": bookname,
                "location": book_url,
                "page_num": page_num
            }
            upload_payload["value"].append(payload)
        
        for attempt in range(1, max_retries + 1):
            try:
                r = requests.post(
                    os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name + "/docs/index",
                    data=json.dumps(upload_payload),
                    headers=headers,
                    params=params,
                    timeout=30
                )
                if r.status_code == 200:
                    print(f"[{bookname}][batch {batch_id}] ✅ Upload successful")
                    return
                else:
                    print(f"[{bookname}][batch {batch_id}] ⚠️ Attempt {attempt} failed: {r.status_code} - {r.text}")
            except Exception as e:
                print(f"[{bookname}][batch {batch_id}] ❗ Attempt {attempt} raised exception: {e}")
            time.sleep(backoff * attempt)

        # Save failed batch
        failed_path = os.path.join(
            failed_batches_dir,
            f"failed_batch_{bookname.replace('/', '_')}_batch_{batch_id}.json"
        )
        with open(failed_path, 'w') as f:
            json.dump(upload_payload, f, indent=2)
        print(f"[{bookname}][batch {batch_id}] ❌ Upload failed after {max_retries} attempts. Saved to {failed_path}")

    except Exception as e:
        print(f"[{bookname}][batch {batch_id}] 🚨 Unexpected error: {e}")

In [15]:
%%time
for bookname, bookmap in book_pages_map.items():
    print("Uploading chunks from", bookname)
    for i in tqdm(range(0, len(bookmap), batch_size)):
        batch = bookmap[i:i + batch_size]
        process_batch(bookname, batch, batch_id=i)

Uploading chunks from books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf


 20%|██        | 1/5 [00:07<00:31,  7.77s/it]

[books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf][batch 0] ✅ Upload successful


 40%|████      | 2/5 [00:14<00:20,  6.87s/it]

[books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf][batch 75] ✅ Upload successful


 60%|██████    | 3/5 [00:20<00:13,  6.55s/it]

[books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf][batch 150] ✅ Upload successful


 80%|████████  | 4/5 [01:10<00:23, 23.94s/it]

[books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf][batch 225] ✅ Upload successful


100%|██████████| 5/5 [01:15<00:00, 15.10s/it]


[books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf][batch 300] ✅ Upload successful
Uploading chunks from books/Fundamentals_of_Physics_Textbook.pdf


  5%|▌         | 1/20 [01:11<22:38, 71.49s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 0] ✅ Upload successful


 10%|█         | 2/20 [02:04<18:12, 60.69s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 75] ✅ Upload successful


 15%|█▌        | 3/20 [03:03<16:56, 59.79s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 150] ✅ Upload successful


 20%|██        | 4/20 [04:05<16:09, 60.62s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 225] ✅ Upload successful


 25%|██▌       | 5/20 [05:07<15:16, 61.09s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 300] ✅ Upload successful


 30%|███       | 6/20 [06:09<14:20, 61.47s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 375] ✅ Upload successful


 35%|███▌      | 7/20 [07:11<13:20, 61.56s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 450] ✅ Upload successful


 40%|████      | 8/20 [08:12<12:19, 61.61s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 525] ✅ Upload successful


 45%|████▌     | 9/20 [09:15<11:19, 61.79s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 600] ✅ Upload successful


 50%|█████     | 10/20 [10:16<10:18, 61.82s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 675] ✅ Upload successful


 55%|█████▌    | 11/20 [11:19<09:17, 61.91s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 750] ✅ Upload successful


 60%|██████    | 12/20 [12:20<08:15, 61.89s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 825] ✅ Upload successful


 65%|██████▌   | 13/20 [13:27<07:23, 63.35s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 900] ✅ Upload successful


 70%|███████   | 14/20 [14:24<06:08, 61.49s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 975] ✅ Upload successful


 75%|███████▌  | 15/20 [15:26<05:08, 61.60s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 1050] ✅ Upload successful


 80%|████████  | 16/20 [16:28<04:06, 61.66s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 1125] ✅ Upload successful


 85%|████████▌ | 17/20 [17:30<03:05, 61.83s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 1200] ✅ Upload successful


 90%|█████████ | 18/20 [18:32<02:03, 61.77s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 1275] ✅ Upload successful


 95%|█████████▌| 19/20 [20:36<01:20, 80.66s/it]

[books/Fundamentals_of_Physics_Textbook.pdf][batch 1350] 🚨 Unexpected error: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the Embeddings_Create Operation under Azure OpenAI API version 2024-10-01-preview have exceeded call rate limit of your current AIServices S0 pricing tier. Please retry after 60 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit. For Free Account customers, upgrade to Pay as you Go here: https://aka.ms/429TrialUpgrade.'}}


100%|██████████| 20/20 [20:41<00:00, 62.09s/it]


[books/Fundamentals_of_Physics_Textbook.pdf][batch 1425] ✅ Upload successful
Uploading chunks from books/Made_To_Stick.pdf


 33%|███▎      | 1/3 [00:06<00:13,  6.70s/it]

[books/Made_To_Stick.pdf][batch 0] ✅ Upload successful


 67%|██████▋   | 2/3 [01:04<00:36, 36.59s/it]

[books/Made_To_Stick.pdf][batch 75] ✅ Upload successful


100%|██████████| 3/3 [01:10<00:00, 23.54s/it]


[books/Made_To_Stick.pdf][batch 150] ✅ Upload successful
Uploading chunks from books/Pere_Riche_Pere_Pauvre.pdf


 33%|███▎      | 1/3 [00:57<01:55, 57.95s/it]

[books/Pere_Riche_Pere_Pauvre.pdf][batch 0] ✅ Upload successful


 67%|██████▋   | 2/3 [01:04<00:27, 27.72s/it]

[books/Pere_Riche_Pere_Pauvre.pdf][batch 75] ✅ Upload successful


100%|██████████| 3/3 [01:59<00:00, 39.94s/it]

[books/Pere_Riche_Pere_Pauvre.pdf][batch 150] ✅ Upload successful
CPU times: user 12 s, sys: 2.32 s, total: 14.3 s
Wall time: 25min 7s





In [None]:
# This is a simple retry mechanism for failed batches, in case there are failed batches in the folder failed_batches after the previous step, uncomment the code below to retry them.
# It assumes that the failed batches are saved in a directory called "failed_batches"

# import glob

# print("\n🔁 Retrying failed batches...")
# for path in glob.glob("failed_batches/failed_batch_*.json"):
#     print(f"Retrying: {path}")
#     with open(path) as f:
#         payload = json.load(f)

#     try:
#         r = requests.post(
#             os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name + "/docs/index",
#             data=json.dumps(payload),
#             headers=headers,
#             params=params,
#             timeout=30
#         )
#         if r.status_code == 200:
#             print(f"✅ Retry succeeded: {path}")
#             os.remove(path)  # Clean up if retry successful
#         else:
#             print(f"❌ Retry failed ({r.status_code}): {r.text}")
#     except Exception as e:
#         print(f"🚨 Retry exception: {e}")


🔁 Retrying failed batches...


## Query the Index

In [18]:
QUESTION = "what normally rich dad do that is different from poor dad?"
# QUESTION = "Dime que significa la radiacion del cuerpo negro"
# QUESTION = "what is the acronym of the main point of Made to Stick book"
# QUESTION = "Tell me a python example of how do I push documents with vectors to an index using the python SDK?"
# QUESTION = "who won the soccer worldcup in 1994?" # this question should have no answer

In [19]:
indexes = [book_index_name]
k=50 # in this index k corresponds to the top pages as well

In [20]:
retriever = CustomAzureSearchRetriever(indexes=[book_index_name], topK=k, reranker_threshold=1)

**Note**: that we are picking a larger k=20 since these chunks are NOT of 5000 chars each like prior notebooks, but instead each page is a chunk.

In [21]:
COMPLETION_TOKENS = 2500
llm = AzureChatOpenAI(deployment_name=os.environ["GPT4oMINI_DEPLOYMENT_NAME"], temperature=0.5, max_tokens=COMPLETION_TOKENS).configurable_alternatives(
    ConfigurableField(id="model"),
    default_key="gpt4omini",
    gpt4o=AzureChatOpenAI(deployment_name=os.environ["GPT4o_DEPLOYMENT_NAME"], temperature=0, max_tokens=COMPLETION_TOKENS),
)

In `utils.py` we created the **CustomAzureSearchRetriever** class that we will use going forward

In [22]:
DOCSEARCH_PROMPT = ChatPromptTemplate.from_messages(
    [
        ("system", DOCSEARCH_PROMPT_TEXT + "\n\nCONTEXT:\n{context}\n\n"),
        ("human", "{question}"),
    ]
)

In [23]:
chain = (
    {
        "context": itemgetter("question") | retriever, # Passes the question to the retriever and the results are assign to context
        "question": itemgetter("question")
    }
    | DOCSEARCH_PROMPT  # Passes the 4 variables above to the prompt template
    | llm   # Passes the finished prompt to the LLM
    | StrOutputParser()  # converts the output (Runnable object) to the desired output (string)
)

#### With GPT4o-mini

In [24]:
for chunk in chain.with_config(configurable={"model": "gpt4omini"}).stream(
    {"question": QUESTION, "language": "English"}):
    print(chunk, end="", flush=True)

The differences between the mindsets and actions of the "rich dad" and the "poor dad" in Robert Kiyosaki's "Père riche, père pauvre" are quite pronounced and can be summarized as follows:

1. **Attitude Towards Money**: The rich dad views money as a tool that can work for him, whereas the poor dad believes that money is something to be earned through hard work. The rich dad teaches that "money works for you" whereas the poor dad thinks "you work for money" [[6]](https://blobstorageixqo5iaqmpzwc.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf).

2. **Financial Education**: The rich dad emphasizes the importance of financial education and understanding how money works. He believes that learning about money management and investments is crucial for wealth creation. In contrast, the poor dad focuses on traditional education and securing a stable job, believing that good grades will lead to a good job [[6]](https://blobstorageixqo5iaqmpzwc.blob.core.windows.net/books/Pere_Riche_Pere_

#### With GPT4-o

In [25]:
for chunk in chain.with_config(configurable={"model": "gpt4o"}).stream(
    {"question": QUESTION, "language": "English"}):
    print(chunk, end="", flush=True)

In "Père riche, Père pauvre," the rich dad and the poor dad have fundamentally different approaches to money and life. The rich dad emphasizes the importance of financial education and making money work for you, rather than working for money. He believes in acquiring assets that generate income and encourages learning about how money works to achieve financial independence. The rich dad also stresses the importance of understanding the law and using it to one's advantage, often employing financial advisors and lawyers to minimize taxes and protect wealth [[1]](https://blobstorageixqo5iaqmpzwc.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf).

On the other hand, the poor dad, despite being well-educated, focuses on job security and working for a stable salary. He believes in the traditional path of getting a good education to secure a good job with benefits. The poor dad often views the house as the most significant investment, whereas the rich dad sees it as a liability unless i

# Summary

In this notebook we learned how to deal with complex and large Documents and make them available for Q&A over them using [Hybrid Search](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector#hybrid-search) (text + vector search).

We also learned the power of Azure Document Inteligence API and why it is recommended for production scenarios where manual Document parsing (instead of Azure Search Indexer Document Cracking) is necessary.

Using Azure AI Search with its Vector capabilities and hybrid search features eliminates the need for other vector databases such as Weaviate, Qdrant, Milvus, Pinecone, and so on.


# NEXT
So far we have learned how to use OpenAI vectors and completion APIs in order to get an excelent answer from our documents stored in Azure AI Search. This is the backbone for a GPT Smart Search Engine.

However, we are missing something: **How to have a conversation with this engine?**

On the next Notebook, we are going to understand the concept of **memory**. This is necessary in order to have a chatbot that can establish a conversation with the user. Without memory, there is no real conversation.