## ❗ Problem Statement

> 📌 **Note**
>
> This guide is specifically focused on handling PDF documents.

When integrating OpenAI models, such as GPT-4, with a vector store, we encounter a unique challenge. This challenge primarily revolves around the process of Retrieval Augmented Generation (RAG). In this process, the model interacts with the vector store to retrieve specific knowledge chunks to answer a particular question from the user. This interaction presents a complex problem for the developer in terms of an effective chunking, sorting, and retrieval data strategy.

**🔍 The Challenges:**

**Data Chunking and Sorting:**

+ **Determining Optimal Chunk Size**: Deciding the appropriate size for document chunks is crucial. Too large, and the chunks may exceed the model's context window, leading to loss of information; too small, and they may lack sufficient context.

+ **Effective Sorting Strategies**: Sorting these chunks for efficient retrieval is another challenge. The sorting mechanism needs to ensure that the most relevant chunks are prioritized.

+ **Overlap Consideration**: Implementing overlapping chunks can be vital. It ensures continuity and context preservation, especially when dealing with long documents or complex topics.

**The Impact: Fragmented Information**

This fragmentation becomes particularly noticeable when similar terms appear across different sections of a document. The system may inadvertently mix up data from unrelated contexts, leading to potential confusion and misinformation. Additionally, the relevance of retrieved information can vary significantly based on how well the chunking and sorting strategy has been implemented.

## 💡 Solution

Incorporate Azure Search as the vector database, employing an overlapping chunking strategy for enhanced performance using `TextChunkingIndexing`.

`TextChunkingIndexing` streamlines the processes of chunking, text vectorization, and indexing within Azure AI Search, using LangChain for enhanced text processing. Discover more about AI Search and LangChain integration [here](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-cognitive-search-and-langchain-a-seamless-integration-for/ba-p/3901448).

### Key Functions

1. **Text Chunking**: Splits extensive text into manageable chunks for better analysis and indexing.
2. **Customization**: Adjusts chunk size and overlap according to different text processing requirements.
3. **Text Vectorization**: Converts chunked text into vectors, crucial for effective indexing and retrieval.
4. **Indexing in Vector Database**: Stores and retrieves the vectorized text in Azure AI Search.

#### Importance of Optimal Chunking

Adjusting chunk sizes and overlaps is vital for high-quality text retrieval, especially in precision-based search applications like RAGs. Learn more about fine-tuning and relevance scores [here](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-cognitive-search-outperforming-vector-search-with-hybrid/ba-p/3929167).

## Getting Started

Before you start, ensure you have a `.env` file in your project directory with the following keys:

```plaintext
OPENAI_API_KEY=****
OPENAI_ENDPOINT=****
AZURE_OPENAI_API_VERSION=****
AZURE_SEARCH_SERVICE_ENDPOINT=****
AZURE_SEARCH_ADMIN_KEY=****
```

#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

1. **Prepare the Environment File**:
   - Ensure you have an `environment.yml` file in your repository. This file should list all the necessary libraries and dependencies for your project.

2. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yml` file:
     
     ```bash
     make create_conda_env
     ```

   - This command runs a `make` target that creates a Conda environment as defined in `environment.yml`.

3. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate [YourEnvName]
     ```
     Replace `[YourEnvName]` with the name of your environment as specified in `environment.yml`.

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions in VSCode.

2. **Attach Kernel to VSCode**:
   - Once the Conda environment is created, you should be able to see it in the kernel selection (top right corner of your VSCode interface).
   - Select your newly created environment as the kernel for running Jupyter Notebooks.

By following these steps, you'll set up a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will contain all the necessary dependencies in your `environment.yml` file.



In [7]:
import os

# Define the target directory
target_directory = r'C:\Users\pablosal\Desktop\azure-ai-gbb-solutions' #change your directory here

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\azure-ai-gbb-solutions


## Chunking

In [8]:
# Import the TextChunkingIndexing class from the langchain_integration module
from src.gbb_ai.rag_utils.langchain_integration import TextChunkingIndexing

# Create an instance of the TextChunkingIndexing class
gbb_ai_client = TextChunkingIndexing()

# Set up the OpenAI API client
gbb_ai_client.setup_aoai()

# Define the name of the deployment
DEPLOYMENT_NAME = "foundational-ada"

# Load the embedding model associated with the specified deployment
gbb_ai_client.load_embedding_model(deployment=DEPLOYMENT_NAME)

2023-11-22 12:37:04,964 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model text-embedding-ada-002, deployment foundational-ada, and chunk size 1 (langchain_integration.py:load_embedding_model:106)
2023-11-22 12:37:04,968 - micro - MainProcess - INFO     OpenAIEmbeddings object created successfully. (langchain_integration.py:load_embedding_model:119)


OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, async_client=None, model='text-embedding-ada-002', deployment='foundational-ada', openai_api_version='2023-05-15', openai_api_base='https://ml-workspace-dev-eastus-001-aoai.openai.azure.com/', openai_api_type='azure', openai_proxy='', embedding_ctx_length=8191, openai_api_key='d050ad8b96ef4ecbb5099eece1212a91', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=16, max_retries=2, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=True, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, http_client=None)

In [9]:
# Define the name of the Azure Search index
# This is the index where your data is stored in Azure Search
INDEX_NAME = "index-teradyne-web"

# Set up the Azure Search client with the specified index
# This prepares the client to interact with the Azure Search service
gbb_ai_client.setup_azure_search(index_name=INDEX_NAME)

100%|██████████| 1/1 [00:00<00:00, 16.72it/s]
2023-11-22 12:37:06,053 - micro - MainProcess - INFO     Azure Cognitive Search client configured successfully. (langchain_integration.py:setup_azure_search:188)


<langchain.vectorstores.azuresearch.AzureSearch at 0x22bb43d8610>

In [10]:
# Scrap web and chuck files intp sentences 
# Define the URLs of the web pages to scrape
file_1 = "C:\\Users\\pablosal\\Desktop\\azure-ai-gbb-solutions\\workshop\\solution\\build_your_own_copilot_aoai\\rag_pattern\\pdf\\ultraflex_user_manual.pdf"

# Set the chunk size and overlap size for splitting the text
CHUNK_SIZE = 512
OVERLAP_SIZE = 128
SEPARATOR = "(\n\w|\w\n)"

# Scrape the web pages, split the text into chunks, and store the chunks
# The text is split into chunks of size CHUNK_SIZE, with an overlap of OVERLAP_SIZE between consecutive chunks
text_chuncked = gbb_ai_client.load_and_split_text_by_character_from_pdf(source=file_1, chunk_size=CHUNK_SIZE, chunk_overlap=OVERLAP_SIZE)

## Indexing

In [11]:
# Embed the chunks and index them in Azure Search
# This function converts the text chunks into vector embeddings and stores them in the Azure Search index
gbb_ai_client.embed_and_index(text_chuncked)

100%|██████████| 1/1 [00:00<00:00, 15.95it/s]
100%|██████████| 1/1 [00:00<00:00, 15.14it/s]
100%|██████████| 1/1 [00:00<00:00, 12.00it/s]
100%|██████████| 1/1 [00:00<00:00, 12.84it/s]
100%|██████████| 1/1 [00:00<00:00, 11.35it/s]
100%|██████████| 1/1 [00:00<00:00, 14.58it/s]
100%|██████████| 1/1 [00:00<00:00, 13.57it/s]
100%|██████████| 1/1 [00:00<00:00, 12.47it/s]
100%|██████████| 1/1 [00:05<00:00,  5.51s/it]
100%|██████████| 1/1 [00:00<00:00, 16.56it/s]
100%|██████████| 1/1 [00:00<00:00,  1.35it/s]
100%|██████████| 1/1 [00:00<00:00, 12.31it/s]
100%|██████████| 1/1 [00:00<00:00, 14.26it/s]
100%|██████████| 1/1 [00:00<00:00, 13.74it/s]
100%|██████████| 1/1 [00:00<00:00, 13.21it/s]
  0%|          | 0/1 [00:00<?, ?it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
100%|██████████| 1/1 [00:04<00:00,  4.12s/it]
100%|██████████| 1/1 [00:00<00:00, 12.86it/s]
100