# Document Embedding and Indexing with Azure OpenAI and AI Search

## Overview 
This tutorial provides a step-by-step guide on how to pull files from [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction), generate embeddings for these files, and store the embeddings in an [Azure AI Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search) index. Embeddings are numerical representations of text that capture the semantic meaning of the content, facilitating advanced search and analysis. An index in AI search is a data structure that organizes these embeddings to improve the speed and efficiency of search queries. Additionally, this tutorial demonstrates how to enable users to interact with these embedding indexes through Azure AI Search and [Azure OpenAI](https://learn.microsoft.com/en-us/azure/ai-services/openai/overview), effectively allowing them to chat over the original files from Azure Blob Storage.
### Learning Objectives  
1. **Vectorization**:
    - Learn how to extract files from Azure Blob Storage.
    - Understand how to generate embeddings using Azure OpenAI.
    - Discover how to store documents with custom metadata in an Azure AI Index.
2. **Retrieval**:
    - Gain skills in interacting with Azure AI Search indexes using Azure OpenAI.

Each section of this notebook will guide you through specific tasks and demonstrate how to utilize the REST APIs provided by each Azure service. By the end of this notebook, you will have a comprehensive understanding of how to integrate and utilize these Azure services to develop a robust data processing and retrieval application. 
### Prerequisites  
Before proceeding with this notebook, please ensure that you have the following Azure services deployed and configured. Resources can be deployed manually in Azure portal or automated by following along with the [ARM Deployment tutorial](../azure_infra_setup/README.md):  
  
1. **Azure OpenAI Service**:   
    - Ensure that you have deployed both a GPT model and an Ada model within your Azure OpenAI instance.
    - Estimated costs for this service varies based on the model usage and number of API calls.
        - **gpt-4o-mini(2024-07-18):** \\$0.15 input/ \\$0.60 output per 1M tokens
        - **text-embedding-3-small(1):** $0.00002 per 1K tokens
        
2. **Azure AI Search**:   
    - Your Azure AI Search service should be a minimum of the Basic tier to ensure compatibility with Azure OpenAI.  
    - **Estimated cost for this service is $0.10 per hour.**
    
3. **Azure Blob Storage Account**:   
    - You should have an Azure Blob Storage account with PDF files stored in a blob container. These files should be located in the `/search_documents` directory of the `GenAI` directory.  
    - **Estimated cost for this service is $0.018 per GB.**

## Get Started 

### 0. Environment Setup  
This section will guide you through setting up the environment for the notebook. We will import the necessary libraries, load environment variables, and configure Azure AI Search parameters.  

### 0.1 Install Python libraries from requirements.txt

To ensure all necessary Python libraries are installed in the virtual environment for this notebook, we will use `pip` to install the packages specified in the `requirements.txt` file.

In [None]:
%pip install -r ../requirements.txt 

### 0.2 Import Necessary Libraries  
Import all the packages installed in the virtual environment into our Python script. This is a crucial step as it makes the required functionalities available for the script to execute correctly.

In [None]:
# Import necessary libraries  
  
# For handling file and directory operations  
import os  
  
# For handling I/O operations  
import io  
  
# For extracting text and tables from PDF files  
import pdfplumber  
  
# For interacting with Azure Blob Storage  
from azure.storage.blob import BlobServiceClient  
  
# For handling Azure credentials  
from azure.core.credentials import AzureKeyCredential  
from azure.identity import DefaultAzureCredential  
  
# For working with Azure Search service  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  
  
# For configuring search indexes and fields  
from azure.search.documents.indexes.models import (  
    SimpleField,                   # Represents a simple field in an index  
    SearchFieldDataType,           # Represents the data type of a field  
    VectorSearch,                  # Enables vector search capabilities  
    SearchIndex,                   # Represents a search index  
    SearchableField,               # Represents a searchable field  
    SearchField,                   # Represents a field in a search index  
    VectorSearchProfile,           # Represents a vector search profile  
    HnswAlgorithmConfiguration     # Configuration for HNSW algorithm in vector search  
)  
  
# For loading environment variables from a .env file  
from dotenv import load_dotenv  
  
# For utilizing OpenAI functionalities within Azure  
from openai import AzureOpenAI  
  
# For tokenization tasks  
import tiktoken  
  
# For regular expression operations  
import re  

### 0.3 Load Environment Variables  
Load the environment variables from a `.env` file. Ensure you have a `.env` file with the required Azure service credentials and configurations. This file should contain all necessary keys and connection strings to connect to your Azure services.

Resources can be deployed manually in Azure portal or automated by following along with the [ARM Deployment tutorial](../azure_infra_setup/README.md). This will also go through creating your .env file.

In [None]:
# Load environment variables
env_path = '../azure_infra_setup/.env'
load_dotenv(dotenv_path=env_path) 

If you did not run the ARM Deployment tutorial create a .env file and enter in the following information:
  
```  
# Example .env file format:  
AZURE_OPENAI_VERSION=your_openai_version  
AZURE_OPENAI_BASE=your_openai_base_url  
AZURE_OPENAI_ENDPOINT=your_openai_endpoint  
AZURE_OPENAI_KEY=your_openai_key  
AZURE_GPT_DEPLOYMENT=your_gpt_deployment  
AZURE_EMBEDDINGS_DEPLOYMENT=your_embeddings_deployment  
AZURE_SEARCH_ENDPOINT=your_search_endpoint  
AZURE_SEARCH_ADMIN_KEY=your_search_admin_key  
AZURE_SEARCH_INDEX=your_search_index  
BLOB_CONTAINER_NAME=your_blob_container_name  
BLOB_CONNECTION_STRING=your_blob_connection_string  
BLOB_ACCOUNT_NAME=your_blob_account_name 
```

Then run `load_dotenv(dotenv_path=env_path)`.

- The `load_dotenv()` function reads the key-value pairs from the .env file and adds them to the environment variables.
- Replace the placeholder values in your .env file with your actual Azure service credentials and configuration details.
- ***This step is crucial for securely managing your credentials and keeping them out of your main codebase.***

### 0.4 Configure Azure AI Search Parameters
Configure the Azure AI Search parameters using the loaded environment variables. This allows us to set up the necessary configurations for connecting to the Azure AI Search service.

In [None]:
# Configure Azure AI Search parameters  
search_endpoint = os.getenv('AZURE_SEARCH_SERVICE_ENDPOINT')  # Get the Azure Search endpoint from environment variables  
search_key = os.getenv('AZURE_SEARCH_API_KEY') # Get the Azure Search admin key from environment variables  
credentials = AzureKeyCredential(search_key) #Set up Auzure credentials

- `os.getenv('AZURE_SEARCH_SERVICE_ENDPOINT')` retrieves the value of the AZURE_SEARCH_ENDPOINT environment variable, which contains the endpoint URL for your Azure Search service.
- `os.getenv('AZURE_SEARCH_API_KEY')` retrieves the value of the AZURE_SEARCH_ADMIN_KEY environment variable, which contains the admin key for your Azure Search service.
- These configurations are essential for authenticating and connecting to your Azure Search service.

## 1. Vectorization 
  
In this section, we will connect to Azure Blob Storage, process PDF documents into text chunks with metadata, generate embeddings using Azure OpenAI, and upload the data to Azure AI Search.  

Objectives:
1. Setup Function for Azure OpenAI
2. Connecting to Azure Blob Storage
3. Splitting Text with Metadata
4. Loading Blob Content
5. Vectorize Function

### 1.1 Setup Function for Azure OpenAI  
This function sets up the Azure OpenAI instance using the provided API key, version, and endpoint from environment variables. 

In [None]:
def setup_azure_openai():  
    """  
    Sets up Azure OpenAI.  
    """  
    print("Setting up Azure OpenAI...")  
    azure_openai = AzureOpenAI(  
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
        api_version= "2024-05-01-preview", #os.getenv('AZURE_OPENAI_VERSION'),  
        azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT')  
    )  
    print("Azure OpenAI setup complete.")  
    return azure_openai

### 1.2 Connecting to Azure Blob Storage  
The following function connects to the Azure Blob Storage using the provided connection string and container name from the environment variables.  

In [None]:
def connect_to_blob_storage():  
    """  
    Connects to Azure Blob Storage.  
    """  
    print("Connecting to Blob Storage...")  
    blob_service_client = BlobServiceClient.from_connection_string(os.getenv("BLOB_CONNECTION_STRING"))  
    container_client = blob_service_client.get_container_client(os.getenv("BLOB_CONTAINER_NAME"))  
    print("Connected to Blob Storage.")  
    return container_client 

### 1.3 Splitting Text with Metadata  
Split the content from PDF files into chunks with associated metadata. The text will be split by a max token length with additional chunk overlap. This is useful for processing large documents.  

In [None]:
def split_text_with_metadata(text, metadata, max_length=800, overlap=75, encoding_name='cl100k_base'):  
    """  
    Splits the text into chunks with metadata.  
    """  
    tokenizer = tiktoken.get_encoding(encoding_name)  
    tokens = tokenizer.encode(text)  
    chunks = []  
    start = 0  
    end = max_length  
      
    while start < len(tokens):  
        chunk = tokens[start:end]  
        chunk_text = tokenizer.decode(chunk)  
        chunk_metadata = metadata.copy()  
        chunk_metadata.update({  
            'start_token': start,  
            'end_token': end,  
            'chunk_length': len(chunk),  
            'chunk_text_preview': chunk_text[:50] + '...'  
        })  
        chunks.append({  
            'text': chunk_text,  
            'metadata': chunk_metadata  
        })  
        start = end - overlap  
        end = start + max_length  
      
    return chunks  

1. ***Tokenize Text***: The text is encoded into tokens using the tokenizer.
2. ***Initialize Variables***: Set up initial indices for chunking.
3. ***Create Chunks***: Loop through the tokens to create chunks:
    - Extract a chunk of tokens.
    - Decode the chunk back into text.
    - Copy and update metadata with chunk-specific information.
    - Append the chunk and its metadata to the list.
4. ***Overlap Handling***: Move the start index back by the overlap amount to ensure chunks overlap as specified.

**Key Params**:
- ***Max Chunk Size (max_length)***: Each chunk will have a maximum of `max_length` tokens (default is 800 tokens).
- ***Chunk Overlap (overlap)***: Consecutive chunks will overlap by `overlap` tokens (default is 75 tokens).

### 1.4 Loading Blob Content  
Load and extracts the content of a PDF file from the Azure Blob Storage client.  

1. **Check File Type**:
    - The function first checks if the blob is a PDF file by verifying the file extension.
    - If the file is not a PDF, it raises a `ValueError`.
2. **Download Blob Content**:
    - The blob content is downloaded and read into `blob_data`.
3. **Convert to Stream**:
    - The blob data is converted into a byte stream using `io.BytesIO`.
4. **Extract Text from PDF**:
    - The PDF is opened using `pdfplumber`.
    - Text is extracted from each page of the PDF and concatenated into `document_text`.
5. **Return Document Text**:
    - The function returns the extracted text from the PDF.

In [None]:
def load_blob_content(blob_client):  
    """  
    Loads and returns the content of the PDF blob.  
    """  
    blob_name = blob_client.blob_name  
    if not blob_name.lower().endswith('.pdf'):  
        raise ValueError(f"Blob {blob_name} is not a PDF file.")  
      
    blob_data = blob_client.download_blob().readall()  
    pdf_stream = io.BytesIO(blob_data)  
    document_text = ""  
      
    with pdfplumber.open(pdf_stream) as pdf:  
        for page in pdf.pages:  
            document_text += page.extract_text() + "\n"  
      
    return document_text  

### 1.5 Vectorize Workflow  
Uses multiple functions to orchestrate the vector workflow. This workflow will connect to Azure services, processes blobs, generate embeddings, and upload the data to Azure AI Search index.  

Set up container and Azure OpenAi clients.

In [None]:
container_client = connect_to_blob_storage()
azure_openai = setup_azure_openai() 

#### 1.5.0 Chunking the data

The `chunk_data` function is responsible for reading PDF files from Azure Blob Storage and splits their content into smaller chunks with metadata. It performs the following tasks:

1. **Connects to Azure Blob Storage:** It uses `connect_to_blob_storage()` to establish a connection to a blob storage container.
2. **Lists Blobs in the Container:** It retrieves a list of all blobs (files) in the container using `container_client.list_blobs()`.
3. **Processes Only PDF Files:** It iterates through the blobs and skips any files that are not PDFs (based on their file extension).
4. **Loads and Processes PDF Content:** For each PDF file it retrieves the blob's content using `load_blob_content(blob_client)`. It generates a public link to the blob using the storage account and container name from environment variables.
5. **Adds Metadata and Splits Content:** It creates metadata for the blob (e.g., blob name and document link). It splits the PDF content into smaller chunks along with the metadata using `split_text_with_metadata()`.
6. **Handles Errors:** If any blob fails to process, it logs the error and continues with the next blob.
7. **Returns Chunked Documents:** The function collects all the processed chunks into a `documents` list and prints a message when processing is complete.

In [None]:
def chunk_data():  
    """  
    Function will read and chunk PDF documents in blob storage with metadata.  
    """  
        
    print("Listing blobs in container...")  
    blob_list = container_client.list_blobs()  
    documents = []  
    for blob in blob_list:  
        if not blob.name.lower().endswith('.pdf'):  
            print(f"Skipping non-PDF blob: {blob.name}")  
            continue  
          
        print(f"Processing blob: {blob.name}")  
        blob_client = container_client.get_blob_client(blob)  
        try:  
            document = load_blob_content(blob_client)  
            document_link = f'https://{os.getenv("BLOB_ACCOUNT_NAME")}.blob.core.windows.net/{os.getenv("BLOB_CONTAINER_NAME")}/{blob.name}'  
              
            metadata = {"blob_name": blob.name, "document_link": document_link}  
            chunks = split_text_with_metadata(document, metadata)  
            documents.extend(chunks)  
        except Exception as e:  
            print(f"Failed to process blob {blob.name}: {e}")  
      
    print("Blobs processed and documents chunked.")
    return documents

#### 1.5.1 Adding Embeddings

The `embeddings` function is responsible for generating vector embeddings for chunks of text using Azure OpenAI's embedding API. It performs the following tasks:

1. **Sets Up Tokenizer and Token Limit:** Uses the `tiktoken` library to get a tokenizer (`cl100k_base`) for encoding text into tokens. Defines a maximum token limit (`max_tokens = 8192`) to ensure chunks do not exceed the model's input size.
2. **Processes Document Chunks:** Iterates through a list of documents (assumed to be preprocessed chunks of text with metadata). For each chunk it prints the chunk's text and its position in the list.
3. **Checks Token Limit:**: Encodes the chunk's text into tokens using the tokenizer. Skips the chunk if the number of tokens exceeds the `max_tokens` limit, logging a message.
4. **Generates Embeddings:** Sends the chunk's text to Azure OpenAI's embedding API (`azure_openai.embeddings.create`) using the model specified in the environment variable `AZURE_EMBEDDINGS_DEPLOYMENT`. Extracts the embedding vector from the API response and pairs it with the chunk's metadata.
5. **Stores Embeddings:** Appends the embedding and its associated metadata to the `embeddings` list.

In [None]:
def embeddings(documents):
    """  
    Function will generate embeddings.  
    """   
    print("Generating embeddings...")  
    embeddings = []  
    tokenizer = tiktoken.get_encoding("cl100k_base")  
    max_tokens = 8192  
    for i, doc in enumerate(documents):  
        #print(f"Processing chunk {i + 1}/{len(documents)}")  
        #print(f"Chunk text: {doc['text']}\n")  
        tokens = tokenizer.encode(doc["text"])  
        if len(tokens) > max_tokens:  
            print(f"Skipping document chunk {i + 1} with {len(tokens)} tokens, exceeding max limit of {max_tokens}.")  
            continue  
        response = azure_openai.embeddings.create(input=doc["text"], model=os.getenv('AZURE_EMBEDDINGS_DEPLOYMENT'))
        
        embeddings.append({  
            "embedding": response.data[0].embedding,  
            "metadata": doc["metadata"]  
        })  
        #print(f"Embeddings: {response.data[0].embedding}")  
    print("Embeddings generation complete.")  
    return embeddings

#### 1.5.2 Creating a Search Index

The `search_index` function is responsible for creating and configuring an Azure Cognitive Search index and populating it with the necessary fields and vector search capabilities. 

1. **Initialize Azure Search Client:** Retrieves the Azure Search admin key and endpoint from environment variables (`AZURE_SEARCH_ADMIN_KEY` and `AZURE_SEARCH_ENDPOINT`). Creates a SearchIndexClient object using the AzureKeyCredential for authentication.
2. **Define Index Fields:** Specifies the schema (fields) for the search index:
- `id`: A unique identifier for each document (key field).
- `content`: A searchable field for the document's text content.
- `blob_name`: A searchable field for the name of the blob in Azure Blob Storage.
- `document_link`: A searchable field for the document's URL or link.
- `embedding`: A vector field for storing embeddings (used for semantic search). It includes:
- `vector_search_dimensions=1536`: Specifies the dimensionality of the embedding vectors.
- `vector_search_profile_name="myHnswProfile"`: Links the field to a vector search profile.
3. **Configure Vector Search:** Sets up vector search capabilities using the Hierarchical Navigable Small World (HNSW) algorithm:
- `HnswAlgorithmConfiguration`: Configures the HNSW algorithm for vector search.
- `VectorSearchProfile`: Associates the algorithm configuration with a profile (`myHnswProfile`).
4. **Create the Search Index:** Combines the fields and vector search configuration into a SearchIndex object. Creates the index in Azure Cognitive Search using `search_index_client.create_index(index)`.

In [None]:
# Create Search Index
def search_index():
    """
    Function creates index and populates fields in your Azure Search AI service.
    """
    print("Creating search index...")   
    search_index_client = SearchIndexClient(endpoint=search_endpoint, credential=credentials)  
    fields = [  
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),  
        SearchableField(name="content", type=SearchFieldDataType.String),  
        SearchableField(name="blob_name", type=SearchFieldDataType.String),  
        SearchableField(name="document_link", type=SearchFieldDataType.String),  
        SearchField(  
            name="embedding",  
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
            searchable=True,  
            vector_search_dimensions=1536,  
            vector_search_profile_name="myHnswProfile"  
        )  
    ]  
    vector_search = VectorSearch(  
        algorithms=[  
            HnswAlgorithmConfiguration(name="myHnsw")  
        ],  
        profiles=[  
            VectorSearchProfile(  
                name="myHnswProfile",  
                algorithm_configuration_name="myHnsw"  
            )  
        ]  
    )  
    index = SearchIndex(name="documents-index", fields=fields, vector_search=vector_search)  
    search_index_client.create_index(index)  
    print("Search index created.")  

#### 1.5.3 Uploading the Data to the Search Index

The `upload` function is responsible for uploading document chunks and their corresponding embeddings to an Azure Cognitive Search index. Here's a breakdown of what it does:

1. **Initialize Azure Search Client:** Creates a `SearchClient` object to interact with the Azure Cognitive Search service.
2. **Prepare Documents for Upload:** Iterates over the `embeddings` list (assumed to contain embeddings and metadata for document chunks). For each embedding, it creates a dictionary with the same index fields mentioned above. Appends each prepared document to the documents_to_upload list.
3. **Upload Documents to Azure Search:** Uses the `upload_documents` method of the `SearchClient` to upload the prepared documents to the Azure Cognitive Search index.

In [None]:
def upload(embeddings, documents):
    """
    Function will upload chunks and embeddings to Azure AI Search
    """

    print("Uploading documents to search index...")  
    search_client = SearchClient(endpoint=search_endpoint, index_name="documents-index", credential=credentials)  
    documents_to_upload = []  

    for i, doc in enumerate(embeddings):  
        documents_to_upload.append({  
            "id": str(i),  
            "content": documents[i]["text"],  
            "embedding": doc["embedding"],  
            "blob_name": doc["metadata"]["blob_name"],  
            "document_link": doc["metadata"]["document_link"]  
        })  
    search_client.upload_documents(documents=documents_to_upload)  
    print("Documents uploaded to search index.")  

Now lets put all of our functions together to create our vector workflow!

In [None]:
def vectorize():    
    """  
    Main function that orchestrates the vector workflow.  
    """ 
    documents = chunk_data()
    data_embeddings = embeddings(documents)
    search_index()
    upload(data_embeddings, documents)
    
vectorize()   

As a reminder here is a summary of what this workflow does.

1. **Setup Connections**:
    - Connect to Azure OpenAI and Blob Storage.
2. **Process Blobs**:
    - List blobs in the container.
    - For each PDF blob, load its content and split it into chunks with metadata.
3. **Customize Metadata**:
    - Add custom metadata such as the blob file name and blob URL:
        ```python
        metadata = {"blob_name": blob.name, "document_link": document_link}  
        ```
4. **Generate Embeddings**:
    - For each chunk, generate embeddings using Azure OpenAI.
5. **Create Search Index**:
    - Define and create a search index in Azure AI Search.
6. **Upload Documents**:
    - Upload the chunks and their embeddings to the search index.

## 2. Retrieve 

In this section, we will implement a function (`chat_on_your_data`) to perform retrieval queries over documents from the Azure AI Search Index using Azure OpenAI for chat capabilities. Our chatbot function will perform retrieval queries over documents from the Azure AI Search Index using Azure OpenAI. Construct a search query, interacts with the search index, and processes the results to provide relevant information based on the query. The steps of the function are detailed below:

1. **Configure Azure OpenAI Parameters**:
    - Retrieve necessary configurations and API keys from environment variables.
2. **Append User Query**:
    - Append the user's query to the chat messages list.
3. **Initialize AzureOpenAI Client**:
    - Initialize the Azure OpenAI client using the provided endpoint, API key, and API version.
4. **Create Chat Completion**:
    - Create a chat completion request using Azure OpenAI.
    - Specify the model deployment, chat messages, and additional parameters like `max_tokens`, `temperature`, etc.
    - Provide extra body parameters to include Azure Search as a data source.
        - Extra Body Parameters:
            - `endpoint`: The Azure Search endpoint.
            - `index_name`: The name of the search index.
            - `semantic_configuration`: The semantic search configuration.
            - `query_type`: Type of query (e.g., `vector_simple_hybrid`).
            - `fields_mapping`: Mapping of fields (if any).
            - `role_information`: Information about the role of the assistant.
            - `filter`: Any filters to apply to the search (if any).
            - `strictness`: Level of strictness for the search.
            - `top_n_documents`: Number of top documents to retrieve.
            - `authentication`: Authentication details (API key).
            - `embedding_dependency`: Embedding deployment details.
5. **Extract and Clean Response**:
    - Extract the response data from the completion result.
    - Clean up the AI response by removing unnecessary characters and formatting it properly.
    - Extract the citation URL from the response context.
6. **Append AI Response**:
    - Append the cleaned AI response to the chat messages list.
    - Print the final response.

In [None]:
def chat_on_your_data(query, search_index):  
    """  
    Perform retrieval queries over documents from the Azure AI Search Index.  
    """  
    # Define the query and other parameters  
    
    messages = []  
  
    # Append user query to chat messages  
    messages.append({"role": "user", "content": query})  
  
    print(f"User: {query}")  
  
    print('Processing...')  
    
    # Create a chat completion with Azure OpenAI  
    completion =  azure_openai.chat.completions.create(  
        model=os.getenv('AZURE_GPT_DEPLOYMENT'),  
        messages=[  
            {"role": "system", "content": "You are an AI assistant that helps people find information. Ensure the Markdown responses are correctly formatted before responding."},  
            {"role": "user", "content": query}  
        ],  
        max_tokens=800,  
        temperature=0.7,  
        top_p=0.95,  
        frequency_penalty=0,  
        presence_penalty=0,  
        stop=None,  
        stream=False,  
        extra_body={  
            "data_sources": [{  
                "type": "azure_search",  
                "parameters": {  
                    "endpoint": search_endpoint,  
                    "index_name": search_index,  
                    "semantic_configuration": "default",  
                    "query_type": "vector_simple_hybrid",  
                    "fields_mapping": {},  
                    "in_scope": True,  
                    "role_information": "You are an AI assistant that helps people find information.",  
                    "filter": None,  
                    "strictness": 3,  
                    "top_n_documents": 5,  
                    "authentication": {  
                        "type": "api_key",  
                        "key": search_key  
                    },  
                    "embedding_dependency": {  
                        "type": "deployment_name",  
                        "deployment_name": os.getenv('AZURE_EMBEDDINGS_DEPLOYMENT') 
                    }  
                }  
            }]  
        }  
    )  
  
    # Extract the response data  
    response_data = completion.to_dict()  
    ai_response = response_data['choices'][0]['message']['content']  
    # Clean up the AI response  
    ai_response_cleaned = re.sub(r'\s+\.$', '.', re.sub(r'\[doc\d+\]', '', ai_response)) 
    citation = response_data["choices"][0]["message"]["context"]["citations"][0]["url"]  
    ai_response_final = f"{ai_response_cleaned}\n\nCitation(s):\n{citation}"  
  
    # Append AI response to chat messages  
    messages.append({"role": "assistant", "content": ai_response_final})  
  
    print(f"GPT Response: {ai_response_final}")  
  


Now before we run our function we must define Query and search index:
- Set up the user query and the name of the search index to be created.
- Default values are provided:
    - Example query: `"What year was the New York State Route 373 built?"`
    - Search index: `"documents-index"`

In [None]:
query = "What year was New York State Route 373 built?" # Example query
search_index = "documents-index"  

Finally lets run our function!

In [None]:
# Call the function to test it  
chat_on_your_data(query, search_index) 

## Conclusion

Congratulations! You have successfully created a Retrieval-Augmented Generation (RAG) application for documents stored in your Azure Blob Storage Account, using Azure OpenAI and Azure AI Search. At this point, you should have a solid understanding of how to build the logic for vectorizing documents from an Azure Blob Storage container and retrieving those documents in your Azure OpenAI application.

### Key Accomplishments:
1. **Environment Setup**:
    - Initialized Azure OpenAI with the necessary API credentials and configurations.
    - Established a connection to Azure Blob Storage to access PDF documents.
2. **Vectorize**:
    - Implemented a function to split PDF text into manageable chunks with associated metadata.
    - Orchestrated the entire vectorization process:
        - Setup Azure OpenAI and connected to Azure Blob Storage.
        - Retrieved and chunked documents.
        - Generated embeddings for each chunk using Azure OpenAI.
        - Created a search index in Azure AI Search.
        - Uploaded the chunks and their embeddings to Azure AI Search.
3. **Retrieve**:
    - Implemented a function to perform retrieval queries over the documents indexed in Azure AI Search using Azure OpenAI.
    - Executed a user query and performed a search using Azure AI Search.
    - Generated a chat completion based on the search results and formatted it for display.

## Clean Up
Make sure to shut down your Azure ML compute and if desired you can delete your Azure AI Search service, Azure Blob Storage Account, and Azure OpenAI service. ***Note these services can be used in other tutorials in this notebook.***