# Retrieval Augmented Generation (RAG) with Azure AI Search and OpenAI

This code demonstrates how to work with RAG to give more context to the LLM/SLM models to get a more accurate answer. The code uses Azure AI Search to index the documents and Azure OpenAI's embedding model to generate embeddings/vectors for the documents.

+ Create an index schema
+ Load the sample data from a local folder
+ Embed the documents in-memory using Azure OpenAI's text-embedding-ada-002 model
+ Index the vector and non-vector fields on Azure AI Search
+ Run a series of vector and hybrid queries, including metadata filtering and hybrid (text + vectors) search. 

The code uses Azure OpenAI to generate embeddings for title and content fields. You'll need access to Azure OpenAI to run this demo.

## Create the resources

Refer to the `README.md` file in the root folder to create the resources.

In [None]:
%pip install python-dotenv
%pip install tiktoken
%pip install azure-search-documents
%pip install azure-identity

Load environment variables from the `.env` file

In [25]:
import os
import re
# import pandas as pd
from openai import AzureOpenAI
from dotenv import load_dotenv
from dotenv import dotenv_values

if os.path.exists(".env"):
    load_dotenv(override=True)
    config = dotenv_values(".env")

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_chat_completions_deployment_name = os.getenv("AZURE_OPENAI_CHAT_COMPLETIONS_DEPLOYMENT_NAME")

azure_openai_embedding_model = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL")
embedding_vector_dimensions = os.getenv("EMBEDDING_VECTOR_DIMENSIONS")

azure_search_service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
azure_search_service_admin_key = os.getenv("AZURE_SEARCH_SERVICE_ADMIN_KEY")
search_index = os.getenv("SEARCH_INDEX")

openai_client = AzureOpenAI(
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_api_key,
    api_version="2024-02-01"
)

# Test connection to OpenAI
try:
    completion = openai_client.chat.completions.create(
        model=azure_openai_chat_completions_deployment_name,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What AI model are you ?"}
        ])

    print(completion.to_json())
except Exception as e:
    print(e.messages)

AttributeError: 'APIConnectionError' object has no attribute 'messages'

Get the number of tokens in a text string.

In [5]:
import tiktoken

def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name="cl100k_base")
    num_tokens = len(encoding.encode(string, disallowed_special=()))
    return num_tokens

num_tokens_from_string("tiktoken is great!")

6

The OpenAI embedding model `text-embedding-3-large` has a limit of `8191` tokens per request.
Before sending the files to the model, we need to split the text into chunks of less than `8191` tokens.
Count the number of tokens in the sample files and show the files with more than `8191` tokens.

In [6]:
input_directory = './data/samples/'
i=0

for filename in os.listdir(input_directory):
    if filename.endswith('.md'):
        with open(os.path.join(input_directory, filename), 'r', encoding='utf-8') as file:
            content = file.read()
            tokens = num_tokens_from_string(content)
            if tokens > 8191:
                print(f'File {filename} has {tokens} tokens which is more than 8191 (max) tokens')

File assistant.md has 8817 tokens which is more than 8191 (max) tokens
File content-filter.md has 11481 tokens which is more than 8191 (max) tokens
File fine-tuning-python.md has 8846 tokens which is more than 8191 (max) tokens
File latest-inference-preview.md has 75252 tokens which is more than 8191 (max) tokens
File latest-inference.md has 22512 tokens which is more than 8191 (max) tokens
File use-your-data.md has 12055 tokens which is more than 8191 (max) tokens


Later in this lab, we will proceed with markdown `.md` files. We will need to remove all special characters and markdown syntax from the files. The function `clean_markdown_content()` will help us with this.

In [7]:
def clean_markdown_content(content):
    # Remove links
    link_pattern = r'\[([^\[]+)\]\(([^\)]+)\)'
    content = re.sub(link_pattern, r'\1', content)

    # Remove images
    image_pattern = r'\!\[([^\[]*)\]\(([^\)]+)\)'
    content = re.sub(image_pattern, '', content)

    # Remove all occurrences of **
    content = content.replace('**', '')
    content = content.replace('\n', '')

    return content

Get the vector embedding for the text.

In [11]:
def get_embeddings_vector(text):

    response = openai_client.embeddings.create(
        input=text,
        model=azure_openai_embedding_model,
    )

    embedding = response.data[0].embedding

    return embedding

vector = get_embeddings_vector("Sample text")
print(vector)

APIConnectionError: Connection error.

Create file chunks.

In [13]:
import uuid
import re
import json
import os

input_directory = './data/samples/'
output_directory = './data/chunks/'

i=0
# Loop through each file in the directory
for filename in os.listdir(input_directory):
    # Check if the file is a markdown file
    if filename.endswith('.md'):
        # Open the file
        with open(os.path.join(input_directory, filename), 'r', encoding='utf-8') as file:
            print(filename)
            # Read the file content
            content = file.read()
            
            # break if content doesn't contain title, description, ms.date and '##'
            if 'title:' not in content or 'description:' not in content or 'ms.date:' not in content or '##' not in content:
                print(f'File {filename} does not contain title, description, ms.date or ##')
                continue

            # Extract the title, description, and date
            page_title = re.search(r'title: (.*)', content).group(1).replace('"', '')
            page_description = re.search(r'description: (.*)', content).group(1)
            page_date = re.search(r'ms.date: (.*)', content).group(1)
            
            # Split the content into chunks based on '##'
            chunks = content.split('\n## ')[1:]  # Skip the first chunk as it contains the title, description, and date
            
            # Add the chunks to the list along with the title, description, and date
            for chunk in chunks:
                i=i+1
                chunk_content = clean_markdown_content(chunk.strip())
                vector = get_embeddings_vector(chunk_content)
                
                chunk = {
                    "id": str(uuid.uuid4()),
                    'page_title': page_title,
                    'page_description': page_description,
                    'page_date': page_date,
                    'chunk_title': chunk.split('\n')[0],  # The first line after '##' is the title of the chunk
                    'chunk_content': chunk_content,  # Remove leading and trailing whitespaces
                    'vector': vector
                }
                
                chunk_file_name = f'chunk_{i}_{page_title}.json'.replace('?', '').replace(':', '').replace("'", '').replace('|', '').replace('/', '').replace('\\', '')

                # write chunk into JSON file into output directory
                with open(f'{output_directory}/{chunk_file_name}', 'w') as f:
                    json.dump(chunk, f)

abuse-monitoring.md
advanced-prompt-engineering.md
File advanced-prompt-engineering.md does not contain title, description, ms.date or ##
api-version-deprecation.md
assistant-functions.md
assistant.md
assistants-logic-apps.md
assistants-quickstart.md
File assistants-quickstart.md does not contain title, description, ms.date or ##
assistants-reference-messages.md
assistants-reference-runs.md
assistants-reference-threads.md
assistants-reference.md
assistants.md
azure-developer-cli.md
azure-machine-learning.md
azure-search.md
business-continuity-disaster-recovery.md
chat-markup-language.md
chatgpt-quickstart.md
File chatgpt-quickstart.md does not contain title, description, ms.date or ##
chatgpt.md
File chatgpt.md does not contain title, description, ms.date or ##
code-interpreter.md
completions.md
content-credentials.md
content-filter.md
content-filters.md
cosmos-db.md
create-resource.md
customizing-llms.md
dall-e-quickstart.md
File dall-e-quickstart.md does not contain title, descriptio

In [17]:
for i, chunk in enumerate(chunks):
    num_tokens = num_tokens_from_string(chunk)
    num_tokens_json = num_tokens_from_string(json.dumps(chunk, ensure_ascii=False))
    
    print(f'chunk {i} : {num_tokens}')
    print(f'chunk {i} : {num_tokens_json}')

chunk 0 : 321
chunk 0 : 341
chunk 1 : 147
chunk 1 : 161
chunk 2 : 1806
chunk 2 : 1955
chunk 3 : 1023
chunk 3 : 1127
chunk 4 : 34
chunk 4 : 39


By default, the length of the embedding vector will be `1536` for `text-embedding-3-small` or `3072` for `text-embedding-3-large`. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.

Create Index in Azure AI Search.

In [18]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    ComplexField,
    CorsOptions,
    SearchIndex,
    SearchField,
    ScoringProfile,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticSearch,
    SemanticField
)

credential = AzureKeyCredential(azure_search_service_admin_key)

index_name = search_index

search_index_client = SearchIndexClient(
    endpoint=azure_search_service_endpoint, 
    index_name=index_name, 
    credential=credential
)

# create search index
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
    ),
    SearchableField(name="page_title", type=SearchFieldDataType.String),
    SearchableField(name="page_description", type=SearchFieldDataType.String),
    SearchableField(name="page_date", type=SearchFieldDataType.String),
    SearchableField(name="chunk_title", type=SearchFieldDataType.String),
    SearchableField(name="chunk_content", type=SearchFieldDataType.String),
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=3072, #1536,
        vector_search_profile_name="myHnswProfile",
    ),
]

# Configure the vector search configuration  
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="page_title"),
        # keywords_fields=[SemanticField(field_name="category")],
        content_fields=[SemanticField(field_name="chunk_content")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])
# Create the search index with the semantic settings
search_index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_search=semantic_search)
result = search_index_client.create_or_update_index(search_index)
print(f' {result.name} created')
# cors_options = CorsOptions(allowed_origins=["*"], max_age_in_seconds=60)
# scoring_profile = ScoringProfile(name="MyProfile")
# scoring_profiles = []
# scoring_profiles.append(scoring_profile)

# index = SearchIndex(
#     name=name,
#     fields=fields,
#     scoring_profiles=scoring_profiles,
#     cors_options=cors_options,
# )
# result = index_client.create_or_update_index(index=index)

 index-doc created


In [19]:
# delete index
search_index_client.delete_index(index_name)

Upload documents to Azure AI Search

In [19]:
import uuid
from azure.search.documents import SearchClient

search_client = SearchClient(endpoint=azure_search_service_endpoint, index_name=index_name, credential=credential)

# for each json file in ./data/chunks/ folder, load the json document and upload it to the search index

for filename in os.listdir(output_directory):
    if filename.endswith('.json'):
        with open(os.path.join(output_directory, filename), 'r') as file:
            document = json.load(file)

            result = search_client.upload_documents(documents=document)
            print(f"Upload of {filename} succeeded: { result[0].succeeded }")

# result = search_client.upload_documents(documents=documents)
# print("Upload of new document succeeded: {}".format(result[0].succeeded))

Upload of chunk_100_How to work with the Chat Markup Language (preview).json succeeded: True
Upload of chunk_101_How to work with the Chat Markup Language (preview).json succeeded: True
Upload of chunk_102_How to work with the Chat Markup Language (preview).json succeeded: True
Upload of chunk_103_How to use Azure OpenAI Assistants Code Interpreter.json succeeded: True
Upload of chunk_104_How to use Azure OpenAI Assistants Code Interpreter.json succeeded: True
Upload of chunk_105_How to use Azure OpenAI Assistants Code Interpreter.json succeeded: True
Upload of chunk_106_How to use Azure OpenAI Assistants Code Interpreter.json succeeded: True
Upload of chunk_107_How to use Azure OpenAI Assistants Code Interpreter.json succeeded: True
Upload of chunk_108_How to generate text with Azure OpenAI Service.json succeeded: True
Upload of chunk_109_How to generate text with Azure OpenAI Service.json succeeded: True
Upload of chunk_10_How to use Azure OpenAI Assistants function calling.json succ

Upload documents to Azure AI Search using Batch uploading.

In [25]:
from azure.search.documents import SearchIndexingBufferedSender

documents = []
for filename in os.listdir(output_directory):
    if filename.endswith('.json'):
        with open(os.path.join(output_directory, filename), 'r') as file:
            document = json.load(file)
            documents.append(document)

# Use SearchIndexingBufferedSender to upload the documents in batches optimized for indexing  
with SearchIndexingBufferedSender(  
    endpoint=azure_search_service_endpoint,  
    index_name=index_name,  
    credential=credential,  
) as batch_client:  
    # Add upload actions for all documents  
    result = batch_client.upload_documents(documents=documents)  
print(f"Uploaded {len(documents)} documents in total")  

Uploaded 1303 documents in total


In [22]:
# [START simple_analyze_text]
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import AnalyzeTextOptions

def simple_analyze_text():

    search_index_client = SearchIndexClient(azure_search_service_endpoint, credential)

    analyze_request = AnalyzeTextOptions(text="One's <two/>", analyzer_name="standard.lucene")

    result = search_index_client.analyze_text(index_name, analyze_request)
    print(result.as_dict())

In [23]:
simple_analyze_text()

{'tokens': [{'token': "one's", 'start_offset': 0, 'end_offset': 5, 'position': 0}, {'token': 'two', 'start_offset': 7, 'end_offset': 10, 'position': 1}]}


## Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

In [24]:
from azure.search.documents.models import VectorizedQuery

# Pure Vector Search
query = "rag"  

embedding = get_embeddings_from_azure_text_embedding(query)
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["page_title", "page_date", "chunk_title", "chunk_content"],
)  
  
for result in results:
    print(f"Page Date: {result['page_date']}")  
    print(f"Page Title: {result['page_title']}")  
    print(f"Chunk Title: {result['chunk_title']}")  
    print(f"Chunk Content: {result['chunk_content']}")
    print(f"Score: {result['@search.score']}")  


Page Date: 03/26/2024
Page Title: Azure OpenAI Service getting started with customizing a large language model (LLM)
Chunk Title: RAG (Retrieval Augmented Generation)
Chunk Content: RAG (Retrieval Augmented Generation)### Definition RAG (Retrieval Augmented Generation) is a method that integrates external data into a Large Language Model prompt to generate relevant responses. This approach is particularly beneficial when using a large corpus of unstructured text based on different topics. It allows for answers to be grounded in the organization’s knowledge base (KB), providing a more tailored and accurate response.RAG is also advantageous when answering questions based on an organization’s private data or when the public data that the model was trained on might have become outdated. This helps ensure that the responses are always up-to-date and relevant, regardless of the changes in the data landscape.### Illustrative use caseA corporate HR department is looking to provide an intellige

In [4]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

completion = openai_client.chat.completions.create(
    model=azure_openai_chat_completions_deployment_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant for a cloud engineer."},
        {"role": "user", "content": "What are the LLM models supported by Azure ?"}
    ],
    extra_body={
        "data_sources": [
            {
                "type": "azure_search",
                "parameters": {
                    "endpoint": azure_search_service_endpoint,
                    "index_name": search_index,
                    "authentication": {
                        "type": "api_key",
                        "key": azure_search_service_admin_key,
                        # "type": "system_assigned_managed_identity"
                    }
                }
            }
        ]
    }
)
      
print(completion.to_json())

APIConnectionError: Connection error.