## Install python packages

In [1]:
%pip install python-dotenv --no-cache-dir
%pip install tiktoken --no-cache-dir
%pip install azure-search-documents --no-cache-dir
%pip install azure-identity --no-cache-dir
%pip install openai --no-cache-dir

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting openai
  Downloading openai-1.99.5-py3-none-any.whl.metadata (29 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.10.0-py3-none-any.whl.metadata (4.0 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.10.0-cp313-cp313-win_amd64.whl.metadata (5.3 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
Collecting sniffio (from openai)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting httpcore==1.* (fro

## Connect to the Azure AI Search and OpenAI

Load environment variables from the .env file

In [1]:
import os
import re
from openai import AzureOpenAI
from dotenv import load_dotenv
from dotenv import dotenv_values

if os.path.exists(".env"):
    load_dotenv(override=True)
    config = dotenv_values(".env")

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_chat_completions_deployment_name = os.getenv("AZURE_OPENAI_CHAT_COMPLETIONS_DEPLOYMENT_NAME")

azure_openai_embedding_model = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL")
embedding_vector_dimensions = os.getenv("EMBEDDING_VECTOR_DIMENSIONS")

azure_search_service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
azure_search_service_admin_key = os.getenv("AZURE_SEARCH_SERVICE_ADMIN_KEY")
search_index_name = os.getenv("SEARCH_INDEX_NAME")

openai_client = AzureOpenAI(
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_api_key,
    api_version="2024-12-01-preview"
)

# Test connection to OpenAI ChatGPT
completion = openai_client.chat.completions.create(
    model=azure_openai_chat_completions_deployment_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you ?"}
    ])
print(completion.to_json())

{
  "id": "chatcmpl-CAyIMZy9kLK8V9vgmfV5w7k2IQlPL",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "Hello! I am an AI language model developed to assist you with information, answer questions, and help with various tasks. How can I help you today?",
        "refusal": null,
        "role": "assistant",
        "annotations": []
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1756731090,
  "model": "gpt-4.1-nano-2025-04-14",
  "object": "chat.completion",
  "system_fingerprint": "fp_368a354b49",
  "usage": {
    "completion_t

## Count the number of tokens in a text

Like LLM models, Embedding models defines a max input. It is defined in number of tokens. The max_input for text-embedding-3-large is 8191 tokens. So we need to split the text into chunks of 8191 tokens or less.

In [2]:
import tiktoken

def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name="cl100k_base")
    num_tokens = len(encoding.encode(string, disallowed_special=()))
    return num_tokens

# Test the function
num_tokens_from_string("tiktoken is great!")

6

The OpenAI embedding model text-embedding-3-large has a limit of 8191 tokens per request. Before sending the files to the model, we need to split the text into chunks of less than 8191 tokens. Count the number of tokens in the sample files and show the files with more than 8191 tokens.

In [None]:
input_directory = './data/md-files/'
i=0

for filename in os.listdir(input_directory):
    if filename.endswith('.md'):
        with open(os.path.join(input_directory, filename), 'r', encoding='utf-8') as file:
            content = file.read()
            tokens = num_tokens_from_string(content)
            if tokens > 8191:
                print(f'File {filename} has {tokens} tokens which is more than 8191 (max) tokens')

File assistant.md has 8817 tokens which is more than 8191 (max) tokens
File content-filter.md has 11481 tokens which is more than 8191 (max) tokens
File fine-tune copy.md has 10788 tokens which is more than 8191 (max) tokens
File fine-tuning-python.md has 8869 tokens which is more than 8191 (max) tokens
File use-your-data.md has 12120 tokens which is more than 8191 (max) tokens
File whats-new.md has 9884 tokens which is more than 8191 (max) tokens


## Transforming/cleaning the documents

Remove all special characters and markdown syntax from the files. The function clean_markdown_content() will help us with this.

In [4]:
def clean_markdown_content(content):
    # Remove links
    link_pattern = r'\[([^\[]+)\]\(([^\)]+)\)'
    content = re.sub(link_pattern, r'\1', content)

    # Remove images
    image_pattern = r'\!\[([^\[]*)\]\(([^\)]+)\)'
    content = re.sub(image_pattern, '', content)

    # Remove all occurrences of **
    content = content.replace('**', '')
    content = content.replace('\n', '')

    return content

## Get the vector embedding for an input text

In [5]:
def get_embeddings_vector(text):

    response = openai_client.embeddings.create(
        input=text,
        model=azure_openai_embedding_model,
    )

    embedding = response.data[0].embedding

    return embedding

# Test the function
vector = get_embeddings_vector("Sample text")
print(vector)

[-0.012435130774974823, -0.04316585138440132, -0.009822873398661613, 0.011554595082998276, 0.006599131505936384, -0.013384154066443443, -0.04163958877325058, 0.059954747557640076, -0.019371801987290382, 0.0006316626095212996, 0.028959864750504494, 0.007949287071824074, 0.008849390782415867, -0.05157986655831337, 0.013932042755186558, 0.013256965205073357, -0.010253357701003551, 0.00492366636171937, 0.008017773739993572, -0.02305048704147339, -0.002491184277459979, 0.004666842985898256, -0.026592200621962547, 0.051892947405576706, 0.007430749014019966, -0.006525753531605005, -0.01613338477909565, 0.012797129340469837, 0.007919936440885067, 0.024635452777147293, 0.008986363187432289, 0.03978068009018898, -0.005650108680129051, -0.028294570744037628, 0.01490063313394785, 0.013755936175584793, 0.03093617968261242, 0.020193634554743767, 0.027844518423080444, 0.007939503528177738, 0.026044311001896858, 0.015389819629490376, -0.04449643939733505, -0.015487657859921455, -0.02007623016834259, 0

## Create file chunks

Split the markdown files in folder ./data/md-files into chunks.

In [6]:
import uuid
import re
import json
import os

input_directory = './data/md-files/'
output_directory = './data/chunks/'
# create output directory if it doesn't exist
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

chunk_index=0
# Loop through each file in the directory
for filename in os.listdir(input_directory):
    # Check if the file is a markdown file
    if filename.endswith('.md'):
        # Open the file
        with open(os.path.join(input_directory, filename), 'r', encoding='utf-8') as file:
            print(filename)
            # Read the file content
            content = file.read()
            
            # break if content doesn't contain title, description, ms.date and '##'
            if 'title:' not in content or 'description:' not in content or 'ms.date:' not in content or '##' not in content:
                print(f'File {filename} does not contain title, description, ms.date or ##')
                continue

            # Extract the title, description, and date
            page_title = re.search(r'title: (.*)', content).group(1).replace('"', '')
            page_description = re.search(r'description: (.*)', content).group(1)
            page_date = re.search(r'ms.date: (.*)', content).group(1)
            
            # Split the content into chunks based on '##'
            chunks = content.split('\n## ')[1:]  # Skip the first chunk as it contains the title, description, and date
            
            # Add the chunks to the list along with the title, description, and date
            for chunk in chunks:
                chunk_index=chunk_index + 1
                chunk_content = clean_markdown_content(chunk.strip())
                
                if (num_tokens_from_string(chunk_content) > 8191):
                    print(f'Chunk {chunk_index} in file {filename} has more than 8191 tokens')
                    break

                vector = get_embeddings_vector(chunk_content)
                
                chunk = {
                    "id": str(uuid.uuid4()),
                    'page_title': page_title,
                    'page_description': page_description,
                    'page_date': page_date,
                    'chunk_title': chunk.split('\n')[0],  # The first line after '##' is the title of the chunk
                    'chunk_content': chunk_content,  # Remove leading and trailing whitespaces
                    'vector': vector
                }
                
                chunk_file_name = f'chunk_{chunk_index}_{page_title}.json'.replace('?', '').replace(':', '').replace("'", '').replace('|', '').replace('/', '').replace('\\', '')

                # write chunk into JSON file into output directory
                with open(f'{output_directory}/{chunk_file_name}', 'w') as f:
                    json.dump(chunk, f)

abuse-monitoring.md
advanced-prompt-engineering.md
File advanced-prompt-engineering.md does not contain title, description, ms.date or ##
ai-search-ingestion.md
File ai-search-ingestion.md does not contain title, description, ms.date or ##
api-surface.md
api-version-deprecation.md
assistant-functions.md
assistant.md
assistants-ai-studio.md
assistants-csharp.md
assistants-javascript.md
assistants-logic-apps.md
assistants-python.md
assistants-quickstart.md
File assistants-quickstart.md does not contain title, description, ms.date or ##
assistants-reference-messages.md
assistants-reference-runs.md
assistants-reference-threads.md
assistants-reference.md
assistants-rest.md
assistants-studio.md
assistants-v2-note.md
File assistants-v2-note.md does not contain title, description, ms.date or ##
assistants.md
azure-developer-cli.md
batch.md
business-continuity-disaster-recovery.md
chat-completion.md
chat-go.md
chat-markup-language.md
chatgpt-dotnet.md
chatgpt-java.md
chatgpt-javascript.md
chatg

By default, the length of the embedding vector will be 1536 for text-embedding-3-small or 3072 for text-embedding-3-large. We can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.

## Create Index in Azure AI Search

In [7]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    ComplexField,
    CorsOptions,
    SearchIndex,
    SearchField,
    ScoringProfile,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticSearch,
    SemanticField
)

credential = AzureKeyCredential(azure_search_service_admin_key)

search_index_client = SearchIndexClient(
    endpoint=azure_search_service_endpoint, 
    index_name=search_index_name, 
    credential=credential
)

# create search index
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
    ),
    SearchableField(name="page_title", type=SearchFieldDataType.String),
    SearchableField(name="page_description", type=SearchFieldDataType.String),
    SearchableField(name="page_date", type=SearchFieldDataType.String),
    SearchableField(name="chunk_title", type=SearchFieldDataType.String),
    SearchableField(name="chunk_content", type=SearchFieldDataType.String),
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=3072, #1536,
        vector_search_profile_name="myHnswProfile",
    ),
]

# Configure the vector search configuration  
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="page_title"),
        # keywords_fields=[SemanticField(field_name="category")],
        content_fields=[SemanticField(field_name="chunk_content")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])
# Create the search index with the semantic settings
search_index = SearchIndex(name=search_index_name, fields=fields,
                    vector_search=vector_search, semantic_search=semantic_search)
result = search_index_client.create_or_update_index(search_index)
print(f' {result.name} created')

 index-doc created


## Upload chunks/documents to Azure AI Search

In [8]:
import uuid
from azure.search.documents import SearchClient

search_client = SearchClient(endpoint=azure_search_service_endpoint, index_name=search_index_name, credential=credential)

# for each json file in ./data/chunks/ folder, load the json document and upload it to the search index

for filename in os.listdir(output_directory):
    if filename.endswith('.json'):
        with open(os.path.join(output_directory, filename), 'r') as file:
            document = json.load(file)

            result = search_client.upload_documents(documents=document)
            print(f"Upload of {filename} succeeded: { result[0].succeeded }")

Upload of chunk_100_Quickstart Use Azure OpenAI Assistants (Preview) via the Azure OpenAI Studio.json succeeded: True
Upload of chunk_101_Quickstart Use Azure OpenAI Assistants (Preview) via the Azure OpenAI Studio.json succeeded: True
Upload of chunk_102_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_103_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_104_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_105_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_106_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_107_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_108_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_109_Azure OpenAI Service Assistants API concepts.json succeeded: True
Upload of chunk_10_Azure OpenAI Service API version retirement.json succeede

## Perform a vector similarity search

In [9]:
from azure.search.documents.models import VectorizedQuery

# Pure Vector Search
query = "How to use Azure AI ?"  

embedding = get_embeddings_vector(query)

vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["page_title", "page_date", "chunk_title", "chunk_content"],
)  
  
for result in results:
    print(f"-------------------------------------------")
    print(f"Page Date: {result['page_date']}")  
    print(f"Page Title: {result['page_title']}")  
    print(f"Chunk Title: {result['chunk_title']}")  
    print(f"Chunk Content: {result['chunk_content']}")
    print(f"Score: {result['@search.score']}")

-------------------------------------------
Page Date: 05/31/2024
Page Title: 'Quickstart: Use Azure OpenAI Assistants (Preview) via the Azure OpenAI Studio'
Chunk Title: Prerequisites
Chunk Content: Prerequisites- An Azure subscription - <a href="https://azure.microsoft.com/free/cognitive-services" target="_blank">Create one for free</a>.- An Azure OpenAI resource with a compatible model in a supported region.- We recommend reviewing the Responsible AI transparency note and other Responsible AI resources to familiarize yourself with the capabilities and limitations of the Azure OpenAI Service.
Score: 0.7548777
-------------------------------------------
Page Date: 05/20/2024
Page Title: 'Create and manage Azure OpenAI Service deployments with the Azure CLI'
Chunk Title: Prerequisites
Chunk Content: Prerequisites- An Azure subscription. <a href="https://azure.microsoft.com/free/ai-services" target="_blank">Create one for free</a>.- Access permissions to create Azure OpenAI resources an

## Simulate a user query

In [10]:
response = openai_client.chat.completions.create(
    model=azure_openai_chat_completions_deployment_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant for an AI learner."},
        {"role": "user", "content": "How to create AI assistant ?"}
    ],
    extra_body={
        "data_sources": [
            {
                "type": "azure_search",
                "parameters": {
                    "endpoint": azure_search_service_endpoint,
                    "index_name": search_index_name,
                    "authentication": {
                        "type": "api_key",
                        "key": azure_search_service_admin_key,
                    }
                }
            }
        ]
    }
)

print(response.to_json())

{
  "id": "c48ba892-9e67-4897-9c8f-5aafe611018b",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "To create an AI assistant using Azure OpenAI Studio, follow these steps:\n\n1. Sign in to Azure AI Studio and create a new project or select an existing one [doc1].\n\n2. Navigate to the Assistants playground under your project overview. Here, you can explore, prototype, and test AI Assistants without needing to run any code [doc1].\n\n3. In the Assistant setup pane, provide a name for your assistant, such as \"Math Helper,\" and add instructions to guide its behavior, for example, \"You are an AI assistant that helps answer math questions\" [doc2][doc4].\n\n4. Select the deployment model you want to use, such as GPT-4, and enable additional features like the code interpreter if needed [doc2][doc4].\n\n5. Save your configuration. You can then add user questions to your assistant and run the session to see how it responds [doc4].\

In [11]:

print(response.choices[0].message.content)

To create an AI assistant using Azure OpenAI Studio, follow these steps:

1. Sign in to Azure AI Studio and create a new project or select an existing one [doc1].

2. Navigate to the Assistants playground under your project overview. Here, you can explore, prototype, and test AI Assistants without needing to run any code [doc1].

3. In the Assistant setup pane, provide a name for your assistant, such as "Math Helper," and add instructions to guide its behavior, for example, "You are an AI assistant that helps answer math questions" [doc2][doc4].

4. Select the deployment model you want to use, such as GPT-4, and enable additional features like the code interpreter if needed [doc2][doc4].

5. Save your configuration. You can then add user questions to your assistant and run the session to see how it responds [doc4].

6. Optionally, you can create assistants programmatically using the Azure SDK for Python, where you specify the assistant's name, instructions, tools (like code interpreter

## Response with plain LLM (no retrieval)

In [None]:
response = openai_client.chat.completions.create(
    model=azure_openai_chat_completions_deployment_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant for an AI learner."},
        {"role": "user", "content": "How to create AI assistant ?"}
    ]
)

print(response.choices[0].message.content)

Creating an AI assistant involves several key steps, from defining its purpose to deploying and maintaining it. Here's a high-level overview to help you get started:

1. Define the Purpose and Scope
- Determine what tasks your AI assistant will perform (e.g., answering questions, scheduling, providing information).
- Identify your target users and their needs.
- Decide on the platform(s) it will operate on (web, mobile, smart speakers, etc.).

2. Choose Technologies and Tools
- Programming Languages: Python is popular due to its extensive AI libraries.
- AI and NLP Frameworks: 
  - TensorFlow, PyTorch for machine learning.
  - Hugging Face Transformers for NLP models.
- Chatbot Frameworks:
  - Rasa, Dialogflow, Microsoft Bot Framework, or IBM Watson.
- Backend and Hosting:
  - Cloud services like AWS, Azure, or Google Cloud.
  - Web servers and databases.

3. Gather and Prepare Data
- Collect relevant data (question-answer pairs, user interactions).
- Clean and preprocess data for trai