# Leveraging Azure SQL DB’s Native Vector Capabilities for Enhanced Resume Matching with Azure Document Intelligence and RAG

In this tutorial, we will explore how to leverage Azure SQL DB’s new vector data type to store embeddings and perform similarity searches using built-in vector functions, enabling advanced resume matching to identify the most suitable candidates. 

By extracting and chunking content from PDF resumes using Azure Document Intelligence, generating embeddings with Azure OpenAI, and storing these embeddings in Azure SQL DB, we can perform sophisticated vector similarity searches and retrieval-augmented generation (RAG) to identify the most suitable candidates based on their resumes.

### **Tutorial Overview**

- This Python notebook will teach you to:
    1. **Chunk PDF Resumes**: Use **`Azure Document Intelligence`** to extract and chunk content from PDF resumes.
    2. **Create Embeddings**: Generate embeddings from the chunked content using the **`Azure OpenAI API`**.
    3. **Vector Database Utilization**: Store embeddings in **`Azure SQL DB`** utilizing the **`new Vector Data Type`** and perform similarity searches using built-in vector functions to find the most suitable candidates.
    4. **LLM Generation Augmentation**: Enhance language model generation with embeddings from a vector database. In this case, we use the embeddings to inform a GPT-4 chat model, enabling it to provide rich, context-aware answers about candidates based on their resumes

## Dataset

We use a sample dataset from [Kaggle](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset) containing PDF resumes for this tutorial. For the purpose of this tutorial we will use 120 resumes from the **Information-Technology** folder

## Prerequisites

- **Azure Subscription**: [Create one for free](https://azure.microsoft.com/free/cognitive-services?azure-portal=true)
- **Azure SQL Database**: [Set up your database for free](https://learn.microsoft.com/azure/azure-sql/database/free-offer?view=azuresql)
- **Azure Document Intelligence** [Create a FreeAzure Doc Intelligence resource](https:/learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0)
- **Azure Data Studio**: Download [here](https://azure.microsoft.com/products/data-studio) to manage your Azure SQL database and [execute the notebook](https://learn.microsoft.com/azure-data-studio/notebooks/notebooks-python-kernel)

## Additional Requirements for Embedding Generation

- **Azure OpenAI Access**: Apply for access in the desired Azure subscription at [https://aka.ms/oai/access](https://aka.ms/oai/access)
- **Azure OpenAI Resource**: Deploy an embeddings model (e.g., `text-embedding-small` or `text-embedding-ada-002`) and a `GPT-4.0` model for chat completion. Refer to the [resource deployment guide](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource)
- **Python**: Version 3.7.1 or later from Python.org. (Sample has been tested with Python 3.11)
- **Python Libraries**: Install the required libraries openai, num2words, matplotlib, plotly, scipy, scikit-learn, pandas, tiktoken, and pyodbc.
- **Jupyter Notebooks**: Use within [Azure Data Studio](https://learn.microsoft.com/en-us/azure-data-studio/notebooks/notebooks-guidance) or Visual Studio Code .

Code snippets are adapted from the [Azure OpenAI Service embeddings Tutorial](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings?tabs=python-new%2Ccommand-line&pivots=programming-language-python)

## Getting Started

1. **Database Setup**: Execute SQL commands from the `createtable.sql` script to create the necessary table in your database.
2. **Model Deployment**: Deploy an embeddings model (`text-embedding-small` or `text-embedding-ada-002`) and a `GPT-4` model for chat completion. Note the 2 models deployment names for later use.

![Deployed OpenAI Models](../Assets/modeldeployment.png)

3. **Connection String**: Find your Azure SQL DB connection string in the Azure portal under your database settings.
4. **Configuration**: Populate the `.env` file with your SQL server connection details , Azure OpenAI key and endpoint, Azure Document Intelligence key and endpoint values.

You can retrieve the Azure OpenAI _endpoint_ and _key_:

![Azure OpenAI Endpoint and Key](../Assets/endpoint.png)

You can [retrieve](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0#get-endpoint-url-and-keys) the Document Intelligence _endpoint_ and _key_:

![Azure Document Intelligence Endpoint and Key](../Assets/docintelendpoint.png)

## Running the Notebook

To [execute the notebook](https://learn.microsoft.com/azure-data-studio/notebooks/notebooks-python-kernel), connect to your Azure SQL database using Azure Data Studio, which can be downloaded [here](https://azure.microsoft.com/products/data-studio)

In [1]:
#Setup the python libraries required for this notebook
#Please ensure that you navigate to the directory containing the `requirements.txt` file in your terminal
%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip





In [2]:
#Load the env details
from dotenv import load_dotenv
load_dotenv()

True

# **PART 1: Extracting and Chunking Text from PDF Resumes using Azure Document Intelligence**

Create an instance of the [DocumentAnalysisClient](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0#get-endpoint-url-and-keys) using the endpoint and API key. 

[Azure Document Intelligence](https://learn.microsoft.com/azure/ai-services/document-intelligence/?view=doc-intel-4.0.0_)(previously known as Form Recognizer) is a Azure cloud service that uses machine learning to analyze text and structured data from your documents. This client will be used to send requests to the [Azure Document Intelligence](https://learn.microsoft.com/python/api/overview/azure/ai-formrecognizer-readme?view=azure-python) service and receive responses containing the extracted text from the PDF resumes.

In [3]:
import os
import re
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

# Load environment variables
endpoint = os.getenv("AZUREDOCINTELLIGENCE_ENDPOINT")
api_key = os.getenv("AZUREDOCINTELLIGENCE_API_KEY")

# Create a DocumentAnalysisClient
document_analysis_client = DocumentAnalysisClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(api_key)
)


### **Analyze input documents using prebuilt model in Azure Document Intelligence**

- DocumentAnalysisClient provides operations for analyzing input documents using prebuilt and custom models through the `begin_analyze_document` and `begin_analyze_document_from_url` APIs. In this tutorial we are using the [prebuilt-layout](https://learn.microsoft.com/python/api/overview/azure/ai-formrecognizer-readme?view=azure-python#using-prebuilt-models)
    

### **Split text into chunks of 500 tokens**

- When faced with content that exceeds the embedding limit, we usually also chunk the content into smaller pieces and then embed those one at a time. Here we will use [tiktoken](https://github.com/openai/tiktoken?tab=readme-ov-file) to chunk the extracted text into token sizes of 500, as we will later pass the extracted chunks to to the `text-embedding-small` model for [generating text embeddings](https://learn.microsoft.com/azure/ai-services/openai/tutorials/embeddings?tabs=python-new%2Ccommand-line&pivots=programming-language-python) as this has a model input token limit of 8192.

**Note**: You need to provide the location of the folder where the PDF files reside in the below script.

In [4]:
import os
import re
import pandas as pd
import tiktoken

# Path to the directory containing PDF files
folder_path = os.path.join(os.getcwd(), '.\docs')

def get_pdf_files(folder_path):
    for path, subdirs, files in os.walk(folder_path):
        for name in files:
            if (name.endswith(".pdf")):
                yield os.path.join(path, name)

# Function to read PDF files and extract text using Azure AI Document Intelligence
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        poller = document_analysis_client.begin_analyze_document("prebuilt-layout", document=f)
    result = poller.result()
    text = ""
    for page in result.pages:
        for line in page.lines:
            text += line.content + " "
    return text

# Function to clean text and remove special characters
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    return text

# Function to split text into chunks of 500 tokens
def split_text_into_token_chunks(text, max_tokens=500):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)
    chunks = []
    
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
    
    return chunks

# Count the number of PDF files in the directory
pdf_files = [f for f in get_pdf_files(folder_path)]
num_files = len(pdf_files)
print(f"Number of PDF files in the directory: {num_files}")

# Create a DataFrame to store the chunks
data = []

for file_id, pdf_file in enumerate(pdf_files):
    print(f"Processing file {file_id + 1}/{num_files}: {pdf_file}")
    pdf_path = os.path.join(folder_path, pdf_file)
    text = extract_text_from_pdf(pdf_path)
    cleaned_text = clean_text(text)
    chunks = split_text_into_token_chunks(cleaned_text)
    
    print(f"Number of chunks for file {pdf_file}: {len(chunks)}")
    
    for chunk_id, chunk in enumerate(chunks):
        chunk_text = chunk.strip() if chunk.strip() else "NULL"
        unique_chunk_id = f"{file_id}_{chunk_id}"
        print(f"File: {pdf_file}, Chunk ID: {chunk_id}, Unique Chunk ID: {unique_chunk_id}, Chunk Length: {len(chunk_text)}, Chunk Text: {chunk_text[:50]}...")  # Print first 50 characters of chunk text
        data.append({
            "file_name": pdf_file,
            "chunk_id": chunk_id,
            "chunk_text": chunk_text,
            "unique_chunk_id": unique_chunk_id
        })

df = pd.DataFrame(data)
df.head(3)



Number of PDF files in the directory: 6
Processing file 1/6: w:\_git\_owned\azure-sql-db-vector-search\RAG-with-Documents\.\docs\DIGITAL-MEDIA\10005171.pdf
Number of chunks for file w:\_git\_owned\azure-sql-db-vector-search\RAG-with-Documents\.\docs\DIGITAL-MEDIA\10005171.pdf: 2
File: w:\_git\_owned\azure-sql-db-vector-search\RAG-with-Documents\.\docs\DIGITAL-MEDIA\10005171.pdf, Chunk ID: 0, Unique Chunk ID: 0_0, Chunk Length: 2791, Chunk Text: MEDIA ACTIVITIES SPECIALIST Summary MultiTasking M...
File: w:\_git\_owned\azure-sql-db-vector-search\RAG-with-Documents\.\docs\DIGITAL-MEDIA\10005171.pdf, Chunk ID: 1, Unique Chunk ID: 0_1, Chunk Length: 2705, Chunk Text: Worked with local production companies to create c...
Processing file 2/6: w:\_git\_owned\azure-sql-db-vector-search\RAG-with-Documents\.\docs\DIGITAL-MEDIA\10515955.pdf
Number of chunks for file w:\_git\_owned\azure-sql-db-vector-search\RAG-with-Documents\.\docs\DIGITAL-MEDIA\10515955.pdf: 2
File: w:\_git\_owned\azure-sql-db-

Unnamed: 0,file_name,chunk_id,chunk_text,unique_chunk_id
0,w:\_git\_owned\azure-sql-db-vector-search\RAG-...,0,MEDIA ACTIVITIES SPECIALIST Summary MultiTaski...,0_0
1,w:\_git\_owned\azure-sql-db-vector-search\RAG-...,1,Worked with local production companies to crea...,0_1
2,w:\_git\_owned\azure-sql-db-vector-search\RAG-...,0,DIGITAL MEDIA SALES CONSULTANT Summary Dedicat...,1_0


### **Tokenization vs. Character Length (OPTIONAL)**

In this section, we will explore the difference between the character length of a text chunk and its tokenized representation. Character length simply counts the number of characters in a text, while tokenization breaks the text into meaningful units called tokens.

Character Length First, let’s add a new column to our DataFrame to view the length of each chunk in terms of characters: Here, chunk\_length represents the number of characters in each chunk.

In [8]:
# Add a new column 'chunk_length' to the DataFrame to view the length of each chunk
df['chunk_length'] = df['chunk_text'].apply(len)

# Display the first few rows of the DataFrame with the new column
print(df[['file_name', 'chunk_id', 'chunk_length']].head(5))


                                           file_name  chunk_id  chunk_length
0  w:\_git\_owned\azure-sql-db-vector-search\RAG-...         0          2791
1  w:\_git\_owned\azure-sql-db-vector-search\RAG-...         1          2705
2  w:\_git\_owned\azure-sql-db-vector-search\RAG-...         0          2853
3  w:\_git\_owned\azure-sql-db-vector-search\RAG-...         1          2564
4  w:\_git\_owned\azure-sql-db-vector-search\RAG-...         0          2639


### Tokenization
To understand how text ultimately is tokenized, it can be helpful to run the below code: 

- We use the tiktoken library to tokenize the text. Tokenization breaks the text into smaller units, which can be words, subwords, or characters, depending on the tokenizer used. You can see that in some cases an entire word is represented with a single token whereas in others parts of words are split across multiple tokens. 

- If you then check the length of the decode variable, you'll find it matches 500 our specified token number. It is simply a way of making sure none of the data we pass to the model for tokenization and embedding exceeds the input token limit of 8,192

- When we pass the documents to the embeddings model, it will break the documents into tokens similar (though not necessarily identical) to the examples below and then convert the tokens to a series of floating point numbers that will be accessible via vector search

In [9]:
import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")
sample_encode = tokenizer.encode(df.chunk_text[0]) 
decode = tokenizer.decode_tokens_bytes(sample_encode)
decode


[b'MEDIA',
 b' ACT',
 b'IV',
 b'ITIES',
 b' SPECIAL',
 b'IST',
 b' Summary',
 b' Multi',
 b'Task',
 b'ing',
 b' Media',
 b' Relations',
 b' Results',
 b'oriented',
 b' Strategic',
 b' Initi',
 b'atives',
 b' Event',
 b' Planning',
 b' Writer',
 b' ',
 b' Editor',
 b' Manager',
 b'Sup',
 b'ervisor',
 b' Flex',
 b'ibility',
 b' Ad',
 b'ap',
 b'table',
 b' Highlights',
 b' ',
 b' Great',
 b'ly',
 b' improved',
 b' media',
 b' coverage',
 b' of',
 b' press',
 b' conferences',
 b' and',
 b' other',
 b' events',
 b' on',
 b' campus',
 b' ',
 b' Increased',
 b' the',
 b' frequency',
 b' of',
 b' newspaper',
 b' radio',
 b' and',
 b' television',
 b' interviews',
 b' featuring',
 b' Chattanooga',
 b' State',
 b' administrators',
 b' faculty',
 b' and',
 b' staff',
 b' ',
 b' Host',
 b'ed',
 b' popular',
 b' television',
 b' show',
 b' that',
 b' focused',
 b' on',
 b' campus',
 b' and',
 b' community',
 b' events',
 b' ',
 b'199',
 b'720',
 b'04',
 b' ',
 b' Commission',
 b'ed',
 b' by',
 b' l

In [10]:
len(decode)

500

# **PART 2 : Generating Embeddings for Text Chunks using Azure Open AI**

- After extracting and chunking the text from PDF resumes, we will generate embeddings for each chunk. These embeddings are numerical representations of the text that capture its semantic meaning. By creating embeddings for the text chunks, we can perform advanced similarity searches and enhance language model generation.

- We will use the Azure OpenAI API to generate these embeddings. The `get_embedding` function defined below takes a piece of text as input and returns its embedding using the `text-embedding-small` model

- Ensure the Environment Variables are set correctly in the .env file

In [11]:
import os
import requests
from num2words import num2words
import pandas as pd
import numpy as np
import json
from openai import AzureOpenAI

# Specify your model name
openai_embedding_model = os.getenv("AZOPENAI_EMBEDDING_MODEL_DEPLOYMENT_NAME")

# Assuming openai_url and openai_key are your environment variables
openai_url = os.getenv("AZOPENAI_ENDPOINT") + "openai/deployments/" + openai_embedding_model + "/embeddings?api-version=2023-05-15"
openai_key = os.getenv("AZOPENAI_API_KEY")

def get_embedding(text):
    """
    Get sentence embedding using the Azure OpenAI text-embedding-small model.

    Args:
        text (str): Text to embed.

    Returns:
        list: A list containing the embedding.
    """
    response = requests.post(openai_url,
        headers={"api-key": openai_key, "Content-Type": "application/json"},
        json={"input": [text]}  # Embed the extracted chunk
    )
    
    if response.status_code == 200:
        response_json = response.json()
        embedding = json.loads(str(response_json['data'][0]['embedding']))
        return embedding
    else:
        return None

# Example usage
all_filenames = []
all_chunkids = []
all_chunks = []
all_embeddings = []

# Assuming df is already defined with the required columns
for index, row in df.iterrows():
    filename = row['file_name']
    chunkid = row['unique_chunk_id']
    chunk = row['chunk_text']
    embedding = get_embedding(chunk)
    
    if embedding is not None:
        all_filenames.append(filename)
        all_chunkids.append(chunkid)
        all_chunks.append(chunk)
        all_embeddings.append(embedding)
    
    if (index + 1) % 200 == 0:  # Print progress every 200 rows
        print(f"Completed {index + 1} rows")

# Create a new DataFrame with the results
result_df = pd.DataFrame({
    'filename': all_filenames,
    'chunkid': all_chunkids,
    'chunk': all_chunks,
    'embedding': all_embeddings
})

print(result_df.head(5))  # Display the first few rows of the dataframe


                                            filename chunkid  \
0  w:\_git\_owned\azure-sql-db-vector-search\RAG-...     0_0   
1  w:\_git\_owned\azure-sql-db-vector-search\RAG-...     0_1   
2  w:\_git\_owned\azure-sql-db-vector-search\RAG-...     1_0   
3  w:\_git\_owned\azure-sql-db-vector-search\RAG-...     1_1   
4  w:\_git\_owned\azure-sql-db-vector-search\RAG-...     2_0   

                                               chunk  \
0  MEDIA ACTIVITIES SPECIALIST Summary MultiTaski...   
1  Worked with local production companies to crea...   
2  DIGITAL MEDIA SALES CONSULTANT Summary Dedicat...   
3  national rates Classified Private Party Rep Ja...   
4  ENGINEERING LAB TECHNICIAN Career Focus My mai...   

                                           embedding  
0  [0.018623829, -0.041176733, 0.0022002836, 0.02...  
1  [-0.023771953, -0.042296108, 0.011301303, 0.02...  
2  [-0.0008766432, -0.0005247905, 0.030398797, 0....  
3  [0.0025675627, -0.015229849, 0.053091597, 0.02...  
4  

# **PART 3 : Using Azure SQL DB as a Vector Database to store and query embeddings**

### **Load the embeddings into the Vector Database : Azure SQL DB**

First let us define a function to connect to Azure SQLDB

In [12]:
#lets define a function to connect to SQLDB
import os
from dotenv import load_dotenv
import pyodbc
import struct
from azure.identity import DefaultAzureCredential

# Load environment variables from .env file
load_dotenv()

def get_mssql_connection():
    # Retrieve the connection string from the environment variables
    entra_connection_string = os.getenv('ENTRA_CONNECTION_STRING')
    sql_connection_string = os.getenv('SQL_CONNECTION_STRING')

    # Determine the authentication method and connect to the database
    if entra_connection_string:
        # Entra ID Service Principal Authentication
        credential = DefaultAzureCredential(exclude_interactive_browser_credential=False)    
        token = credential.get_token('https://database.windows.net/.default')
        token_bytes = token.token.encode('UTF-16LE')
        token_struct = struct.pack(f'<I{len(token_bytes)}s', len(token_bytes), token_bytes)
        SQL_COPT_SS_ACCESS_TOKEN = 1256  # This connection option is defined by Microsoft in msodbcsql.h
        conn = pyodbc.connect(entra_connection_string, attrs_before={SQL_COPT_SS_ACCESS_TOKEN: token_struct})
    elif sql_connection_string:
        # SQL Authentication
        conn = pyodbc.connect(sql_connection_string)
    else:
        raise ValueError("No valid connection string found in the environment variables.")

    return conn

### **Insert embeddings into the native 'Vector' Data Type**

We will insert our vectors into the SQL Table now. Azure SQL DB now has a dedicated, native, data type for storing vectors: the `vector` data type. Read about the preview [here](https://devblogs.microsoft.com/azure-sql/eap-for-vector-support-refresh-introducing-vector-type)

The table embeddings has a column called vector which is vector(1536) type. Ensure you have created the table using the script `CreateTable.sql` before running the below code.

In [14]:
import pyodbc
import pandas as pd

# Retrieve the connection string from the function get_mssql_connection()
conn = get_mssql_connection()

# Create a cursor object
cursor = conn.cursor()

# Enable fast_executemany
cursor.fast_executemany = True

# Loop through the DataFrame rows and insert them into the table
for index, row in result_df.iterrows():
    chunkid = row['chunkid']
    filename = row['filename']
    chunk = row['chunk']
    embedding = row['embedding']
    
    # Use placeholders for the parameters in the SQL query
    query = f"""
    INSERT INTO resumedocs (chunkid, filename, chunk, embedding)
    VALUES (?, ?, ?, CAST(? AS VECTOR(1536)))
    """
    # Execute the query with the parameters
    cursor.execute(query, chunkid, filename, chunk, json.dumps(embedding))

# Commit the changes
conn.commit()

# Print a success message
print("Data inserted successfully into the 'resumedocs' table.")

# Close the connection
conn.close()


Data inserted successfully into the 'resumedocs' table.


Let's take a look at the data in the Resume Docs table:

In [16]:
from prettytable import PrettyTable

import pyodbc
import pandas as pd

# Load environment variables from .env file
load_dotenv()

# Retrieve the connection string from the environment variables
conn = get_mssql_connection()

# Create a cursor object
cursor = conn.cursor()

# Use placeholders for the parameters in the SQL query
query = "SELECT TOP(10) filename, chunkid, chunk, CAST(embedding AS NVARCHAR(MAX)) as embedding FROM dbo.resumedocs ORDER BY Id"

# Execute the query with the parameters
cursor.execute(query)
queryresults = cursor.fetchall()

# Get column names from cursor.description
column_names = [column[0] for column in cursor.description]

# Create a PrettyTable object
table = PrettyTable()

# Add column names to the table
table.field_names = column_names

# Set max width for each column to truncate data
table.max_width = 20

# Add rows to the table
for row in queryresults:
    # Truncate each value to 20 characters
    truncated_row = [str(value)[:20] for value in row]
    table.add_row(truncated_row)

# Print the table
print(table)

# Commit the changes
conn.commit()
# Close the connection
conn.close()


+----------------------+---------+----------------------+----------------------+
|       filename       | chunkid |        chunk         |      embedding       |
+----------------------+---------+----------------------+----------------------+
| w:\_git\_owned\azure |   0_0   | MEDIA ACTIVITIES SPE | [1.8623829e-002,-4.1 |
| w:\_git\_owned\azure |   0_1   | Worked with local pr | [-2.3771953e-002,-4. |
| w:\_git\_owned\azure |   1_0   | DIGITAL MEDIA SALES  | [-8.7664317e-004,-5. |
| w:\_git\_owned\azure |   1_1   | national rates Class | [2.5675627e-003,-1.5 |
| w:\_git\_owned\azure |   2_0   | ENGINEERING LAB TECH | [-1.1730898e-002,1.8 |
| w:\_git\_owned\azure |   3_0   | EQUIPMENT ENGINEERIN | [1.4169044e-002,1.47 |
| w:\_git\_owned\azure |   3_1   | Operates repair and  | [-2.8785424e-002,1.1 |
| w:\_git\_owned\azure |   3_2   | City State  Expedite | [3.8580999e-003,2.69 |
| w:\_git\_owned\azure |   4_0   | INFORMATION TECHNOLO | [-2.6888244e-002,1.9 |
| w:\_git\_owned\azure |   4

### **Performing Vector Similarity Search in Azure SQL DB using VECTOR\_DISTANCE built in function**

Let's now query our ResumeDocs table to get the top similar candidates given the User search query.

What we are doing: Given any user search query, we can obtain the vector representation of that text. We then use this vector to calculate the cosine distance against all the resume embeddings stored in the database. By selecting only the closest matches, we can identify the resumes most relevant to the user’s query. This helps in finding the most suitable candidates based on their resumes.

The most common distance is the cosine similarity, which can be calculated quite easily in SQL with the help of the new distance functions.

```
VECTOR_DISTANCE('distance metric', V1, V2)

```

We can use **cosine**, **euclidean**, and **dot** as the distance metric today.

We will define the function `vector_search_sql`.

In [19]:
import os
import pyodbc
import json
from dotenv import load_dotenv

def vector_search_sql(query, num_results=5):
    # Load environment variables from .env file
    load_dotenv()

    # Use the get_mssql_connection function to get the connection string details
    conn = get_mssql_connection()

    # Create a cursor object
    cursor = conn.cursor()

    # Generate the query embedding for the user's search query
    user_query_embedding = get_embedding(query)
    
    # SQL query for similarity search using the function vector_distance to calculate cosine similarity
    sql_similarity_search = f"""
    SELECT TOP(?) filename, chunkid, chunk,
           1-vector_distance('cosine', CAST(? AS VECTOR(1536)), embedding) AS similarity_score,
           vector_distance('cosine', CAST(? AS VECTOR(1536)), embedding) AS distance_score
    FROM dbo.resumedocs
    ORDER BY distance_score 
    """

    cursor.execute(sql_similarity_search, num_results, json.dumps(user_query_embedding), json.dumps(user_query_embedding))
    results = cursor.fetchall()

    # Close the database connection
    conn.close()

    return results

vector_search_sql("system administrator", num_results=5)

[('w:\\_git\\_owned\\azure-sql-db-vector-search\\RAG-with-Documents\\.\\docs\\INFORMATION-TECHNOLOGY\\10089434.pdf', '4_1', 'Disaster Recovery plan and procedures  Researching evaluating and recommending new hardware and new software  Communicating and defining systems design and requirements for new and existing systems and applications  Researching evaluating recommending testing and implementing third party softwareutilities  Planning and designing network infrastructure changes  addingremoving servers appliances network logical flow  Reviewing evaluating and analyzing existing system and application viability with management and staff  Administering and maintaining shares on the file servers  Reviewing server logs to troubleshoot issues  Scheduling and applying hot fixes and security patches on the server infrastructure which includes the operating system and application software  Reviewing systems reporting in SCCM System Center Configuration Manager  Resolving service requests es

# **Part 4 : Use embeddings retrieved from a Azure SQL vector database to augment LLM generation**

Lets create a helper function to feed prompts into the [Completions model](https://learn.microsoft.com/azure/ai-services/openai/concepts/models#gpt-4) & create interactive loop where you can pose questions to the model and receive information grounded in your data.

The function `generate_completion` is defined to help ground the gpt-4o model with prompts and system instructions.   
Note that we are passing the results of the `vector_search_sql` we defined earlier to the model and we define the system prompt .  
We are using gpt-4o model here. 

You can get more information on using Azure Open AI GPT chat models [here](https://learn.microsoft.com/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line%2Cpython-new&pivots=programming-language-python)

In [23]:
import os
from dotenv import load_dotenv
from openai import AzureOpenAI

# Load environment variables from a .env file
load_dotenv()

# Use environment variables for the API key and endpoint
api_key = os.getenv("AZOPENAI_API_KEY")
azure_endpoint = os.getenv("AZOPENAI_ENDPOINT")
chat_model = os.getenv("AZOPENAI_CHAT_MODEL_DEPLOYMENT_NAME")

# Create a chat completion request
client = AzureOpenAI(
    api_key=api_key,
    api_version="2023-05-15",
    azure_endpoint=azure_endpoint
)

def generate_completion(search_results, user_input):
    system_prompt = '''
You are an intelligent & funny assistant who will exclusively answer based on the data provided in the `search_results`:
- Use the information from `search_results` to generate your top 3 responses. If the data is not a perfect match for the user's query, use your best judgment to provide helpful suggestions and include the following format:
  File: {filename}
  Chunk ID: {chunkid}
  Similarity Score: {similarity_score}
  Add a small snippet from the Relevant Text: {chunktext}
  Do not use the entire chunk
- Avoid any other external data sources.
- Add a summary about why the candidate maybe a goodfit even if exact skills and the role being hired for are not matching , at the end of the recommendations. Ensure you call out which skills match the description and which ones are missing. If the candidate doesnt have prior experience for the hiring role which we may need to pay extra attention to during the interview process.
- Add a Microsoft related interesting fact about the technology that was searched 
'''

    messages = [{"role": "system", "content": system_prompt}]
    
    # Create an empty list to store the results
    result_list = []

    # Iterate through the search results and append relevant information to the list
    for result in search_results:
        filename = result  # Assuming filename is the first column
        chunkid = result
        chunktext = result
        similarity_score = result  # Assuming similarity_score is the third column
        
        # Append the relevant information as a dictionary to the result_list
        result_list.append({
            "filename": filename,
            "chunkid": chunkid,
            "chunktext": chunktext,
            "similarity_score": similarity_score
        })

    # Print the result list
    #print(result_list)
    
    messages.append({"role": "system", "content": f"{result_list}"})
    messages.append({"role": "user", "content": user_input})
    response = client.chat.completions.create(model=chat_model, messages=messages, temperature=0) #replace with your model deployment name

    return response.dict()


In [25]:
# Create a loop of user input and model output to perform Q&A on the PDF's that are now chunked and stored in the SQL DB with embeddings
#
# PLEASE NOTE: An input box will be displayed for the user to enter a question/query at the top of the scree.
# The model will then provide a response based on the data stored in the SQL DB.
# Type 'end' to end the session.
#
print("*** What Role are you hiring for? And What skills are you looking for? Ask me & I can help you find a candidate :) Type 'end' to end the session.\n")

while True:
    user_input = input("User prompt: ")
    if user_input.lower() == "end":
        break

    # Print the user's question
    print(f"\nUser asked: {user_input}")

    # Assuming vector_search_sql and generate_completion are defined functions that work correctly
    search_results = vector_search_sql(user_input)
    completions_results = generate_completion(search_results, user_input)

    # Print the model's response
    print("\nAI's response:")
    print(completions_results['choices'][0]['message']['content'])

# The loop will continue until the user types 'end'


*** What Role are you hiring for? And What skills are you looking for? Ask me & I can help you find a candidate :) Type 'end' to end the session.


User asked: software developer

AI's response:
File: w:\\_git\\_owned\\azure-sql-db-vector-search\\RAG-with-Documents\\.\\docs\\INFORMATION-TECHNOLOGY\\10089434.pdf
Chunk ID: 4_0
Similarity Score: 0.3759350958545771
Add a small snippet from the Relevant Text: Versatile Systems Administrator possessing superior troubleshooting skills for networking issues, end user problems, and network security... Received training in MVC 4 for Visual Studio using .Net Framework 4/4.5 to develop application using HTML5 and CSS3...

File: w:\\_git\\_owned\\azure-sql-db-vector-search\\RAG-with-Documents\\.\\docs\\INFORMATION-TECHNOLOGY\\10089434.pdf
Chunk ID: 4_2
Similarity Score: 0.33812141061846746
Add a small snippet from the Relevant Text: Installing, configuring, and supporting McAfee antivirus software on desktops... Developing and maintaining websites 