# How to Convert Text to Vectors Using the Embedding API and Search for Similar Content

This document demonstrates how to convert text into vectors using the Embedding API and how to perform semantic searches for similar content.  
In the following example, we will vectorize 53 sample documents provided by Wikipedia and explore how to search for similar documents.  
Reference: [Azure OpenAI Embeddings Tutorial](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings)


### Note
> - Due to a version conflict issue between pandas==2.0.3 and numpy==2.0.0, the pandas version has been updated to 2.1.2 as of July 2024.
> - Update December 2024: Switched to the text_embeddings_3_large API.

In [1]:
import os
import re
import pandas as pd
import numpy as np
import tiktoken
from openai import AzureOpenAI
from dotenv import load_dotenv
load_dotenv()

client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key        = os.getenv("AZURE_OPENAI_API_KEY"),
    api_version    = os.getenv("OPENAI_API_VERSION")
)

deployment_name = os.getenv("DEPLOYMENT_NAME")
deployment_embedding_name = os.getenv("DEPLOYMENT_EMBEDDING_NAME")

Read the file (`./data/wiki_data.csv`) for vectorization and inspect it using pandas.

In [2]:
df_wiki_data=pd.read_csv(os.path.join(os.getcwd(),'data/wiki_data.csv'))
df_wiki_data

Unnamed: 0,id,url,title,text
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...


In [3]:
pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df_wiki_data['text']= df_wiki_data["text"].apply(lambda x : normalize_text(x))

### Check that the text in the documents does not exceed 8,192 tokens to use the Embedding API provided by Azure OpenAI.

In [4]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df_wiki_data['n_tokens'] = df_wiki_data["text"].apply(lambda x: len(tokenizer.encode(x)))
df_wiki_data = df_wiki_data[df_wiki_data.n_tokens<8192]
len(df_wiki_data)
df_wiki_data

Unnamed: 0,id,url,title,text,n_tokens
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,607
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...,460
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i...",987
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...,94
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...,131


### Examine the tokenized sections of each document's text.

In [None]:
sample_encode = tokenizer.encode(df_wiki_data.text[0]) 
decode = tokenizer.decode_tokens_bytes(sample_encode)
decode

[b'April',
 b' is',
 b' the',
 b' fourth',
 b' month',
 b' of',
 b' the',
 b' year',
 b' in',
 b' the',
 b' Julian',
 b' and',
 b' Greg',
 b'orian',
 b' calendars',
 b',',
 b' and',
 b' comes',
 b' between',
 b' March',
 b' and',
 b' May',
 b'.',
 b' It',
 b' is',
 b' one',
 b' of',
 b' four',
 b' months',
 b' to',
 b' have',
 b' ',
 b'30',
 b' days',
 b'.',
 b' April',
 b' always',
 b' begins',
 b' on',
 b' the',
 b' same',
 b' day',
 b' of',
 b' week',
 b' as',
 b' July',
 b',',
 b' and',
 b' additionally',
 b',',
 b' January',
 b' in',
 b' leap',
 b' years',
 b'.',
 b' April',
 b' always',
 b' ends',
 b' on',
 b' the',
 b' same',
 b' day',
 b' of',
 b' the',
 b' week',
 b' as',
 b' December',
 b'.',
 b' April',
 b"'s",
 b' flowers',
 b' are',
 b' the',
 b' Sweet',
 b' Pe',
 b'a',
 b' and',
 b' Daisy',
 b'.',
 b' Its',
 b' birth',
 b'stone',
 b' is',
 b' the',
 b' diamond',
 b'.',
 b' The',
 b' meaning',
 b' of',
 b' the',
 b' diamond',
 b' is',
 b' innocence',
 b'.',
 b' The',
 b' M

In [6]:
len(decode)

3902

Generate vector data for the text using the Embedding API and add it as a new column named `content_vector`.

In [7]:
def generate_embeddings(text, model=deployment_embedding_name):
    return client.embeddings.create(input = [text], model=model).data[0].embedding

# model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
df_wiki_data['content_vector'] = df_wiki_data["text"].apply(lambda x : generate_embeddings (x, model = deployment_embedding_name)) 
df_wiki_data

Unnamed: 0,id,url,title,text,n_tokens,content_vector
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[0.004067655652761459, -0.002844500122591853, ..."
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.005639874842017889, -0.010014292784035206, ..."
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149,"[0.0059301890432834625, 0.003919691313058138, ..."
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[-0.008293329738080502, -0.01762649603188038, ..."
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,607,"[-0.01116474624723196, -0.04582960158586502, 0..."
5,12,https://simple.wikipedia.org/wiki/Autonomous%2...,Autonomous communities of Spain,Spain is divided in 17 parts called autonomous...,460,"[0.028125453740358353, 0.020387642085552216, -..."
6,13,https://simple.wikipedia.org/wiki/Alan%20Turing,Alan Turing,"Alan Mathison Turing OBE FRS (London, 23 June ...",1138,"[0.002844375092536211, 0.025143058970570564, -..."
7,14,https://simple.wikipedia.org/wiki/Alanis%20Mor...,Alanis Morissette,"Alanis Nadine Morissette (born June 1, 1974) i...",987,"[-0.008278449065983295, 0.0023970487527549267,..."
8,17,https://simple.wikipedia.org/wiki/Adobe%20Illu...,Adobe Illustrator,Adobe Illustrator is a computer program for ma...,94,"[-0.0221348125487566, -0.0012831256026402116, ..."
9,18,https://simple.wikipedia.org/wiki/Andouille,Andouille,Andouille is a type of pork sausage. It is spi...,131,"[0.005165310576558113, -0.008634304627776146, ..."


In [8]:
# Save the data to a CSV file(data/wiki_data_embeddings_3_large.csv)
df_wiki_data.to_csv(os.path.join(os.getcwd(),'data/wiki_data_embeddings_3_large.csv'), index=False)

Analyze the query results to identify relationships based on similarity.

In [9]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text, model=deployment_embedding_name): # model = "deployment_name"
    return client.embeddings.create(input = [text], model=model).data[0].embedding

def search_docs(df, user_query, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        model=deployment_embedding_name # model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
    )
    df["similarities"] = df.content_vector.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res


res = search_docs(df_wiki_data, "Tell me about April.", top_n=4)
res = search_docs(df_wiki_data, "Classify the types of art.", top_n=4)
res = search_docs(df_wiki_data, "Draw a table comparing April and August.", top_n=4)

Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[0.004067655652761459, -0.002844500122591853, ...",0.456704
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.005639874842017889, -0.010014292784035206, ...",0.246201
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[-0.008293329738080502, -0.01762649603188038, ...",0.163401
16,32,https://simple.wikipedia.org/wiki/Abbreviation,Abbreviation,An abbreviation is a shorter way to write a wo...,365,"[-0.0016603072872385383, 0.00935384351760149, ...",0.151688


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,1149,"[0.0059301890432834625, 0.003919691313058138, ...",0.5366
25,49,https://simple.wikipedia.org/wiki/Architecture,Architecture,Architecture is designing the structures of bu...,1017,"[-0.00010834706336027011, 0.003430503187701106...",0.298108
33,57,https://simple.wikipedia.org/wiki/Archaeology,Archaeology,"Archaeology, or archeology, is the study of th...",872,"[0.0030919311102479696, -0.013002167455852032,...",0.222888
17,33,https://simple.wikipedia.org/wiki/Angel,Angel,"In many mythologies and religions, an angel is...",1455,"[-0.01719103381037712, -0.04939764365553856, -...",0.201375


Unnamed: 0,id,url,title,text,n_tokens,content_vector,similarities
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,2179,"[0.005639874842017889, -0.010014292784035206, ...",0.390042
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,3902,"[0.004067655652761459, -0.002844500122591853, ...",0.362131
12,22,https://simple.wikipedia.org/wiki/Addition,Addition,"In mathematics, addition, represented by the s...",801,"[0.01080403570085764, 0.0018525621853768826, -...",0.223811
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,401,"[-0.008293329738080502, -0.01762649603188038, ...",0.173751


In [10]:
# A function for generating RAG-based answers to user queries.
def generate_rag_answer(user_query, top_n=3):
    content_msg = ""
    res = search_docs(df_wiki_data, user_query, top_n=top_n, to_print=False)
    for index, result in res.iterrows():
        # print(result)
        content_msg = content_msg + result.title + ":\n  " + result.text + "  \n"
    system_msg = """You should generate an answer based on the "### Grouding data" message provided below, rather than using any knowledge you have about the user's question. If there is no "### Grouding data" message, "I could not find a context for the answer." You have to answer.  \n\n### Grouding data  \n""" + content_msg
    print (system_msg + "\nQuestion: " + user_query)

    response = client.chat.completions.create(
        model=deployment_name,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": user_query},
        ],
        temperature=0.1,
        max_tokens=2000
    )
    return response.choices[0].message.content

# Generate RAG-based answers to user queries:
user_query = """Compare April and August, summarize the differences for each category, and draw a table."""
response = generate_rag_answer(user_query)
print("Question: " + response)

You should generate an answer based on the "### Grouding data" message provided below, rather than using any knowledge you have about the user's question. If there is no "### Grouding data" message, "I could not find a context for the answer." You have to answer.  

### Grouding data  
August:
  August (Aug.) is the eighth month of the year in the Gregorian calendar, coming between July and September. It has 31 days. It is named after the Roman emperor Augustus Caesar. August does not begin on the same day of the week as any other month in common years, but begins on the same day of the week as February in leap years. August always ends on the same day of the week as November. The Month This month was first called Sextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus. October was the eighth month. August was the eighth month when January or February were added to the start of the year by King Numa Pompil