# Embeddings

**Query Models**
- Searching for similar documents

**Document Models**
- Embedding documents

**Similarty Models**
- Clustering, regression, anomaly detection, visualization

In [1]:
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import pandas as pd
import re
import tiktoken

### Set up OpenAI API

In [2]:
#HIDDEN KEYS
OPENAI_KEY = "6c5cd7620921425bb3fa47815f00f2a4"
OPENAI_RESOURCE_ENDPOINT = "https://aoi-linkedin-openai-01.openai.azure.com/"

In [3]:
openai.api_type = "azure"
openai.api_key = OPENAI_KEY
openai.api_base = OPENAI_RESOURCE_ENDPOINT
openai.api_version = "2022-12-01"

## Getting some data

In [6]:
df = pd.read_csv("data/data-embedding.csv")
display(df)

Unnamed: 0,title,description,label_int,label
0,World Briefings,BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime M...,1,World
1,Nvidia Puts a Firewall on a Motherboard (PC Wo...,PC World - Upcoming chip set will include buil...,4,Sci/Tech
2,"Olympic joy in Greek, Chinese press",Newspapers in Greece reflect a mixture of exhi...,2,Sports
3,U2 Can iPod with Pictures,"SAN JOSE, Calif. -- Apple Computer (Quote, Cha...",4,Sci/Tech
4,The Dream Factory,"Any product, any shape, any size -- manufactur...",4,Sci/Tech
...,...,...,...,...
1995,You Control: iTunes puts control in OS X menu ...,MacCentral - You Software Inc. announced on Tu...,4,Sci/Tech
1996,Argentina beat Italy for place in football final,Favourites Argentina beat Italy 3-0 this morni...,2,Sports
1997,NCAA case no worry for Spurrier,Shortly after Steve Spurrier arrived at Florid...,2,Sports
1998,Secret Service Busts Cyber Gangs,The US Secret Service Thursday announced arres...,4,Sci/Tech


Cleaning up the data

In [7]:
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df['description'] = df["description"].apply(lambda x : normalize_text(x))

### Embedding

Notice that although the first sentence is very short, the size of the returned vector is identically the same!

In [8]:
text_to_embed = "This is a short sentence"
embeddings = get_embedding(text_to_embed, engine = "embedding-ada")
print(text_to_embed)
print('Array Size (Short Text) - Ada: ', len(embeddings))
print(embeddings)

print("=====================================")
embeddings = get_embedding(df["description"][1], engine = "embedding-ada")
print(df["description"][1])
print('Array Size (Short Text) - Ada: ', len(embeddings))
print(embeddings)

This is a short sentence
Array Size (Short Text) - Ada:  1024
[0.011673801578581333, 0.035735711455345154, 0.01157175749540329, 0.01918421871960163, -0.007928797043859959, 0.05102185904979706, 0.01912299357354641, -0.008270643651485443, 0.05596077814698219, -0.0401235893368721, -0.058899637311697006, 0.008673716336488724, 0.034919362515211105, 0.089716836810112, -0.03308257460594177, -0.05363417789340019, 0.0026072170585393906, -0.015306558459997177, 0.003183764172717929, 0.002311290241777897, 0.0045588030479848385, -0.027245674282312393, 0.04371552914381027, -0.034143827855587006, 0.01941891945898533, 0.017255593091249466, 0.0035511215683072805, 0.0318988673388958, -0.02491907589137554, -0.02430681511759758, -0.06261402368545532, 0.0011601095320656896, 0.017541315406560898, 0.05730775371193886, 0.013949376530945301, -0.004484821576625109, 0.0025472664274275303, -0.030796794220805168, 0.01251056045293808, -0.02287820167839527, -0.03985827788710594, -0.024592537432909012, -0.00076915451

But depending on the model that you use the size will be different, this means that also the knowledge saved in the vector is less or more depending on the size of the vector.
DaVinci model in this case contains more knowledge then Ada

In [9]:
embeddings = get_embedding(df["description"][1], engine = "embedding-babbage")
print('Array Size (Long Text) - Babbage: ', len(embeddings))

embeddings = get_embedding(df["description"][1], engine = "embedding-curie")
print('Array Size (Long Text) - Curie: ', len(embeddings))

embeddings = get_embedding(df["description"][1], engine = "embedding-davinci")
print('Array Size (Long Text) - Davinci: ', len(embeddings))

Array Size (Long Text) - Babbage:  2048
Array Size (Long Text) - Curie:  4096
Array Size (Long Text) - Davinci:  12288


In [10]:
tokenizer = tiktoken.get_encoding("cl100k_base")
sample_encode = tokenizer.encode(df["description"][1]) 
print("No of tokens: ", len(sample_encode))
tokenizer.decode_tokens_bytes(sample_encode)


No of tokens:  17


[b'PC',
 b' World',
 b' -',
 b' Up',
 b'coming',
 b' chip',
 b' set',
 b' will',
 b' include',
 b' built',
 b'-in',
 b' security',
 b' features',
 b' for',
 b' your',
 b' PC',
 b'.']

## Calculate embeddings

Instead of recalculating all the embeddings for the data, you can make use of the preloaded dataset available in the pickle file.
Do save your embeddings after calculating them, especially if you are converting a big data set, otherwise you will have a cost each time you need it.

In [11]:

#df['embedding'] = df["description"].apply(lambda x : get_embedding(x, engine = 'text-embedding-ada-002'))
#pd.to_pickle(df, "data/data-embedding.pkl")
df = pd.read_pickle("data/data-embedding.pkl")
df.head()

Unnamed: 0,title,description,label_int,label,embedding
0,World Briefings,BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime M...,1,World,"[-0.010669018141925335, -0.024026192724704742,..."
1,Nvidia Puts a Firewall on a Motherboard (PC Wo...,PC World - Upcoming chip set will include buil...,4,Sci/Tech,"[0.0045676459558308125, -0.004495817236602306,..."
2,"Olympic joy in Greek, Chinese press",Newspapers in Greece reflect a mixture of exhi...,2,Sports,"[-0.004999000113457441, 0.007488568313419819, ..."
3,U2 Can iPod with Pictures,"SAN JOSE, Calif. -- Apple Computer (Quote, Cha...",4,Sci/Tech,"[-0.0030229880940169096, -0.021461213007569313..."
4,The Dream Factory,"Any product, any shape, any size -- manufactur...",4,Sci/Tech,"[-0.024735869839787483, -0.009320170618593693,..."


This functions will calculate the cosine similarity between our query and the dataset.

In [12]:
def search_docs(df, user_query:str, engine:str, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine=engine
    )
    df["similarities"] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res

And here we call the function with the query "who is edgar davids". It will return us the top 3 results linked to this query.

In [13]:
res = search_docs(df, "who is edgar davids", top_n=4, engine="text-embedding-ada-002", to_print=False)
for i in res.index:
    print(df["description"][i])
    print("=====================================")

AMSTERDAM, Aug 18 (Reuters) - Midfielder Edgar Davids #39;s leadership qualities and never-say-die attitude have earned him the captaincy of the Netherlands under new coach Marco van Basten.
England boss Sven Goran Eriksson has defended goalkeeper David James after last night #39;s 2-2 draw in Austria. James allowed Andreas Ivanschitz #39;s shot to slip through his fingers to complete Austria comeback from two goals down.
ENGLAND captain and Real Madrid midfielder David Beckham has played down speculation that his club are moving for England manager Sven-Goran Eriksson.
Manchester United boss Sir Alex Ferguson wants the FA to punish Arsenal good guy Dennis Bergkamp for taking a swing at Alan Smith last Sunday.
