<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=150px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Generative AI with OpenAI API</h1>
<h1>Embeddings</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

from ipywidgets import interact
import tiktoken
from openai import OpenAI

import nltk
from nltk.corpus import reuters
from nltk import bigrams, trigrams

import scipy
from scipy.spatial.distance import cosine as cosine_similarity

import os
import gzip

import tqdm as tq
from tqdm.notebook import tqdm
tqdm.pandas()

import watermark

%load_ext watermark

We start by printing out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.12.3

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 23.4.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit

Git hash: 5217868ae783ab0315a9d79f2e1ac0fb25095003

numpy    : 1.26.4
pandas   : 2.1.4
tiktoken : 0.6.0
tqdm     : 4.66.2
scipy    : 1.11.4
nltk     : 3.8.1
watermark: 2.4.3



Get the API key

In [3]:
client = OpenAI()

# Extract embeddings

In [4]:
response = client.embeddings.create(
    model="text-embedding-ada-002",
    input="Your text goes here",
    )

In [5]:
response

CreateEmbeddingResponse(data=[Embedding(embedding=[-0.013014387339353561, -0.013467525132000446, 0.009609189815819263, -0.021537378430366516, -0.02504253387451172, 0.044007688760757446, -0.01971150003373623, -0.01577986218035221, -0.010795344598591328, -0.023349929600954056, 0.005287719890475273, 0.016606172546744347, -0.009849085472524166, -0.0010128965368494391, 0.003032025881111622, -0.006457215175032616, 0.02437615394592285, -0.004034926649183035, 0.014460430480539799, -0.012201405130326748, -0.0046146768145263195, 0.01326761208474636, 0.013207637704908848, 0.01222139596939087, -0.004747952334582806, 0.0010237251408398151, 0.01196150854229927, -0.017259223386645317, 0.015926465392112732, 0.009502568282186985, 0.005900788586586714, -0.006327271461486816, -0.020177964121103287, -0.006793736945837736, -0.017872292548418045, -0.009056094102561474, 0.0024939244613051414, -0.015926465392112732, 0.027987930923700333, -0.005111129023134708, 0.011328447610139847, 0.007723336108028889, -0.00

The embedding itself is a high dimensional vector

In [6]:
response.data[0].embedding

[-0.013014387339353561,
 -0.013467525132000446,
 0.009609189815819263,
 -0.021537378430366516,
 -0.02504253387451172,
 0.044007688760757446,
 -0.01971150003373623,
 -0.01577986218035221,
 -0.010795344598591328,
 -0.023349929600954056,
 0.005287719890475273,
 0.016606172546744347,
 -0.009849085472524166,
 -0.0010128965368494391,
 0.003032025881111622,
 -0.006457215175032616,
 0.02437615394592285,
 -0.004034926649183035,
 0.014460430480539799,
 -0.012201405130326748,
 -0.0046146768145263195,
 0.01326761208474636,
 0.013207637704908848,
 0.01222139596939087,
 -0.004747952334582806,
 0.0010237251408398151,
 0.01196150854229927,
 -0.017259223386645317,
 0.015926465392112732,
 0.009502568282186985,
 0.005900788586586714,
 -0.006327271461486816,
 -0.020177964121103287,
 -0.006793736945837736,
 -0.017872292548418045,
 -0.009056094102561474,
 0.0024939244613051414,
 -0.015926465392112732,
 0.027987930923700333,
 -0.005111129023134708,
 0.011328447610139847,
 0.007723336108028889,
 -0.0016251325

And the similarity between vectors represents the similarity between the texts

In [7]:
emb1 = client.embeddings.create(
    model="text-embedding-ada-002",
    input="love",
    ).data[0].embedding

emb2 = client.embeddings.create(
    model="text-embedding-ada-002",
    input="kindness",
    ).data[0].embedding

Cosine similarity is a quick and dirty way to measure how similar two vectors are

In [8]:
cosine_similarity(emb1, emb2)

0.162337414711768

Naturally, unrelated words have much lower degree of similarity

In [9]:
emb3 = client.embeddings.create(
    model="text-embedding-ada-002",
    input="airplane",
    ).data[0].embedding

In [10]:
cosine_similarity(emb2, emb3)

0.22999315123592778

# Generate embeddings for the Reuters sentences dataset

We're going to use Reuters

In [11]:
sentences = [" ".join(word for word in sent) for sent in reuters.sents()]

In [12]:
len(sentences)

54716

In [13]:
sentences[:10]

["ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .",
 'They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products .',
 "But some exporters said that while the conflict would hurt them in the long - run , in the short - term Tokyo ' s loss might be their gain .",
 "The U . S . Has said it will impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost .",
 'Unofficial Japanese estimates put the impact of the tariffs at 10 billion dlrs and spokesmen for major electronics firms said they would virtually halt expo

Let us build a DataFrame of 1000 sentences

In [14]:
data = pd.DataFrame(sentences[:1000], columns=["text"])

In [15]:
data

Unnamed: 0,text
0,ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA...
1,They told Reuter correspondents in Asian capit...
2,But some exporters said that while the conflic...
3,The U . S . Has said it will impose 300 mln dl...
4,Unofficial Japanese estimates put the impact o...
...,...
995,The decision to cut deposits was taken by the ...
996,The cuts were likely to attract more business ...
997,ELECTRO RENT CORP & lt ; ELRC > 3RD QTR FEB 28...
998,WALGREEN CO 2ND QTR SHR 62 CTS VS 58 CTS


In [16]:
embedding_model = "text-embedding-ada-002"

In [17]:
encoding = tiktoken.encoding_for_model(embedding_model)

# Compute the embeddings for each sentence

In [18]:
def get_embedding(text, engine):
    text = text.replace("\n", " ")
    
    return client.embeddings.create(input=[text], model=engine).data[0].embedding

In [19]:
data["embedding"] = data.text.progress_apply(lambda x: get_embedding(x, engine=embedding_model))

  0%|          | 0/1000 [00:00<?, ?it/s]

In [20]:
data

Unnamed: 0,text,embedding
0,ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPA...,"[-0.016322532668709755, -0.028255142271518707,..."
1,They told Reuter correspondents in Asian capit...,"[-0.021973947063088417, -0.016387123614549637,..."
2,But some exporters said that while the conflic...,"[-0.020723218098282814, -0.018283598124980927,..."
3,The U . S . Has said it will impose 300 mln dl...,"[-0.02001209929585457, -0.01670563593506813, 0..."
4,Unofficial Japanese estimates put the impact o...,"[-0.02122245542705059, -0.03379626199603081, 0..."
...,...,...
995,The decision to cut deposits was taken by the ...,"[0.003845407161861658, -0.014886741526424885, ..."
996,The cuts were likely to attract more business ...,"[-0.02035561017692089, -0.02054610475897789, 0..."
997,ELECTRO RENT CORP & lt ; ELRC > 3RD QTR FEB 28...,"[0.0014210807858034968, -0.012158041819930077,..."
998,WALGREEN CO 2ND QTR SHR 62 CTS VS 58 CTS,"[0.002370386151596904, 0.00641195522621274, 0...."


# Generate Recommendations

In [21]:
def get_recommendations(data, pos, k_nearest_neighbors=3):
    embeddings = np.array(data["embedding"].to_list())
    strings = data["text"].to_list()

    query_embedding = embeddings[pos]

    distances = [ cosine_similarity(query_embedding, embedding) for embedding in tqdm(embeddings, total=len(embeddings)) ]

    indices_of_nearest_neighbors = np.argsort(distances)

    query_string = strings[pos]
    print(f"Query: {query_string}\n\n")

    for pos, i in enumerate(indices_of_nearest_neighbors[1:k_nearest_neighbors+1]):
        print("-- Nearest neighbor %u of %u (%u)---\n%s\n\n" % (pos, k_nearest_neighbors, i, strings[i]))

In [22]:
%%time
get_recommendations(data, 123, 5)

  0%|          | 0/1000 [00:00<?, ?it/s]

Query: It will be the next decade before we see if the strategy is right or wrong ."


-- Nearest neighbor 0 of 5 (94)---
But financial analysts are divided on whether and how quickly the gamble will pay off .


-- Nearest neighbor 1 of 5 (118)---
We expect this will be allowed in two or three years ," he said .


-- Nearest neighbor 2 of 5 (106)---
We only have to wait two or three years , not until the 21st century ," Komatsu said .


-- Nearest neighbor 3 of 5 (321)---
Bankers say it is too early to venture a forecast for economic growth this year or next .


-- Nearest neighbor 4 of 5 (580)---
" And we will have to see if the United States is able to do what they promised in Paris on reducing the budget deficit -- and get it through Congress ," he added .


CPU times: user 41.8 ms, sys: 4.39 ms, total: 46.2 ms
Wall time: 45.2 ms


<center>
     <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>