<a href="https://colab.research.google.com/github/PHIHOD/monsters-rolodex/blob/master/Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Word Embeddings, Semantic Search

Word embeddings are a way of representing words and phrases as vectors. They can be used for a variety of tasks, including semantic search, anomaly detection, and classification. In the video on OpenAI Whisper, I mentioned how words whose vectors are numerically similar are also similar in semantic meaning. In this tutorial, we will learn how to implement semantic search using OpenAI embeddings. Understanding the Embeddings concept will be crucial to the next several videos in this series since we will use it to build several practical applications.

To get started, we will need to install and import OpenAI and input an API Key. We learned how to do this in [Video 3 of this series](https://www.youtube.com/watch?v=LWYgjcZye1c).

In [2]:
!pip install openai -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m41.0/76.5 kB[0m [31m989.3 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import openai
import pandas as pd
import numpy as np
from getpass import getpass

openai.api_key = getpass()

··········


# Read Data File Containing Words

Now that we have configured OpenAI, let's start with a simple CSV file with familiar words. From here we'll build up to a more complex semantic search using sentences from the Fed speech. [Save the linked "words.csv" as a CSV](https://gist.github.com/hackingthemarkets/25240a55e463822d221539e79d91a8d0) and upload it to Google Colab. Once the file is uploaded, let's read it into a pandas dataframe using the code below:

In [17]:
df = pd.read_csv('words.csv')
print(df)

            text
0            red
1       potatoes
2           soda
3         cheese
4          water
5           blue
6         crispy
7      hamburger
8         coffee
9          green
10          milk
11      la croix
12        yellow
13     chocolate
14  french fries
15         latte
16          cake
17         brown
18  cheeseburger
19      espresso
20    cheesecake
21         black
22         mocha
23         fizzy
24        carbon
25        banana


# Calculate Word Embeddings

To use word embeddings for semantic search, you first compute the embeddings for a corpus of text using a word embedding algorithm. What does this mean? We are going to create a numerical representation of each of these words. To perform this computation, we'll use OpenAI's 'get_embedding' function.

Since we have our words in a pandas dataframe, we can use "apply" to apply the get_embedding function to each row in the dataframe. We then store the calculated word embeddings in a new text file called "word_embeddings.csv" so that we don't have to call OpenAI again to perform these calculations.

In [15]:
get_embedding("the fox crossed the road", engine='text-embedding-ada-002')

[-0.0005497358506545424,
 0.0003819362900685519,
 -0.020266875624656677,
 0.007079525385051966,
 -0.013881421647965908,
 0.0253272857517004,
 -0.02231123112142086,
 -0.02231123112142086,
 0.013868802227079868,
 -0.03313874080777168,
 0.029075268656015396,
 0.008802083320915699,
 0.03291159123182297,
 -0.015875298529863358,
 0.004849032964557409,
 -0.001055303611792624,
 0.019547566771507263,
 -0.0009748543961904943,
 0.01838657446205616,
 -0.03069056198000908,
 -0.00983688049018383,
 0.029504330828785896,
 0.0022746603935956955,
 -0.02371199242770672,
 -0.003902572439983487,
 0.006407538428902626,
 0.015269564464688301,
 -0.01379308570176363,
 -0.0027920587453991175,
 -0.01162253599613905,
 0.008139560930430889,
 0.0019213149789720774,
 -0.02884811908006668,
 0.0005071451305411756,
 -0.01379308570176363,
 -0.017402255907654762,
 -0.001864527352154255,
 -0.008713747374713421,
 0.011281810700893402,
 -0.027384260669350624,
 0.0271066315472126,
 0.005319108720868826,
 -0.01665770635008812

In [18]:
from openai.embeddings_utils import get_embedding

df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
df.to_csv('word_embeddings.csv')

# Semantic Search

Now that we have our word embeddings stored, let's load them into a new dataframe and use it for semantic search. Since the 'embedding' in the CSV is stored as a string, we'll use apply() and to interpret this string as Python code and convert it to a numpy array so that we can perform calculations on it.

In [30]:
df = pd.read_csv('word_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df

Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,red,"[-7.451758574461564e-05, -0.024687238037586212..."
1,1,potatoes,"[0.00496138958260417, -0.03108060173690319, 0...."
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ..."
3,3,cheese,"[-0.003182360902428627, -0.008869297802448273,..."
4,4,water,"[0.019149407744407654, -0.01257583312690258, 0..."
5,5,blue,"[0.005404925439506769, -0.007392944302409887, ..."
6,6,crispy,"[-0.00097661220934242, -0.005434627644717693, ..."
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107..."
8,8,coffee,"[-0.0007453818107023835, -0.019452422857284546..."
9,9,green,"[0.015282037667930126, -0.010865106247365475, ..."


Let's now prompt ourselves for a search term that isn't in the dataframe. We'll use word embeddings to perform a semantic search for the words that are most similar to the word we entered. I'll first try the word "hot dog". Then we'll come back and try the word "yellow".

In [24]:
search_term = input('Enter a search term: ')


Enter a search term: soda


Now that we have a search term, let's calculate an embedding or vector for that search term using the OpenAI get_embedding function.

In [28]:
# semantic search
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")
search_term_vector

[0.025804834440350533,
 -0.007458584848791361,
 -0.004356400575488806,
 -0.031168611720204353,
 -0.01994737796485424,
 0.006694713607430458,
 -0.025377867743372917,
 -0.01885327324271202,
 -0.006281089037656784,
 -0.006798119749873877,
 0.030314676463603973,
 0.026298515498638153,
 0.0024317121133208275,
 0.0057640583254396915,
 -0.01630481332540512,
 0.004089545924216509,
 0.03279642388224602,
 -0.02176198922097683,
 0.03623884916305542,
 -0.01526407990604639,
 0.011341318488121033,
 0.0004932639421895146,
 0.02129499241709709,
 -0.011074463836848736,
 -0.007525298278778791,
 0.005520553328096867,
 0.018893301486968994,
 -0.020267603918910027,
 0.005910828243941069,
 -0.006958232261240482,
 0.02018754743039608,
 -0.016318155452609062,
 -0.015023911371827126,
 -0.007445241790264845,
 -0.016611695289611816,
 -0.00827249139547348,
 -0.021708616986870766,
 -0.0193336121737957,
 0.000506189709994942,
 -0.03589193522930145,
 -0.0039094192907214165,
 -0.0218820720911026,
 -0.0065079154446721

 Once we have a vector representing that word, we can see how similar it is to other words in our dataframe by calculating the cosine similarity of our search term's word vector to each word embedding in our dataframe.

In [31]:
from openai.embeddings_utils import cosine_similarity

df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))

df

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,red,"[-7.451758574461564e-05, -0.024687238037586212...",0.793664
1,1,potatoes,"[0.00496138958260417, -0.03108060173690319, 0....",0.824367
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ...",1.0
3,3,cheese,"[-0.003182360902428627, -0.008869297802448273,...",0.835515
4,4,water,"[0.019149407744407654, -0.01257583312690258, 0...",0.836316
5,5,blue,"[0.005404925439506769, -0.007392944302409887, ...",0.787873
6,6,crispy,"[-0.00097661220934242, -0.005434627644717693, ...",0.797836
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107...",0.813212
8,8,coffee,"[-0.0007453818107023835, -0.019452422857284546...",0.817923
9,9,green,"[0.015282037667930126, -0.010865106247365475, ...",0.780331


# Sorting By Similarity

Now that we have calculated the similarities to each term in our dataframe, we simply sort the similarity values to find the terms that are most similar to the term we searched for. Notice how the foods are most similar to "hot dog". Not only that, it puts fast food closer to hot dog. Also some colors are ranked closer to hot dog than others. Let's go back and try the word "yellow" and walk through the results.

In [32]:
df.sort_values("similarities", ascending=False).head(20)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ...",1.0
23,23,fizzy,"[-0.01298743300139904, -0.010277776047587395, ...",0.857401
10,10,milk,"[0.000902261643204838, -0.019304810091853142, ...",0.849798
4,4,water,"[0.019149407744407654, -0.01257583312690258, 0...",0.836316
3,3,cheese,"[-0.003182360902428627, -0.008869297802448273,...",0.835515
19,19,espresso,"[-0.02247573435306549, -0.012739825062453747, ...",0.829389
14,14,french fries,"[0.0014221749734133482, -0.016582705080509186,...",0.82901
22,22,mocha,"[-0.012462769635021687, -0.026210907846689224,...",0.828314
1,1,potatoes,"[0.00496138958260417, -0.03108060173690319, 0....",0.824367
13,13,chocolate,"[0.00145119393710047, -0.012891666032373905, -...",0.823176


# Adding Words Together

What's even more interesting is that we can add word vectors together. What happens when we add the numbers for milk and espresso, then search for the word vector most similar to milk + espresso? Let's make a copy of the original dataframe and call it food_df. We'll operate on this copy. Let's try adding word together. Let's add milk + espresso and store the results in milk_espresso_vector.

In [33]:
food_df = df.copy()

milk_vector = food_df['embedding'][10]
espresso_vector = food_df['embedding'][19]

milk_espresso_vector = milk_vector + espresso_vector
milk_espresso_vector

array([-0.02157347, -0.03204464, -0.01626644, ..., -0.00419496,
        0.00076436, -0.02888974])

Now let's find the words most similar to milk + espresso. If you have never done this before, it's pretty surprising that you can add words together like this and find similar words using numbers.

In [34]:
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_espresso_vector))
food_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
19,19,espresso,"[-0.02247573435306549, -0.012739825062453747, ...",0.96049
10,10,milk,"[0.000902261643204838, -0.019304810091853142, ...",0.96049
15,15,latte,"[-0.015636539086699486, -0.0039571840316057205...",0.923062
22,22,mocha,"[-0.012462769635021687, -0.026210907846689224,...",0.899298
8,8,coffee,"[-0.0007453818107023835, -0.019452422857284546...",0.895402
3,3,cheese,"[-0.003182360902428627, -0.008869297802448273,...",0.884499
13,13,chocolate,"[0.00145119393710047, -0.012891666032373905, -...",0.883368
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ...",0.87413
4,4,water,"[0.019149407744407654, -0.01257583312690258, 0...",0.865977
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107...",0.852543


# Microsoft Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [52]:
earnings_df = pd.read_csv('2.csv')
earnings_df

Unnamed: 0,Index,Content
0,1-1,Lovens formål er: a. å sikre et arbeidsmiljø s...
1,1-2,(1) Loven gjelder for virksomhet som sysselset...
2,1-3,(1) Loven gjelder for virksomhet i forbindelse...
3,1-4,(1) Departementet kan gi forskrift om at loven...
4,1-5,(1) Departementet kan gi forskrift om og i hvi...
...,...,...
186,19-5,Enhver som er knyttet til Arbeidstilsynet er i...
187,19-6,Overtredelse av denne lov er undergitt offentl...
188,19-7,Opphevet ved lov 19 juni 2015 nr. 65 (ikr. 1 o...
189,20-1,Loven trer i kraft fra den tid Kongen bestemme...


Once we have the dataframe, we'll once again compute the embeddings for each line in our CSV file.

In [67]:
earnings_df['embedding'] = earnings_df['Content'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
print(earnings_df.to_json(orient="records"))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



If you download the earnings_embeddings.csv file locally and open it up, you'll see that our embeddings are for entire paragraphs - not just words. This means that we'll be able to search on similar sentences even if there isn't an exact match for the string we search for. We are searching on meaning.

In [58]:
earnings_search = input("Search earnings for a sentence:")

Search earnings for a sentence:Særbehandling som bidrar til å fremme likebehandling er ikke i strid med bestemmelsene i dette kapittel. Særbehandlingen skal opphøre når formålet med den er oppnådd.


In [59]:

earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")
earnings_search_vector

[-0.03408805653452873,
 -0.019190313294529915,
 0.022384490817785263,
 -0.030755003914237022,
 -0.004333602264523506,
 -0.01112911943346262,
 -0.013205965980887413,
 -0.0063220723532140255,
 -0.02329350635409355,
 0.0006742649129591882,
 -5.725728260586038e-05,
 0.020894717425107956,
 -0.008641953580081463,
 -0.004387259483337402,
 -0.007272119168192148,
 0.00733524514362216,
 0.052621860057115555,
 -0.035401079803705215,
 0.009853973984718323,
 -0.007796064950525761,
 -0.0024429773911833763,
 0.01093342900276184,
 -0.021715356037020683,
 0.005315212067216635,
 -0.03815337270498276,
 0.006571420002728701,
 0.004118973854929209,
 -0.029315728694200516,
 0.002809108467772603,
 -0.007240556180477142,
 0.008370512165129185,
 -0.01744803600013256,
 -0.03994615375995636,
 -0.014796742238104343,
 0.00962672010064125,
 -0.02406364306807518,
 0.009311090223491192,
 0.005971722770482302,
 0.02211936190724373,
 -0.023078877478837967,
 0.019442817196249962,
 -0.02506103552877903,
 -0.0001481489016

In [60]:

earnings_df["similarities"] = earnings_df['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))

earnings_df


Unnamed: 0,Index,Content,embedding,similarities
0,1-1,Lovens formål er: a. å sikre et arbeidsmiljø s...,"[-0.005932123865932226, -0.008052770979702473,...",0.805156
1,1-2,(1) Loven gjelder for virksomhet som sysselset...,"[-0.009379606693983078, -0.01712575927376747, ...",0.822950
2,1-3,(1) Loven gjelder for virksomhet i forbindelse...,"[0.004244636744260788, -0.013968712650239468, ...",0.815731
3,1-4,(1) Departementet kan gi forskrift om at loven...,"[-0.013111429288983345, -0.008210494183003902,...",0.832425
4,1-5,(1) Departementet kan gi forskrift om og i hvi...,"[0.0042964499443769455, 0.0011408518766984344,...",0.817953
...,...,...,...,...
186,19-5,Enhver som er knyttet til Arbeidstilsynet er i...,"[-0.004267487674951553, -0.02458224445581436, ...",0.803283
187,19-6,Overtredelse av denne lov er undergitt offentl...,"[-0.007169641088694334, -0.02803962491452694, ...",0.813201
188,19-7,Opphevet ved lov 19 juni 2015 nr. 65 (ikr. 1 o...,"[-0.0051294309087097645, -0.02359042502939701,...",0.812440
189,20-1,Loven trer i kraft fra den tid Kongen bestemme...,"[-0.006408637389540672, -0.029810603708028793,...",0.806918


In [61]:
earnings_df.sort_values("similarities", ascending=False)

Unnamed: 0,Index,Content,embedding,similarities
97,13-6,Særbehandling som bidrar til å fremme likebeha...,"[-0.033746205270290375, -0.019957832992076874,...",0.995463
94,13-3,(1) Forskjellsbehandling som har et saklig for...,"[-0.008884463459253311, -0.013157921843230724,...",0.870482
101,13-10,Som fullmektig i forvaltningssak etter dette k...,"[-0.03072771430015564, -0.026752984151244164, ...",0.863270
153,15-17,Bestemmelsene i dette kapittel gjelder ikke ve...,"[-0.01006377674639225, -0.006401385646313429, ...",0.862406
99,13-8,Dersom arbeidstaker eller arbeidssøker fremleg...,"[-0.027981314808130264, -0.019785530865192413,...",0.860862
...,...,...,...,...
35,5-2,(1) Dersom arbeidstaker omkommer eller blir al...,"[-0.0127043342217803, -0.011660141870379448, -...",0.785239
15,2 A-2,(1) Arbeidstaker kan alltid varsle internt a. ...,"[-0.015065928921103477, -0.01334595587104559, ...",0.784613
75,11-5,(1) Personer under 18 år skal ha hvilepause i ...,"[-0.0015598261961713433, -0.005544301588088274...",0.781308
79,12-4,Etter fødselen skal mor ha permisjon de første...,"[-0.0069259474985301495, -0.013777755200862885...",0.776129


# Sentences of the Fed Speech

Let's use the Fed Speech example once more. Let's calculate the word embeddings for a particular sentence in the November 2nd speech that we discussed in the OpenAI Whisper tutorial. Then we'll take a new sentence from a future speech that isn't in our dataset, and find the most similar sentence in our dataset. Here is the sentence we will use to search for similarity:

"the inflation is too damn high"

As we did previously, take [the linked CSV file](https://gist.github.com/hackingthemarkets/9b55ea8b73c7f4e04b42a9f8eddb8393) and upload it to Google Colab as fed-speech.csv. We'll once again read it into a pandas dataframe.

In [None]:
fed_df = pd.read_csv('fed-speech.csv')
fed_df

Unnamed: 0,text
0,Good afternoon
1,My colleagues and I are strongly committed to ...
2,We have both the tools that we need and the re...
3,Price stability is the responsibility of the F...
4,"Without price stability, the economy does not ..."
5,"In particular, without price stability, we wil..."
6,"Today, the FOMC raised our policy interest rat..."
7,We are moving our policy stance purposefully t...
8,"In addition, we are continuing the process of ..."
9,Restoring price stability will likely require ...


We'll once again calculate the embeddings and save them in a new CSV file.

In [None]:
fed_df['embedding'] = fed_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
fed_df.to_csv('fed-embeddings.csv')

We'll then enter the new sentence that we want to find similarity for:

"We will continue to increase interest rates and tighten monetary policy"

In [None]:
fed_sentence = input('Enter something Jerome Powell said: ')


Enter something Jerome Powell said: the inflation is too damn high


Again we'll get the vector for this sentence, find the cosine similarity, and sort by most similar.

In [None]:
fed_sentence_vector = get_embedding(fed_sentence, engine="text-embedding-ada-002")
fed_sentence_vector

[-0.00413066940382123,
 -0.011251280084252357,
 -0.005313646513968706,
 -0.02224256657063961,
 -0.012122263200581074,
 0.0024195776786655188,
 -0.03860924765467644,
 -0.005732887890189886,
 -0.016691673547029495,
 -0.0204096008092165,
 0.022372564300894737,
 0.006987363565713167,
 0.023464541882276535,
 0.006652620155364275,
 0.014026726596057415,
 0.011277279816567898,
 0.0338253416121006,
 0.007643850985914469,
 0.02031860314309597,
 -0.015677694231271744,
 0.0025706999003887177,
 0.011101783253252506,
 -0.0122522609308362,
 -0.0034319330006837845,
 -0.020214606076478958,
 -0.0012877873377874494,
 0.016340680420398712,
 -0.02594749443233013,
 -0.0051089003682136536,
 -0.002343204338103533,
 0.007513853255659342,
 -0.0077023496851325035,
 -0.03166738152503967,
 -0.0024634518194943666,
 -0.020019609481096268,
 -0.03564530611038208,
 -0.013870729133486748,
 -0.016990669071674347,
 -0.0031215641647577286,
 -0.00859933253377676,
 0.026168489828705788,
 -0.010932786390185356,
 0.0133507391

In [None]:
fed_df = pd.read_csv('fed-embeddings.csv')
fed_df['embedding'] = fed_df['embedding'].apply(eval).apply(np.array)
fed_df


Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,Good afternoon,"[-0.017524775117635727, 0.02069251798093319, -..."
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,..."
2,2,We have both the tools that we need and the re...,"[0.003941578324884176, -0.015006175264716148, ..."
3,3,Price stability is the responsibility of the F...,"[0.009378707036376, -0.016561055555939674, -0...."
4,4,"Without price stability, the economy does not ...","[-0.003026996273547411, -0.014454687014222145,..."
5,5,"In particular, without price stability, we wil...","[-0.03618694841861725, -0.008898851461708546, ..."
6,6,"Today, the FOMC raised our policy interest rat...","[-0.024621201679110527, -0.02114815264940262, ..."
7,7,We are moving our policy stance purposefully t...,"[-0.025701606646180153, -0.012234759517014027,..."
8,8,"In addition, we are continuing the process of ...","[-0.03149143233895302, 0.0019273122306913137, ..."
9,9,Restoring price stability will likely require ...,"[-0.010953230783343315, -0.020290518179535866,..."


In [None]:

fed_df["similarities"] = fed_df['embedding'].apply(lambda x: cosine_similarity(x, fed_sentence_vector))

fed_df


Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,Good afternoon,"[-0.017524775117635727, 0.02069251798093319, -...",0.750047
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,...",0.826724
2,2,We have both the tools that we need and the re...,"[0.003941578324884176, -0.015006175264716148, ...",0.770154
3,3,Price stability is the responsibility of the F...,"[0.009378707036376, -0.016561055555939674, -0....",0.775339
4,4,"Without price stability, the economy does not ...","[-0.003026996273547411, -0.014454687014222145,...",0.80408
5,5,"In particular, without price stability, we wil...","[-0.03618694841861725, -0.008898851461708546, ...",0.775005
6,6,"Today, the FOMC raised our policy interest rat...","[-0.024621201679110527, -0.02114815264940262, ...",0.787081
7,7,We are moving our policy stance purposefully t...,"[-0.025701606646180153, -0.012234759517014027,...",0.812895
8,8,"In addition, we are continuing the process of ...","[-0.03149143233895302, 0.0019273122306913137, ...",0.745955
9,9,Restoring price stability will likely require ...,"[-0.010953230783343315, -0.020290518179535866,...",0.790525


In [None]:

fed_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
24,24,The recent inflation data again have come in h...,"[-0.021040253341197968, -0.009753845632076263,...",0.871317
22,22,Inflation remains well above our longer run go...,"[-0.023937253281474113, -0.0032772799022495747...",0.869225
31,31,My colleagues and I are acutely aware that hig...,"[-0.011414038017392159, -0.01515731681138277, ...",0.847498
29,29,The longer the current amount of high inflatio...,"[-0.018355058506131172, -0.012731979601085186,...",0.847374
32,32,We are highly attentive to the risks that high...,"[-0.025864068418741226, -0.015762366354465485,...",0.833747
27,27,"Despite elevated inflation, longer term inflat...","[-0.023557519540190697, -0.024205774068832397,...",0.828953
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,...",0.826724
37,37,"It will take time, however, for the full effec...","[-0.02066067047417164, -0.018034202978014946, ...",0.826675
46,46,Reducing inflation is likely to require a sust...,"[-0.03423553332686424, -0.014666956849396229, ...",0.823727
26,26,Russia's war against Ukraine has boosted price...,"[-0.009621184319257736, -0.019101163372397423,...",0.818095


# Calculating Cosine Similarity

We used the Cosine Similarity function, but how does it actually work? Cosine similarity is just calculating the similarity between two vectors. There is a mathematical equation for calculating the angle between two vectors.

![](https://drive.google.com/uc?export=view&id=1cehvtx7LKuFeq_LqfnLi-gzIz1D1wSf9)

In [None]:
v1 = np.array([1,2,3])
v2 = np.array([4,5,6])

# (1 * 4) + (2 * 5) + (3 * 6)
dot_product = np.dot(v1, v2)
dot_product

32

In [None]:
# square root of (1^2 + 2^2 + 3^2) = square root of (1+4+9) = square root of 14
np.linalg.norm(v1)

3.7416573867739413

In [None]:
# square root of (4^2 + 5^2 + 6^2) = square root of (16+25+36) = square root of 14
np.linalg.norm(v2)

8.774964387392123

In [None]:
magnitude = np.linalg.norm(v1) * np.linalg.norm(v2)
magnitude

32.83291031876401

In [None]:
dot_product / magnitude

0.9746318461970762

In [None]:
from scipy import spatial

result = 1 - spatial.distance.cosine(v1, v2)

result

0.9746318461970761