# Playing with embeddings

In this notebook, I create a small set of facts, create an embedding (based on the `text-embedding-ada-` model)
and to use cosine similarity to evaluate the next text.

The [openai-cookbook](https://github.com/openai/openai-cookbook/tree/main) has useful notebooks.

# Import & set params

In [5]:
from pathlib import Path
import os
import pandas as pd
import tiktoken

# nltk.download('punkt')  # Download necessary data for NLTK (if not already downloaded)
import nltk
from nltk.tokenize import sent_tokenize

from openai.embeddings_utils import get_embedding, cosine_similarity
import openai
openai.api_key = "sk-..." # <-------- ENTER HERE THE API KEY

I used this model:

- Model name: `text-embedding-ada-002`
- Tokenizer:`cl100k_base`
- Max input tokens: `8191`
- Output dimensions: `1536` (i.e., dimensions in the embedding)

In [2]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
embedding_encoding = "cl100k_base"  # encoding for text-embedding-ada-002
max_tokens = 8000  # max input token for text-embedding-ada-002 is 8191

# Create data

The text is from [here](https://www.bamf.de/SharedDocs/Anlagen/EN/Forschung/Migrationsberichte/migrationsbericht-2020.html)

In [33]:
file_path = Path(os.getcwd(), "data", "text_example_migration_report.txt")
with open(file_path, "r") as file:
    content = file.read()

In [38]:
content[:100]

'Net migration to Germany has been declining continuously since 2016. The outbreak of the COVID-19 pa'

`sent_tokenize()` splits the text into sentences, taking care of handling cases where dots are not indicative of the end of a sentence due to abbreviations, initials, or decimal numbers.

NOTE: **It is NOT 100% accurate!**

In [30]:
sentences = sent_tokenize(content)

In [31]:
for sentence in sentences:
    print(sentence)

Net migration to Germany has been declining continuously since 2016.
The outbreak of the COVID-19 pandemic has further intensified this trend.
As a result of the global travel restrictions that were caused by the pandemic, the decline in migration was particularly noticeable from March 2020 onwards.
In 2020, a total of 1,186,702 arrivals and 966,451 departures were recorded, so that immigration to Germany decreased by 23.9 percent, and emigration fell by 21.5 percent in comparison to 2019.
These developments culminated in net migration of +220,251 persons.
This was a significantly lower value than in 2019 (+327,060 persons).
More or less pronounced declines were therefore also shown in the individual forms of migration.
Migration to and from Germany continues to be especially characterised by arrivals from and departures to other European countries.
This meant that 69.1 percent of all immigrants came to Germany from another European country in 2020 (66.4 percent in 2019), 54.6 percent 

# Create embeddings

In [44]:
sentences_embedding = openai.Embedding.create(
    input=sentences,
    model=embedding_model)

In [45]:
sentences_embedding.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [47]:
print(f'Object: {sentences_embedding["object"]}')
print(f'\nModel: {sentences_embedding["model"]}')
print(f'\nUsage:\n{sentences_embedding["usage"]}')

Object: list

Model: text-embedding-ada-002-v2

Usage:
{
  "prompt_tokens": 498,
  "total_tokens": 498
}


From the embedding model we get `N` vectors, where `N` is the number of sentences.

In [49]:
len(sentences)

18

In [48]:
len(sentences_embedding["data"])

18

All vectors have the same length (i.e., the output dimensions: `1536` )

In [51]:
first_N_sentences = 10 # N=10

for i in range(first_N_sentences):
    print(f'Vector {i} - {len(sentences_embedding["data"][i]["embedding"])}')

Vector 0 - 1536
Vector 1 - 1536
Vector 2 - 1536
Vector 3 - 1536
Vector 4 - 1536
Vector 5 - 1536
Vector 6 - 1536
Vector 7 - 1536
Vector 8 - 1536
Vector 9 - 1536


First 10 dimensions of the first vector

In [52]:
sentences_embedding["data"][0]["embedding"][:10]

[-0.019956665113568306,
 -0.02068890631198883,
 0.02489618770778179,
 -0.01648162305355072,
 -0.03591703251004219,
 0.005525935906916857,
 -0.03815098851919174,
 0.006354360841214657,
 -0.0028250222094357014,
 -0.026435134932398796]

Create a DataFrame with the sentences and the embeddings

In [57]:
df = pd.DataFrame(sentences, columns=["Sentences"])
df.head(5)

Unnamed: 0,Sentences
0,Net migration to Germany has been declining co...
1,The outbreak of the COVID-19 pandemic has furt...
2,As a result of the global travel restrictions ...
3,"In 2020, a total of 1,186,702 arrivals and 966..."
4,These developments culminated in net migration...


In [60]:
embeddings_to_df = []
for i in range(len(sentences_embedding["data"])):
    embeddings_to_df.append(sentences_embedding["data"][i]["embedding"])

len(embeddings_to_df)

18

In [62]:
df["embedding"] = embeddings_to_df

In [64]:
df.head(5)

Unnamed: 0,Sentences,embedding
0,Net migration to Germany has been declining co...,"[-0.019956665113568306, -0.02068890631198883, ..."
1,The outbreak of the COVID-19 pandemic has furt...,"[-0.007713058032095432, -0.020636092871427536,..."
2,As a result of the global travel restrictions ...,"[-0.02250930666923523, -0.02766112983226776, 0..."
3,"In 2020, a total of 1,186,702 arrivals and 966...","[0.00510792713612318, -0.02350330352783203, 0...."
4,These developments culminated in net migration...,"[-0.006019430700689554, -0.0071343788877129555..."


# Search for similarity

In [65]:
def get_embedding(text, model):
   """
   Return an embedding from a text given a model
   """
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']


def search_reviews(df: pd.DataFrame, product_description: str, n: int = 3):
   """
   Calculate the cosine similarity between a database of reviews and a product description.

   Args:
       df (DataFrame): A pandas dataframe with the embedding of the review in the column "embedding".
       product_description (string): product description to evaluate with cosine similarity.
       n (int): number of most similar results to print.
    
    Returns:
        DataFrame: The dataframe df with an additional column (i.e., "similarities") with the cosine similarities.
    """
   embedding = get_embedding(product_description, model=embedding_model)
   df['similarities'] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))
   res = df.sort_values('similarities', ascending=False).head(n)
   return res

# Evaluate similarity

In [83]:
def return_more_similar_sentences(text_to_evaluate: str, n: int = 3):
    """
   Return the "n" sentences more similar (in cosine similarity) to the "text_to_evaluate".

   Args:
       text_to_evaluate (string): Text to evaluate against the embeddings in the dataframe "df".
       n (int): number of most similar results to print.
    
    Returns:
        Print the n sentences more similar to "text_to_evaluate" and their similarity scores.
    """
    res = search_reviews(df, text_to_evaluate, n=n)

    for i in range(res.shape[0]):
        print("="*70)
        print(f'Similarity: {res.iloc[i].loc["similarities"]:.3f}')
        print(f'Sentence:\n{res.iloc[i].loc["Sentences"]}')

In [84]:
return_more_similar_sentences('There are too many immigrants in Germany', 3)

Similarity: 0.869
Sentence:
Net migration to Germany has been declining continuously since 2016.
Similarity: 0.852
Sentence:
Migration to and from Germany continues to be especially characterised by arrivals from and departures to other European countries.
Similarity: 0.849
Sentence:
This meant that 69.1 percent of all immigrants came to Germany from another European country in 2020 (66.4 percent in 2019), 54.6 percent of them from EU Member States (incl.


In [86]:
return_more_similar_sentences('Immigrants are taking our jobs', 3)

Similarity: 0.801
Sentence:
These developments culminated in net migration of +220,251 persons.
Similarity: 0.792
Sentence:
Net migration to Germany has been declining continuously since 2016.
Similarity: 0.790
Sentence:
As a result of the global travel restrictions that were caused by the pandemic, the decline in migration was particularly noticeable from March 2020 onwards.


In [87]:
return_more_similar_sentences('The number of immigrants is increasing', 3)

Similarity: 0.861
Sentence:
These developments culminated in net migration of +220,251 persons.
Similarity: 0.844
Sentence:
The number of asylum applications reflects the ongoing decline in forced migration: The number of first-time applications fell from 722,370 to 142,509 in the period 2016 to 2019 (-80.3 per cent).
Similarity: 0.844
Sentence:
Migration to and from Germany continues to be especially characterised by arrivals from and departures to other European countries.


In [88]:
return_more_similar_sentences('Most of immigrants come from Ukraine', 3)

Similarity: 0.833
Sentence:
This meant that 69.1 percent of all immigrants came to Germany from another European country in 2020 (66.4 percent in 2019), 54.6 percent of them from EU Member States (incl.
Similarity: 0.825
Sentence:
Migration to and from Germany continues to be especially characterised by arrivals from and departures to other European countries.
Similarity: 0.824
Sentence:
About two-thirds of those emigrating moved from Germany to another European country in 2020 (67.4 percent; 67.2 percent in 2019); 55.7 percent migrated to other EU Member States including the United Kingdom (56.0 percent in 2019).


In [90]:
return_more_similar_sentences('The percentage of immigrant is higher than ever', 3)

Similarity: 0.849
Sentence:
This meant that 69.1 percent of all immigrants came to Germany from another European country in 2020 (66.4 percent in 2019), 54.6 percent of them from EU Member States (incl.
Similarity: 0.842
Sentence:
These developments culminated in net migration of +220,251 persons.
Similarity: 0.832
Sentence:
102,581 people applied for asylum for the first time, this being 28.0 percent fewer than in 2019.


In [91]:
return_more_similar_sentences('The number of immigrant is higher now than in 2019', 3)

Similarity: 0.866
Sentence:
This brought the number of cross-border first-time asylum applications in 2020 to 76,061 (2019: 111,094).
Similarity: 0.858
Sentence:
The number of asylum applicants thus fell below the 2013 level (109,580 first-time applications).
Similarity: 0.856
Sentence:
102,581 people applied for asylum for the first time, this being 28.0 percent fewer than in 2019.
