<a href="https://colab.research.google.com/github/RiccardoRubini93/ML-AI-cookbook/blob/main/text_similarity_with_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install openai

In [17]:
#set openAI env
import os
os.environ["OPENAI_API_KEY"] = ''

In [3]:
import textwrap as tr
from typing import List, Optional

import matplotlib.pyplot as plt
import plotly.express as px
from scipy import spatial
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import average_precision_score, precision_recall_curve
from tenacity import retry, stop_after_attempt, wait_random_exponential

import openai
import numpy as np
import pandas as pd


@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def get_embedding(text: str, model="text-similarity-davinci-001", **kwargs) -> List[float]:

    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")

    response = openai.embeddings.create(input=[text], model=model, **kwargs)

    return response.data[0].embedding

In [15]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [12]:
import pandas as pd
import numpy as np
from ast import literal_eval

datafile_path = "data/fine_food_reviews_with_embeddings_1k.csv"

df = pd.read_csv(datafile_path,sep=',', nrows=10)

df["embedding"] = df.embedding.apply(literal_eval).apply(np.array)

In [18]:
df.head(10)

Unnamed: 0,ProductId,UserId,Score,Summary,Text,combined,n_tokens,embedding
0,B003XPF9BO,A3R7JR3FMEBXQB,5,where does one start...and stop... with a tre...,Wanted to save some to bring to my Chicago fam...,Title: where does one start...and stop... wit...,52,"[0.007018072064965963, -0.02731654793024063, 0..."
1,B003VXHGPK,A21VWSCGW7UUAR,4,Good; but not Wolfgang Puck good,"Honestly, I have to admit that I expected a li...","Title: Good, but not Wolfgang Puck good; Conte...",178,"[-0.003140551969408989, -0.009995664469897747,..."
2,B008JKTTUA,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,78,"[-0.01757248118519783, -8.266511576948687e-05,..."
3,B000LKTTTW,A14MQ40CCU8B13,5,Best tomato soup,I have a hard time finding packaged food of an...,Title: Best tomato soup; Content: I have a har...,111,"[-0.0013932279543951154, -0.011112828738987446..."
4,B001D09KAM,A34XBAIFT02B60,1,Should advertise coconut as an ingredient more...,"First, these should be called Mac - Coconut ba...",Title: Should advertise coconut as an ingredie...,78,"[-0.01757248118519783, -8.266511576948687e-05,..."
5,B001D09KAM,A1XV4W7JWX341C,5,Loved these gluten free healthy bars; saved $$...,These Kind Bars are so good and healthy & glut...,"Title: Loved these gluten free healthy bars, s...",96,"[-0.002289338270202279, -0.01313735730946064, ..."
6,B002JA06Z8,A3ESIUM1JTR7KK,5,These fresh berries are truly MIRACULOUS!!!,I have ordered from Ethans on three separate o...,Title: These fresh berries are truly MIRACULOU...,98,"[-0.015488614328205585, -0.035269252955913544,..."
7,B002HQNCBO,A1UW65ZMZ3UWD3,5,Baconnaise,If you are a fan of bacon you're going to like...,Title: Baconnaise; Content: If you are a fan o...,44,"[-0.020532963797450066, 5.7461649703327566e-05..."
8,B008JKTTUA,A1XV4W7JWX341C,5,Loved these gluten free healthy bars; saved $$...,These Kind Bars are so good and healthy & glut...,"Title: Loved these gluten free healthy bars, s...",96,"[-0.002289338270202279, -0.01313735730946064, ..."
9,B0048GRNZM,AXG287OY16WWL,1,Cute,"For some reason I thought that you got three ""...",Title: Cute; Content: For some reason I though...,77,"[-0.021329568699002266, -0.0003809756308328360..."


In [25]:
# search through the reviews for a specific product
def search_reviews(df, product_description, n=3, pprint=True):
    product_embedding = get_embedding(
        product_description,
        model="text-embedding-ada-002"
    )
    df["similarity"] = df.embedding.apply(lambda x: cosine_similarity(x, product_embedding))

    results = (
        df.sort_values("similarity", ascending=False)
        .head(n)
        .combined.str.replace("Title: ", "")
        .str.replace("; Content:", ": ")
    )
    if pprint:
        for r in results:
            print(r)
            print()
    return results,df


results,df = search_reviews(df, "I love bacon", n=3)

Baconnaise:  If you are a fan of bacon you're going to like this stuff. J & D makes good products. I will buy again! Makes a killer addition to a BLT.

where does one  start...and stop... with a treat like this:  Wanted to save some to bring to my Chicago family but my North Carolina family ate all 4 boxes before I could pack. These are excellent...could serve to anyone

These fresh berries are truly MIRACULOUS!!!:  I have ordered from Ethans on three separate occasions and I couldn't be happier. The berries have always arrived fresh and exactly on the day I chose. The "sweet" effect usually last a good 40 minutes and it's a blast. IT TRULY WORKS! All of my family and friends have been absolutely amazed by it. Don't hesitate to try these, you will not regret it!



In [26]:
results

7    Baconnaise:  If you are a fan of bacon you're ...
0    where does one  start...and stop... with a tre...
6    These fresh berries are truly MIRACULOUS!!!:  ...
Name: combined, dtype: object