Semantic text search using embeddings
We can search through all our documents semantically in a very efficient manner and at very low cost, by simply embedding our search query, and then finding the most similar documents. 

1. Imports
First, let's import the packages and functions we'll need for later. If you don't have these, you'll need to install them. You can install them via your terminal by running pip install {package_name}, e.g. pip install pandas.

In [3]:
#pip install openai

In [5]:
#pip install transformers

In [17]:
import pandas as pd
import openai
import pickle
from pathlib import Path
from transformers import GPT2TokenizerFast, AutoTokenizer,GPTJForCausalLM

openai.api_key = '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'

from openai.embeddings_utils import (
    get_embedding,
    distances_from_embeddings,
    tsne_components_from_embeddings,
    chart_from_components,
    indices_of_nearest_neighbors_from_distances,
)

2. Load data
Next, let's load the AG news data and see what it looks like.

In [7]:
dataset_path = "https://cdn.openai.com/API/examples/data/AG_news_samples.csv"
df = pd.read_csv(dataset_path)

# print dataframe
n_examples = 5
df.head(n_examples)


Unnamed: 0,title,description,label_int,label
0,World Briefings,BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime M...,1,World
1,Nvidia Puts a Firewall on a Motherboard (PC Wo...,PC World - Upcoming chip set will include buil...,4,Sci/Tech
2,"Olympic joy in Greek, Chinese press",Newspapers in Greece reflect a mixture of exhi...,2,Sports
3,U2 Can iPod with Pictures,"SAN JOSE, Calif. -- Apple Computer (Quote, Cha...",4,Sci/Tech
4,The Dream Factory,"Any product, any shape, any size -- manufactur...",4,Sci/Tech


In [8]:
# print the title, description, and label of each example
for idx, row in df.head(n_examples).iterrows():
    print("")
    print(f"Title: {row['title']}")
    print(f"Description: {row['description']}")
    print(f"Label: {row['label']}")


Title: World Briefings
Description: BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases.
Label: World

Title: Nvidia Puts a Firewall on a Motherboard (PC World)
Description: PC World - Upcoming chip set will include built-in security features for your PC.
Label: Sci/Tech

Title: Olympic joy in Greek, Chinese press
Description: Newspapers in Greece reflect a mixture of exhilaration that the Athens Olympics proved successful, and relief that they passed off without any major setback.
Label: Sports

Title: U2 Can iPod with Pictures
Description: SAN JOSE, Calif. -- Apple Computer (Quote, Chart) unveiled a batch of new iPods, iTunes software and promos designed to keep it atop the heap of digital music players.
Label: Sci/Tech

Title: The Dream Factory
Description: Any product, any shape, any size -- manufactured o

3. Build cache to save embeddings
Before getting embeddings for these articles, let's set up a cache to save the embeddings we generate. In general, it's a good idea to save your embeddings so you can re-use them later. If you don't save them, you'll pay again each time you compute them again.

In [22]:
# Did not get completed in time and used simpler method below.

4. Load tokenizer used for preprocessing text data.**bold text**


In [14]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")


Downloading:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

In [15]:
# remove reviews that are too long
df['n_tokens'] = df['description'].apply(lambda x: len(tokenizer.encode(x)))

In [20]:

df = df[df.n_tokens<8000][:40]
len(df)

40

In [21]:
from openai.embeddings_utils import get_embedding
# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage

# This will take just between 5 and 10 minutes
df['ada_similarity'] = df.description.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))

In [23]:
df['ada_search'] = df.description.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))


In [24]:
# To save you the expense of computing the embeddings needed for this demo

Save document 

In [163]:
import numpy as np

df.to_csv('dataset_path_text.csv')


datafile_path = "dataset_path_text.csv"

df = pd.read_csv(datafile_path)

df["ada_search"] = df.ada_search.apply(eval).apply(np.array)

Final. Remember to use the documents embedding engine for documents, and query embedding engine for queries. Note that here we just compare the cosine similarity of the embeddings of the query and the documents, and show top_n best matches.

In [25]:
from openai.embeddings_utils import get_embedding, cosine_similarity


# search through the reviews for a specific product
def search_reviews(df, product_description, n=3, pprint=True):
    embedding = get_embedding(
        product_description,
        engine="text-embedding-ada-002"
    )
    df["similarities"] = df.ada_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(n)
        .description.str.replace("Title: ", "")
        .str.replace("; Content:", ": ")
    )
    if pprint:
        for r in res:
            print(r[:500])
            print()
    return res


In [26]:
res = search_reviews(df, "Israel mercenaries", n=3)


GAZA CITY, Gaza Strip: Hamas militants killed an Israeli soldier and wounded four with an explosion in a booby-trapped chicken coop on Tuesday, in what the Islamic group said was an elaborate scheme to lure troops to the area with the help of a double 

Egypt #39;s release of accused Israeli spy Azzam Azzam in an apparent swap for six Egyptian students held on suspicion of terrorism is expected to melt the ice and perhaps result 

PALESTINIAN leader Yasser Arafat today issued an urgent call for the immediate release of two French journalists taken hostage in Iraq.



In [27]:
res = search_reviews(df, "Vice President", n=3)


United Arab Emirates President and ruler of Abu Dhabi Sheik Zayed bin Sultan al-Nayhan died Tuesday, official television reports. He was 86.

Julia Gillard has reportedly bowed out of the race to become shadow treasurer, taking enormous pressure off Opposition Leader Mark Latham.

Reuters - Palestinian leader Mahmoud Abbas called\Israel "the Zionist enemy" Tuesday, unprecedented language for\the relative moderate who is expected to succeed Yasser Arafat.

