# Embedding Based Search

In this notebook, we will leverage embedding spaces & nearest neighbor search to recommend news articles. We can take features of the news articles, convert them into embeddings, and then utilize similarity search to find the most similar embedding vectors to a given article's embedding, thereby finding similar and relevant news articles.

In [1]:
import openai
import pandas as pd
import regex as re
import pickle

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


We use a [Kaggle Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset). Download the Kaggle dataset, and save it in the same directory as this notebook as `News_Category_Dataset_v3.json.`

In [3]:
df = pd.read_json('data/News_Category_Dataset_v3.json', lines=True)

In [4]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


There's a lot of columns, and we probably won't need most of them.

In [5]:
df = df[["headline", "short_description", "category"]].dropna()

In [6]:
df.head()

Unnamed: 0,headline,short_description,category
0,Over 4 Million Americans Roll Up Sleeves For O...,Health experts said it is too early to predict...,U.S. NEWS
1,"American Airlines Flyer Charged, Banned For Li...",He was subdued by passengers and crew when he ...,U.S. NEWS
2,23 Of The Funniest Tweets About Cats And Dogs ...,"""Until you have a dog you don't understand wha...",COMEDY
3,The Funniest Tweets From Parents This Week (Se...,"""Accidentally put grown-up toothpaste on my to...",PARENTING
4,Woman Who Called Cops On Black Bird-Watcher Lo...,Amy Cooper accused investment firm Franklin Te...,U.S. NEWS


In [7]:
len(df)

209527

The dataset is colossal. Let's work with a small sample of the data for convenience.

In [8]:
df = df.sample(250)

Clean the data with regex.

In [9]:
def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\&", " and ", text)
    text = re.sub(r"\|", " ", text)
    text = re.sub(r"\s+", " ", text)
    # Eliminate all punctuation
    text = re.sub(r"[^\w\d\s]", "", text)
    return text.strip()

df["headline"] = df["headline"].apply(clean_text)
df["short_description"] = df["short_description"].apply(clean_text)

In [10]:
for _, row in df.head(5).iterrows():
    print("Headline:", row["headline"])
    print("Category:", row["category"])
    print("About:", row["short_description"])
    print()

Headline: 15 Forgotten Landmarks In New York City PHOTOS
Category: TRAVEL
About: Kevin Walsh has made it his mission to chronicle hundreds of forgotten landmarks all over the five boroughsmost of which

Headline: A Teenage Syrian Refugee On A Mission To Educate Her Generation
Category: WORLD NEWS
About: Nineteenyearold refugee education campaigner Muzoon Almellehan has become the youngestever UNICEF goodwill ambassador

Headline: Chinas First Domestic Violence Law Still Needs Work Say Activists
Category: THE WORLDPOST
About: In December 2015 the Chinese government passed the countrys landmark first bill against domestic violence But one year

Headline: Afghan Officials Report Major Gains In Kunduz After Push By Taliban
Category: THE WORLDPOST
About: But social media accounts linked to the Taliban indicate fighting ongoing

Headline: Recipe Of The Day Angel Food Cake
Category: FOOD & DRINK
About: With orange essence and fresh fruit on top



Which features of the news articles should we use when trying to recommend similar news articles? A combinination of the headline and description is a good start. News articles with a semantically similar headline + description are probably relevant to one another.

In [None]:
# make new column that appends headline and short_description. 
# this will be the input to the model
df["text"] = df["headline"] + " " + df["short_description"]

In [None]:
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023

client = openai.OpenAI()

def get_embedding(text: str, model: str = EMBEDDING_MODEL):
    # print(text)
    return client.embeddings.create(input = [text], model=model).data[0].embedding

In [None]:
import numpy as np

In [None]:
h1 = "Tech Giant Announces Groundbreaking AI Advancements in Automation"
h2 = "Leading Tech Corporation Unveils Revolutionary Developments in AI Technology"

print(np.array(get_embedding(h1)) - np.array(get_embedding(h2)))

In [None]:
# Establish a cache of embeddings to avoid recomputing - saves time and money
# Cache is a dict of tuples (text, model) -> embedding, saved as a pickle file

# Set path to embedding cache
embedding_cache_path = "recommendations_embeddings_cache.pkl"

# Load the cache if it exists, and save a copy to disk
try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
    pickle.dump(embedding_cache, embedding_cache_file)

def embedding_from_string(
    string: str,
    model: str = EMBEDDING_MODEL,
    embedding_cache=embedding_cache
) -> list:
    # Return embedding of given string, using a cache to avoid recomputing.
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = get_embedding(string, model)
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(string, model)]

In [None]:
# as an example, take the first description from the dataset
example_string = df["text"].values[0]
print(f"\nExample string: {example_string}")

# print the first 10 dimensions of the embedding
example_embedding = embedding_from_string(example_string)
print(f"\nExample embedding: {example_embedding[:10]}...")


In [None]:
import numpy as np

In [None]:
def distances_from_embeddings(query_embedding: list, embeddings: list) -> list:
    """Return distances between query and each embedding in embeddings."""
    def cosine_similarity(embedding1, embedding2):
        return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

    return [cosine_similarity(query_embedding, embedding) for embedding in embeddings]

In [None]:
def indices_of_closest_matches_from_distances(distances: list) -> list:
    """Return indices of n_matches closest embeddings to query."""
    # distances = distances_from_embeddings(query, embeddings)
    # return sorted(range(len(distances)), key=lambda i: distances[i])[:n_matches]
    return (sorted(range(len(distances)), key=lambda i: distances[i]))[::-1]

In [None]:
def print_recommendations_from_strings(
    strings: list[str],
    index_of_source_string: int,
    k_nearest_neighbors: int = 1,
    model=EMBEDDING_MODEL,
) -> list[int]:
    """Print out the k nearest neighbors of a given string."""
    # get embeddings for all strings
    embeddings = [embedding_from_string(string, model=model) for string in strings]
    # get the embedding of the source string
    query_embedding = embeddings[index_of_source_string]
    # get distances between the source embedding and other embeddings
    distances = distances_from_embeddings(query_embedding, embeddings)
    
    indices_of_nearest_neighbors = indices_of_closest_matches_from_distances(distances)

    # print out source string
    query_string = strings[index_of_source_string]
    # print out its k nearest neighbors
    k_counter = 0
    for i in indices_of_nearest_neighbors:
        # skip any strings that are identical matches to the starting string
        if query_string == strings[i]:
            continue
        # stop after printing out k articles
        if k_counter >= k_nearest_neighbors:
            break
        k_counter += 1

        # print out the similar strings and their distances
        print(
            f"""
        --- Recommendation #{k_counter} (nearest neighbor {k_counter} of {k_nearest_neighbors}) ---
        String: {strings[i]}
        Distance: {distances[i]:0.3f}"""
        )

    return indices_of_nearest_neighbors


Now, for a given article, we can generate recommendations for it. Try this with different `article_no` values.

In [None]:
article_no = 0

print("Headline:", df.iloc[article_no]["headline"])
print("Description:", df.iloc[article_no]["short_description"])

In [None]:
df["text"].values[article_no]

In [None]:
descriptions = df["text"].values

print_recommendations_from_strings(descriptions, article_no, k_nearest_neighbors=10)

The recommendations *should* make sense. If they don't, you must have gotten a really unlucky sample of documents.

Now that you've reached the end, here are some additional things you can spend your time doing in groups:

For the more Data Science / ML oriented people:
- Try to do this with completely different datasets! What about taking Amazon Reviews and doing a review recommendation system? Think about how your preprocessing will differ (your reviews dataset may include lots of numbers you'd want to remove or substitute, etc.)

For the more Computer Science / Data Structures & Algo oriented:
- K-Nearest Neighbors - the search algorithm we used - is pretty inefficient. Approximate Nearest Neighbors, or ANN, is significantly quicker, but sacrifices some accuracy. Try to do the recommendation search, but with an ANN heuristic like Hierarchical Navigable Small World (HNSW). Many vector databases use HNSW, so this should be an interesting and relevant exercise that'll provide you some background for similarity search next week.

Some other questions to maybe ponder, and get answered:
- What if we didn't use embeddings? What if we used [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term Frequency * Inverse Document Frequency) vectorization instead and did similarity search based on that? 
- What if we use another distance function, like euclidean distance, or dot product instead of cosine similarity?