# Embedding Based Search

In this notebook, we will leverage embedding spaces & nearest neighbor search to recommend news articles. We can take features of the news articles, convert them into embeddings, and then utilize similarity search to find the most similar embedding vectors to a given article's embedding, thereby finding similar and relevant news articles.

In [None]:
import openai
import pandas as pd
import regex as re
import pickle

We use a [Kaggle Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset). Download the Kaggle dataset, and save it in the same directory as this notebook as `News_Category_Dataset_v3.json.`

In [None]:
df = pd.read_json('News_Category_Dataset_v3.json', lines=True)

In [None]:
df.head()

There's a lot of columns, and we probably won't need most of them.

In [None]:
df = df[["headline", "short_description", "category"]].dropna()

In [None]:
df.head()

In [None]:
len(df)

The dataset is colossal. Let's work with a small sample of the data for convenience.

In [None]:
df = df.sample(250)

Clean the data with regex.

In [None]:
def clean_text(text):
    text = re.sub(r"\n", " ", text)
    text = re.sub(r"\&", " and ", text)
    text = re.sub(r"\|", " ", text)
    text = re.sub(r"\s+", " ", text)
    # Eliminate all punctuation
    text = re.sub(r"[^\w\d\s]", "", text)
    return text.strip()

df["headline"] = df["headline"].apply(clean_text)
df["short_description"] = df["short_description"].apply(clean_text)

In [None]:
for _, row in df.head(5).iterrows():
    print("Headline:", row["headline"])
    print("Category:", row["category"])
    print("About:", row["short_description"])
    print()

Which features of the news articles should we use when trying to recommend similar news articles? A combinination of the headline and description is a good start. News articles with a semantically similar headline + description are probably relevant to one another.

In [None]:
# make new column that appends headline and short_description. 
# this will be the input to the model
df["text"] = df["headline"] + " " + df["short_description"]

In [None]:
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023

client = openai.OpenAI()

def get_embedding(text: str, model: str = EMBEDDING_MODEL):
    # print(text)
    return client.embeddings.create(input = [text], model=model).data[0].embedding

In [None]:
import numpy as np

In [None]:
h1 = "Tech Giant Announces Groundbreaking AI Advancements in Automation"
h2 = "Leading Tech Corporation Unveils Revolutionary Developments in AI Technology"

print(np.array(get_embedding(h1)) - np.array(get_embedding(h2)))

In [None]:
# Establish a cache of embeddings to avoid recomputing - saves time and money
# Cache is a dict of tuples (text, model) -> embedding, saved as a pickle file

# Set path to embedding cache
embedding_cache_path = "recommendations_embeddings_cache.pkl"

# Load the cache if it exists, and save a copy to disk
try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
    pickle.dump(embedding_cache, embedding_cache_file)

def embedding_from_string(
    string: str,
    model: str = EMBEDDING_MODEL,
    embedding_cache=embedding_cache
) -> list:
    # Return embedding of given string, using a cache to avoid recomputing.
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = get_embedding(string, model)
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(string, model)]

In [None]:
# as an example, take the first description from the dataset
example_string = df["text"].values[0]
print(f"\nExample string: {example_string}")

# print the first 10 dimensions of the embedding
example_embedding = embedding_from_string(example_string)
print(f"\nExample embedding: {example_embedding[:10]}...")


In [None]:
import numpy as np

In [None]:
def distances_from_embeddings(query_embedding: list, embeddings: list) -> list:
    """Return distances between query and each embedding in embeddings."""
    def cosine_similarity(embedding1, embedding2):
        return np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))

    return [cosine_similarity(query_embedding, embedding) for embedding in embeddings]

In [None]:
def indices_of_closest_matches_from_distances(distances: list) -> list:
    """Return indices of n_matches closest embeddings to query."""
    # distances = distances_from_embeddings(query, embeddings)
    # return sorted(range(len(distances)), key=lambda i: distances[i])[:n_matches]
    return (sorted(range(len(distances)), key=lambda i: distances[i]))[::-1]

In [None]:
def print_recommendations_from_strings(
    strings: list[str],
    index_of_source_string: int,
    k_nearest_neighbors: int = 1,
    model=EMBEDDING_MODEL,
) -> list[int]:
    """Print out the k nearest neighbors of a given string."""
    # get embeddings for all strings
    embeddings = [embedding_from_string(string, model=model) for string in strings]
    # get the embedding of the source string
    query_embedding = embeddings[index_of_source_string]
    # get distances between the source embedding and other embeddings
    distances = distances_from_embeddings(query_embedding, embeddings)
    
    indices_of_nearest_neighbors = indices_of_closest_matches_from_distances(distances)

    # print out source string
    query_string = strings[index_of_source_string]
    # print out its k nearest neighbors
    k_counter = 0
    for i in indices_of_nearest_neighbors:
        # skip any strings that are identical matches to the starting string
        if query_string == strings[i]:
            continue
        # stop after printing out k articles
        if k_counter >= k_nearest_neighbors:
            break
        k_counter += 1

        # print out the similar strings and their distances
        print(
            f"""
        --- Recommendation #{k_counter} (nearest neighbor {k_counter} of {k_nearest_neighbors}) ---
        String: {strings[i]}
        Distance: {distances[i]:0.3f}"""
        )

    return indices_of_nearest_neighbors


Now, for a given article, we can generate recommendations for it. Try this with different `article_no` values.

In [None]:
article_no = 0

print("Headline:", df.iloc[article_no]["headline"])
print("Description:", df.iloc[article_no]["short_description"])

In [None]:
df["text"].values[article_no]

In [None]:
descriptions = df["text"].values

print_recommendations_from_strings(descriptions, article_no, k_nearest_neighbors=10)