# Semantic search

There are several use-cases where we might want to search amongst a large corpus of text. Examples include:
- Powering a search engine
- Detecting near duplicate texts
- Finding similar texts to reccomend to users
- Retrieving factual information to provide context to large language models

By using embeddings we can encode our search query as a series of related concepts rather than lexical symbols. This is what it means to capture the semantics of the search query. Let's look at how we can build out python code to enable semantic search. We'll use OpenAI's embeddings along with the Annoy (approximate nearest neighbours) library published by Spotify.

# The dataset
For our dataset we'll use texts about the marvel cinematic universe taken from wikipedia. Let's load the article texts and also store the article names (which are the filenames) for later use.

In [73]:
from glob import glob
import os

def read_text_file(filename: str) -> str:
    with open(filename, 'r') as f:
        return f.read()
    
    
def extract_filename(path: str) -> str:
    return os.path.basename(path)


paths = glob('datasets/marvel/*.txt')
texts = {extract_filename(path): read_text_file(path) for path in paths}

# Loading OpenAI API key

In [74]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get the OpenAI API key from the environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

# Retrieving OpenAI embeddings

In [75]:
import openai
import concurrent
import numpy as np

def get_embedding(text, model="text-embedding-ada-002"):
    return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']


def parallel_embedding(text_list, model="text-embedding-ada-002"):
    # Create a ThreadPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
        # Submit tasks to the executor and remember the order
        futures = []
        mapping = dict()
        for i, text in enumerate(text_list):
            future = executor.submit(get_embedding, text, model)
            futures.append(future)
            mapping[future] = i

        # Retrieve results as they become available and sort their order
        embeddings = [None] * len(futures)
        for future in concurrent.futures.as_completed(futures):
            embeddings[mapping[future]] = future.result()
    return np.array(embeddings)

# Embedding our dataset
Although we could look at methods of embedding each article to create one dense vector per article we need to consider that embedding models have a token limit but also the loss of granularity as local information or specific details might be averaged out or overshadowed by the dominant themes of the article.

To capture the full meaning and context within our articles, it's essential to treat the text with care when dividing it into chunks for embedding. A naive approach might inadvertently slice a sentence in half, causing a loss of vital context. Imagine having a sentence where the first part poses a question and the second part delivers an answer. If these two segments were separated, the overall understanding of that information would be compromised.

For our embedding process, we employ a strategy that aims to preserve context. We divide each article into segments containing 256 tokens. To achieve this, we implement a sliding window mechanism. The window spans 256 tokens of the text, and for each step, it moves by 192 tokens. This means that consecutive chunks will overlap by 64 tokens, ensuring that they share roughly one-third of their content.

In [83]:
import tiktoken
import itertools
import time
import numpy as np


def split_with_sliding_window(text: str, window_size: int = 256, stride: int = 192, model_name: str = "text-embedding-ada-002") -> list:
    enc = tiktoken.encoding_for_model(model_name)
    tokens = enc.encode(text)

    chunks = []
    current_position = 0

    while current_position + window_size <= len(tokens):
        chunk_tokens = tokens[current_position:current_position + window_size]
        chunks.append(enc.decode(chunk_tokens))
        current_position += stride

    # Handle the tail if any tokens remain
    if current_position < len(tokens):
        chunks.append(enc.decode(tokens[-window_size:]))

    return chunks


text_chunks = list()
for filename, text in texts.items():
    template = 'Article title: """{filename}"""\nText: """{text}"""'
    for chunk in split_with_sliding_window(text):
        text_chunks.append(template.format(filename=filename, text=chunk))
        
text_chunks = np.array(text_chunks)

start_time = time.time()
embeddings = parallel_embedding(text_chunks)
duration = time.time() - start_time

print(f'Created {len(text_chunks)} text chunks from {len(texts)} texts.')
print(f'Retrieved embeddings in {round(duration, 2)}s')

Created 2953 text chunks from 46 texts.
Retrieved embeddings in 9.93s


# Creating the search index
We use the annoy python library to build an efficient approximate nearest neighbours search index that we can use to search for embeddings similar to the embedding of our search query.

In [84]:
from annoy import AnnoyIndex

search_index = AnnoyIndex(embeddings.shape[1], 'angular')
# Add all the vectors to the search index
for i, embedding in enumerate(embeddings):
    search_index.add_item(i, embedding)

search_index.build(10) # 10 trees
search_index.save('index.ann')

True

In [91]:
import pandas as pd

def search(query):
    # Get the query's embedding
    query_embedding = parallel_embedding([query])[0]

#   # Retrieve the nearest neighbors
    similar_item_ids = search_index.get_nns_by_vector(
        query_embedding,
        3,
        include_distances=True
    )

    # Format the results
    results = pd.DataFrame(
        data={
            'texts': text_chunks[similar_item_ids[0]],
            'distance': similar_item_ids[1]
        }
    )    
    return results

In [98]:
query = "Who is Thanos?"
print(search(query).texts[0])
print()
print(search(query).texts[1])
print()
print(search(query).texts[2])

Article title: """Avengers- Infinity War.txt"""
Text: """ige added that Thanos believes the universe is becoming over-populated, which led to the destruction of his home moon Titan and is something he vowed not to let happen again,[56] and also said "you could almost go so far as to say he is the main character of" the film.[58] McFeely shared this sentiment, describing the film as his "hero journey" in addition to being the film's protagonist, stating, "Part of that is the things that [mean] the most to him. We wanted to show that. It wasn't just power; it wasn't just an ideal; it was people".[26] Brolin likened Thanos to "the Quasimodo of this time" and the novel Perfume, since Thanos was born deformed and considered a "freak" on Titan,[59] while Joe Russo would reference The Godfather (1972) for Brolin at times, which Brolin felt helped "to emotionalize the whole thing".[60] Brolin further added that he preferred playing Thanos over Cable in Deadpool 2 (2018) because of the amount o