# Semantic search

There are several use-cases where we might want to search amongst a large corpus of text. Examples include:
- Powering a search engine
- Detecting near duplicate texts for removal
- Finding similar articles to reccomend to users
- Retrieving factual information to use as context for a large language model

By using embeddings we can encode our search query as a series of interconnected concepts rather than lexical symbols. This is what it means to capture the semantics of text. Let's look at how we can write python code to enable semantic search. We'll use OpenAI's embeddings along with the the python library annoy (approximate nearest neighbours) published by Spotify.

# The dataset
For our dataset we'll use texts about the movies from the marvel cinematic universe as taken from wikipedia. Let's load the article texts and also store the article names for later use.

In [6]:
from glob import glob
import os

def read_text_file(filename: str) -> str:
    with open(filename, 'r') as f:
        return f.read()
    
    
def extract_filename(path: str) -> str:
    return os.path.basename(path)


paths = glob('../datasets/marvel/*.txt')
texts = {extract_filename(path): read_text_file(path) for path in paths}

# Loading OpenAI API key

In [10]:
import os
from dotenv import load_dotenv
from pathlib import Path

# Path to the .env file in the parent directory
env_path = Path("..") / ".env"

# Load environment variables from .env file
load_dotenv(dotenv_path=env_path)

# Get the OpenAI API key from the environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

# Retrieving OpenAI embeddings

In [11]:
import openai
import concurrent
import numpy as np

def get_embedding(text, model="text-embedding-ada-002"):
    return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']


def parallel_embedding(text_list, model="text-embedding-ada-002"):
    # Create a ThreadPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
        # Submit tasks to the executor and remember the order
        futures = []
        mapping = dict()
        for i, text in enumerate(text_list):
            future = executor.submit(get_embedding, text, model)
            futures.append(future)
            mapping[future] = i

        # Retrieve results as they become available and sort their order
        embeddings = [None] * len(futures)
        for future in concurrent.futures.as_completed(futures):
            embeddings[mapping[future]] = future.result()
    return np.array(embeddings)

# Embedding our dataset
Although we could look at methods to embed each article so as to create one dense vector per article, we need to consider the limitations of such an approach. Embedding models have a token limit but also embedding large texts results in a loss of granularity as the local information or specific details of the text may be averaged out or overshadowed by the dominant themes of the text.

To capture the full meaning and context within our texts, it's essential to treat the text with care when dividing it for embedding. A naive approach might inadvertently slice a sentence in half, causing a loss of vital context. Imagine having a sentence where the first part poses a question and the second part delivers an answer. If these two segments were separated, the overall understanding of that information would be compromised.

For our embedding process, we employ a strategy that aims to preserve context. We divide each article into segments containing 256 tokens. To achieve this, we implement a sliding window mechanism. The window spans 256 tokens of the text, and for each step, it moves by 192 tokens. This means that consecutive chunks will overlap by 64 tokens, ensuring that adjacent text segments share roughly one-quater of their content.

We will also prepend the title of the article to add aditional context to the embeddings.

In [12]:
import tiktoken
import itertools
import time
import numpy as np


def split_with_sliding_window(text: str, window_size: int = 256, stride: int = 192, model_name: str = "text-embedding-ada-002") -> list:
    enc = tiktoken.encoding_for_model(model_name)
    tokens = enc.encode(text)

    chunks = []
    current_position = 0

    while current_position + window_size <= len(tokens):
        chunk_tokens = tokens[current_position:current_position + window_size]
        chunks.append(enc.decode(chunk_tokens))
        current_position += stride

    # Handle the tail if any tokens remain
    if current_position < len(tokens):
        chunks.append(enc.decode(tokens[-window_size:]))

    return chunks


text_chunks = list()
for filename, text in texts.items():
    template = 'Article title: """{filename}"""\nText: """{text}"""'
    for chunk in split_with_sliding_window(text):
        text_chunks.append(template.format(filename=filename, text=chunk))
        
text_chunks = np.array(text_chunks)

start_time = time.time()
embeddings = parallel_embedding(text_chunks)
duration = time.time() - start_time

print(f'Created {len(text_chunks)} text chunks from {len(texts)} texts.')
print(f'Retrieved embeddings in {round(duration, 2)}s')

Created 2953 text chunks from 46 texts.
Retrieved embeddings in 13.55s


# Creating the search index
We use the annoy python library to build an efficient approximate nearest neighbours search index that we can use to search for embeddings similar to the embedding of our search query.

In [13]:
from annoy import AnnoyIndex

search_index = AnnoyIndex(embeddings.shape[1], 'angular')
# Add all the vectors to the search index
for i, embedding in enumerate(embeddings):
    search_index.add_item(i, embedding)

search_index.build(10) # 10 trees
search_index.save('index.ann')

True

In [14]:
import pandas as pd

def search(query):
    # Get the query's embedding
    query_embedding = parallel_embedding([query])[0]

    # Retrieve the nearest neighbors
    similar_item_ids = search_index.get_nns_by_vector(
        query_embedding,
        3,
        include_distances=True
    )

    # Format the results
    results = pd.DataFrame(
        data={
            'texts': text_chunks[similar_item_ids[0]],
            'distance': similar_item_ids[1]
        }
    )    
    return results

In [19]:
query = "When was iron man 2 released?"
print(search(query).texts[0])
print()
print(search(query).texts[1])
print()
print(search(query).texts[2])

Article title: """Iron Man 2.txt"""
Text: """ and immediately set to work on producing a sequel. In July, Theroux was hired to write the script and Favreau was signed to return as director. Downey, Paltrow, and Jackson were set to reprise their roles from Iron Man, while Cheadle was brought in to replace Terrence Howard in the role of James Rhodes. In the early months of 2009, Rourke (Vanko), Rockwell, and Johansson filled out the supporting cast. Filming took place from April to July 2009, mostly in California as in the first film, except for a key sequence in Monaco. Unlike its predecessor, which mixed digital and practical effects, the sequel primarily relied on computer-generated imagery to create the Iron Man suits.

Iron Man 2 premiered at the El Capitan Theatre in Los Angeles on April 26, 2010, and was released in the United States on May 7, as part of Phase One of the MCU. The film received praise for its action sequences and performances, although critics deemed it to be infer

In [18]:
query = "Who plays doctor strange?"
print(search(query).texts[0])
print()
print(search(query).texts[1])
print()
print(search(query).texts[2])

Article title: """Doctor Strange (2016 film).txt"""
Text: """ projects.[102] Feige stated that a lead actor would be announced "relatively quickly",[103] and by the end of that month Joaquin Phoenix entered talks to play the character.[104][105]

Marvel Studios was in negotiations by September 2014 to shoot Doctor Strange at Pinewood-Shepperton in England, with crews being assembled for a move into Shepperton Studios in late 2014/early 2015, for filming in May 2015.[106] Negotiations with Phoenix ended in October 2014,[107] as the actor felt that blockbuster films would never be "fulfilling", with "too many requirements that went against [his] instincts for character."[108] Marvel then placed Leto, Ethan Hawke, Oscar Isaac, Ewan McGregor, Matthew McConaughey, Jake Gyllenhaal, Colin Farrell, and Keanu Reeves on their shortlist for the character.[109][110] Ryan Gosling also had discussions to play the character,[111] while Reeves was not approached about the role,[112] and Cumberbatch wa