# Semantic Search with LLMs

## Text embeddings
### Text embeddings are a way to represent words or phrases as machine-readable numerical vectors in a multidimensional space, generally based on their contextual meaning. The idea is that if two phrases are similar (we will explore the word “similar” in more detail later on in this chapter), then the vectors that represent those phrases should be close together by some measure (like Euclidean distance), and vice versa.

## Asymmetric Semantic Search

### A semantic search system can understand the meaning and context of your search query and match it against the meaning and context of the documents that are available to retrieve. This kind of system can find relevant results in a database without having to rely on exact keyword or n-gram matching; instead, it relies on a pre-trained LLM to understand the nuances of the query and the documents 

### The asymmetric part of asymmetric semantic search refers to the fact that there is an imbalance between the semantic information (basically the size) of the input query and the documents/information that the search system has to retrieve. Basically, one of them is much shorter than the other. For example, a search system trying to match “magic the gathering cards” to lengthy paragraphs of item descriptions on a marketplace would be considered asymmetric. The four-word search query has much less information than the paragraphs but nonetheless is what we have to compare.

### Asymmetric semantic search systems can produce very accurate and relevant search results, even if you don’t use exactly the right words in your search. They rely on the learnings of LLMs rather than the user being able to know exactly which needle to search for in the haystack.

They can be overly sensitive to small variations in text, such as differences in capitalization or punctuation.

They struggle with nuanced concepts, such as sarcasm or irony, that rely on localized cultural knowledge.

They can be more computationally expensive to implement and maintain than the traditional method, especially when launching a home-grown system with many open-source components.

## Text Embedder

### At the heart of any semantic search system is the text embedder. This component takes in a text document, or a single word or phrase, and converts it into a vector. The vector is unique to that text and should capture the contextual meaning of the phrase.

### The choice of the text embedder is critical, as it determines the quality of the vector representation of the text. We have many options for how we vectorize with LLMs, both open and closed source. To get off of the ground more quickly, we will use OpenAI’s closed-source “Embeddings” product for our purposes here. In a later section, I’ll go over some open-source options.

### OpenAI’s “Embeddings” is a powerful tool that can quickly provide high-quality vectors, but it is a closed-source product, which means we have limited control over its implementation and potential biases. In particular, when using closed-source products, we may not have access to the underlying algorithms, which can make it difficult to troubleshoot any issues that arise.

## What Makes Pieces of Text “Similar”

### Once we convert our text into vectors, we have to find a mathematical representation of figuring out whether pieces of text are “similar.” Cosine similarity is a way to measure how similar two things are. It looks at the angle between two vectors and gives a score based on how close they are in direction. If the vectors point in exactly the same direction, the cosine similarity is 1. If they’re perpendicular (90 degrees apart), it’s 0. And if they point in opposite directions, it’s –1. The size of the vectors doesn’t matter; only their orientation does.

In [3]:
from dotenv import load_dotenv
import os
# Load the environment variables from the .env file
load_dotenv()

# Access the OpenAI API key
openai_api_key = os.getenv('OPENAI_API_KEY')

In [5]:
# Importing the necessary modules for the script to run
import openai
from openai.embeddings_utils import get_embedding

# Setting the engine to be used for text embedding
ENGINE = 'text-embedding-ada-002'

# Generating the vector representation of the given text using the specified engine
embedded_text = get_embedding('I love to be vectorized', engine=ENGINE)

# Checking the length of the resulting vector to ensure it is the expected size (1536)
len(embedded_text) == '1536'

ModuleNotFoundError: No module named 'openai.embeddings_utils'

In [6]:
# Importing the SentenceTransformer library
from sentence_transformers import SentenceTransformer

# Initializing a SentenceTransformer model with the 'multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(
  'sentence-transformers/multi-qa-mpnet-base-cos-v1')

# Defining a list of documents to generate embeddings for
docs = [
          "Around 9 million people live in London",
          "London is known for its financial district"
       ]

# Generate vector embeddings for the documents
doc_emb = model.encode(
    docs,                   # Our documents (an iterable of strings)
    batch_size=32,          # Batch the embeddings by this size
    show_progress_bar=True  # Display a progress bar
)

# The shape of the embeddings is (2, 768), indicating a length of 768 and two
doc_emb.shape  #  == (2, 768)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.25k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(2, 768)

In [7]:
# Use the PyPDF2 library to read a PDF file
import PyPDF2
import tqdm

# Open the PDF file in read-binary mode
with open('../data/pds2.pdf', 'rb') as file:

    # Create a PDF reader object
    reader = PyPDF2.PdfReader(file)

    # Initialize an empty string to hold the text
    principles_of_ds = ''

    # Loop through each page in the PDF file
    for page in tqdm(reader.pages):

        # Extract the text from the page
        text = page.extract_text()
        # Find the starting point of the text we want to extract
        # In this case, we are extracting text starting from the string ' ]'
        principles_of_ds += '\n\n' + text[text.find(' ]')+2:]

# Strip any leading or trailing whitespace from the resulting string
principles_of_ds = principles_of_ds.strip()

FileNotFoundError: [Errno 2] No such file or directory: '../data/pds2.pdf'