# Semantic Search

## What is semantic search?
Semantic search is a data searching technique that focuses on understanding the contextual meaning and intent behind a user's query, rather than only matching keywords.

Traditional search engines typically focus on matching keywords within a search query to corresponding keywords in indexed web pages. In contrast, semantic search aims to comprehend the deeper meaning and intent behind a user's search, much like a human would.

## Semantic Meaning and Similarity Search

**Semantic meaning** refers to the contextual or conceptual similarity between pieces of information. In AI systems, capturing semantic meaning allows models to understand not just the literal text but the underlying concepts and realtionships between words or phrases.

**Similarity search** is the process of finding items in a dataset that are most similar to a given query. Instead of relying solely on exact matches, similarity search considers the semantic meaning, enabling more accurate and relevant results.

### How do AI systems represent meaning?

**Vector Embeddings:** AI models convert text, images, or other data types into numerical representations called *vectors*. These vectors capture relationships between words, sentences, or documents by encoding semantic information.

**Semantic Vector Space:** Vectors reside in a high-dimensional space where their postions determine how similar or different they are.

Similar items are closer together, while dissimlar items are farther apart.

![](./resources/embedding_img.webp)

*Rozado, David (2020). Word embeddings map words in a corpus of text to vector space. PLOS ONE. Figure. [https://doi.org/10.1371/journal.pone.0231189.g008](https://doi.org/10.1371/journal.pone.0231189.g008)*

#### Example of Semantic Similarity

Consider the words **"Cat"** and **"kitten"**. These words are semantically similar because they relate to the same animal at different life stages. In vector, space, their embeddings might look something like:

cat = ```[1.5, -0.4, 7.2, 19.6, 3.1, ..., 20.2]```

kitten = ```[1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]```

Notice that their embeddings have similar numbers, especially at the beginning of the dimensions, with minor changes towards the end. This closeness in vector space reflects their semantic similarity.

In contrast, words like **"dog"** or **"apple"** would have embeddings that are significantly different, indicating less semantic similarity as compared to "cat" or "kitten".

### Applications of Semantic Meaning:

**Semantic Search Engines:** Improve search results by understanding the context and meaning behind queries.

**Recommendation Systems**: Suggest items similar to a user's preferences based on semantic similarities.

**Document Clustering and Classification:** Group documents with similar content, aiding in organziation and rerieval.

## What is Similarity Search?

**Similarity search** involves finding items that are most similar to a given query based on their vector representations.

### How it Works

1. **Vector Representation:** Each item is represented as a vector in a multi-dimensional space.

2. **Measuring Similarity:** Mathematical functions calculate how close these vectors are to each other.

3. **Nearest Neighbor Search:** By calculating distances or angles between vectors, we identify the items most similar to our query.

### Importance of Similarty Search

**Enhanced Search Capabilities:** Allows for more accurate and relevant search results by considering context and meaning.

**Personalization:** Powers recommendation engines that tailor suggestions to individual users.

**Clustering and Classification:** Facilities grouping similar data points, crucial in data analysis and pattern recognition.

### Techniques for Measuring Similarity

**Cosine Similarity:** Measures the angle between two vectors.

**Euclidean Distance:** Measures the straight-line distance between two vectors.

## What is Cosine Similarity?

Cosine similarity measures the similarity between two non-zero vectors by calculating the cosine of the angle between them. It is widely used in machine learning and data analysis, especially in text analysis, document comparison, search queries, and recommendation systems.

- Similarty measure calculates the distances between data objects based on their feature dimensions in a dataset.
- A smaller distance indicates a higher similarity, while a larger distance indicates a lower similarity.

Consine similarity is the cosine of the angle between the vectorsl that is, it is the dot product of the vectors divided by the product of their lengths.

![](./resources/cosine_sim.webp)

- **A*b**: Product of vectors A and B.
- **||A||** and **||B||**: Magnitudes (lengths) of vectors A and B.

### Interpreting Values
- **1**: Vectors are identical in direction.
- **0**: Vectors are orthogonal (unrelated).
- **-1**: Vectors are diametrcally opposite.

### Why use Cosine Similarity?

- **Focus on Direction:** Emphasizes the orientation of vectors rather than their magnitude, making it ideal for tet data where word counts may vary.
- **Semantic Relevance:** Captures semantic similarity by considering the context and meaning of the data.

## What is Euclidean Distance?

**Euclidean Distance** calculates the root of the sum of squared differences between corresponding components of two vectors. It measures how far apart two vectors are in space.

### Why Use Euclidean Distance?

- **Simplicity and Intuitiveness:** Easy to understand and compute.
- **Effective for Numerical and Spatial Data:** Ideal for applications where absolute differences matter.

### Limitations:

- **Curse of Dimensionality:** In high-dimensional spaces, Euclidean Distance can become less meaningful due to data sparsity.

## Choosing Between Cosine Similarity and Euclidean Distance:

1. **Cosine Similarity** focuses on semantic meaning and direction, making it ideal for NLP tasks and text analysis.
2. **Euclidean Distance** measures absolute closeness, excelling in numerical or spatial data applications.

## Implementing Cosine Similarity Search in (semantic search) in code:

In [1]:
import os
import json

# Load the data source that you will be doing semantic searching on
path='./resources/stock_news.json'

with open(path,'r') as f:
    loaded_data = json.load(f)

In [3]:
%pip install scikit-learn numpy pandas python-dotenv openai

Collecting python-dotenv
  Using cached python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Using cached python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


After loading the json data, we need to generate embeddings for the data.

In [2]:
from dotenv import load_dotenv
from openai import OpenAI

EMBEDDING_MODEL = "text-embedding-3-large"
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

for ticker, articles in loaded_data.items():
    for article in articles:
        embedding_res = client.embeddings.create(model=EMBEDDING_MODEL, input=article.get("full_text"))
        embedding = embedding_res.data[0].embedding
        article["embedding"] = embedding



        

Create a dataframe with the input data + the embeddings (this will act as our matrix)

In [3]:
import pandas as pd

dataframe_rows = []
# Loaded data items now have an embedding key-value pair
for ticker, articles in loaded_data.items():
    for article in articles:
        dataframe_rows.append({
            "ticker": article.get('ticker'),
            "title": article.get('title'),
            "link": article.get('link'),
            "full_text": article.get('full_text'),
            "embedding": article.get('embedding')
        })

df= pd.DataFrame(dataframe_rows)

print(f'\nDataFrame:\n {df}')


DataFrame:
     ticker                                              title  \
0     AAPL  Apple Inc. (AAPL) Teases New Product Launch wi...   
1     AAPL  Apple's Launch of iPhone SE4 Not Seen Impactin...   
2     AAPL  Apple Supplier Foxconn's Efforts To Make iPhon...   
3     AAPL  Is Apple Inc. (AAPL) the Most Profitable Tech ...   
4     AAPL  Apple Partners With Alibaba, Eyes Baidu for AI...   
..     ...                                                ...   
133    IBM  Corelight Cuts SIEM Ingest By Up to 80% withou...   
134    IBM  The Zacks Analyst Blog Highlights Apple, Eli L...   
135    IBM  Putting People First Drives Higher Adoption of...   
136    IBM  Cohesity Appoints Carol Carpenter as Chief Mar...   
137    IBM  Rashida Hodge of Microsoft and Gerben Bakker o...   

                                                  link  \
0    https://finance.yahoo.com/news/apple-inc-aapl-...   
1    https://finance.yahoo.com/news/apple-apos-laun...   
2    https://finance.yahoo.com/n

Get a query to search for, and generate embeddings for it:

In [4]:
user_search_query = "What new product will Apple be releasing?"
user_embedding_res = client.embeddings.create(model=EMBEDDING_MODEL, input=user_search_query)
user_in_embedding = user_embedding_res.data[0].embedding

Perform Cosine Similarity on the dataframe matrix with embeddings and the embeded user query:

In [9]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

embedding_matrix = np.array(df['embedding'].tolist(), dtype=np.float32)

print(f'\nEmbedding Matrix created from dataframe:\n{embedding_matrix}')

q = np.array(user_in_embedding, dtype=np.float32).reshape(1,-1) # Reshape the matrix to be a 2d Vector with 1 row, and numpy decides the number of columns in this case it is 3072 because that is the embedding array length
print(f'\nOriginal User input query embedding:\n {np.concatenate([user_in_embedding[:5], user_in_embedding[-5:]])}')
print(f'\n2D vector representation of the user input query embedding:\n{q}')

# Cosine Similarity:
sims = cosine_similarity(q, embedding_matrix)[0] # 0 flattens the list of similarity scores, one for each article.

df = df.assign(similarity=sims) # Add cosine similarity scores as a new column

res_df = (
    df.sort_values("similarity", ascending=False)
      .loc[:, ["ticker", "title", "link", "similarity", "full_text"]]
      .reset_index(drop=True)
)


print(f'Cosine Similarity results:\n {res_df}')


Embedding Matrix created from dataframe:
[[-0.04479447 -0.03505534 -0.00366255 ... -0.01124704 -0.02408498
  -0.00260079]
 [-0.05163372  0.00394114 -0.00725202 ... -0.00103614 -0.01689744
   0.01646663]
 [-0.03661007 -0.00509496 -0.00808166 ... -0.01099306 -0.00697733
   0.00031634]
 ...
 [-0.01426093 -0.00085992 -0.02583767 ... -0.03460239 -0.01368575
  -0.01221583]
 [-0.01313475 -0.04058871 -0.01504347 ... -0.0174124  -0.00266882
  -0.01824228]
 [ 0.00044649 -0.02187563 -0.01777496 ... -0.01740364  0.00766454
   0.00204933]]

Original User input query embedding:
 [-0.07105432  0.00581561 -0.01986648  0.03576227 -0.03053769  0.01336187
 -0.01118714 -0.01256512 -0.03239241 -0.01304839]

2D vector representation of the user input query embedding:
[[-0.07105432  0.00581561 -0.01986648 ... -0.01256512 -0.03239241
  -0.01304839]]
Cosine Similarity results:
     ticker                                              title  \
0     AAPL  Apple Could Announce a New iPhone Tomorrow. It...   
1  