# Vector Search <- Semantic Search <- Representation Learning

## Representation Learning

Representation learning is what made machine learning go from simple algorithms like "detect handwriting" to "conversational agent." Machine learning is the art and science of using algorithms to write programs too complex to write by hand. Representation learning is a way to automate the _features_ given to a machine learning algorithm.

### Introduction to Natural Language Processing

Take a look at the handout <a href="../Introduction to NLP for Chatbot Course.pdf">Introduction to NLP for Chatbot Course.pdf</a> to make sure you have an understanding of the basics of Natural Language Processing (NLP). We are going to discuss this handout, so skim the sections you already know :)

One technique used in NLP is to _embed_ unstructured objects like text or semi-structured objects like JSON into a vector embedding that represents the object.

## Text Embeddings

The Python library [Sentence Transformers](https://www.sbert.net/) ([PyPi module sentence-transformers](https://pypi.org/project/sentence-transformers/0.3.0/)) - lets you create a fixed length embedding of a variable length block of text. We are going to take the 

We are going to demonstrate semantic, vector comparison techniques using both open source / open data models like [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) and [paraphrase-multilingual-MiniLM-L12-v2](paraphrase-multilingual-MiniLM-L12-v2) and models like OpenAI's [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/embedding-models).

> [all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) and [paraphrase-multilingual-MiniLM-L12-v2](paraphrase-multilingual-MiniLM-L12-v2) - This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

## Semantic, Vector Search

Semantic search means to search in terms of the semantics of an object. Modern vector search for LLMs uses an embedding with a vector database to perform semantic search.

### FAISS (Facebook AI Similarity Search)

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors. It's commonly used for tasks involving large-scale vector search, like searching for similar images or finding related text documents. FAISS is optimized for databases of millions to billions of high-dimensional vectors and can perform similarity search quickly and efficiently, even on very large datasets. The library offers implementations of several algorithms used in similarity search and is particularly known for its speed and scalability in handling large datasets.

In [1]:
# Could not get this to work on the Docker image
!pip install faiss-cpu



In [2]:
import json
import numpy as np
import os
import pandas as pd
from typing import List

import faiss
import numpy as np
import openai
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer

# Makes a warning go away
os.environ["TOKENIZERS_PARALLELISM"] = "true"
openai.api_key = os.environ.get("OPENAI_API_KEY")

## Comparing Embeddings using Cosine Similarity, Indexing them in FAISS

In [33]:
models = {
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2": SentenceTransformer(
        "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
    ),
    "sentence-transformers/all-MiniLM-L12-v2": SentenceTransformer(
        "sentence-transformers/all-MiniLM-L12-v2"
    ),
    "openai/text-embedding-ada-002": 1,
}

In [177]:
def compare_records_to_df(record_pairs: List[List[str]]):
    """compare_records_to_df Generate a pd.DataFrame cosine similarity comparisons for a list of record pairs and models

    Parameters
    ----------
    record_pairs : List[List[str]]
        Pairs of records to compare
    models : Dict[str, SentenceTransformer]
        A pair of sentence transformers to compare
    """

    values = {k:[] for k in models.keys()}
    embeddings = {k:[] for k in models.keys()}
    dimensions = {k:dims for k, dims in zip(models.keys(), [384, 384, 1536])}
    indexes = {k:faiss.IndexFlatL2(dims) for k, dims in zip(models.keys(), dimensions.values())}

    rows = []
    for name_one, name_two in record_pairs:
        scores = []

        for model_name in models.keys():

            if model_name.startswith("sentence-transformers"):
                model = models[model_name]
                embedding_one = model.encode(name_one)
                embedding_two = model.encode(name_two)

            elif model_name.startswith("openai"):
                text_model = model_name.split("/")[-1]
                data_obj = (
                    openai.Embedding.create(input=[name_one, name_two], model="text-embedding-ada-002")
                )["data"]
                embedding_one, embedding_two = [d["embedding"] for d in data_obj]

            embeddings[model_name].append([embedding_one])
            embeddings[model_name].append([embedding_two])

            values[model_name].append(name_one)
            values[model_name].append(name_two)

            score = 1.0 - cosine(embedding_one, embedding_two)
            scores.append(score)

        rows.append([name_one, name_two, scores[0], scores[1], scores[2]])

    embed_ary = np.array(embeddings[model_name]).reshape(len(embeddings[model_name]), dimensions[model_name])
    print(f"embed_ary shape: {embed_ary.shape}")
    indexes[model_name].add(embed_ary)

    df = pd.DataFrame(rows, columns=["Name One", "Name Two", "All Cosine", "Paraphrase Cosine", "OpenAI Cosine"])

    return df, embeddings, indexes

In [178]:
name_pairs = np.array(
    [
        ["Russell H Jurney", "Russell Jurney"],
        ["Russ H. Jurney", "Russell Jurney"],
        ["Russ H Jurney", "Russell Jurney"],
        ["Russ Howard Jurney", "Russell H Jurney"],
        ["Russell H. Jurney", "Russell Howard Jurney"],
        ["Russell H Jurney", "Russell Howard Jurney"],
        ["Alex Ratner", "Alexander Ratner"],
        ["ʿAlī ibn Abī Ṭālib", "عَلِيّ بْن أَبِي طَالِب"],
        ["Igor Berezovsky", "Игорь Березовский"],
        ["Oleg Konovalov", "Олег Коновалов"],
        ["Ben Lorica", "罗瑞卡"],
        ["Sam Smith", "Tom Jones"],
        ["Sam Smith", "Ron Smith"],
        ["Sam Smith", "Samuel Smith"],
    ]
)

In [179]:
json_pairs = np.array(
    [
        [
            json.dumps({"name": "Russell H Jurney", "birthday": "02/01/1980"}),
            json.dumps({"name": "Russell Jurney", "birthday": "02/01/1990"}),
        ],
        [
            json.dumps({"name": "Russ H. Jurney", "birthday": "02/01/1980"}),
            json.dumps({"name": "Russell Jurney", "birthday": "02/01/1991"}),
        ],
        [
            json.dumps({"name": "Russ H Jurney", "birthday": "02/01/1980"}),
            json.dumps({"name": "Russell Jurney", "birthday": "02/02/1990"}),
        ],
        [
            json.dumps({"name": "Russ Howard Jurney", "birthday": "02/01/1980"}),
            json.dumps({"name": "Russell H Jurney", "birthday": "02/01/1990"}),
        ],
        [
            json.dumps({"name": "Russell H. Jurney", "birthday": "02/01/1980"}),
            json.dumps({"name": "Russell Howard Jurney", "birthday": "02/01/1990"}),
        ],
        [
            json.dumps({"name": "Russell H Jurney", "birthday": "02/01/1980"}),
            json.dumps({"name": "Russell Howard Jurney", "birthday": "02/01/1990"}),
        ],
        [
            json.dumps({"name": "Alex Ratner", "birthday": "02/01/1901"}),
            json.dumps({"name": "Alexander Ratner", "birthday": "02/01/1976"}),
        ],
        [
            json.dumps({"name": "ʿAlī ibn Abī Ṭālib", "birthday": "02/01/1980"}),
            json.dumps({"name": "عَلِيّ بْن أَبِي طَالِب", "birthday": "02/01/1980"}),
        ],
        [
            json.dumps({"name": "Igor Berezovsky", "birthday": "01/01/1980"}),
            json.dumps({"name": "Игорь Березовский", "birthday": "02/03/1908"}),
        ],
        [
            json.dumps({"name": "Oleg Konovalov", "birthday": "02/01/1980"}),
            json.dumps({"name": "Олег Коновалов", "birthday": "05/04/1980"}),
        ],
        [
            json.dumps({"name": "Ben Lorica", "birthday": "02/01/1980"}),
            json.dumps({"name": "罗瑞卡", "birthday": "02/01/1980"}),
        ],
        [
            json.dumps({"name": "Sam Smith", "birthday": "02/01/1980"}),
            json.dumps({"name": "Tom Jones", "birthday": "02/01/1976"}),
        ],
        [
            json.dumps({"name": "Sam Smith", "birthday": "02/01/1980"}),
            json.dumps({"name": "Ron Smith", "birthday": "02/01/2001"}),
        ],
        [
            json.dumps({"name": "Sam Smith", "birthday": "02/01/1980"}),
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/1801"}),
        ],
        [
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/1980"}),
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/1980"}),
        ],
        [
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/1980"}),
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/1991"}),
        ],
        [
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/1980"}),
            json.dumps({"name": "Samuel Smith", "birthday": "02/01/2011"}),
        ],
    ]
)

In [180]:
def highlight_max(s):
    # Convert to numeric, non-convertible values will become NaN
    numeric_s = pd.to_numeric(s, errors="coerce")

    # Identify max value; NaNs are ignored by default
    max_val = numeric_s.max()

    # Highlight max numeric value, ignore non-numeric or NaN
    return ["color: green" if (cell == max_val and pd.notna(cell)) else "" for cell in numeric_s]

In [183]:
name_pair_df, name_embeddings, name_indexes = compare_records_to_df(name_pairs)
name_pair_df.style.apply(highlight_max, axis=1)

embed_ary shape: (28, 1536)


Unnamed: 0,Name One,Name Two,All Cosine,Paraphrase Cosine,OpenAI Cosine
0,Russell H Jurney,Russell Jurney,0.962018,0.952299,0.977396
1,Russ H. Jurney,Russell Jurney,0.829617,0.854041,0.959307
2,Russ H Jurney,Russell Jurney,0.843623,0.871992,0.965398
3,Russ Howard Jurney,Russell H Jurney,0.864891,0.849141,0.933938
4,Russell H. Jurney,Russell Howard Jurney,0.91484,0.875086,0.900228
5,Russell H Jurney,Russell Howard Jurney,0.923309,0.901534,0.922573
6,Alex Ratner,Alexander Ratner,0.893104,0.800972,0.964629
7,ʿAlī ibn Abī Ṭālib,عَلِيّ بْن أَبِي طَالِب,0.555111,0.44387,0.862945
8,Igor Berezovsky,Игорь Березовский,0.922118,0.322799,0.893567
9,Oleg Konovalov,Олег Коновалов,0.965302,0.3392,0.918567


In [184]:
json_name_df, json_embeddings, json_indexes = compare_records_to_df(json_pairs)
json_name_df.style.apply(highlight_max, axis=1)

embed_ary shape: (34, 1536)


Unnamed: 0,Name One,Name Two,All Cosine,Paraphrase Cosine,OpenAI Cosine
0,"{""name"": ""Russell H Jurney"", ""birthday"": ""02/01/1980""}","{""name"": ""Russell Jurney"", ""birthday"": ""02/01/1990""}",0.945763,0.98197,0.987332
1,"{""name"": ""Russ H. Jurney"", ""birthday"": ""02/01/1980""}","{""name"": ""Russell Jurney"", ""birthday"": ""02/01/1991""}",0.903774,0.929158,0.975243
2,"{""name"": ""Russ H Jurney"", ""birthday"": ""02/01/1980""}","{""name"": ""Russell Jurney"", ""birthday"": ""02/02/1990""}",0.896312,0.931978,0.983517
3,"{""name"": ""Russ Howard Jurney"", ""birthday"": ""02/01/1980""}","{""name"": ""Russell H Jurney"", ""birthday"": ""02/01/1990""}",0.928514,0.925558,0.957646
4,"{""name"": ""Russell H. Jurney"", ""birthday"": ""02/01/1980""}","{""name"": ""Russell Howard Jurney"", ""birthday"": ""02/01/1990""}",0.931395,0.962626,0.956308
5,"{""name"": ""Russell H Jurney"", ""birthday"": ""02/01/1980""}","{""name"": ""Russell Howard Jurney"", ""birthday"": ""02/01/1990""}",0.926443,0.967375,0.960998
6,"{""name"": ""Alex Ratner"", ""birthday"": ""02/01/1901""}","{""name"": ""Alexander Ratner"", ""birthday"": ""02/01/1976""}",0.916344,0.931648,0.977333
7,"{""name"": ""\u02bfAl\u012b ibn Ab\u012b \u1e6c\u0101lib"", ""birthday"": ""02/01/1980""}","{""name"": ""\u0639\u064e\u0644\u0650\u064a\u0651 \u0628\u0652\u0646 \u0623\u064e\u0628\u0650\u064a \u0637\u064e\u0627\u0644\u0650\u0628"", ""birthday"": ""02/01/1980""}",0.755785,0.812418,0.931003
8,"{""name"": ""Igor Berezovsky"", ""birthday"": ""01/01/1980""}","{""name"": ""\u0418\u0433\u043e\u0440\u044c \u0411\u0435\u0440\u0435\u0437\u043e\u0432\u0441\u043a\u0438\u0439"", ""birthday"": ""02/03/1908""}",0.608111,0.603985,0.886545
9,"{""name"": ""Oleg Konovalov"", ""birthday"": ""02/01/1980""}","{""name"": ""\u041e\u043b\u0435\u0433 \u041a\u043e\u043d\u043e\u0432\u0430\u043b\u043e\u0432"", ""birthday"": ""05/04/1980""}",0.572889,0.686509,0.901305


## Vector Search in FAISS

Above we created a `faiss.IndexFlatL2(384)` for each sentence transformer and a `faiss.IndexFlatL2(1536)` for the OpenAI embedding. Let's search for some different names to find the records they are closest to.

### The Advantes of FAISS

Note how we can search the names using vector search without rigorously comparing all records.

In [185]:
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
value = "Russell H Jurney"
embedding = models[model_name].encode(value)
shaped_embedding = embedding.reshape(1, 384)
distances, indexes = name_indexes[model_name].search(shaped_embedding, k=99)

In [186]:
shaped_embedding.shape, embedding.shape, distances, indexes

((1, 384),
 (384,),
 array([[3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.4028235e+38, 3.4028235e+38, 3.4028235e+38, 3.4028235e+38,
         3.402

### Annnnnd... it doesn't work. Ok, let's get simple :)

In [187]:
import faiss
import numpy as np

# Example dataset with keys and vectors
keys = np.array([1, 2, 3])  # These could be IDs or any unique identifiers
vectors = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])  # Example 2D vectors

# Training a FAISS index (using IndexFlatL2 for simplicity)
dimension = 2  # Dimension of the vectors
index = faiss.IndexFlatL2(dimension)
index.add(vectors)

# Querying the index
query_vector = np.array([[0.15, 0.25]])
k = 2  # Number of nearest neighbors to find
distances, indices = index.search(query_vector, k)

# Retrieve keys for the nearest neighbors
nearest_neighbor_keys = keys[indices]

print("Nearest Neighbors:", nearest_neighbor_keys)

Nearest Neighbors: [[1 2]]
