# What are embeddings?

Embeddings are used to transform textual data into a format that can be easily processed by machine learning algorithms. An embedding refers to a numerical representation of words, phrases, or entire documents that captures semantic meaning and relationships between them. They are n-dimensional vectors that represent singular points in an n-dimensional space. Two points that are close together in this space are also closely related in semantic meaning.

# What's so good about them?
If we compare the use of embeddings to a bag of words approach we can immediately notice a few key benefits.
1. Because embeddings use dense vectors they are far more memory efficient than one-hot encoded texts, even when using sparse arrays. 
2. Our ability to compare two texts does not rely on them having some or even any overlapping text. Comparing embeddings allows us to treat text as a collection of interconnected concepts rather than a set of lexical symbols. This more closely mimics how humans understand language and it's why most if not all large language models use embeddings.

# Loading OpenAI API key

In [127]:
import os
from dotenv import load_dotenv
from pathlib import Path

# Path to the .env file in the parent directory
env_path = Path("..") / ".env"

# Load environment variables from .env file
load_dotenv(dotenv_path=env_path)

# Get the OpenAI API key from the environment variables
openai_api_key = os.getenv("OPENAI_API_KEY")

# Retrieving OpenAI embeddings

OpenAI's text-embedding-ada-002 model provides a fast and cost-effective method of retrieving high-performance embeddings for our text. While we could use a pre-trained model from the likes of hugging face, the performance of those embeddings on tasks such as semantic search is unlikely to perform as well as the embeddings provided by OpenAI. We would also be limited by the availability of local compute while with OpenAI we are able to make our embedding requests asyncronously.

In [128]:
import openai
import concurrent
import numpy as np

def get_embedding(text, model="text-embedding-ada-002"):
    return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']


def parallel_embedding(text_list, model="text-embedding-ada-002"):
    # Create a ThreadPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=100) as executor:
        # Submit tasks to the executor and remember the order
        futures = []
        mapping = dict()
        for i, text in enumerate(text_list):
            future = executor.submit(get_embedding, text, model)
            futures.append(future)
            mapping[future] = i

        # Retrieve results as they become available and sort their order
        embeddings = [None] * len(futures)
        for future in concurrent.futures.as_completed(futures):
            embeddings[mapping[future]] = future.result()
    return np.array(embeddings)

In [129]:
emotions = [
    ['Joy', 'Happiness', 'Excitement'],
    ['Sadness', 'Grief', 'Sorrow'],
    ['Anger', 'Rage', 'Fury'],
    ['Fear', 'Dread', 'Terror'],
    ['Surprise', 'Astonishment', 'Amazement']
]

gemstones = [
    ['Diamond', 'Radiant', 'Gem'],
    ['Ruby', 'Garnet', 'Rhodolite'],
    ['Sapphire', 'Topaz', 'Aquamarine'],
    ['Emerald', 'Peridot', 'Jade'],
    ['Amethyst', 'Quartz', 'Lavenderite']
]

fruits = [
    ['Apple', 'Fruit', 'Crisp'],
    ['Orange', 'Mandarin', 'Clementine'],
    ['Banana', 'Plantain', 'Manzano'],
    ['Kiwi', 'Fuzzy', 'Refreshing'],
    ['Strawberry', 'Raspberry', 'Blueberry']
]

animals = [
    ['Cat', 'Feline', 'Kitty'],
    ['Panther', 'Jaguar', 'Leopard'],
    ['Jaguar', 'Leopard', 'Cheetah'],
    ['Lion', 'Lioness', 'Cub'],
    ['Tiger', 'Bengal', 'Siberian']
]


def concatenate_lists(list_of_lists):
    return [', '.join(inner_list) for inner_list in list_of_lists]

emotions = concatenate_lists(emotions)
gemstones = concatenate_lists(gemstones)
fruits = concatenate_lists(fruits)
animals = concatenate_lists(animals)

# Retrieve the embeddings
embeddings = parallel_embedding(emotions + gemstones + fruits + animals)

# Split the embeddings for emotions, gemstones, fruits, and animals
emotions_embeddings = embeddings[:5]
gemstones_embeddings = embeddings[5:10]
fruits_embeddings = embeddings[10:15]
animals_embeddings = embeddings[15:]

# Display the shape of the matrices
print("Emotions Embeddings Shape:", emotions_embeddings.shape)
print("Gemstones Embeddings Shape:", gemstones_embeddings.shape)
print("Fruits Embeddings Shape:", fruits_embeddings.shape)
print("Animals Embeddings Shape:", animals_embeddings.shape)

Emotions Embeddings Shape: (5, 1536)
Gemstones Embeddings Shape: (5, 1536)
Fruits Embeddings Shape: (5, 1536)
Animals Embeddings Shape: (5, 1536)


Retrieving embeddings for our emotions, gemstones, fruits and animals gives four matracies with 5 rows and 1536 columns. The 1536 columns are used to describe the position of each vector in the 1536 dimensional embedding space.

# K-means clustering

As we said previously, vectors that are close in the embedding space are close in semantic meaning. To illustrate this further let's use k-means clustering to group our embedding vectors into 4 distinct clusters. We'll label our data points with the original texts that were used to create the embeddings so that we can examine which texts have been placed in which clusters.

In [130]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
from collections import defaultdict

# Combine the labels
labels = emotions + gemstones + fruits + animals

# Use K-Means Clustering
num_clusters = 4  # Specify the number of clusters
kmeans = KMeans(n_clusters=num_clusters)
cluster_labels = kmeans.fit_predict(embeddings)

# Create a dictionary to store cluster labels for each category
cluster_dict = {label: cluster for label, cluster in zip(labels, cluster_labels)}

# Collect labels into lists based on cluster
clusters = defaultdict(list)
for label, cluster in cluster_dict.items():
    clusters[cluster].append(label)

# Print the cluster assignments for each category
for cluster, l in clusters.items():
    print(f"Members belonging to cluster {cluster}")
    for label in l:
        print(f"\t{label}")
    print()

Members belonging to cluster 2
	Joy, Happiness, Excitement
	Sadness, Grief, Sorrow
	Anger, Rage, Fury
	Fear, Dread, Terror
	Surprise, Astonishment, Amazement

Members belonging to cluster 1
	Diamond, Radiant, Gem
	Ruby, Garnet, Rhodolite
	Sapphire, Topaz, Aquamarine
	Emerald, Peridot, Jade
	Amethyst, Quartz, Lavenderite

Members belonging to cluster 3
	Apple, Fruit, Crisp
	Orange, Mandarin, Clementine
	Banana, Plantain, Manzano
	Kiwi, Fuzzy, Refreshing
	Strawberry, Raspberry, Blueberry

Members belonging to cluster 0
	Cat, Feline, Kitty
	Panther, Jaguar, Leopard
	Jaguar, Leopard, Cheetah
	Lion, Lioness, Cub
	Tiger, Bengal, Siberian



The ability of this unsupervised algorithm to correctly group our input texts based on their corresponding dense vector embeddings demonstrates clearly that the text semantics have been captured succesfully.

# Cosine similarity vs Euclidean distance
K-means clustering uses euclidean distance which means that it attempts to find a clustering that minimises the straight line distance between members of the cluster and the cluster centroids.

A more appropriate measure of similairty is cosine similarity. It has several advantages over euclidean distance for comparing texts but the main points are:
1. Costine similarity is a measure of orientation, not magnitude. This makes it particularly suitable for text data, where the length of the vectors (i.e., the length or word count of the documents) can vary significantly, but we're more interested in the direction (i.e., the context or topic).

2. Text embeddings, particularly those derived from models like Word2Vec, FastText, or BERT, can be high-dimensional. Cosine similarity works well in high-dimensional spaces where the Euclidean distance can become less meaningful due to the curse of dimensionality. The curse of dimensionality says that in high a dimensional space "all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient."

3. Cosine similarity values lie between -1 and 1 (though, in practice, for text embeddings, they often lie between 0 and 1). A value of 1 indicates that the vectors are identical in orientation, 0 indicates orthogonality (completely dissimilar), and -1 indicates opposite direction (though this negative case is rare in text embeddings). This bounded range is convenient and intuitive for many applications.