# Embeddings and vector databases from scratch

### Contentsp
0. Install packages
1. Embeddings from scratch

## 0. Some theory

We will calculate the cosine similarity. 

cosine similarity = A · B / ||A|| * ||B||

In normal English this is the dotproduct of A and B divided by the normalised A en B.

In the first step wel will use 'vector normailization'. 
Vector normalization is the operation that gives vectors a length of 1. In two dimensions, this means that the endpoints of the vectors lie on a circle passing through (0,1) and (1,0). In three dimensions, vector normalization ensures that the endpoints of the vectors all lie on a sphere. Vector normalization is applied in hyperspace (space with more than three dimensions) when processing analytics data.
 

## 1. Embeddings from scratch (issue)

source: https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye/

In [29]:
#import packages
import numpy as np
from collections import defaultdict
from typing import List, Tuple

In [30]:
#make a function to find the cosine similarity

def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
    norm_v1 = np.linalg.norm(v1) #get the matrix normalisation
    norm_v2 = np.linalg.norm(v2) 
    return dot_product / (norm_v1 * norm_v2)

In [31]:
dot_product

array([0.99258333, 0.97463185, 0.95941195, 0.98270763, 0.96832966])

In [26]:
#create the Vectordatabase class itself
class VectorDatabase:
    def __init__(self):
        self.vectors = defaultdict(np.ndarray)

    def insert(self, key: str, vector: np.ndarray) -> None:
        self.vectors[key] = vector

    def search(self, query_vector: np.ndarray, k: int) -> List[Tuple[str, float]]:
        similarities = [(key, cosine_similarity(query_vector, vector)) for key, vector in self.vectors.items()]
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:k]

    def retrieve(self, key: str) -> np.ndarray:
        return self.vectors.get(key, None)

In [34]:
# Create an instance of the VectorDatabase
vector_db = VectorDatabase()

# Insert vectors into the database
vector_db.insert("vector_1", np.array([0.1, 0.2, 0.3]))
vector_db.insert("vector_2", np.array([0.4, 0.5, 0.6]))
vector_db.insert("vector_3", np.array([0.7, 0.8, 0.9]))

# Search for similar vectors
query_vector = np.array([0.15, 0.25, 0.35])
similar_vectors = vector_db.search(query_vector, k=2)
print("Similar vectors:", similar_vectors)

# Retrieve a specific vector by its key
retrieved_vector = vector_db.retrieve("vector_1")
print("Retrieved vector:", retrieved_vector)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

## 2. A second example

source: https://twitter.com/akshay_pachaar/status/1678381104917782530/photo/1

In [19]:
import numpy as np

my_array = np.array([1,2,3])

vectors = np.array([[2,3,4], [4,5,6], [7,8,9], [3,4,5], [5,6,7]])

#normalize the vectors and the array using the linalg.norm (vector normalization)
vector_norm = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

my_array_norm = my_array / np.linalg.norm(my_array)

#compute the dot product of two normalised matrices
dot_product = np.dot(vector_norm, my_array_norm)

#find the indez of the nearest vector in 'vectors'
nearest_vector_index = np.argmax(dot_product)

print(f'Index of the nearest vector: {nearest_vector_index}')
print(f'Value of nearest vector: {vectors[nearest_vector_index]}')

Index of the nearest vector: 0
Value of nearest vector: [2 3 4]


In [20]:
#print the values of the values
print(vector_norm)
print(my_array_norm)
print(dot_product)

[[0.37139068 0.55708601 0.74278135]
 [0.45584231 0.56980288 0.68376346]
 [0.50257071 0.57436653 0.64616234]
 [0.42426407 0.56568542 0.70710678]
 [0.47673129 0.57207755 0.66742381]]
[0.26726124 0.53452248 0.80178373]
[0.99258333 0.97463185 0.95941195 0.98270763 0.96832966]
