<a href="https://colab.research.google.com/github/HarshJ23/VectorDB_from_scratch/blob/main/VectorDB_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import numpy as np

### Choosing Distance metric
##### I will use Cosine similarity as my similarity metric for this implementation.



#### Cosine similarity formula:
![image.png](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*LfW66-WsYkFqWc4XYJbEJg.png)



In [6]:
def cosine_similarity(vec1, vec2):
  dot_pdt = np.dot(vec1, vec2)

  # calculating magnitudes(norm) of vec1 and vec2
  norm_vec1 =  np.linalg.norm(vec1)
  norm_vec2 =  np.linalg.norm(vec2)
  if norm_vec1 == 0 or norm_vec2 == 0:
    return 0

  cos_sim = dot_pdt / (norm_vec1 * norm_vec2)
  return cos_sim

In [33]:
class VectorDb:
  def __init__(self):
    # creating dictionary(key-value pairs) to store vectors with respective vector id.
    self.vectors = {}

  def insert_vector(self , vec_id, vec):
    self.vectors[vec_id] = vec

  def search(self, query_vector , top_k):
    similar_vectors = []
    for vec_id , vec in self.vectors.items():
      if vec.ndim == query_vector.ndim and vec.shape == query_vector.shape:
        similarity = cosine_similarity(query_vector , vec)
        similar_vectors.append((vec_id , similarity))
    similar_vectors.sort(key=lambda x: x[1], reverse=True)
    return similar_vectors[:top_k]






In [35]:
db = VectorDb()

db.insert_vector("vec1" , np.array([1 , 3.6, 7 , 5 , 3.2]))
db.insert_vector("vec2" , np.array([1 , 1 ,0 , 6 , 3.5]))
db.insert_vector("vec3" , np.array([[1,3] , [4,7]]))
db.insert_vector("vec4" , np.array([1, 4, 5.5]))
db.insert_vector("vec5" , np.array([1, 4, 5.5 , 7 , 2.2]))

query = np.array([2 , 5 ,8 , 6 ,3.5])
results = db.search( query , 6)

# print
print("Dimension of query vector:" , query.ndim)
print("Shape of query vector:" , query.shape)
print("Query vector:" , query)
print(results)


Dimension of query vector: 1
Shape of query vector: (5,)
Query vector: [2.  5.  8.  6.  3.5]
[('vec1', 0.9951250059934604), ('vec5', 0.9682444860165454), ('vec2', 0.6557978991392666)]


#### Further improvements in the project:

- [x] Edge cases handling
- [x] Return sorted search results
- [ ] Add multiple vectors at once.
- [ ] Handle larger datasets.
- [x] Add dimensionality checks.
