# Similarity Search using Embeddings & Nearest Neighbor

#### Problem Statement: 
You've been tasked with developing a recommendation system for a grocery delivery service based on user preferences. The goal is to suggest items that are similar to those already chosen by the user. However, traditional methods like keyword matching or simple collaborative filtering may not capture nuanced similarities between items.

#### Solution:
We propose an embedding based system utilizing embeddings to represent each word. After that, a K-nearest neighbor search algorithm was implemented to efficiently find items that are closest to a given query item in the embedding space.

## Importing Modules

In [30]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

## Loading Data

We start by loading a data and creating a dataframe of our dataset.

In [3]:
words = ["red","potatoes","soda","cheese","water","blue","crispy","hamburger","coffee","green","milk","la croix","yellow"
,"chocolate","french fries","latte","cake","brown","cheeseburger","espresso","cheesecake","black","mocha","fizzy","carbon"
,"banana","sunshine","orange carrot","sun","hay","cookies","fish"]

In [56]:
DF = pd.DataFrame(words,columns = ["Word"])
DF

Unnamed: 0,Word
0,red
1,potatoes
2,soda
3,cheese
4,water
5,blue
6,crispy
7,hamburger
8,coffee
9,green


## Embedding Model
Use pre-trained word embeddings to represent each item in the vocabulary as a dense vector in a high-dimensional space.

In [5]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Create Embeddings
Conversion of words into their respective embeddings

In [20]:
embeddings = model.encode(words)
embeddings

array([[-0.07723496,  0.12294615, -0.02570257, ..., -0.00323109,
        -0.00391515,  0.5210295 ],
       [ 0.2620569 , -0.25374994, -0.08988532, ...,  0.34893227,
         0.43459976,  0.03882637],
       [-0.26256433,  0.10996028, -0.10783651, ...,  0.6241137 ,
         0.27782378, -0.20880705],
       ...,
       [ 0.40209708,  0.56491685,  0.05248033, ..., -0.71106267,
         0.47614315, -0.24891067],
       [-0.3350797 , -0.13180478, -0.14418049, ..., -0.87842876,
         0.24764776,  0.25170773],
       [-0.18092024,  0.46646872,  0.2453509 , ...,  0.26894614,
         0.6552008 ,  0.5024928 ]], dtype=float32)

## Nearest Neighbor for semantic search
Implement a K-nearest neighbor search algorithm to efficiently find items that are closest to a given query item in the embedding space.

In [22]:
k = 2
KNN = NearestNeighbors(n_neighbors=k, metric='cosine')
KNN.fit(embeddings)

NearestNeighbors(metric='cosine', n_neighbors=2)

## Querying new word
When a user selects an item, retrieve its embedding vector.

In [37]:
test = "hotdog"
test_embedding = model.encode(test)
test_embedding = test_embedding.reshape((1,-1))
test_embedding

array([[-2.42375866e-01, -3.23452294e-01, -1.68686420e-01,
         2.20599562e-01,  1.35132819e-01, -1.81159172e-02,
        -3.56361270e-05, -1.31620288e-01,  1.55606925e-01,
        -8.96533132e-01, -6.65159374e-02, -6.79481566e-01,
         8.24273303e-02, -4.53334600e-01,  4.17263657e-01,
        -2.49839708e-01,  1.18036306e+00, -3.86382639e-01,
        -1.63744673e-01, -2.38266259e-01, -1.87594727e-01,
        -1.21404573e-01, -2.06091851e-01, -2.71494299e-01,
         2.39559889e-01, -4.05560732e-01,  8.19400176e-02,
         1.93712443e-01, -4.27181154e-01,  1.02891423e-01,
         7.24207640e-01, -1.21013856e+00, -8.05192232e-01,
         7.52292573e-02, -4.05502468e-01, -3.22328508e-03,
         4.17447448e-01, -5.17934084e-01, -3.92087251e-02,
         5.31760752e-01,  5.73495567e-01, -1.97624892e-01,
        -6.89657778e-02, -7.78758287e-01, -4.05459791e-01,
         5.25453091e-01, -1.13808322e+00,  3.54061276e-01,
        -5.34154475e-03, -1.49161875e-01, -8.91737759e-0

After generating an embedding vector of the new item, find its nearest neighbors in the embedding space. These nearest neighbors represent items that are most similar to the selected item.

In [58]:
neighbours = KNN.kneighbors(test_embedding, k,  return_distance = True)

In [59]:
DF.iloc[neighbours[1].squeeze()]

Unnamed: 0,Word
18,cheeseburger
7,hamburger
