# Neural Search

An AI powered search engine using Transformers, K-Means, and Cosine Similarity.

## Search operation

### Preprocessing
- Take the text from pandas
- Run the model on all of the text
- Perform KMeansClustering and record the cluster for each vector

### Perform search
- Create embeddings
- Find all text with the same cluster
- Perform cosine similarity search between all elements in the cluster
- Return the top K most elements

In [1]:
!pip install -q transformers

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Load data

Load CSV file from S3 bucket into DataFrame.

In [2]:
import sagemaker
import pandas as pd

In [3]:
sess = sagemaker.Session()

bucket_name = sess.default_bucket()
role_name = sagemaker.get_execution_role()

bucket_prefix = "neural-search"

In [111]:
DATA_LIMIT = 96

data_location = f"s3://{bucket_name}/{bucket_prefix}/data/movies_metadata.csv"

df = pd.read_csv(data_location)[["title", "overview"]]
df = df[df["overview"].notna()]

df = df.head(DATA_LIMIT)

df.head(2)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...


## Load embedding model

Load the sentence embedding transformer model.

In [5]:
from transformers import BertModel, BertTokenizer
import torch
import numpy as np

In [95]:
model_name = "bert-base-uncased"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BertTokenizer.from_pretrained(model_name)
text_model = BertModel.from_pretrained(model_name).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    
    pooled = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    return torch.nn.functional.normalize(pooled, p=2, dim=1)

In [27]:
def create_embeddings(text, tokenizer, model, device):
    encoded_input = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_tensors="pt"
    ).to(device)
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    return mean_pooling(model_output, encoded_input["attention_mask"])

### Example embedding similarities

In [39]:
from itertools import combinations

sample_text = [
    "Andy is going to the beach on Sunday with his friends",
    "Mandy is going to the movies Tuesday with her Mum",
    "The cargo ship sailed through the night",
]

sample_embeddings = create_embeddings(sample_text, tokenizer, text_model, device).cpu().detach()

tuples = combinations(list(range(sample_embeddings.shape[0])), 2)

for i, j in tuples:
    similarity = cosine_similarity_model(sample_embeddings[i], sample_embeddings[j])
    print(f"'{sample_text[i]}', '{sample_text[j]}', similarity={similarity}")

'Andy is going to the beach on Sunday with his friends', 'Mandy is going to the movies Tuesday with her Mum', similarity=0.8401806950569153
'Andy is going to the beach on Sunday with his friends', 'The cargo ship sailed through the night', similarity=0.561195433139801
'Mandy is going to the movies Tuesday with her Mum', 'The cargo ship sailed through the night', similarity=0.5241331458091736


## Create embeddings

Create the embeddings representing each movie description.

In [99]:
BATCH_SIZE = 16

embeddings = np.zeros((len(input_text), sample_embeddings.shape[-1]))

input_text = df["overview"].values

for i in range(0, len(input_text), BATCH_SIZE):
    batch = list(input_text[i:i + BATCH_SIZE])
    
    batch_embeddings = create_embeddings(batch, tokenizer, text_model, device).cpu().detach().numpy()
    embeddings[i:i + BATCH_SIZE, :] = batch_embeddings
    
embeddings

array([[-0.02961762,  0.0277533 ,  0.02578685, ...,  0.02357359,
         0.01307014, -0.00316987],
       [-0.02399799,  0.0163674 ,  0.03241539, ...,  0.0048577 ,
         0.00903032,  0.02823391],
       [ 0.00591561, -0.02498905,  0.03475502, ..., -0.01966395,
         0.00797963, -0.03016624],
       ...,
       [-0.01262194,  0.04083531,  0.04546607, ..., -0.00758757,
         0.01327129, -0.01018843],
       [ 0.01855192,  0.03487618,  0.04439139, ..., -0.01809348,
         0.03097658, -0.02043471],
       [ 0.01045637, -0.00373105,  0.06106191, ..., -0.01639825,
         0.0091132 , -0.01901365]])

## Train K-Means model

Train model used for clustering embeddings.

In [97]:
from sklearn.cluster import KMeans

In [100]:
kmeans_model = KMeans().fit(embeddings)

In [134]:
df["cluster"] = kmeans_model.labels_

df.head(2)

Unnamed: 0,title,overview,cluster
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",4
1,Jumanji,When siblings Judy and Peter discover an encha...,4


## Perform search

Search the database to find a list of movies that match the query.

In [182]:
cosine_similarity_model = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

In [222]:
def search(queries, tokenizer, text_model, kmeans_model, df, embeddings, device, max_results, cosine_similarity_model):
    query_embeddings = create_embeddings(queries, tokenizer, text_model, device).cpu().detach().numpy().astype(float)
    
    clusters = kmeans_model.predict(query_embeddings)
    
    out = []
    
    for i in range(len(clusters)):
        mask = df["cluster"] == clusters[i]
        
        temp = []
        
        for j in mask.keys():
            if mask[j]:
                similarity = cosine_similarity_model(torch.tensor(query_embeddings[i]), torch.tensor(embeddings[j]))
                temp.append((similarity, clusters[i], df["title"][j], df["overview"][j]))
        
        temp = sorted(temp, reverse=True, key=lambda x: x[0])[:MAX_RESULTS]
        out.append(temp)
        
    return out

In [226]:
MAX_RESULTS = 5

queries = [
    "gambling and mafia and crime",
    "food is tasty",
    "magical realm with siblings"
]

results = search(queries, tokenizer, text_model, kmeans_model, df, embeddings, device, MAX_RESULTS, cosine_similarity_model)

results

[[(tensor(0.5883, dtype=torch.float64),
   7,
   'Casino',
   'The life of the gambling paradise – Las Vegas – and its dark mafia underbelly.'),
  (tensor(0.4831, dtype=torch.float64),
   7,
   'Richard III',
   "Shakespeare's Play transplanted into a 1930s setting."),
  (tensor(0.3891, dtype=torch.float64),
   7,
   "Things to Do in Denver When You're Dead",
   'A mafia film in Tarantino style with a star-studded cast. Jimmy’s “The Saint” gangster career has finally ended. Yet now he finds him self doing favors for a wise godfather known as “The Man with the Plan.”')],
 [(tensor(0.4448, dtype=torch.float64),
   7,
   'Casino',
   'The life of the gambling paradise – Las Vegas – and its dark mafia underbelly.'),
  (tensor(0.4254, dtype=torch.float64),
   7,
   "Things to Do in Denver When You're Dead",
   'A mafia film in Tarantino style with a star-studded cast. Jimmy’s “The Saint” gangster career has finally ended. Yet now he finds him self doing favors for a wise godfather known as 