# Neural Search

An AI powered search engine using Transformers, K-Means, and Cosine Similarity.

## Search operation

### Preprocessing
- Take the text from pandas
- Run the model on all of the text
- Perform KMeansClustering and record the cluster for each vector

### Perform search
- Create embeddings
- Find all text with the same cluster
- Perform cosine similarity search between all elements in the cluster
- Return the top K most elements

In [1]:
!pip install -q transformers

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Load data

Load CSV file from S3 bucket into DataFrame.

In [2]:
import sagemaker
import pandas as pd

In [3]:
sess = sagemaker.Session()

bucket_name = sess.default_bucket()
role_name = sagemaker.get_execution_role()

bucket_prefix = "neural-search"

In [4]:
DATA_LIMIT = 896

data_location = f"s3://{bucket_name}/{bucket_prefix}/data/movies_metadata.csv"

df = pd.read_csv(data_location)[["title", "overview"]]
df = df[df["overview"].notna()]

df = df.head(DATA_LIMIT)

df.head(8)

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...
5,Heat,"Obsessive master thief, Neil McCauley leads a ..."
6,Sabrina,An ugly duckling having undergone a remarkable...
7,Tom and Huck,"A mischievous young boy, Tom Sawyer, witnesses..."


## Load embedding model

Load the sentence embedding transformer model.

In [5]:
from transformers import AutoTokenizer as Tokenizer, AutoModel as Model
import torch
import numpy as np

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = Tokenizer.from_pretrained(model_name)
text_model = Model.from_pretrained(model_name).to(device)

cosine_similarity_model = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

In [7]:
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    
    pooled = torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    
    return torch.nn.functional.normalize(pooled, p=2, dim=1)

In [8]:
def create_embeddings(text, tokenizer, model, device):
    encoded_input = tokenizer(
        text,
        padding=True,
        truncation=True,
        return_tensors="pt"
    ).to(device)
    
    with torch.no_grad():
        model_output = model(**encoded_input)
    
    return mean_pooling(model_output, encoded_input["attention_mask"])

### Example embedding similarities

In [9]:
from itertools import combinations

sample_text = [
    "Andy is going to the beach on Sunday with his friends",
    "Mandy is going to the movies Tuesday with her Mum",
    "The cargo ship sailed through the night",
]

sample_embeddings = create_embeddings(sample_text, tokenizer, text_model, device).cpu().detach()

tuples = combinations(list(range(sample_embeddings.shape[0])), 2)

for i, j in tuples:
    similarity = cosine_similarity_model(sample_embeddings[i], sample_embeddings[j])
    print(f"'{sample_text[i]}', '{sample_text[j]}', similarity={similarity}")

'Andy is going to the beach on Sunday with his friends', 'Mandy is going to the movies Tuesday with her Mum', similarity=0.31142428517341614
'Andy is going to the beach on Sunday with his friends', 'The cargo ship sailed through the night', similarity=0.0774894580245018
'Mandy is going to the movies Tuesday with her Mum', 'The cargo ship sailed through the night', similarity=0.062295686453580856


## Create embeddings

Create the embeddings representing each movie description.

In [10]:
BATCH_SIZE = 16

input_text = df["overview"].values

embeddings = torch.zeros((len(input_text), sample_embeddings.shape[-1]))

for i in range(0, len(input_text), BATCH_SIZE):
    batch = list(input_text[i:i + BATCH_SIZE])
    
    batch_embeddings = create_embeddings(batch, tokenizer, text_model, device).cpu().detach()
    embeddings[i:i + BATCH_SIZE, :] = batch_embeddings
    
embeddings

tensor([[ 0.0634,  0.0010,  0.0932,  ...,  0.0154,  0.0446,  0.0220],
        [ 0.0863,  0.0446, -0.0405,  ..., -0.0033, -0.0293, -0.0266],
        [-0.1009,  0.0374, -0.0009,  ...,  0.0568, -0.0262,  0.0183],
        ...,
        [-0.0317, -0.1009,  0.0276,  ..., -0.0754,  0.0153, -0.0045],
        [-0.0724,  0.0377, -0.0088,  ...,  0.0259,  0.0469, -0.0306],
        [ 0.0073, -0.0173, -0.0241,  ...,  0.0053,  0.0448, -0.0604]])

## Train K-Means model

Train model used for clustering embeddings.

In [11]:
from sklearn.cluster import KMeans

In [12]:
kmeans_model = KMeans().fit(embeddings)

In [13]:
df["cluster"] = kmeans_model.labels_

df.head(2)

Unnamed: 0,title,overview,cluster
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",3
1,Jumanji,When siblings Judy and Peter discover an encha...,3


## Perform search

Search the database to find a list of movies that match the query.

In [14]:
def search(queries, tokenizer, text_model, kmeans_model, df, embeddings, device, max_results, cosine_similarity_model):
    query_embeddings = create_embeddings(queries, tokenizer, text_model, device).cpu().detach()
    
    clusters = kmeans_model.predict(query_embeddings)
    
    out = []
    
    for i in range(len(clusters)):
        mask = df["cluster"] == clusters[i]
        
        temp = []
        
        for j, key in enumerate(mask.keys()):
            if mask[key]:
                similarity = cosine_similarity_model(query_embeddings[i], embeddings[j])
                temp.append((similarity, clusters[i], df["title"][key], df["overview"][key]))
        
        temp = sorted(temp, reverse=True, key=lambda x: x[0])[:MAX_RESULTS]
        out.append(temp)
        
    return out

In [15]:
MAX_RESULTS = 5

queries = [
    "people go on an adventure",
    "crime, gangs, and the mob",
    "magical realm"
]

results = search(queries, tokenizer, text_model, kmeans_model, df, embeddings, device, MAX_RESULTS, cosine_similarity_model)

for i in range(len(results)):
    print(f"Query: '{queries[i]}'")
    
    for result in results[i]:
        print(f"{result[0].squeeze()} - '{result[2]}' - '{result[3]}'")
        
    print()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Query: 'people go on an adventure'
0.3668648898601532 - 'Bhaji on the Beach' - 'A group of women of Indian descent take a trip together from their home in Birmingham, England to the beach resort of Blackpool.'
0.35877588391304016 - 'White Squall' - 'Teenage boys discover discipline and camaraderie on an ill-fated sailing voyage.'
0.34105104207992554 - 'Loaded' - 'A group of young friends convene in the countryside to shoot a horror movie. But an experiment with LSD sees normal boundaries between them collapsing, and tragedy subsequently striking.'
0.3262408673763275 - 'Beautiful Girls' - 'During a snowy winter in the small fictional town of Knight"s Ridge, Massachusetts, a group of lifelong buddies hang out,