# Transformers for Search

Let's take a look at how we can use language models to perform a simple search task - given 50,000 movie reviews, can we find one that describes a file that we might be interested in watching?

## Imports

We will lean heavily on our `modelling` package for loading pre-trained models and adapting them to the search task, for which we will also need to import PyTorch.

In [1]:
import math
import warnings

import torch
import torch.nn.functional as F

from modelling import data, utils
from modelling import transformer as tfr

warnings.filterwarnings("ignore")

## Configuration

There are three key parameters we need to set:

- The number of text tokens from each review to use for creating embeddings (set to 60 as this matches what we used for training the model).

- The name of the pre-trained transformer model to use as the foundation for this task.

In [2]:
CHUNK_SIZE = 100
MODEL_NAME = "decoder_next_word_gen"

## Get Data

Load the reviews into memory.

In [3]:
reviews = data.get_data()["review"].tolist()
review = reviews[0]

utils.print_wrapped(review)

Forget what I said about Emeril. Rachael Ray is the most irritating personality on the
Food Network AND all of television. If you've never seen 30 Minute Meals, then you cannot
possibly begin to comprehend how unfathomably annoying she is. I really truly meant that
you can't even begin to be boggled by her until you've viewed the show once or twice, and
even then all words and intelligent thoughts will fail you. The problem is mostly with
her mannerisms as you might have guessed. Ray has a goofy mouth and often imitates the
parrot. If you love something or think it's "awesome" (a word she uses roughly 87 times
per telecast) just say it. And she's constantly using horrible, unfunny catchphrases like
"EVOO" (Extra virgin olive oil!). SHUT UP! What's worse is Ray has TWO other shows on the
network! I think this is some elaborate conspiracy by the terrorists to drive us mad.
Give me more Tyler Florence! Ray is lame.


## Setup Tokenizer

We will need a tokenizer to convert reviews from strings to lists of integers, that the model has been trained to use as inputs. We will need to use the same tokenizer as that used for training the original model.

In [4]:
tokenizer = data.IMDBTokenizer(reviews, 10)

tokenizer(review)[:10]

[831, 49, 11, 300, 44, 1, 3, 10505, 1363, 8]

## Load Pre-Trained Model

We load a pre-trained transformer decoder (generative) model.

In [5]:
pre_trained_model: tfr.NextWordPredictionTransformer = utils.load_model(MODEL_NAME)
pre_trained_model

loading .models/decoder_next_word_gen/trained@2023-07-23T10:13:30;loss=5_0299.pt


NextWordPredictionTransformer(
  (_position_encoder): PositionalEncoding(
    (_dropout): Dropout(p=0.1, inplace=False)
  )
  (_embedding): Embedding(133046, 256)
  (_decoder): TransformerDecoderLayer(
    (self_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
    )
    (multihead_attn): MultiheadAttention(
      (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
    )
    (linear1): Linear(in_features=256, out_features=512, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (linear2): Linear(in_features=512, out_features=256, bias=True)
    (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    (dropout1): Dropout(p=0.1, inplace=False)
    (dropout2): Dropout(p=0.1, inplace=False)
    (dropout3): Dropout(p=0.1, inplace=Fa

## Adapt Pre-Trained Model to Create Document Embeddings

Recall that the pre-trained model was original trained to predict the next token in a sequence, which is ultimately a classification task that necessitated the final layer of the model being a linear layer that output the logits for all possible tokens (~30k). 

Well, we don't need this linear layer to create embeddings - we'd prefer to have the intermediate output from the core transformer block - i.e., the context-aware token embeddings output the from multi-head attention mechanism.

To get at these we define a new model (using inheritance) that will only initialise and use the layers of the pre-trained model that we want. We also add an additional step that will map many context-aware embeddings to a single embedding for a whole chunk of text (or document).

In [6]:
class DocumentEmbeddingTransformer(tfr.NextWordPredictionTransformer):
    """Adapting a generative model to yield text embeddings."""

    def __init__(self, pre_trained_model: tfr.NextWordPredictionTransformer):
        super().__init__(
            pre_trained_model._size_vocab,
            pre_trained_model._size_embed,
            pre_trained_model._n_heads,
        )
        del self._linear
        self.load_state_dict(pre_trained_model.state_dict(), strict=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x_causal_mask, x_padding_mask = self._make_mask(x)
        out = self._embedding(x) * math.sqrt(torch.tensor(self._size_embed))
        out = self._position_encoder(out)
        out = self._decoder(
            out,
            out,
            tgt_mask=x_causal_mask,
            tgt_key_padding_mask=x_padding_mask,
            memory_mask=x_causal_mask,
            memory_key_padding_mask=x_padding_mask,
        )
        out = torch.sum(out.squeeze(), dim=0)
        out /= out.norm()
        return out

Let's create an instance of our document embedding model and feed it a tokenised chunk of text to make sure that what we get back is a single vector with the same dimension as our context-aware embeddings.

In [7]:
embedding_model = DocumentEmbeddingTransformer(pre_trained_model)
embedding = embedding_model(torch.tensor([tokenizer(review)]))
embedding.size()

torch.Size([256])

## Index Reviews using Embeddings

We now use our document embedding model to produce an embedding vector for each review in the dataset - all 50,000!

In [8]:
embeddings_db = []
errors = []

embedding_model.eval()
with torch.no_grad():
    for i, review in enumerate(reviews):
        try:
            review_tokenized = tokenizer(reviews[i])[:CHUNK_SIZE]
            review_embedding = embedding_model(torch.tensor([review_tokenized]))
            embeddings_db.append(review_embedding)
        except Exception:
            errors.append(str(i))

if errors:
    print(f"ERRORS: {', '.join(errors)}")

embeddings_db = torch.stack(embeddings_db)

# Query Reviews

We now have everything we need approach our search task. We start by specifying a query.

In [9]:
query = "Classic horror movie that is terrifying"

We then create an embedding for our query and use [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to score all reviews for relevance to our query.

In [10]:
query_embedding = embedding_model(torch.tensor([tokenizer(query)]))
query_results = F.cosine_similarity(query_embedding, embeddings_db)

Let's take a look at the top query result to see if it sounds like a relavent match for our query.

In [11]:
query_embedding = embedding_model(torch.tensor([tokenizer(query)]))
query_results = F.cosine_similarity(query_embedding, embeddings_db)

top_hit = query_results.argsort(descending=True)[0]

print(f"[review #{top_hit}; score = {query_results[top_hit]:.4f}]\n")
utils.print_wrapped(reviews[top_hit])

[review #14212; score = 0.7540]

"Pet Sematary" succeeds on two major situations. First, it's a scary Horror movie. Those
that just aren't produced in these days. Second, it's an emotional, clever movie overall.
So if you are looking for chills, scares, creepiness and visually stunning settings,
great acting, dialongs, and gruesome effects; this is the movie you are looking for. A
classic now and truly a must see for any Horror fan. <br /><br />Probably, the best
adaptation to any of King's novels. The events feel a little rushed compared with the
novel, but that doesn't means that this underrated movie isn't a complete Horror/Drama
accomplishment. <br /><br />Stephen King's novel is widely known for being very emotional
and gruesome at the same time. The movie captures the same feeling mainly because there's
a great character development and you can feel the loving relationship between it's
members. Then, when everything seems to be happiness (technically happy, because the
title "Pet

Not bad for a toy model!

In practice, pre-trained language models are not used for semantic search tasks (i.e., for sentence embeddings), without having first fine-tuned them to ensure that performance for this task has been optimised. This is beyond the scope of this work, but the next logical step to try.