# Embedding text with an existing model

This notebook will walk you through embedding some text with a pretrained model using [DeCLUTR](https://github.com/JohnGiorgi/DeCLUTR). You can embed text in one of three ways:

1. __As a library__: import and initialize an object from this repo, which can be used to embed sentences/paragraphs.
2. __🤗 Transformers__: load our pretrained model with the [🤗 Transformers library](https://github.com/huggingface/transformers).
3. __Bulk embed__: embed all text in a given text file with a simple command-line interface.

Each approach has advantages and disadvantages:

1. __As a library__: This is the easiest way to add DeCLUTR to an existing pipeline, but requires that you install our package.
2. __🤗 Transformers__: This only requires you to install the [🤗 Transformers library](https://github.com/huggingface/transformers), but requires more boilerplate code.
3. __Bulk embed__: This most suitable if you want to embed large quantities of text "offline" (e.g. not on-the-fly within an existing pipeline).

## 🔧 Install the prerequisites

In [None]:
!pip install git+https://github.com/JohnGiorgi/DeCLUTR.git

Finally, let's check to see if we have a GPU available, which we can use to dramatically speed up the embedding of text

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    cuda_device = torch.cuda.current_device()
else:
    device = torch.device("cpu")
    cuda_device = -1

## 1️⃣ As a library

To use the model as a library, import `Encoder` and pass it some text (it accepts both strings and lists of strings)

In [None]:
from declutr import Encoder

# This can be a path on disk to a model you have trained yourself OR
# the name of one of our pretrained models.
pretrained_model_or_path = "declutr-small"

text = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
    "A soccer game with multiple males playing.",
    "Some men are playing a sport.",
]

encoder = Encoder(pretrained_model_or_path, cuda_device=cuda_device)
embeddings = encoder(text)

These embeddings can then be used, for example, to compute the semantic similarity between some number of sentences or paragraphs.

In [None]:
from scipy.spatial.distance import cosine

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
print(semantic_sim)

Mainly for fun, the following cells visualize the semantic similarity with a heatmap!

In [None]:
from typing import List

import numpy as np
import pandas as pd
import seaborn as sns

def plot_heatmap(text: List[str], embeddings: np.ndarray) -> None:
    embeddings = torch.as_tensor(embeddings)
    cosine = torch.nn.CosineSimilarity(-1)
    similarity_matrix = []
    for _, embedding in enumerate(embeddings):
        similarity_vector = cosine(embedding, embeddings)
        similarity_vector = similarity_vector.numpy()
        similarity_matrix.append(similarity_vector)
    df = pd.DataFrame(similarity_matrix)
    df.columns = df.index = text
    sns.heatmap(df, cmap="YlOrRd")

In [None]:
plot_heatmap(text, embeddings)

See the list of available `PRETRAINED_MODELS` in [declutr/encoder.py](https://github.com/JohnGiorgi/DeCLUTR/blob/master/declutr/encoder.py)

In [None]:
from declutr.encoder import PRETRAINED_MODELS ; print(list(PRETRAINED_MODELS.keys()))

## 2️⃣ 🤗 Transformers

Our pretrained models are also hosted with 🤗 Transformers, so they can be used like any other model in that library. Here is a simple example using [DeCLUTR-small](https://huggingface.co/johngiorgi/declutr-small):

In [None]:
import torch
from scipy.spatial.distance import cosine

from transformers import AutoModel, AutoTokenizer

# Load the model
tokenizer = AutoTokenizer.from_pretrained("johngiorgi/declutr-small")
model = AutoModel.from_pretrained("johngiorgi/declutr-small")
model = model.to(device)

# Prepare some text to embed
text = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
    "A soccer game with multiple males playing.",
    "Some men are playing a sport.",
]
inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
# Put the tensors on the GPU, if available
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)

# Embed the text
with torch.no_grad():
    sequence_output, _ = model(**inputs, output_hidden_states=False)

# Mean pool the token-level embeddings to get sentence-level embeddings
embeddings = torch.sum(
    sequence_output * inputs["attention_mask"].unsqueeze(-1), dim=1
) / torch.clamp(torch.sum(inputs["attention_mask"], dim=1, keepdims=True), min=1e-9)
embeddings = embeddings.cpu()

# Compute a semantic similarity via the cosine distance
semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
print(semantic_sim)

Currently available models:

- [johngiorgi/declutr-small](https://huggingface.co/johngiorgi/declutr-small)
- [johngiorgi/declutr-base](https://huggingface.co/johngiorgi/declutr-base)

## 3️⃣ Bulk embed a file

First, lets save our running example to a file

In [None]:
text = [
    "A smiling costumed woman is holding an umbrella.",
    "A happy woman in a fairy costume holds an umbrella.",
    "A soccer game with multiple males playing.",
    "Some men are playing a sport.",
]
text = "\n".join(text)

!echo -e "$text" > "input.txt"

We then need a pretrained model to embed the text with. Following our running example, lets use DeCLUTR-small

In [None]:
from allennlp.common.file_utils import cached_path
from declutr.encoder import PRETRAINED_MODELS

# Download the model OR retrieve its filepath if it has already been downloaded & cached.
declutr_small_cached_path = cached_path(PRETRAINED_MODELS["declutr-small"])

To embed all text in a given file with a trained model, run the following command

In [None]:
# When embedding text with a pretrained model, we do NOT want to sample spans.
# We can turn off span sampling by setting the num_anchors attribute to None.
overrides = "{'dataset_reader.num_anchors': null}"

!allennlp predict $declutr_small_cached_path "input.txt" \
    --output-file "embeddings.jsonl" \
    --batch-size 32 \
    --cuda-device $cuda_device \
    --use-dataset-reader \
    --overrides "$overrides" \
    --include-package "declutr"

As a sanity check, lets load the embeddings and make sure their cosine similarity is as expected

In [None]:
import json

with open("embeddings.jsonl", "r") as f:
    embeddings = []
    for line in f:
        embeddings.append(json.loads(line)["embeddings"])

In [None]:
from scipy.spatial.distance import cosine

semantic_sim = 1 - cosine(embeddings[0], embeddings[1])
print(semantic_sim)

## ♻️ Conclusion

That's it! In this notebook, we covered three ways to embed text with a pretrained model. Please see [our paper](https://arxiv.org/abs/2006.03659) and [repo](https://github.com/JohnGiorgi/DeCLUTR) for more details, and don't hesitate to open an issue if you have any trouble!