# Neural search with BERT sentence embeddings

Instead of training document embeddings, we use pretrained embeddings that work directly on sentence or document level.

**Note:** This tutorial is an adapted version of the [semantic search example on SBERT](https://www.sbert.net/examples/applications/semantic-search/README.html).

### Install PyTorch and SentenceTransformers

This installation may take a while. On a command-line Python version, install the packages the usual way, e.g. `pip install torch`.

In [None]:
import sys

!{sys.executable} -m pip install torch
!{sys.executable} -m pip install sentence_transformers

### Define a document collection and create embeddings for them

Let's use the examples from the SentenceBERT tutorial. Creating embeddings just involves two lines of code: one for defining (and downloading) the pre-trained model, and one for producing the embeddings properly speaking:

In [None]:
from sentence_transformers import SentenceTransformer, util
import torch
import numpy as np

corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

model = SentenceTransformer("all-MiniLM-L6-v2")
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
corpus_embeddings.shape

The chosen BERT model produces embeddings with 384 dimensions. As the corpus contains 9 examples, this corresponds to a 9 x 384 matrix.

### Query the document collection

Now we can start writing queries and searching for the most relevant documents. For each query, we:
- Produce its embedding using the same model as above
- Find the 3 most similar documents in the corpus
- Display the document text and the similarity score

In [None]:
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]

for query in queries:
    query_embedding = model.encode(query, convert_to_tensor=True)

    # We use cosine-similarity and torch.topk to find the highest 3 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    most_similar = torch.topk(cos_scores, k=3)
    print("Query:", query)
    for score, idx in zip(most_similar[0], most_similar[1]):
        print(f"- {corpus[idx]} (Score: {score:.4f})")
    print()

### Working on a bigger document collection

This seems to work fine. Let's go back to the Lee corpus that we used before and re-run the same pipeline:

In [None]:
import os
import gensim

lee_bg_file = os.path.join(gensim.__path__[0], 'test', 'test_data', 'lee_background.cor')
lee_bg_corpus = open(lee_bg_file, 'r')
corpus = lee_bg_corpus.readlines()
lee_bg_corpus.close()
for i in range(20):
    print(f"({i:>2})  {corpus[i][:100]}...")
print()

We're again using the `all-MiniLM-L6-v2` pretrained model, so we don't have to reload it. Producing the document embeddings will take some time here as the collection is a bit larger.

We also add `.cpu()` to make saving and reloading easier. In practice, this notebook (and your Flask server) are not connected to a GPU, so everything is run on CPU anyway.

In [None]:
corpus_embeddings = model.encode(corpus, convert_to_tensor=True).cpu()
corpus_embeddings.shape

In this case, we get a 300 x 384 matrix because we have 300 examples in our corpus.

**Note:** You can save the embeddings into a file like this:
```
import numpy as np
np.savez_compressed("my_embeddings.npz", data=corpus_embeddings)
```

and then load them again like this:
```
emb_file = np.load("my_embeddings.npz")
corpus_embeddings = torch.from_numpy(emb_file["data"]).to("cpu")
```

Let's use the same queries as before and show the 5 most similar documents:

In [None]:
queries = [
    "Israel",
    "investors capital profits financial services",
    "militant terrorist killed",
    "earthquake flood rain tsunami forest fire"
]

for q in queries:
    print("Query:", q)
    query_embedding = model.encode(q, convert_to_tensor=True).cpu()

    # We use cosine similarity and torch.topk to find the highest 5 scores
    cos_scores = util.cos_sim(query_embedding, corpus_embeddings)[0]
    most_similar = torch.topk(cos_scores, k=5)
    for score, idx in zip(most_similar[0], most_similar[1]):
        print(f"- {score:.4f}  {corpus[idx][:100]}...")
    print()

That's it! If you're interested in using sentence embeddings for your project, please have a look at the [semantic search page](https://www.sbert.net/examples/applications/semantic-search/README.html). In particular, check if other pretrained models are more suitable for your task and language than `all-MiniLM-L6-v2`.