[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W2E_Pipelines_Sentence_Transformers.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install transformers sentencepiece torch datasets sentence-transformers

# 🤗 Pipelines & Sentence Transformers for Semantic Search and QA

*This notebook is partly adapted from a tutorial by Wietse De Vries for the IK-NLP course of 2021*

The goal of this notebook is to make you practice with some pipeline use-cases and to introduce the [Sentence Transformers](https://sbert.net/) in the context of a concrete example of semantic search and question answering (relevant to the Open-book Question Answering final project).

Exercises 1 and 2 of this notebook are mandatory and will be part of your first graded portfolio. Exercise 3 is optional, but we highly recommend you to complete it, especially if you're interested in the Open-book Question Answering final project.

## Exercise 1: Using the Fill-mask pipeline for Probing Linguistic Knowledge in mBERT

As you probably know by now, BERT is a transformed-based, context-sensitive, neural language models that has been trained, among others, on a masked language modeling task, where the model learns to predict what the most likely word is at a a *masked* (hidden) position in the sentence. In a sentence like:

>There were several [MASK] with the proposed solution.
    
the model will learn that the word *problems* or *issues* is more likely at this position than the word *unicorns* or *days*. The model uses both the right and left context of the masked position to make its predictions. Models trained to maximize the probability of predicting the correct masked words learn to represent a good amount of linguistic knowledge as a result of this.

A **probe** is a test of a language model aimed at investigating how accurate these predictions are, especially for cases where syntax makes it quite clear that one (form of a) word is correct, and another word is impossible. In the example above, for instance, the masked position can be filled by a plural noun (*problems*) but not by a singular noun (*problem*). If the model makes predictions that respect the linguistic constraints, we have reason to believe that the model is somehow aware of the linguistic structure of the language.

While predicting whether the masked position should be filled by a singular or plural noun seems easy in the example above (both *were* and *several* are good predictors of plural), we can try to make the task harder by looking for contexts where the solution requires more careful *attention* to the right words in the context

>There were some [MASK] with the proposed solution.
>
>There could be several [MASK] with the proposed solution.
>
>There were some unexpected and unforeseen [MASK] with the proposed solution.
    
In the examples above, the task is made harder by replacing *several* (which is always followed by a plural noun) by *some* (which can be followed by a singular or a plural noun), by replacing *were* (which always heads a sentence with a plural subject) by *could be* (which can head a sentence with a singular or plural subject), and by inserting material between the verb *were* (which indicates that there should be a plural) and the MASK.


### Assignment

Think of a grammatical phenomenon in a language of your choice, and come up with at least 5 example sentences to probe whether the model makes the correct predictions. Think of cases where the context makes it clear that the mask has to be plural or singular, that a verb has to have a particular form (like plural or singular, or participle or infinitive), that a specific (personal, possessive, reflexive) pronoun has to be used, that an adjective or noun has to have a specific inflection (like in German and more generally in languages with a rich case and/or gender marking system). There is a host of literature on this, see for instance [Marvin and Linzen](https://arxiv.org/abs/1808.09031) (for English) and [Sahin et al](https://www.mitpressjournals.org/doi/full/10.1162/coli_a_00376) (for multilingual probes).

### Model

The model we will be using for this task is the [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) model mBERT, that was trained on the Wikipedia text of the 102 largest Wikipedia's. This means that you do not have to choose examples from English, but that you may also present a probe for another language. 

The following loads the pipeline for doing masked prediction, and load the mBERT model (this may take a minute or so). You can ignore the warning about some weights not being initialized. The pipeline can be used to test masked language model prediction. Given a sequence containing the special token [MASK], the model will predict what the most likely tokens are at that position, using both left and right context.

In [15]:
from transformers import pipeline

mbert = pipeline('fill-mask', model='bert-base-multilingual-cased')
mbert('[There were some unexpected and unforeseen [MASK] with the proposed solution.]')

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'score': 0.6111882925033569,
  'token': 20390,
  'token_str': 'problems',
  'sequence': '[ There were some unexpected and unforeseen problems with the proposed solution. ]'},
 {'score': 0.04767114296555519,
  'token': 17850,
  'token_str': 'issues',
  'sequence': '[ There were some unexpected and unforeseen issues with the proposed solution. ]'},
 {'score': 0.04377792775630951,
  'token': 73082,
  'token_str': 'dealing',
  'sequence': '[ There were some unexpected and unforeseen dealing with the proposed solution. ]'},
 {'score': 0.03851241618394852,
  'token': 64557,
  'token_str': 'difficulties',
  'sequence': '[ There were some unexpected and unforeseen difficulties with the proposed solution. ]'},
 {'score': 0.030926214531064034,
  'token': 47451,
  'token_str': 'concerned',
  'sequence': '[ There were some unexpected and unforeseen concerned with the proposed solution. ]'}]

By default, the pipe returns the 5 most likely words that could appear at the position of the mask, along with a score. If you want to know specifically whether the model prefers one of two forms, you can give these forms as targets to the pipe, and also print the answer in a more readable form:

In [2]:
def probe(sentence: str, targets: str) :
    for res in mbert(sentence,targets=targets) :
        print(f"{res['score']:6.4f}\ttoken: {res['token_str']}\t{res['sequence']}")
        
probe('There were some unexpected and unforeseen [MASK] with the proposed solution.',['challenge','challenges'])

0.0034	token: challenges	There were some unexpected and unforeseen challenges with the proposed solution.
0.0001	token: challenge	There were some unexpected and unforeseen challenge with the proposed solution.


> **💡 Interesting Fact**: The same bias that is present towards grammatically correct choices can be observed in other cases, such as racial and gender stereotyping. Much work is currently in process to identify and remove gender and racial biases from learned language embeddings. See the following example and [this recent survey on the topic](https://arxiv.org/abs/2112.14168).

In [3]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-cased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']
['nurse', 'waitress', 'teacher', 'maid', 'prostitute']


### Your Turn to Probe

Give at least five example sentences with a [MASK] and a list of targets that illustrate a specific grammatical phenomenon in a language of your choice. Describe what the grammatical phenomenon is you are investigating. Use the probe function for testing. Try to include both *easy* sentences (where the model should do well) as well as *hard* sentences (where there are words in the context that might lead to confusion, or where the clue words are far away from the mask). For languages other than Dutch or English, make sure to include enough explanation so that examples and tests are clear to a non-native speaker. 

Describe how well the model did on your probe sentences. Where there any cases where the model made the wrong decision?

### Answer
A common mistake in Dutch is to use the word "als" instead of "dan" in a comparison where the subjects are not equal. Take for example the sentence "Maarten is stronger than Peter". "Maarten is sterker als Peter" is incorrect, and "Maarten is sterker dan Peter" is correct. In fact, the use of "als" instead of "dan" is always incorrect in such a comparison. The use of "als" in a comparison is however correct when the subjects are equal. Take for example the sentence "Maarten is as strong as Peter". In Dutch this would for example be "Maarten is even sterk als Peter". So "als" is correct in the second type of comparison, but not in the first type. Unfortunately, these cases are mixed up quite often.

A list of probes was constructed to test whether mBERT correctly uses "dan" instead of "als" in unequal comparisons. The probes contained two different types of subjects to be compared: the pronouns "jij" (you) and "ik" (I/me), and the colours "paars" (purple) and "groen" (green). Four different adjectives were used: "sterker" (stronger), "levendiger" (more vibrant), "slimmer" (smarter), and "helderder" (brighter). Complexity was introduced by using an adjective that is uncommonly used in describing a given subject. E.g. "slim" is not really an adjective one would use to describe a colour. Still, even if the meaning makes less sense, it would be expected that "dan" is used instead of "als". Complexity was also introduced by making sentences longer.

Remarkably, mBERT gives the correct answer in only four out of twelve probes. For sentences about colours, mBERT's mistakes are in the relatively more complex case of "slimmer", and the longer sentences. For the comparisons containing pronouns, mBERT is incorrect not only in the relatively more complex cases of "levendiger" and "helderder", but also in the simple case of "sterker". Additionally, it is incorrect in the cases of longer sentences.

It is unclear why mBERT has such a poor performance in selecting the correct word, even in relatively simple cases. Namely, given that it was trained on Wikipedia texts, it is unlikely that it has seen a lot of examples of "als" incorrectly being used in an unequal comparison.

In [103]:
print("English: Purple is more vibrant than green.")
probe("Paars is levendiger [MASK] groen.", ["dan", "als"])
print("---------------------")
print("English: Purple is brighter than green.")
probe("Paars is helderder [MASK] groen.", ["dan", "als"])
print("---------------------")
print("English: Purple is stronger than green.")
probe("Paars is sterker [MASK] groen.", ["dan", "als"])
print("---------------------")
print("English: Purple is smarter than green.")
probe("Paars is slimmer [MASK] groen.", ["dan", "als"])
print("---------------------")
print("English: Purple is a more vibrant colour than green.")
probe("Paars is een levendigere kleur [MASK] groen.", ["dan", "als"])
print("---------------------")
print("English: Purple is a brighter colour than green.")
probe("Paars is een helderdere kleur [MASK] groen.", ["dan", "als"])
print("---------------------")
print("English: You are more vibrant than me.")
probe("Jij bent levendiger [MASK] ik.", ["dan", "als"])
print("---------------------")
print("English: You are brighter than me.")
probe("Jij bent helderder [MASK] ik.", ["dan", "als"])
print("---------------------")
print("English: You are stronger than me.")
probe("Jij bent sterker [MASK] ik.", ["dan", "als"])
print("---------------------")
print("English: You are smarter than me.")
probe("Jij bent slimmer [MASK] ik.", ["dan", "als"])
print("---------------------")
print("English: You read a lot of books and that is why you are smarter than me.")
probe("Jij leest veel boeken en daarom ben je slimmer [MASK] ik.", ["dan", "als"])
print("---------------------")
print("English: You go to the gym and that is why you are stronger than me.")
probe("Jij gaat naar de sportschool en daarom ben je sterker [MASK] ik.", ["dan", "als"])


English: Purple is more vibrant than green.
0.0014	token: dan	Paars is levendiger dan groen.
0.0013	token: als	Paars is levendiger als groen.
---------------------
English: Purple is brighter than green.
0.0013	token: dan	Paars is helderder dan groen.
0.0011	token: als	Paars is helderder als groen.
---------------------
English: Purple is stronger than green.
0.0149	token: dan	Paars is sterker dan groen.
0.0112	token: als	Paars is sterker als groen.
---------------------
English: Purple is smarter than green.
0.0021	token: als	Paars is slimmer als groen.
0.0020	token: dan	Paars is slimmer dan groen.
---------------------
English: Purple is a more vibrant colour than green.
0.0022	token: als	Paars is een levendigere kleur als groen.
0.0021	token: dan	Paars is een levendigere kleur dan groen.
---------------------
English: Purple is a brighter colour than green.
0.0016	token: als	Paars is een helderdere kleur als groen.
0.0012	token: dan	Paars is een helderdere kleur dan groen.
---------

## Exercise 2: Mixing Pipelines for Text-to-text QA in Many Languages

The Model Hub of HuggingFace is home to a staggering amount of models for the more disparate use-cases, but you may have noticed that many of those are trained on the English language. Let's consider for example the [`UnifiedQA`](https://github.com/allenai/unifiedqa) model by AllenAI, which is a T5 model architecture trained to perform question answering on multiple formats (e.g. extract the answer from the provided context, produce an answer without a supporting context, choose among multiple possible answers, yes/no questions) using a unified text-to-text approach. While this opens thrilling perspectives in having a single model for all QA use-cases, UnifiedQA models are available for English only, and training such models from scratch in another language would require nontrivial effort and resources.

Here are several examples of how text should be formatted for the UnifiedQA model:

| **Task type** | **Example Dataset** | **Format** | **Example** | **Output** |
| :---: | :--- | :--- | :--- | :--- |
|  **Extractive QA** | SQUAD | `<QUESTION> \n <CONTEXT>` | `At what speed did the turbine operate? \n (Nikola_Tesla) On his 50th birthday in 1906, Tesla demonstrated his 200 horsepower (150 kilowatts) 16,000 rpm bladeless turbine. ...` |  `16,000 rpm` |
|  **Abstractive QA** | NarrativeQA | `<QUESTION> \n <CONTEXT>` | `What does a drink from narcissus's spring cause the drinker to do?  \n  Mercury has awakened Echo, who weeps for Narcissus, and states that a drink from Narcissus's spring causes the drinkers to ''Grow dotingly enamored of themselves.'' ...` | `fall in love with themselves` |
|  **Multiple-choice QA** | ARC-challenge | `<QUESTION> \n (a) <CHOICE_A> (b) <CHOICE_B> ...` | `What does photosynthesis produce that helps plants grow? \n (A) water (B) oxygen (C) protein (D) sugar` | `sugar` |
|  **Multiple-choice QA with context** | MCTest | `<QUESTION> \n (a) <CHOICE_A> (b) <CHOICE_B> ... \n <CONTEXT>` | `Who was Billy? \n (A) The skinny kid (B) A teacher (C) A little kid (D) The big kid \n Billy was like a king on the school yard. A king without a queen. He was the biggest kid in our grade, so he made all the rules during recess. ...` | `The big kid` |
|  **Yes-no QA** | BoolQ | `<QUESTION> \n <CONTEXT>` | `Was America the first country to have a president?  \n (President) The first usage of the word president to denote the highest official in a government was during the Commonwealth of England ...` | `no` |

### Assignment

We are gonna build a function making use of multiple models through 🤗 Pipelines to generate a response to a question in one of the formats specified above, in one of the languages supported by the MT systems available on the HuggingFace Model Hub. The function will translate and paraphrase the query into multiple examples, and then pick the best outputs of the UnifiedQA model as candidates for backtranslation. In this way, we mock the existance of a UnifiedQA model for the language of our choice.

### Model

The following code loads the UnifiedQA model and use it to perform QA on a multiple choice question without context.

In [1]:
from transformers import pipeline

# Using at least the base variant of the model is advised for good results
generator = pipeline("text2text-generation", model="allenai/unifiedqa-t5-base")
generator("What is the name of the city where the Eiffel Tower is located? \n (A) Paris (B) London (C) Prague (D) Berlin")

[{'generated_text': 'Paris'}]

### Your turn to Pipe

Using the pipelines we saw in the tutorial, create a function taking a `question` string, an optional `context` string and an optional list of strings called `choices` and performs the following steps:

- Use one of the [MarianMT](https://huggingface.co/models?search=helsinki-nlp) machine translation models to translate all the inputs from the language of your choice to English. You may want to split the text into sentences (e.g. by splitting on periods) to obtain better results for the context.

- Use a paraphrasing model ([`tuner007/pegasus_paraphrase`](https://huggingface.co/tuner007/pegasus_paraphrase) is a good choice, albeit heavy) to produce 4 paraphrases of the question, using the `num_return_sequences` parameter.

- For each question (translated + 4 paraphrases), format it with the translated context and choices (if present) as a single string in the format required by UnifiedQA.

- Use the `allenai/unifiedqa-t5-base` model to generate an answer for each of the 5 questions.

- If at least 3 of the 5 answers are identical strings, return the result translated back to the original language using the MarianMT model for the reciprocal language pair (e.g. if you used `Helsinki-NLP/opus-mt-nl-en` to translate from Dutch to English, you will need to use `Helsinki-NLP/opus-mt-en-nl`). Otherwise, print "No common answer found" translated in the original language.

**Importantly**, the quality of the output does not determine your score in the evaluation. The goal is to get a feel for the models and their capabilities.

In [9]:
from typing import Optional, List
from collections import Counter
from transformers import AutoTokenizer, AutoModel, PegasusForConditionalGeneration, PegasusTokenizer

def answer(question: str, context: Optional[str] = None, choices: Optional[List[str]] = None):
    alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    paraphraser_model_name = "tuner007/pegasus_paraphrase"
    paraphraser_tokenizer = PegasusTokenizer.from_pretrained(paraphraser_model_name)
    paraphraser_model = PegasusForConditionalGeneration.from_pretrained(paraphraser_model_name)
    dutch_to_english_translator = pipeline('translation', model = "Helsinki-NLP/opus-mt-nl-en")
    english_to_dutch_translator = pipeline('translation', model = "Helsinki-NLP/opus-mt-en-nl")
    

    question_en = dutch_to_english_translator(question)[0]['translation_text']
    paraphraser_tokens = paraphraser_tokenizer(question_en, return_tensors="pt")
    output = paraphraser_model.generate(**paraphraser_tokens, num_return_sequences = 4)
    paraphrases = paraphraser_tokenizer.batch_decode(output, skip_special_tokens=True)
   
    all_questions = [question_en]
    for paraphrase in paraphrases:
        all_questions.append(paraphrase)
    
    if (context):
        # Remove any leading or trailing spaces and periods from the
        # context. Then, split the context at periods. Translate each
        # sentence separately. Finally, strip each translation of 
        # spaces and periods, and combine all sentences with periods
        # between them.
        context_en = dutch_to_english_translator(context.strip(" .").split('.'))
        complete_context_en = ""
        for sentence in context_en:
            complete_context_en += sentence['translation_text'].strip(" .") + ". "

        

    if (choices):
        choices_en = []
        for choice in choices:
            choices_en.append(dutch_to_english_translator(choice)[0]['translation_text'])

        
    all_answers = []
    for q in all_questions:
        model_input = f"{q}"

        if (context):
            model_input += " \n "
            model_input += f"{complete_context_en}"
        
        if (choices):
            model_input += " \n "
            for i in range(len(choices_en)):
                model_input += f"({alphabet[i]}) {choices_en[i]} "

        generator(model_input)[0]['generated_text']
        all_answers.append(generator(model_input)[0]['generated_text'])


    most_common = Counter(all_answers).most_common(1)
    if (most_common[0][1] >= 3):
        final_answer_en = most_common[0][0]
    else:
        final_answer_en = "No common answer found"

    
    return english_to_dutch_translator(final_answer_en)[0]['translation_text']

question = "Waar woon ik?"
context = "Ik ben een student in Groningen. Ik studeer kunstmatige intelligentie. Mijn naam is Luuk. Ik ben bijna 24 jaar oud."
choices = ["Luuk", "Groningen", "een student", "kunstmatige intelligentie"]
final_answer = answer(question = question, context = context, choices = choices)
final_answer
#MISSCHIEN TOKENIZER GEBRUIKEN?
#lowercase?

'Groningen'

## (Optional) Exercise 3: SentenceTransformers for Semantic Similarity Search

*Figures and some code are from the [CohereAI Semantic Search Tutorial](https://docs.cohere.ai/semantic-search/) by Jay Alammar.*

> SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084).
>
> You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining. 
> 
> The framework is based on PyTorch and Transformers and offers a large collection of pre-trained models tuned for various tasks. Further, it is easy to fine-tune your own models.

Semantic search is a typical use case in natural language processing, in which we want to retrieve the most relevant documents from a corpus (e.g. Automatic FAQs, Web browser search results, etc.). Using the similarity between embedded text representations allows us to go beyond simple keyword matching, which is highly desirable in this setting (e.g. `tomorrow will rain` should be very close in embedding space to `the weather forecast announces showers for the next day`, despite no lexical overlap). 

<div>
<img alt="Visualizing Semantic Search" src="https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/basic-semantic-search-overview.png?3" style="width: 60%" />
</div>

In this exercise we will use first Huggingface Transformers and then the [SentenceTransformers](https://sbert.net) library to find the most relevant paragraphs for a specific query.

Let's start by loading the `train` split of the `squad` dataset from the Dataset Hub and flatten its structure so that every example contains a single triplet `(context, question, answer)`. We are going to use only the first 50 rows of the dataset, the others can be discarded.

In [None]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:50]")

squad_train_filtered = None# Your code here

Now, sample a question at random from the resulting dataset and print it. You can use `shuffle` or turn the Dataset into a `DataFrame` and use `sample`.

In [None]:
query = None # Your code here

print(query)

We now going to use a model trained to perform **semantic search** to retrieve the top 10 most likely contexts for each selected question.

The model `sentence-transformers/multi-qa-MiniLM-L6-cos-v1` is a good choice for this task, since it is relatively small and was explicitly trained for semantic search. We are going to define now three utility functions:

- `dot_score` computes the dot product between two Pytorch tensors.
- `mean_pooling` averages the embeddings produced by a model to obtain a **sentence embedding**.
- `encode` uses a model and a tokenizer to convert a list of texts into a tensor of embeddings. **Complete it with the first two steps we saw in the tutorial.**

In [17]:
import torch
from torch import Tensor
import torch.nn.functional as F


def dot_score(a: Tensor, b: Tensor):
    """
    Computes the dot-product dot_prod(a[i], b[j]) for all i and j.
    :return: Matrix with res[i][j]  = dot_prod(a[i], b[j])
    Taken from the SentenceTransformer library
    """
    if not isinstance(a, torch.Tensor):
        a = torch.tensor(a)
    if not isinstance(b, torch.Tensor):
        b = torch.tensor(b)
    if len(a.shape) == 1:
        a = a.unsqueeze(0)
    if len(b.shape) == 1:
        b = b.unsqueeze(0)
    print(a.shape, b.shape)
    # Compute the dot-product
    return torch.mm(a, b.transpose(0, 1))


#Mean Pooling - Average all the embeddings produced by the model
def mean_pooling(model_output, attention_mask):
    # First element of model_output contains all token embeddings
    token_embeddings = model_output.last_hidden_state
    # Expand the mask to the same size as the token embeddings to avoid indexing errors
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    # Compute the mean of the token embeddings
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


#Encode text
def encode(model, tokenizer, texts):
    # Tokenize sentences
    encoded_input = None # Your code here
    # Compute token embeddings
    with torch.no_grad():
        model_output = None # Your code here
    # Perform pooling
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    # Normalize embeddings
    embeddings = F.normalize(embeddings, p=2, dim=1)
    
    return embeddings

Let's now use these functions to compute similarity scores for the query and all the contexts:

In [None]:
# Sentences we want sentence embeddings for
contexts = list(squad_train_filtered.to_pandas()['context'])

# Load the model and tokenizer from HuggingFace Hub
tokenizer = None # Your code here
model = None # Your code here

#Encode query and contexts with the encode function
query_emb = None # Your code here
contexts_emb = None # Your code here

#Compute dot score between query and all contexts embeddings
scores = torch.mm(query_emb, contexts_emb.transpose(0, 1))[0].cpu().tolist()

#Combine contexts & scores
contexts_score_pairs = list(zip(contexts, scores))

#Sort by decreasing score
contexts_score_pairs = sorted(contexts_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for ctx, score in contexts_score_pairs:
    print(score, ctx)

We can now use SentenceTransformers to do the same, but using much less code:

In [None]:
from sentence_transformers import SentenceTransformer, util

# Query was defined above
contexts = list(squad_train_filtered.to_pandas()['context'])

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

#Encode query and contexts using SentenceTransformer model.encode
query_emb = model.encode(query)
contexts_emb = model.encode(contexts)

#Compute dot score between query and all contexts embeddings
scores = util.dot_score(query_emb, contexts_emb)[0].tolist()

#Combine contexts & scores
contexts_score_pairs = list(zip(contexts, scores))

#Sort by decreasing score
contexts_score_pairs = sorted(contexts_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for ctx, score in contexts_score_pairs:
    print(score, ctx)

For a more advanced overview on how to optimize speed with semantic search, see how to use the [FAISS](https://github.com/facebookresearch/faiss) library to perform fast nearest neighbor search natively with Huggingface Transformers here: [Using FAISS for efficient similarity search ](https://huggingface.co/course/chapter5/6?fw=pt#using-faiss-for-efficient-similarity-search)