<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/sentence_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentence Transformers

* We will test the sentence transformers we saw on the lecture
* This is best done using the sentence-transformers package
* Documentation and other information: https://www.sbert.net/

In [18]:
!pip3 install -q datasets transformers sentence-transformers

# Data

For this exercise we will need some paraphrase-style data, i.e. sentence pairs with the same meaning.

* **English** We can use for example the MRPC dataset from GLUE
* **Finnish** We can use TurkuNLP's own large paraphrase dataset

Both are fortunately in the Hugging Face datasets repository :)

# The task

For each text, identify its most likely paraphrase by comparing its embedding with the embeddings of all possible sentences it could be paired with in the data and selecting the one with the maximum similarity.

# English - MRPC

Description from [Hugging Face datasets MRPC](https://huggingface.co/datasets/glue#mrpc) entry:

> The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

In [19]:
import datasets

dataset_en = datasets.load_dataset("glue", "mrpc")

In [20]:
# Filter to positive cases to assure paraphrase pair exists
def is_paraphrase_pair(example):
    return example["label"] == 1

dataset_en = dataset_en.filter(is_paraphrase_pair)

In [21]:
# Print the first 20 examples
for item in dataset_en["test"].select(range(20)):
    print(item)

{'sentence1': "PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .", 'sentence2': 'Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .', 'label': 1, 'idx': 0}
{'sentence1': "The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected .", 'sentence2': 'Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash .', 'label': 1, 'idx': 1}
{'sentence1': 'According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 .', 'sentence2': 'The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2

# Finnish - Turku Paraphrase Corpus

Description from [Hugging Face datasets Turku Paraphrase Corpus](https://huggingface.co/datasets/TurkuNLP/turku_paraphrase_corpus) entry:

> The project gathered a large dataset of Finnish paraphrase pairs (over 100,000). The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The objective is to create a dataset which is challenging and better tests the capabilities of natural language understanding. An important feature of the data is that most paraphrase pairs are distributed in their document context. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.

Some observations about the Finnish dataset:
* Labels are a bit more complex than just paraphrase yes/no
* 4: paraphrase universally, 3: paraphrase in the given context, 2: related but not paraphrase
* 4> and 4< paraphrase in one direction but not the other (entailment)

In [22]:
dataset_fi = datasets.load_dataset("TurkuNLP/turku_paraphrase_corpus", "plain")

In [23]:
# Filter to full paraphrases to assure paraphrase pair exists
def is_paraphrase_pair(example):
    return example["label"] == "4"

dataset_fi = dataset_fi.filter(is_paraphrase_pair)

In [24]:
# Print the first 20 examples
for item in dataset_fi["test"].select(range(20)):
    print(item)

{'id': 'turku_paraphrase_corpus-test-1', 'gem_id': 'gem-turku_paraphrase_corpus-test-1', 'goeswith': 'episode-11836', 'fold': 90, 'text1': 'Katsokaa hänen pikkuhampaita. Ihan kuin delfiinimies.', 'text2': 'Katsokaa hänen pikkuruisia hampaitaan, hän näyttää delfiinimieheltä.', 'label': '4', 'binary_label': 'positive', 'is_rewrite': True}
{'id': 'turku_paraphrase_corpus-test-3', 'gem_id': 'gem-turku_paraphrase_corpus-test-3', 'goeswith': 'episode-07366', 'fold': 90, 'text1': 'Tarkkailen tilanneta vielä muutaman minuutin. Sitten menen takaisin sisälle.', 'text2': 'Katson tilannetta vielä muutaman minuutin ajan ja menen uudelleen sisään.', 'label': '4', 'binary_label': 'positive', 'is_rewrite': True}
{'id': 'turku_paraphrase_corpus-test-5', 'gem_id': 'gem-turku_paraphrase_corpus-test-5', 'goeswith': 'episode-00231', 'fold': 90, 'text1': 'Et erotettuna jatka tämän jutun tutkimuksia.', 'text2': 'Sinut on erotettu tästä jutusta ja sen tutkimuksista.', 'label': '4', 'binary_label': 'positive',

# English vs Finnish

* Labels differ
* One has the texts in `sentence1` and `sentence2` and the other in `text1` and `text2` fields

# Sentence transformers

* There are 100+ models in the package
* `paraphrase-xlm-r-multilingual-v1` is a good choice according to the paper
* (this definitely does take some figuring out, though)
* this is a cross-lingual model, so we can use it for both datasets

In [25]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-xlm-r-multilingual-v1')

In [26]:
# go browse sbert.net and look at the example code there
# you will learn it has an `encode()` method

help(model.encode)

Help on method encode in module sentence_transformers.SentenceTransformer:

encode(sentences: Union[str, List[str]], batch_size: int = 32, show_progress_bar: bool = None, output_value: str = 'sentence_embedding', convert_to_numpy: bool = True, convert_to_tensor: bool = False, device: str = None, normalize_embeddings: bool = False) -> Union[List[torch.Tensor], numpy.ndarray, torch.Tensor] method of sentence_transformers.SentenceTransformer.SentenceTransformer instance
    Computes sentence embeddings
    
    :param sentences: the sentences to embed
    :param batch_size: the batch size used for the computation
    :param show_progress_bar: Output a progress bar when encode sentences
    :param output_value:  Default sentence_embedding, to get sentence embeddings. Can be set to token_embeddings to get wordpiece token embeddings. Set to None, to get all output values
    :param convert_to_numpy: If true, the output is a list of numpy vectors. Else, it is a list of pytorch tensors.
    :p

In [27]:
# At first, the run was very slow
# the way to debug: notice in colab GPU resources that GPU memory is 0
# i.e. the model needs to be pushed to the GPU

model=model.cuda()  #A pretty generic way of placing something onto the GPU in torch

In [28]:
#### UNCOMMENT FOR FINNISH

#dataset = dataset_fi # change to dataset_en for the English data
#text_field = "text"  # change to "sentence" for the English data
#max_count = 10000 # this is a huge dataset, let's take 10K examples

####  UNCOMMENT FOR ENGLISH

dataset = dataset_en # change to dataset_en for the English data
text_field = "sentence"  # change to "sentence" for the English data
max_count=len(dataset["train"]) #This is shorter dataset, we can take everything


texts1=[item[text_field+"1"] for item in dataset["train"].select(range(max_count))]
texts2=[item[text_field+"2"] for item in dataset["train"].select(range(max_count))]

Now we need to encode the datasets. Since, in the end we will need the data in a matrix form, the easiest is to use the model's own `.encode()` and give it a list of sentences. That means it will batch nicely and run fast.

The following parameters given to the function can be glanced from the help

In [29]:
encoded_texts1 = model.encode(texts1, convert_to_tensor=True, device=model.device, show_progress_bar=True, normalize_embeddings=True)
encoded_texts2 = model.encode(texts2, convert_to_tensor=True, device=model.device, show_progress_bar=True, normalize_embeddings=True)

Batches:   0%|          | 0/78 [00:00<?, ?it/s]

Batches:   0%|          | 0/78 [00:00<?, ?it/s]

now we have the "left" sentences from the dataset encoded in `texts1_e` and the "right" sentences in `texts2_e`. Let us make sure:

In [30]:
print(encoded_texts1.shape)
print(encoded_texts2.shape)

torch.Size([2474, 768])
torch.Size([2474, 768])


# Pairwise comparison

* We need to compare all sentences from `encoded_texts1` with all sentences from `encoded_texts2`
* We can use simple matrix multiplication to get all-against-all dot products, because that is exactly what matrix multiplication does
* Note that the embeddings are normalized (we asked for it in the `.encode()` arguments), so dot product is the same as cosine similarity
* This is efficient in the sense that it runs on GPU
* For large data which would not fit in the GPU memory, we'd need to do something else

![matrix multiplication](https://www.mcs.anl.gov/~itf/dbpp/text/img792.gif)

In [31]:
import torch

sims=torch.mm(encoded_texts1, encoded_texts2.T)
sims.shape

torch.Size([2474, 2474])

## Most similar sentences

* We have done this many times previously
* To get the most similar pairs out of a matrix of similarities, we do `argsort()` or `argmax()` and pick the first one

In [32]:
sims_sort = torch.argsort(sims, dim=-1, descending=True)

* stop to think what you would expect the result to look like
* we compare the similarity of "left" sentences to "right" sentences

In [33]:
sims_sort[:100,0]

tensor([   0,    1,  536,    3,    4,    5,    6,    7,    8,    9,   10,   11,
          12,   13,   14,   15,   16,   17,   18,   19,   20,   21,   22,   23,
          24,   25,   26,   27,   28,   29,   30,   31, 1903,   33,   34,   35,
          36,   37,   38,   39,  795,   41,   42,   43,   44,   45,   46,   47,
          48,   49,   50,   51,   52,   53,   54,   55,   56,   57,   58,   59,
          60,   61,   62,   63,   64,   65,   66,   67,   68,   69,   70,   71,
          72,   73,   74,   75,   76,   77,   78, 2165,   80,   81,   82,   83,
          84,   85,   86,   87,   88,   89,   90,   91,   92,   93,   94,   95,
          96,   97,   98,   99], device='cuda:0')

In [34]:
# let's inspect few pairs
for i in range(100):
    print(dataset["train"][i][text_field+"1"])
    j=int(sims_sort[i,0]) #this is the index of the corresponding sentence
    print(dataset["train"][j][text_field+"2"]) #
    print("\n---------------------\n")


Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .
Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .

---------------------

They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .
On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .

---------------------

The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange .
MONY shares rose 8.76 per cent to $ 31.90 in after-hours trading in New York .

---------------------

Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .
With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier .

---------------------

The DVD-CCA then appealed to the stat