# Transfer learning sentence embbeding 

SBERT can be used for information retrieval, clustering, automatic essay scoring, and for semantic textual similarity with incredible time and high accuracy. However, the limitation of SBERT is that it only supports English at the moment while leave blank for other languages. To solve that, we can use the model architecture similar with Siamese and Triplet network structures to extend SBERT to new language [1](https://arxiv.org/abs/2004.09813).

# Multilingual-Models

The idea is based on a fixed (monolingual) teacher model, that produces sentence embeddings with our desired properties in one language. The student model is supposed to mimic the teacher model, i.e., the same English sentence should be mapped to the same vector by the teacher and by the student model. In order that the student model works for further languages, we train the student model on parallel (translated) sentences. The translation of each sentence should also be mapped to the same vector as the original sentence.

# Installing dependencies

In [1]:
!pip install -U sentence-transformers
!pip install -U opustools

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 1.7 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 30.3 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 41.0 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 45.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 15.4 MB/s 
Building wheels for collected 

# Import libraries

In [2]:
from sentence_transformers import SentenceTransformer, LoggingHandler, models, evaluation, losses
from torch.utils.data import DataLoader
from sentence_transformers.datasets import ParallelSentencesDataset

import os
import sentence_transformers.util
import csv
from opustools import OpusRead
import gzip
from tqdm.autonotebook import tqdm
import numpy as np
import io
import zipfile

# Defining Parameters

In [3]:
#Our monolingual teacher model, we want to convert to multiple languages
teacher_model_name = 'paraphrase-distilroberta-base-v2'   
#Multilingual base model we use to imitate the teacher model
student_model_name = 'xlm-roberta-base'       

max_seq_length = 128                # Student model max. lengths for inputs (number of word pieces)
train_batch_size = 64               # Batch size for training
inference_batch_size = 64           # Batch size at inference
train_max_sentence_length = 200     # Maximum length (characters) for parallel training sentences

# Maximum number of  parallel sentences for training.
# NOTE: too high and it will increase the training time
max_sentences_per_language = 100000 

num_epochs = 10            
num_warmup_steps = 1000 

num_evaluation_steps = 1000 
#Number of parallel sentences to be used for development
dev_sentences = 5000 

corpus = 'TildeMODEL'  #Corpora you want to use
# Visit https://opus.nlpl.eu/ for the dataset available

# Define the language codes you would like to extend the model to
src_lang = 'en'  #Source language, our teacher model is able to understand

# We want to extend the model to these new languages.
trg_lang = 'de' #Target languages, our student model should learn

download_folder = "downloaded-corpus/"
output_folder = "parallel-sentences/"
opus_download_folder = './opus'

output_path = "model-out"  # Output path to save the model output

# Here we define train train and dev corpora
sts_corpus = f"{download_folder}/STS2017-extended.zip"
train_corpus = f"{download_folder}{corpus}.tsv.gz"

# In order to train a new model
## Otherwise skip to the load the trained model
## Download the corpus

In [None]:
os.makedirs(download_folder, exist_ok=True)

output_filename = os.path.join(download_folder, f"{corpus}-{src_lang}-{trg_lang}.tsv.gz")
if not os.path.exists(output_filename):
    print("Create:", output_filename)
    try:
        read = OpusRead(directory=corpus, 
                        source=src_lang, 
                        target=trg_lang, 
                        write=[output_filename], 
                        download_dir=opus_download_folder, 
                        preprocess='raw', 
                        write_mode='moses', 
                        suppress_prompts=True)
        read.printPairs()
    except:
        print("An error occured during the creation of", output_filename)

            
if not os.path.exists(sts_corpus):
    print(sts_corpus, "does not exists. Try to download from server")
    filename = os.path.basename(sts_corpus)
    url = "https://sbert.net/datasets/" + filename
    sentence_transformers.util.http_get(url, sts_corpus)

Create: downloaded-corpus/TildeMODEL-en-de.tsv.gz
No alignment file "/projappl/nlpl/data/OPUS/TildeMODEL/latest/xml/de-en.xml.gz" or "./opus/TildeMODEL_latest_xml_de-en.xml.gz" found
The following files are available for downloading:

  34 MB https://object.pouta.csc.fi/OPUS-TildeMODEL/v2018/xml/de-en.xml.gz
 254 MB https://object.pouta.csc.fi/OPUS-TildeMODEL/v2018/raw/de.zip
 262 MB https://object.pouta.csc.fi/OPUS-TildeMODEL/v2018/raw/en.zip

 549 MB Total size
./opus/TildeMODEL_latest_xml_de-en.xml.gz ... 100% of 34 MB
./opus/TildeMODEL_latest_raw_de.zip ... 41% of 254 MBAn error occured during the creation of downloaded-corpus/TildeMODEL-en-de.tsv.gz


## Create dataset from source

As training data we require parallel sentences, i.e., sentences translated in various languages. As data format, we use a tab-seperated .tsv file. In the first column, you have your source sentence, for example, an English sentence. In the following columns, you have the translations of this source sentence. If you have multiple translations per source sentence, you can put them in the same line or in different lines.

```
Source_sentence Target_lang1    Target_lang2    Target_lang3
Source_sentence Target_lang1    Target_lang2
```


In this case we will download the TED2020 corpus, a corpus with transcripts and translations from TED and TEDx talks. It than extends a monolingual model to several languages (en, de, es, it, fr). TED2020 contains parallel data for more than 100 languages, hence, you can simple change the script and train a multilingual model in other languages. 

NOTE: The more languages you insert, the larger will be the training set, hence the training will take longer. 

In [None]:
train_files = os.path.join(output_folder, f"{corpus}-{src_lang}-{trg_lang}-train.tsv.gz")
dev_file = os.path.join(output_folder, f"{corpus}-{src_lang}-{trg_lang}-dev.tsv.gz")

In [None]:
# Create parallel files for the selected language combinations
os.makedirs(output_folder, exist_ok=True)

train_file = os.path.join(output_folder, f"{corpus}-{src_lang}-{trg_lang}-train.tsv.gz")
dev_file = os.path.join(output_folder, f"{corpus}-{src_lang}-{trg_lang}-dev.tsv.gz")
outfile = {'src_lang': src_lang, 'trg_lang': trg_lang,
            'fTrain': gzip.open(train_file, 'wt', encoding='utf8'),
            'fDev': gzip.open(dev_file, 'wt', encoding='utf8'),
           'devCount': 0}
print(f"Parallel sentence file {corpus}-{src_lang}-{trg_lang} does not exist. Create file now")


with gzip.open(output_filename, 'rt', encoding='utf8') as fIn:
    tsv_reader = csv.reader(fIn, delimiter="\t")
    
    for line in tqdm(tsv_reader):
        if len(line) > 1:
            src_text = line[0].strip()
            trg_text = line[1].strip()
            
            if outfile['devCount'] < dev_sentences:
                outfile['devCount'] += 1
                fOut = outfile['fDev']
            else:
                fOut = outfile['fTrain']

            fOut.write(f"{src_text}\t{trg_text}\n")

outfile['fTrain'].close()
outfile['fDev'].close()

Parallel sentence file TildeMODEL-en-de does not exist. Create file now


0it [00:00, ?it/s]

In [None]:
with gzip.open(train_files, 'rt', encoding='utf8') as fIn:
    tsv_reader = csv.reader(fIn, delimiter="\t")
    number_of_lines = 100

    for i in range(number_of_lines):
        if i > 90:
            row = next(tsv_reader)
            print(i, row)
        else:
            row = next(tsv_reader)

91 ['Nearby lies the Lednice-Valtice Park, the dominant feature of which is the Baroque chateau containing a Chapel and the Church of the Ascension.', 'In der Nähe befindet sich die Kulturlandschaft Lednice-Valtice mit einem Barockschloss samt Kapelle und Mariä-Himmelfahrts-Kirche.']
92 ['The hottest of the springs – the Vřídlo – emerges from the ground nearby.', 'In der Nähe befindet sich eine der heißesten Karlsbader Quellen: Der Sprudel (Vřídlo), die unglaubliche 72 °C heiß ist!']
93 ['Near the viewing tower there’s a mountain refuge and an observatory.', 'In der Nähe des Aussichtsturms befinden sich eine Berghütte und eine Sternwarte.']
94 ['The region also boasts other spas at Kynžvart and the first radon spa in the world at Jáchymov.', 'In der Nähe des Bäderdreiecks befinden sich das schöne Kurstädtchen Kynžvart (Königswart) sowie die berühmte Radon-Kurstadt Jáchymov.']
95 ['Near the Pohansko hunting lodge, which is part of the unique Lednice-Valtice complex, an open-air archaeol

## Start the extension of the teacher model to multiple languages

In [None]:
teacher_model = SentenceTransformer(teacher_model_name)

word_embedding_model = models.Transformer(student_model_name, max_seq_length=max_seq_length)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
student_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

## Loading Training Datasets

In [None]:
train_data = ParallelSentencesDataset(student_model=student_model, teacher_model=teacher_model, batch_size=inference_batch_size, use_embedding_cache=True)

# Load each file created
train_data.load_data(train_file, max_sentences=max_sentences_per_language, max_sentence_length=train_max_sentence_length)

train_dataloader = DataLoader(train_data, shuffle=True, batch_size=train_batch_size)
train_loss = losses.MSELoss(model=student_model)

## Evaluate cross-lingual performance on different tasks

- MSE: You can measure the mean squared error (MSE) between the student embeddings and teacher embeddings. This evaluator computes the teacher embeddings for the src_sentences, for example, for English. During training, the student model is used to compute embeddings for the trg_sentences, for example, for German. The distance between teacher and student embeddings is measures. Lower scores indicate a better performance.

- Translation Accuracy: You can also measure the translation accuracy. Given a list with source sentences, for example, 1000 English sentences. And a list with matching target (translated) sentences, for example, 1000 german sentences. For each sentence pair, we check if their embeddings are the closest using cosine similarity. I.e., for each src_sentences[i] we check if trg_sentences[i] has the highest similarity out of all target sentences. If this is the case, we have a hit, otherwise an error. This evaluator reports accuracy (higher = better).

In [None]:
#evaluators has a list of different evaluator classes we call periodically
evaluators = []         
src_sentences = []
trg_sentences = []

with gzip.open(dev_file, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        splits = line.strip().split('\t')
        if splits[0] != "" and splits[1] != "":
            src_sentences.append(splits[0])
            trg_sentences.append(splits[1])


    #Mean Squared Error (MSE)
    dev_mse = evaluation.MSEEvaluator(src_sentences, trg_sentences, name=os.path.basename(dev_file), teacher_model=teacher_model, batch_size=inference_batch_size)
    evaluators.append(dev_mse)

    # TranslationEvaluator computes the embeddings for all parallel sentences. It then check if the embedding of source[i] is the closest to target[i] out of all available target sentences
    dev_trans_acc = evaluation.TranslationEvaluator(src_sentences, trg_sentences, name=os.path.basename(dev_file),batch_size=inference_batch_size)
    evaluators.append(dev_trans_acc)

## Read cross-lingual Semantic Textual Similarity (STS) data

You can also measure the semantic textual similarity (STS) between sentence pairs in different languages. Where sentences1 and sentences2 are lists of sentences and score is numeric value indicating the sematic similarity between sentences1[i] and sentences2[i].

In [None]:
all_languages = (src_lang,trg_lang)
sts_data = {}

#Open the ZIP File of STS2017-extended.zip and check for which language combinations we have STS data
with zipfile.ZipFile(sts_corpus) as zipsts:
    filelist = zipsts.namelist()
    sts_files = []

    for i in range(len(all_languages)):
        for j in range(i, len(all_languages)):
            lang1 = all_languages[i]
            lang2 = all_languages[j]
            filepath = f'STS2017-extended/STS.{lang1}-{lang2}.txt')
            if filepath not in filelist:
                lang1, lang2 = lang2, lang1
                filepath = f'STS2017-extended/STS.{lang1}-{lang2}.txt'

            if filepath in filelist:
                filename = os.path.basename(filepath)
                sts_data[filename] = {'sentences1': [], 'sentences2': [], 'scores': []}

                fIn = zipsts.open(filepath)
                for line in io.TextIOWrapper(fIn, 'utf8'):
                    sent1, sent2, score = line.strip().split("\t")
                    score = float(score)
                    sts_data[filename]['sentences1'].append(sent1)
                    sts_data[filename]['sentences2'].append(sent2)
                    sts_data[filename]['scores'].append(score)


for filename, data in sts_data.items():
    test_evaluator = evaluation.EmbeddingSimilarityEvaluator(data['sentences1'], data['sentences2'], data['scores'], batch_size=inference_batch_size, name=filename, show_progress_bar=False)
    evaluators.append(test_evaluator)

## Train the model

In [None]:
student_model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluation.SequentialEvaluator(evaluators, main_score_function=lambda scores: np.mean(scores)),
          epochs=num_epochs,
          warmup_steps=num_warmup_steps,
          evaluation_steps=num_evaluation_steps,
          output_path=output_path,
          save_best_model=True,
          optimizer_params = {'lr': 2e-5, 'eps': 1e-6})

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2932 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

# Load the trained model (Only if you didn't train the model)

In [4]:
student_model = SentenceTransformer("airnicco8/xlm-roberta-en-it-de")

Downloading:   0%|          | 0.00/1.49k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/698 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

# Model evaluation

# Sentence similarity evaluation using sts benchmark

In [16]:
if not os.path.exists('./eval'): 
    import os
    from pydrive.auth import GoogleAuth
    from pydrive.drive import GoogleDrive
    from google.colab import auth
    from oauth2client.client import GoogleCredentials

    auth.authenticate_user()
    gauth = GoogleAuth()
    gauth.credentials = GoogleCredentials.get_application_default()
    drive = GoogleDrive(gauth)

    # Input json file
    json_file_input = 'sts_results.zip' # File name
    file_id = '16n8cNgbYGvyW0CqUJPOtD87EKISDm462' 

    download = drive.CreateFile({'id': file_id})
    download.GetContentFile(json_file_input)

    !unzip './sts_results.zip' -d './eval/'

In [35]:
import re
import pandas as pd

# Reading the files
for filename in os.listdir('./eval'):
    first_lang = re.findall('[a-z]{2}-', filename)[0][:-1]
    second_lang = re.findall('-[a-z]{2}', filename)[0][1:]


    res = pd.read_csv(f'./eval/{filename}')
    print('\n-------------------\n')
    print(f'STS results for the last epoch for the languages {first_lang} - {second_lang}')
    
    print(res.iloc[-1])


-------------------

STS results for the last epoch for the languages en - de
epoch                 9.000000
steps                -1.000000
cosine_pearson        0.787160
cosine_spearman       0.805358
euclidean_pearson     0.767839
euclidean_spearman    0.760701
manhattan_pearson     0.767756
manhattan_spearman    0.760717
dot_pearson           0.779268
dot_spearman          0.787252
Name: 69, dtype: float64

-------------------

STS results for the last epoch for the languages en - en
epoch                 9.000000
steps                -1.000000
cosine_pearson        0.832706
cosine_spearman       0.850208
euclidean_pearson     0.820564
euclidean_spearman    0.810169
manhattan_pearson     0.818571
manhattan_spearman    0.809898
dot_pearson           0.827145
dot_spearman          0.832964
Name: 69, dtype: float64

-------------------

STS results for the last epoch for the languages it - en
epoch                 9.000000
steps                -1.000000
cosine_pearson        0.800526


## Demo for the model

In [5]:
import scipy.spatial

if not os.path.exists(sts_corpus):
    print(sts_corpus, "does not exists. Try to download from server")
    filename = os.path.basename(sts_corpus)
    url = "https://sbert.net/datasets/" + filename
    sentence_transformers.util.http_get(url, sts_corpus)

all_languages = [src_lang] + list(trg_lang)
sts_data = dict((l,[]) for l in all_languages)

''' 
We are using the sts corpus. Since the objective of this corpus is to label the similarity between sentences the same phrase 
will not be found identical in all the langues. The demo below will have different resutls in the three different languages
'''
with zipfile.ZipFile(sts_corpus) as zipsts:
    filelist = zipsts.namelist()
    sts_files = []

    for i in range(len(all_languages)):
        for j in range(i, len(all_languages)):
            lang1 = all_languages[i]
            lang2 = all_languages[j]
            filepath = f'STS2017-extended/STS.{lang1}-{lang2}.txt'
            if filepath not in filelist:
                lang1, lang2 = lang2, lang1
                filepath = f'STS2017-extended/STS.{lang1}-{lang2}.txt'

            if filepath in filelist:
                filename = os.path.basename(filepath)

                l1vf = False 
                l2vf = False
                if len(sts_data[f'{lang1}']) == 0:
                    l1vf = True
                if len(sts_data[f'{lang2}']) == 0:
                    l2vf = True
                
                fIn = zipsts.open(filepath)
                for line in io.TextIOWrapper(fIn, 'utf8'):
                    sent1, sent2, _ = line.strip().split("\t")
                    if l1vf:
                        sts_data[f'{lang1}'].append(sent1)
                    if l2vf:
                        sts_data[f'{lang2}'].append(sent2)

downloaded-corpus//STS2017-extended.zip does not exists. Try to download from server


  0%|          | 0.00/96.3k [00:00<?, ?B/s]

In [None]:
# Corpus with example sentences
corpus_en = sts_data['en']

corpus_it = sts_data['it']

corpus_de = sts_data['de']

# Query sentences:
queries_en = ['A man is cooking pasta.', 'Someone in a gorilla costume is playing the guitar.']
queries_it = ['Un uomo sta cucinando la pasta.', 'Qualcuno in un costume da gorilla sta suonando la chitarra.']
queries_de = ['Ein Mann kocht Nudeln.', 'Jemand in einem Gorillakostüm spielt Gitarre.']

### Evaluation of the student model in English

In [None]:
corpus_embeddings = student_model.encode(corpus_en)

query_embeddings = student_model.encode(queries_en)
# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3

for query, query_embedding in zip(queries_en, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus_en[idx].strip(), "(Score: %.4f)" % (1-distance))



Query: A man is cooking pasta.

Top 3 most similar sentences in corpus:

There is a cook preparing food. (Score: 0.7866)
A cook is making food. (Score: 0.7862)
These cooks in the white are busy in the kitchen making dinner for their customers. (Score: 0.5110)


Query: Someone in a gorilla costume is playing the guitar.

Top 3 most similar sentences in corpus:

A man in a green shirt and black hat playing a guitar on stage. (Score: 0.7905)
A man in a white shirt and hat playing a guitar. (Score: 0.7717)
A musician is smearing jam on his white guitar at a concert. (Score: 0.7393)


### Evaluation of the student model in Italian

In [None]:
corpus_embeddings = student_model.encode(corpus_it)

query_embeddings = student_model.encode(queries_it)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3
for query, query_embedding in zip(queries_it, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus_it[idx].strip(), "(Score: %.4f)" % (1-distance))



Query: Un uomo sta cucinando la pasta.

Top 3 most similar sentences in corpus:

C'è un cuoco che prepara il cibo. (Score: 0.8699)
Questi cuochi in bianco sono impegnati in cucina a preparare la cena per i loro clienti. (Score: 0.6491)
Un uomo sta eseguendo il lavoro. (Score: 0.4796)


Query: Qualcuno in un costume da gorilla sta suonando la chitarra.

Top 3 most similar sentences in corpus:

Un musicista spalma marmellata sulla sua chitarra bianca durante un concerto. (Score: 0.6971)
Un uomo in una camicia bianca e cappello a suonare una chitarra. (Score: 0.6911)
I bambini tengono strumenti musicali. (Score: 0.6510)


### Evaluation of the student model in German

In [None]:
corpus_embeddings = student_model.encode(corpus_de)

query_embeddings = student_model.encode(queries_de)

# Find the closest 3 sentences of the corpus for each query sentence based on cosine similarity
closest_n = 3
for query, query_embedding in zip(queries_de, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n======================\n")
    print("Query:", query)
    print("\nTop 3 most similar sentences in corpus:\n")

    for idx, distance in results[0:closest_n]:
        print(corpus_de[idx].strip(), "(Score: %.4f)" % (1-distance))



Query: Ein Mann kocht Nudeln.

Top 3 most similar sentences in corpus:

Ein Koch bereitet Essen zu. (Score: 0.8604)
Die Frauen bereiten das Abendessen in der Küche vor. (Score: 0.4938)
Zwei Frauen kochen. (Score: 0.4823)


Query: Jemand in einem Gorillakostüm spielt Gitarre.

Top 3 most similar sentences in corpus:

Ein Mann mit grünem Shirt und schwarzem Hut spielt Gitarre auf der Bühne. (Score: 0.8358)
Ein Mann spielt Gitarre im Regen. (Score: 0.7231)
Der Mann spielt das Schlagzeug für seine Mutter. (Score: 0.6504)
