# Checking for translated answers in the translated context

The purpose of this notebook is to check if the translation of the answer exists in the translated context.
This is done by extracting N-grams, transforming them to word embeddings using fasttext model and
calculating cosine similarity between them.

The algorithm is described in an article called 
[Sentence Similarity Techniques for Short vs Variable Length Text using Word Embeddings](https://www.researchgate.net/publication/338283181_Sentence_Similarity_Techniques_for_Short_vs_Variable_Length_Text_using_Word_Embeddings)
by *Dudekula, Shashavali & Vishwjeet, V. & Kumar, Rahul & Mathur, Gaurav & Nihal, Nikhil & Mukherjee, Siddhartha & Patil, Suresh (2019)
Computación y Sistemas. 23. 10.13053/cys-23-3-3273*. 


----
First we load prepared fasttext model that we will use for creating word embeddings.

In [1]:
import fasttext
model = fasttext.load_model('../../models/fasttext_train_model.bin')



----
Function `get_grams(text)` takes text as an input and returns N-grams, which are consecutive strings
of N words.

In [2]:
from nltk.util import ngrams
def get_grams(text, n = 3):
    text_split = text.split()
    for i_ in range(1, n + 1):
        yield ngrams(text_split, i_)

----
Function `grams_to_embeddings(grams, model_)` transforms N-grams to word embeddings that are needed for
calculating similarity between N-grams.

In [3]:
def grams_to_embeddings(grams, model_):
    result = []
    for gram in grams:
        ngram = gram[0]
        for w in gram[1:]:
            ngram += f' {w}'
        result.append(model_.get_sentence_vector(ngram))
    return result
        

----
Function `average_similarity(query_embeddings, context_embeddings)` calculates cosine similarity between
all combinations of N-grams and returns the starting index of the similar text in the context and
average cosine similarity.

In [4]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def average_similarity(query_embeddings, context_embeddings):
    
    if len(query_embeddings) == 0:
        return -1, -1
        
    unigrams_similarities = cosine_similarity(query_embeddings, context_embeddings)
    
    indexes = []
    max_similarities = []
    for similarity_matrix in unigrams_similarities:
        argmax = np.argmax(similarity_matrix)
        indexes.append(argmax)
        max_similarities.append(similarity_matrix[argmax])
    
    return indexes[0], sum(max_similarities) / len(max_similarities)
    

----
Function `find_similar_text(query, text, fasttext_model)` just runs the calculation of,
similarity for different N-grams and reports the results.

In [16]:
def find_similar_text(query, text, fasttext_model, n = 3):
    average_similarities = []
    for query_grams, text_grams in zip(get_grams(query, n), get_grams(text, n)):
        answer_unigrams_embeddings = grams_to_embeddings(query_grams, fasttext_model)
        context_unigrams_embeddings = grams_to_embeddings(text_grams, fasttext_model)
        average_similarities.append(average_similarity(answer_unigrams_embeddings, context_unigrams_embeddings))

    return average_similarities

In [17]:
import json
with open('../../data/dev-v2.0_SL.json', 'r', encoding='UTF-8') as file:
    data = json.load(file)
    
context = data[0]['paragraphs'][0]['context']
for ix, i in enumerate(next(get_grams(context))):
    for word in i:
        print(f'{ix}: {word}', end=' ')

0: Normani 1: (Norman: 2: Nourmands; 3: Francoščina: 4: Normandi; 5: Latinščina: 6: Normanni) 7: so 8: bili 9: ljudje, 10: ki 11: so 12: v 13: 10. 14: in 15: 11. 16: stoletju 17: dali 18: ime 19: Normandiji, 20: regiji 21: v 22: Franciji. 23: Bili 24: so 25: potomci 26: nordijskih 27: plenilcev 28: in 29: piratov 30: iz 31: Danske, 32: Islandije 33: in 34: Norveške, 35: ki 36: so 37: pod 38: svojim 39: voditeljem 40: Rollom 41: prisegli 42: zvestobo 43: kralju 44: Karlu 45: III. 46: iz 47: Zahodne 48: Frankovske. 49: Skozi 50: generacije 51: asimilacije 52: in 53: mešanja 54: z 55: domačimi 56: frankovskimi 57: in 58: rimsko-gavskimi 59: populacijami 60: so 61: se 62: njihovi 63: potomci 64: postopoma 65: združili 66: s 67: karolinškimi 68: kulturami 69: Zahodne 70: Frankovske. 71: Posebna 72: kulturna 73: in 74: etnična 75: identiteta 76: Normanov 77: se 78: je 79: sprva 80: pojavila 81: v 82: prvi 83: polovici 84: 10. 85: stoletja 86: in 87: se 88: je 89: razvijala 90: v 91: naslednj

In [19]:
qas_number = 0
answer = data[0]['paragraphs'][0]['qas'][qas_number]['answers'][0]['text']
question = data[0]['paragraphs'][0]['qas'][qas_number]['question']        
print(question)
print(answer)
for similarity in find_similar_text(answer, context, model):
    print(similarity)

qas_number = 1
answer = data[0]['paragraphs'][0]['qas'][qas_number]['answers'][0]['text']
question = data[0]['paragraphs'][0]['qas'][qas_number]['question']      
print(question)
print(answer)
for similarity in find_similar_text(answer, context, model):
    print(similarity)

qas_number = 2
answer = data[0]['paragraphs'][0]['qas'][qas_number]['answers'][0]['text']
question = data[0]['paragraphs'][0]['qas'][qas_number]['question']      
print(question)
print(answer)
for similarity in find_similar_text(answer, context, model):
    print(similarity)

V kateri državi se nahaja Normandija?
Francija
(22, 0.7197608947753906)
(-1, -1)
(-1, -1)
Kdaj so bili Normani v Normandiji?
10. in 11. stoletje
(13, 0.9646217525005341)
(13, 0.9829012950261434)
(13, 0.9859484136104584)
Iz katerih držav je Norveška izvirala?
Danska, Islandija in Norveška
(31, 0.7949058413505554)
(31, 0.8514732718467712)
(31, 0.857131153345108)
