# Checking for translated answers in the translated context

The purpose of this notebook is to check if the translation of the answer exists in the translated context.
This is done by extracting N-grams, transforming them to word embeddings using fasttext model and
calculating cosine similarity between them.

The algorithm is described in an article called [Sentence Similarity Techniques for Short vs Variable Length Text using Word Embeddings](https://www.researchgate.net/publication/338283181_Sentence_Similarity_Techniques_for_Short_vs_Variable_Length_Text_using_Word_Embeddings)
by *Dudekula, Shashavali & Vishwjeet, V. & Kumar, Rahul & Mathur, Gaurav & Nihal, Nikhil & Mukherjee, Siddhartha & Patil, Suresh (2019)
Computación y Sistemas. 23. 10.13053/cys-23-3-3273*.

\
<img src="../../docs/imgs/text_similarity_graph.PNG" alt="drawing" width="800"/>


In [1]:
import fasttext
model = fasttext.load_model('../../models/cc.sl.300.bin')



In [2]:
from src.utils.translation_utils import get_grams

test = 'Francija Test 123 Haha Test'
for ix, g in enumerate(get_grams(test, 1, 100)):
    print(f'{ix+1}-grams: {g}')

1-grams: [['Francija'], ['Test'], ['123'], ['Haha'], ['Test']]
2-grams: [['Francija', 'Test'], ['Test', '123'], ['123', 'Haha'], ['Haha', 'Test']]
3-grams: [['Francija', 'Test', '123'], ['Test', '123', 'Haha'], ['123', 'Haha', 'Test']]
4-grams: [['Francija', 'Test', '123', 'Haha'], ['Test', '123', 'Haha', 'Test']]
5-grams: [['Francija', 'Test', '123', 'Haha', 'Test']]


Example of how function `find_similar_text` works.

In [3]:
from src.utils.translation_utils import clean_text, find_similar_text
import json

with open('../../data/dev-v2.0_SL.json', 'r', encoding='UTF-8') as file:
    data = json.load(file)

theme_index = 1
paragraph_index = 44
qas_index = 0

clean_context = clean_text(data[theme_index]['paragraphs'][paragraph_index]['context'])
for index, i in enumerate(get_grams(clean_context)[0]):
    for word in i:
        print(f'{index}: {word}', end=' ')


qas_number = 0
answer = clean_text(data[theme_index]['paragraphs'][paragraph_index]['qas'][qas_number]['answers'][0]['text'])
question = data[theme_index]['paragraphs'][paragraph_index]['qas'][qas_number]['question']        
print(question)
print(answer)
print(find_similar_text(answer, clean_context, model))


0: kot 1: poudarjata 2: fortnow 3: & 4: homer 5: 2003 6: začetek 7: sistematičnih 8: študij 9: računske 10: kompleksnosti 11: pripisujemo 12: temeljnemu 13: članku 14: o 15: računalniški 16: zapletenosti 17: algoritmov 18: ki 19: sta 20: ga 21: napisala 22: juris 23: hartmanis 24: in 25: richard 26: stearns 27: 1965 28: ki 29: sta 30: določila 31: definicije 32: časovne 33: in 34: prostorske 35: kompleksnosti 36: in 37: dokazala 38: hierarhične 39: izreke 40: leta 41: 1965 42: je 43: edmonds 44: definiral 45: dober 46: algoritem 47: kot 48: dober 49: algoritem 50: ki 51: je 52: omejen 53: s 54: polinomom 55: vhodne 56: velikosti Kateri papir se običajno šteje za zvonec, ki se uporablja v sistematičnih študijah računske kompleksnosti?
o računalniški zapletenosti algoritmov
([14, 14, 14], 1.0000000819563866)


Even though the words may look very similar, cosine similarity between their embedding vectors might not be big. Here is an example:

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

a = 'Sena'
b = 'Sene'

a_embed = model.get_sentence_vector(a)
b_embed = model.get_sentence_vector(b)

cosine_similarity([a_embed], [b_embed])

array([[0.28973785]], dtype=float32)

The following code fixes the answers and cleans the text of our dataset:

In [13]:
from tqdm import tqdm

data_name = 'train'

with open(f'../../data/{data_name}-v2.0_SL.json', 'r', encoding='UTF-8') as file:
    data = json.load(file)

removed_qas = 0
removed_answers = 0
for i_theme, theme in enumerate(tqdm(data, ncols=100)):
# for i_theme, theme in enumerate(data):
    for i_paragraph, paragraph in enumerate(theme['paragraphs']):
        context = paragraph['context']
        clean_context = clean_text(context)  
        data[i_theme]['paragraphs'][i_paragraph]['context'] = clean_context

        for i_qas, qas in enumerate(paragraph['qas']):
            question = qas['question']
            
            for i_answer, answer in enumerate(qas['answers']):
                answer_text = answer['text']
                answer_text = clean_text(answer_text).rstrip().lstrip()
                
                if len(answer_text) == 0:
                    print(f'Deleting answer to question: {question}')
                    qas['answers'].remove(answer)
                    continue
                
                answer_start = int(answer['answer_start'])
                answer_split = answer_text.split(' ')
                
                if answer_start < len(clean_context):
                    reduced_context_start = max(answer_start - 200, 0)
                    reduced_context_clean = clean_context[max(answer_start - 200, 0): min(answer_start + len(answer_text) + 200, len(clean_context))]
                else:
                    reduced_context_start = 0
                    reduced_context_clean = clean_context
                
                reduced_context_clean.lstrip().rstrip()
                reduced_context_clean_split = reduced_context_clean.split(' ')
                context_split = clean_context.split(' ')

                answer_length = len(answer_split)
                start_indexes, avg_similarities = find_similar_text(answer_text, reduced_context_clean, model, answer_length, answer_length)
                
                if avg_similarities > 0.75:
                    start_index = start_indexes[0]
                    fixed_answer = ' '.join(reduced_context_clean_split[start_index:start_index + answer_length])
                    fixed_index = reduced_context_start + reduced_context_clean.index(fixed_answer)
                    data[i_theme]['paragraphs'][i_paragraph]['qas'][i_qas]['answers'][i_answer]['text'] = fixed_answer
                    data[i_theme]['paragraphs'][i_paragraph]['qas'][i_qas]['answers'][i_answer]['answer_start'] = fixed_index
                else:
                    qas['answers'].remove(answer)
                    removed_answers += 1
                
                    # print(f'{"-" * 100}')
                    # print(f'CONTEXT: {clean_context}')
                    # print(f'QUESTION: {question}')
                    # print(f'BEFORE: {answer_text}')
                    # print(f'AFTER: {fixed_answer}')
                    # print(f'SIMILARITY: {avg_similarities}')
                
            # if 'plausible_answers' in qas:
            #     for plausible_answer in qas['plausible_answers']:
            #         plausible_answer = plausible_answer['text']
        
            if len(qas['answers']) == 0:
                paragraph['qas'].remove(qas)
            removed_qas += 1


print(f'Removed answers: {removed_answers}')            
print(f'Removed qas: {removed_qas}')            
with open(f'../../data/{data_name}-v2.0_SL_fixed_removed.json', 'w', encoding='UTF-8') as file:
    json.dump(data, file, sort_keys=True, indent=4, ensure_ascii=False)

 58%|███████████████████████████████████▏                         | 255/442 [08:13<06:32,  2.10s/it]

Deleting answer to question: Temperature nad 100 stopinj bodo običajno najdene v kakšni višini Piemonta?


100%|█████████████████████████████████████████████████████████████| 442/442 [14:12<00:00,  1.93s/it]


Removed answers: 20313
Removed qas: 94523
