# Induce and Evaluate Proc, Proc-B and VecMap

This Notebook will induce all CLWE used in this project (Proc, Proc-B and VecMap) and will directly evaluate each method on the BLI task. We evaluate on 6 language pairs: EN-DE, EN-PL, EN-NL, EN-FI, EN-RU, EN-IT. At the bottom of the notebook, you can create subsets of the translation dictionary (e.g. for creating 500 unique source words translations)

In [1]:
#! pip3 install torch==1.5.0 transformers==3.4.0
#! pip install faiss-gpu cudatoolkit=10.0 -c pytorch

In [2]:
import sys
import os
sys.path.append(os.path.dirname((os.path.abspath(''))))
from src.cross_lingual_embeddings.pipeline_clwe_induction import clew_induction
#from google.colab import drive
#drive.mount('/content/drive')

Mounted at /content/drive


# 1. EN-DE

In [None]:
path_source_language = "../data/monolingual_embedding/fasttext.wiki.en.300.vocab_200K.vec"
path_target_language = "../data/monolingual_embedding/fasttext.wiki.de.300.vocab_200K.vec"

train_translation_dict_path = "../data/translation_dict/en-de.0-5000.txt"
train_translation_dict_1k_path = "../data//translation_dict/en-de.0-500.txt"
test_translation_dict_path = "../data/translation_dict/en-de.5000-6500.txt"
new_test_translation_path = "../data/translation_dict/yacle.test.freq.cutoff.en-de.tsv"

clew_induction(path_source_language, path_target_language, train_translation_dict_path,
                   train_translation_dict_1k_path, test_translation_dict_path, new_test_translation_path,
                   "en_de", number_tokens=100000, save_embedding=True)


First, we cut the test dictionaries to the monolingual vocabularies:
Original Dictionary Size: 3660
New Dictionary Size: 3561
--------------------------------

Create procrustes model with 5000 translation pairs
Length of Original dictionary: 14677
Length of dictionary after pruning: 14420


Number of Test Translations: 3561/3561
P@1: 0.2990732940185341
P@5: 0.5810165683796686
P@10: 0.6590845268183094


MRR: 0.4227324011221896
--------------------------------

Create procrustes model with 1000 translation pairs
Length of Original dictionary: 1677
Length of dictionary after pruning: 1666


Number of Test Translations: 3561/3561
P@1: 0.2038753159224937
P@5: 0.44566133108677336
P@10: 0.5313114293737714


MRR: 0.3137714125427455
--------------------------------

Create procrustes bootstrapping model with 1000 translation pairs
Length of Original dictionary: 1677
Length of Original dictionary: 1677
Length of dictionary after pruning: 1666
Length of new dictionary: 2515
Length of Original d

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=512.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115590446.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, max=3561.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3561.0), HTML(value='')))




Number of Test Translations: 3561/3561
P@1: 0.20864925582701488
P@5: 0.3066554338668913
P@10: 0.32462791350744175


MRR: 0.2525664188879662
--------------------------------

Create  Text Encoder Last Layer model


HBox(children=(FloatProgress(value=0.0, max=3561.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=3561.0), HTML(value='')))




Number of Test Translations: 3561/3561
P@1: 0.11822521763549565
P@5: 0.14462229710755406
P@10: 0.15782083684358325


MRR: 0.13278305720414538


# 2. EN-POL

In [None]:
path_source_language = "../data/monolingual_embedding/fasttext.wiki.en.300.vocab_200K.vec"
path_target_language = "../data/monolingual_embedding/fasttext.wiki.pol.300.vocab.vec"
train_translation_dict_path = "../data/translation_dict/en-pl.0-5000.txt"
train_translation_dict_1k_path = "../data/translation_dict/en-pol.0-500.txt"
test_translation_dict_path = "../data/translation_dict/en-pl.5000-6500.txt"

new_test_translation_path = "../data/translation_dict/yacle.test.freq.cutoff.en-pl.tsv"

clew_induction(path_source_language, path_target_language, train_translation_dict_path,
                   train_translation_dict_1k_path, test_translation_dict_path, new_test_translation_path,
                   "en_pl", number_tokens=100000,  save_embedding=False)


First, we cut the test dictionaries to the monolingual vocabularies:
Original Dictionary Size: 2745
New Dictionary Size: 2584
--------------------------------

Create procrustes model with 5000 translation pairs
Length of Original dictionary: 12201
Length of dictionary after pruning: 11821


Number of Test Translations: 2584/2584
P@1: 0.31656346749226005
P@5: 0.641640866873065
P@10: 0.7287151702786377


MRR: 0.46062657598031814
--------------------------------

Create procrustes model with 1000 translation pairs
Length of Original dictionary: 1497
Length of dictionary after pruning: 1476


Number of Test Translations: 2584/2584
P@1: 0.16950464396284828
P@5: 0.39009287925696595
P@10: 0.4887770897832817


MRR: 0.27367271935846665
--------------------------------

Create procrustes bootstrapping model with 1000 translation pairs
Length of Original dictionary: 1497
Length of Original dictionary: 1497
Length of dictionary after pruning: 1476
Length of new dictionary: 2245
Length of Origina

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=512.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115590446.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, max=2584.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2584.0), HTML(value='')))




Number of Test Translations: 2584/2584
P@1: 0.23258513931888544
P@5: 0.30688854489164086
P@10: 0.3173374613003096


MRR: 0.26757572093042115
--------------------------------

Create  Text Encoder Last Layer model


HBox(children=(FloatProgress(value=0.0, max=2584.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2584.0), HTML(value='')))




Number of Test Translations: 2584/2584
P@1: 0.13660990712074303
P@5: 0.16215170278637772
P@10: 0.17879256965944273


MRR: 0.1516095785544099


# 3. EN-NL

In [None]:
path_source_language = "../data/monolingual_embedding/fasttext.wiki.en.300.vocab_200K.vec"
path_target_language = "../data/monolingual_embedding/wiki.nl.vec"
train_translation_dict_path = "../data/translation_dict/en-nl.0-5000.txt"
train_translation_dict_1k_path = "../data/translation_dict/en-nl.0-500.txt"
test_translation_dict_path = "../data/translation_dict/en-nl.5000-6500.txt"

new_test_translation_path = "../data/translation_dict/yacle.test.freq.cutoff.en-nl.tsv"

clew_induction(path_source_language, path_target_language, train_translation_dict_path,
                train_translation_dict_1k_path, test_translation_dict_path, new_test_translation_path,
                "en_nl", number_tokens=100000)


First, we cut the test dictionaries to the monolingual vocabularies:
Original Dictionary Size: 2481
New Dictionary Size: 2412
--------------------------------

Create procrustes model with 5000 translation pairs
Length of Original dictionary: 9419
Length of dictionary after pruning: 9237


Number of Test Translations: 2412/2412
P@1: 0.41376451077943616
P@5: 0.6695688225538972
P@10: 0.7458540630182421


MRR: 0.5335588512957985
--------------------------------

Create procrustes model with 1000 translation pairs
Length of Original dictionary: 1021
Length of dictionary after pruning: 1012


Number of Test Translations: 2412/2412
P@1: 0.24543946932006633
P@5: 0.4792703150912106
P@10: 0.5522388059701493


MRR: 0.3531230253992282
--------------------------------

Create procrustes bootstrapping model with 1000 translation pairs
Length of Original dictionary: 1021
Length of Original dictionary: 1021
Length of dictionary after pruning: 1012
Length of new dictionary: 1531
Length of Original di

HBox(children=(FloatProgress(value=0.0, max=2412.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2412.0), HTML(value='')))




Number of Test Translations: 2412/2412
P@1: 0.32587064676616917
P@5: 0.4033996683250415
P@10: 0.41832504145936983


MRR: 0.36270461225684275
--------------------------------

Create  Text Encoder Last Layer model


HBox(children=(FloatProgress(value=0.0, max=2412.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2412.0), HTML(value='')))




Number of Test Translations: 2412/2412
P@1: 0.19112769485903813
P@5: 0.22470978441127695
P@10: 0.2433665008291874


MRR: 0.20965476294460786


# 4. EN-FI

In [None]:
path_source_language = "../data/monolingual_embedding/fasttext.wiki.en.300.vocab_200K.vec"
path_target_language = "../data/monolingual_embedding/fasttext.wiki.fi.300.vocab_200K.vec"
train_translation_dict_path = "../data/translation_dict/en-fi.0-5000.txt"
train_translation_dict_1k_path = "../data/translation_dict/en-fi.0-500.txt"
test_translation_dict_path = "../data/translation_dict/en-fi.5000-6500.txt"

new_test_translation_path = "/content/drive/MyDrive/CLIR/translation_dict/yacle.test.freq.cutoff.en-fi.tsv"

clew_induction(path_source_language, path_target_language, train_translation_dict_path,
                train_translation_dict_1k_path, test_translation_dict_path, new_test_translation_path,
                "en_fi", number_tokens=100000)



First, we cut the test dictionaries to the monolingual vocabularies:
Original Dictionary Size: 2517
New Dictionary Size: 2349
--------------------------------

Create procrustes model with 5000 translation pairs
Length of Original dictionary: 11496
Length of dictionary after pruning: 11070


Number of Test Translations: 2349/2349
P@1: 0.2707535121328225
P@5: 0.5615155385270327
P@10: 0.6577266922094508


MRR: 0.40308053342830696
--------------------------------

Create procrustes model with 1000 translation pairs
Length of Original dictionary: 1412
Length of dictionary after pruning: 1390


Number of Test Translations: 2349/2349
P@1: 0.12175393784589186
P@5: 0.31375053214133675
P@10: 0.40272456364410386


MRR: 0.2156435105812954
--------------------------------

Create procrustes bootstrapping model with 1000 translation pairs
Length of Original dictionary: 1412
Length of Original dictionary: 1412
Length of dictionary after pruning: 1390
Length of new dictionary: 2118
Length of Origina

HBox(children=(FloatProgress(value=0.0, max=2349.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2349.0), HTML(value='')))




Number of Test Translations: 2349/2349
P@1: 0.2175393784589187
P@5: 0.2601106853980417
P@10: 0.2673478075776926


MRR: 0.23927964727677578
--------------------------------

Create  Text Encoder Last Layer model


HBox(children=(FloatProgress(value=0.0, max=2349.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2349.0), HTML(value='')))




Number of Test Translations: 2349/2349
P@1: 0.1502767134951043
P@5: 0.1728395061728395
P@10: 0.1839080459770115


MRR: 0.16425003922541054


# 5. EN-RU

In [None]:
path_source_language = "/content/drive/MyDrive/CLIR/monolingual_embedding/fasttext.wiki.en.300.vocab_200K.vec"
path_target_language = "/content/drive/MyDrive/CLIR/monolingual_embedding/fasttext.wiki.ru.300.vocab_200K.vec"
train_translation_dict_path = "/content/drive/MyDrive/CLIR/translation_dict/en-ru.0-5000.txt"
train_translation_dict_1k_path = "/content/drive/MyDrive/CLIR/translation_dict/en-ru.0-500.txt"
test_translation_dict_path = "/content/drive/MyDrive/CLIR/translation_dict/en-ru.5000-6500.txt"

new_test_translation_path = "/content/drive/MyDrive/CLIR/translation_dict/yacle.test.freq.cutoff.en-ru.tsv"
try:
  clew_induction(path_source_language, path_target_language, train_translation_dict_path,
                    train_translation_dict_1k_path, test_translation_dict_path, new_test_translation_path,
                    "en_de", number_tokens=100000, save_embedding=False)
except:
  print("except")


First, we cut the test dictionaries to the monolingual vocabularies:
Original Dictionary Size: 2447
New Dictionary Size: 2291
--------------------------------

Create procrustes model with 5000 translation pairs
Length of Original dictionary: 10887
Length of dictionary after pruning: 10601


Number of Test Translations: 2291/2291
P@1: 0.31776516804888694
P@5: 0.6302924487123527
P@10: 0.7127891750327368


MRR: 0.45576879639080314
--------------------------------

Create procrustes model with 1000 translation pairs
Length of Original dictionary: 1376
Length of dictionary after pruning: 1365


Number of Test Translations: 2291/2291
P@1: 0.18070711479703186
P@5: 0.3950240069838499
P@10: 0.4984722828459188


MRR: 0.2837927396628475
--------------------------------

Create procrustes bootstrapping model with 1000 translation pairs
Length of Original dictionary: 1376
Length of Original dictionary: 1376
Length of dictionary after pruning: 1365
Length of new dictionary: 2064
Length of Original

HBox(children=(FloatProgress(value=0.0, max=2291.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2291.0), HTML(value='')))




Number of Test Translations: 2291/2291
P@1: 0.11305106940200786
P@5: 0.21780881711043212
P@10: 0.2409428197293758


MRR: 0.16202785517892926
--------------------------------

Create  Text Encoder Last Layer model


HBox(children=(FloatProgress(value=0.0, max=2291.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2291.0), HTML(value='')))




Number of Test Translations: 2291/2291
P@1: 0.02749890877346137
P@5: 0.0405936272370144
P@10: 0.04845045831514622


MRR: 0.03619241105503459


# 6. EN-IT

In [None]:
path_source_language = "../data/monolingual_embedding/fasttext.wiki.en.300.vocab_200K.vec"
path_target_language = "../data/monolingual_embedding/wiki.it.vec"
train_translation_dict_path = "../data/translation_dict/en-it.0-5000.txt"
train_translation_dict_1k_path = "../data/translation_dict/en-it.0-500.txt"
test_translation_dict_path = "../data/translation_dict/en-it.5000-6500.txt"

new_test_translation_path = "../data/translation_dict/yacle.test.freq.cutoff.en-it.tsv"

clew_induction(path_source_language, path_target_language, train_translation_dict_path,
                train_translation_dict_1k_path, test_translation_dict_path, new_test_translation_path,
                "en_it", number_tokens=100000, save_embedding=False)



First, we cut the test dictionaries to the monolingual vocabularies:
Original Dictionary Size: 2585
New Dictionary Size: 2554
--------------------------------

Create procrustes model with 5000 translation pairs
Length of Original dictionary: 9657
Length of dictionary after pruning: 9575


Number of Test Translations: 2554/2554
P@1: 0.42521534847298353
P@5: 0.6918559122944401
P@10: 0.7599843382928739


MRR: 0.5438951152149026
--------------------------------

Create procrustes model with 1000 translation pairs
Length of Original dictionary: 1058
Length of dictionary after pruning: 1056


Number of Test Translations: 2554/2554
P@1: 0.27721221613155833
P@5: 0.5305403288958497
P@10: 0.6049334377447142


MRR: 0.39052255633687755
--------------------------------

Create procrustes bootstrapping model with 1000 translation pairs
Length of Original dictionary: 1058
Length of Original dictionary: 1058
Length of dictionary after pruning: 1056
Length of new dictionary: 1587
Length of Original d

HBox(children=(FloatProgress(value=0.0, max=2554.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2554.0), HTML(value='')))




Number of Test Translations: 2554/2554
P@1: 0.31362568519968675
P@5: 0.44675019577133906
P@10: 0.4635865309318716


MRR: 0.37417703255145207
--------------------------------

Create  Text Encoder Last Layer model


HBox(children=(FloatProgress(value=0.0, max=2554.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=2554.0), HTML(value='')))

# Create Subsets of dictionaries if needed here

In [None]:
import csv
from src.cross_lingual_embeddings.load_monolingual import load_translation_dict

path_dict = "../data/translation_dict/en-it.0-5000.txt"
save_path = "../data/translation_dict/en-it.0-500.txt"

translation_source, translation_target = load_translation_dict(path_dict)
text_file = ""
unique_words = set([])
for index, word in enumerate(translation_source):
    if len(unique_words) == 500:
        break
    text_file += translation_source[index] + " " + translation_target[index] + "\n"
    unique_words.add(word)

with open(save_path, 'w') as f:
    f.write(text_file)