## Final Tests and Scores
This Notebook is a part of the Thesis Project: Learning Multilingual Document Representations. (Marc Lenz, 2021)

-------------------------------------------------

**General Information**
- Inside this Notebook, different methods to create multilingual document representations are tested and evaluated. 

- Which methods, parameter-settings and languages are used for the evaluation can be adjusted by changing the variables in the Cell below. 

- This Notebook was run in Google Colab. 

**About the Methods and Datasets**

Datasets: 
 - JRC-Arquis (sample of 5000 Documents)
 - EU-Bookshop (sample of ~9000 Documents, first 5000 are selected)

Methods:

- Methods which are based on creating mappings between monolingual corpora.
Those methods are: Linear Concept Approximation (LCA), Linear Concept Compression(LCC) and the Neural Network versions of those: NNCA and NNCC. 
For them, first the monolingual representation have to be created, then the mapping can be applied. Algorithms which are applied here to derive monolingual representations are: Latent Semantic indexing and Doc2Vec (Mikolov et al.)

- Methods which derive multilingual representations directly. Those are: Cross-Lingual Latent Semantic Indexing (CL-LSI) and the improved version of it, which is also described within the theoretical section of the Thesis. 

In [None]:
"""
----
Languages Preprocessed for JRC_Arquis: en, hu, fr, de, nl, pt, cz, pl
Languages Preprocessed for EU-Bookshop: en, es, fr

"""
#Choose either "JRC_Arquis" "EU-Bookshop"
dataset ="EU-Bookshop"

#Determines which methods are tested
# True -> Method is evaluated
# False -> Method is ignored
test_LCA = False
test_LCC = False
test_CLLSI = False
test_neural_networks = True

#Set languages, dimensions and kind of monolingual embedding
#The monolingual embedding method influences the results of 
# LCA, LCC, NNCA, and NNCC
languages = ["en", "es", "fr"] #["en", "hu", "fr", "de", "nl", "pt", "cs", "pl"]
embedding_method = "LSI"


#BEST PARAMETERS/PARAMETERS TO BE TESTED
lca_dimension = [500]
lcao_dimension =[500]
lcc_dimension = [500]
cllsi_dimension = [500]
settings_nncc = [     
             ]

settings_nnca = [  
            {"dimension" : 500,
             "neurons" : [500], 
             "activation_function" : None,
             "dropout" : 0.0,
             "optimizer" : "adam",
             "loss_function" : "cosine_sim"},
             ]

In [None]:
all_dimensions = lca_dimension + lcao_dimension + lcc_dimension
dimensions = list(dict.fromkeys(all_dimensions))

##  Load Dataset
- First of all, clone the git repository which contains most of the functions and models for this Notebook

In [None]:
!git clone https://github.com/marc-lenz/thesis_code.git

fatal: destination path 'thesis_code' already exists and is not an empty directory.


- then load the Dataset

In [None]:
from google.colab import drive
import pandas as pd 
import numpy as np
import pickle

drive.mount("/content/gdrive")

if dataset == "JRC_Arquis" :
  main_dir = "/content/gdrive/My Drive/Thesis/JRC_Arquis_files/"
  sample_df = pd.read_pickle(main_dir+"sample_df_preprocessed.pkl")
  train_df = sample_df[:3000]
  val_df = sample_df[3000:4000]
  test_df = sample_df[4000:5000]
  
elif dataset == "EU-Bookshop": 
  main_dir = "/content/gdrive/My Drive/Thesis/EU-BookShop Files/"
  #define

  def get_eub_dataframe(main_dir):
    def load(filepath):
      with open(filepath,"rb") as f:
          obj = pickle.load(f)
      return obj
    tokenized_en = load(main_dir+"/tokenized_en.list")
    tokenized_fr = load(main_dir+"/tokenized_fr.list")
    tokenized_es = load(main_dir+"/tokenized_es.list")
    sample_df = pd.DataFrame()
    sample_df["body_pre_en"] = tokenized_en
    sample_df["body_pre_fr"] = tokenized_fr
    sample_df["body_pre_es"] = tokenized_es
    #erase empty lists
    for key in sample_df.keys():
      sample_df = sample_df[sample_df.astype(str)[key] != '[]']
    return sample_df

  sample_df = get_eub_dataframe(main_dir)[:5000]
  train_df = sample_df[:3000]
  val_df = sample_df[3000:4000]
  test_df = sample_df[4000:5000]

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Train Monolingual Representations which will be aligned
- > Define the languages and dimensions which should be tested here

In [None]:
from thesis_code.Utils import read_docs, Vector_Lsi_Model
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from tqdm import tqdm 




max_dim = max(dimensions)
matrices = dict()


if embedding_method == "LSI":
  lsi_models = dict()
  for t in languages:
    key = "body_pre_{}".format(t)
    lsi_models[t] = Vector_Lsi_Model(sample_df[key], dimension=max_dim)
    matrices["{}_train_vecs".format(t)] = np.asarray(lsi_models[t].create_embeddings(train_df[key]))
    matrices["{}_val_vecs".format(t)] = np.asarray(lsi_models[t].create_embeddings(val_df[key]))
    matrices["{}_test_vecs".format(t)] = np.asarray(lsi_models[t].create_embeddings(test_df[key]))

elif embedding_method =="Doc2Vec":
  for dimension in dimensions:
    matrices[dimension] = dict()
    for t in tqdm(languages):
      key = "body_pre_{}".format(t)
      #create tagged docs first
      documents = []
      for ind in sample_df.index:
        doc = sample_df[key][ind]
        tagged_doc = TaggedDocument(doc, [ind])
        documents.append(tagged_doc)
      #Train Doc2Vec Model
      model = Doc2Vec(documents, vector_size=dimension, window=3, min_count=10, workers=4, epochs=100, dm=0)
      training_docs = [model[i] for i in train_df.index]
      validation_docs = [model[i] for i in val_df.index]
      test_docs = [model[i] for i in test_df.index]
      #set matrices
      matrices[dimension]["{}_train_vecs".format(t)] = np.asarray(training_docs)
      matrices[dimension]["{}_val_vecs".format(t)] = np.asarray(validation_docs)
      matrices[dimension]["{}_test_vecs".format(t)] = np.asarray(test_docs)

In [None]:
from itertools import permutations
pairs = permutations(languages, 2)
pair_list = [p for p in pairs]

In [None]:
pair_list

[('en', 'es'),
 ('en', 'fr'),
 ('es', 'en'),
 ('es', 'fr'),
 ('fr', 'en'),
 ('fr', 'es')]

## Linear Concept Approximation

In [None]:
from thesis_code.evaluation_functions import mate_retrieval, reciprocal_rank, comp_scores

In [None]:
from thesis_code.evaluation_functions import evaluate_baseline_lca_model, evaluate_baseline_lca_model_ort
#from thesis_code.evaluation_functions import mate_retrieval, reciprocal_rank, comp_scores
from tqdm import tqdm

if test_LCA == True:
  lca_scores = dict()

  for pair in pair_list:
    l1 = pair[0]
    l2 = pair[1]
    if embedding_method == "LSI":
      l1_train, l1_test = matrices["{}_train_vecs".format(l1)], matrices["{}_test_vecs".format(l1)]
      l2_train, l2_test = matrices["{}_train_vecs".format(l2)], matrices["{}_test_vecs".format(l2)]
      score_lca = evaluate_baseline_lca_model(l1_train, l1_test, l2_train, l2_test, lca_dimension, comp_scores)
      score_lcao = evaluate_baseline_lca_model_ort(l1_train, l1_test, l2_train, l2_test, lcao_dimension, comp_scores)
    if embedding_method =="Doc2Vec":
      score_lca = []
      score_lcao = []
      for dimension in lca_dimension: 
        l1_train, l1_test = matrices[dimension]["{}_train_vecs".format(l1)], matrices[dimension]["{}_test_vecs".format(l1)]
        l2_train, l2_test = matrices[dimension]["{}_train_vecs".format(l2)], matrices[dimension]["{}_test_vecs".format(l2)]
        score_lca.append(evaluate_baseline_lca_model(l1_train, l1_test, l2_train, l2_test, [dimension], comp_scores)[0])
      for dimension in lcao_dimension: 
        l1_train, l1_test = matrices[dimension]["{}_train_vecs".format(l1)], matrices[dimension]["{}_test_vecs".format(l1)]
        l2_train, l2_test = matrices[dimension]["{}_train_vecs".format(l2)], matrices[dimension]["{}_test_vecs".format(l2)]
        score_lcao.append(evaluate_baseline_lca_model_ort(l1_train, l1_test, l2_train, l2_test, [dimension], comp_scores)[0])

    lca_scores["{}-> {}".format(l1,l2)] = {"lca_{}".format(embedding_method): score_lca, 
                         "lcao_{}".format(embedding_method): score_lcao}
    #Save Results
    target_dir = main_dir+"lca_scores_{}_{}".format(embedding_method, dataset)
    with open(target_dir, 'wb') as handle:
        pickle.dump(lca_scores, handle, protocol=pickle.HIGHEST_PROTOCOL)


##LCC Scores

In [None]:
from thesis_code.evaluation_functions import evaluate_lcc_model

if test_LCC == True:
  lcc_scores = dict()

  for pair in pair_list:
    l1 = pair[0]
    l2 = pair[1]
    if embedding_method =="LSI":
      l1_train, l1_test = matrices["{}_train_vecs".format(l1)], matrices["{}_test_vecs".format(l1)]
      l2_train, l2_test = matrices["{}_train_vecs".format(l2)], matrices["{}_test_vecs".format(l2)]
      score_lcc = evaluate_lcc_model(l1_train, l1_test, l2_train, l2_test, lcc_dimension, evaluation_function = comp_scores)
    if embedding_method =="Doc2Vec":
      score_lcc = []
      for dimension in lcc_dimension: 
        l1_train, l1_test = matrices[dimension]["{}_train_vecs".format(l1)], matrices[dimension]["{}_test_vecs".format(l1)]
        l2_train, l2_test = matrices[dimension]["{}_train_vecs".format(l2)], matrices[dimension]["{}_test_vecs".format(l2)]
        score_lcc.append(evaluate_lcc_model(l1_train, l1_test, l2_train, l2_test, [dimension], comp_scores)[0])
    lcc_scores["{}-> {}".format(l1,l2)] = score_lcc

    #Save Results
    target_dir = main_dir+"lcc_scores_{}_{}".format(embedding_method, dataset)
    with open(target_dir, 'wb') as handle:
        pickle.dump(lcc_scores, handle, protocol=pickle.HIGHEST_PROTOCOL)



#Cross-Lingual LSI

In [None]:
from thesis_code.evaluation_functions import evaluate_cllsi, evaluate_improved_cllsi
from tqdm import tqdm

cllsi_scores = dict()
if test_CLLSI == True:

  for pair in tqdm(pair_list):
    l1 = pair[0]
    l2 = pair[1]
    l1_train, l1_test = list(train_df["body_pre_{}".format(l1)]), list(val_df["body_pre_{}".format(l1)])
    l2_train, l2_test = list(train_df["body_pre_{}".format(l2)]), list(val_df["body_pre_{}".format(l2)])
    cllsi_score = evaluate_cllsi(l1_train, l1_test, l2_train, l2_test, cllsi_dimension, evaluation_function = comp_scores)
    print("pair: {}, CL-LSI score: {}".format(pair, cllsi_score) )
    i_cllsi_score = evaluate_improved_cllsi(l1_train, l1_test, l2_train, l2_test, cllsi_dimension, evaluation_function = comp_scores)
    print("pair: {}, CL-LSI score: {}".format(pair, i_cllsi_score))

    cllsi_scores["{}-> {}".format(l1,l2)] = {"cllsi_{}".format(embedding_method): cllsi_score, 
                         "icllsi_{}".format(embedding_method): i_cllsi_score}
    #Save Results
    target_dir = main_dir+"cllsi_scores_{}_{}".format(embedding_method, dataset)
    with open(target_dir, 'wb') as handle:
        pickle.dump(lca_scores, handle, protocol=pickle.HIGHEST_PROTOCOL)


##Neural Networks 

List all settings to be tested here. 

In [None]:
from thesis_code.evaluation_functions import evaluate_nncc, evaluate_nnca

if test_neural_networks == True:
  nncc_scores = dict()
  nnca_scores = dict()
  #choose only one, to reduce computational burden
  for pair in pair_list:
    l1 = pair[0]
    l2 = pair[1]
    
    if embedding_method =="LSI":
        for setting in settings_nncc:
          dimension = setting["dimension"]
          l1_train, l1_test = matrices["{}_train_vecs".format(l1)], matrices["{}_test_vecs".format(l1)]
          l2_train, l2_test = matrices["{}_train_vecs".format(l2)], matrices["{}_test_vecs".format(l2)]
          score, history = evaluate_nncc(l1_train, l1_test, l2_train, l2_test, 
                                dimensions = [dimension], 
                                evaluation_function = comp_scores,
                                neurons = setting["neurons"],
                                activation_function = setting["activation_function"],
                                max_epochs = 200,
                                dropout = setting["dropout"],
                                optimizer = setting["optimizer"],
                                loss = setting["loss_function" ])
          nncc_scores["{}-> {}".format(l1,l2)] = score
        for setting in settings_nnca:
          dimension = setting["dimension"]
          l1_train, l1_test = matrices["{}_train_vecs".format(l1)], matrices["{}_test_vecs".format(l1)]
          l2_train, l2_test = matrices["{}_train_vecs".format(l2)], matrices["{}_test_vecs".format(l2)]
          score, h1, h2 = evaluate_nnca(l1_train, l1_test, l2_train, l2_test, 
                                dimensions = [dimension], 
                                evaluation_function = comp_scores,
                                neurons = setting["neurons"],
                                activation_function = setting["activation_function"],
                                max_epochs = 200,
                                dropout = setting["dropout"],
                                optimizer = setting["optimizer"],
                                loss = setting["loss_function" ])
          nnca_scores["{}-> {}".format(l1,l2)] = score
    if embedding_method =="Doc2Vec":
        #Compute score for each nncc Setting:
        for setting in settings_nncc:
          dimension = setting["dimension"]
          l1_train, l1_test = matrices[dimension]["{}_train_vecs".format(l1)], matrices[dimension]["{}_test_vecs".format(l1)]
          l2_train, l2_test = matrices[dimension]["{}_train_vecs".format(l2)], matrices[dimension]["{}_test_vecs".format(l2)]
          score, history = evaluate_nncc(l1_train, l1_test, l2_train, l2_test, 
                                dimensions = [dimension], 
                                evaluation_function = comp_scores,
                                neurons = setting["neurons"],
                                activation_function = setting["activation_function"],
                                max_epochs = 200,
                                dropout = setting["dropout"],
                                optimizer = setting["optimizer"],
                                loss = setting["loss_function" ])
          nncc_scores["{}-> {}".format(l1,l2)] = score
        for setting in settings_nnca:
          dimension = setting["dimension"]
          l1_train, l1_test = matrices[dimension]["{}_train_vecs".format(l1)], matrices[dimension]["{}_test_vecs".format(l1)]
          l2_train, l2_test = matrices[dimension]["{}_train_vecs".format(l2)], matrices[dimension]["{}_test_vecs".format(l2)]
          score, h1, h2 = evaluate_nnca(l1_train, l1_test, l2_train, l2_test, 
                                dimensions = [dimension], 
                                evaluation_function = comp_scores,
                                neurons = setting["neurons"],
                                activation_function = setting["activation_function"],
                                max_epochs = 200,
                                dropout = setting["dropout"],
                                optimizer = setting["optimizer"],
                                loss = setting["loss_function" ])
          nnca_scores["{}-> {}".format(l1,l2)] = score

    #Save Results
    target_dir = main_dir+"nnca_scores_{}_{}".format(embedding_method, dataset)
    with open(target_dir, 'wb') as handle:
        pickle.dump(nnca_scores, handle, protocol=pickle.HIGHEST_PROTOCOL)

    #Save Results
    target_dir = main_dir+"nncc_scores_{}_{}".format(embedding_method, dataset)
    with open(target_dir, 'wb') as handle:
        pickle.dump(nncc_scores, handle, protocol=pickle.HIGHEST_PROTOCOL)
    print(nncc_scores)

      #lca_nn_score = evaluate_single_layer_lca_nn(l1_train, l1_test, l2_train, l2_test, evaluation_function = reciprocal_rank)
      #lca_nn_scores.append(lca_nn_score)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78