# EMBEDDINGS CREATION

This is the script where we make use of the results of the script **Preprocessing.ipynb**. Finally we can create our embeddings starting from our data. The embeddings will also recive an additional formatting preprocessing in order to make them suitable to be used with the amplpy library, which will be the next step of our work. Our embeddings will, indeed, be feed to two different optimization algorithms as reported in the thesis work. The module used is Gensim word2vec.

Now since all this work has the aim to imporve an alignment technique we will refer to source embedding space and target embedding space.

In [1]:
from gensim.models import Word2Vec
import numpy as np

The following names are the names of the files that we preprocessed and that now become the source and the target embedding spaces.

In [2]:
# TO MODIFY IF YOU WANT TO CHANGE THE EMBEDDINGS
SOURCE_PREPROCESSED_TXT = 'p2_0side_2959files_preprocessed.txt'
TARGET_PREPROCESSED_TXT = 'p2_2side_2959files_preprocessed.txt'




# Name of the json file where we will save the word2id information
NAME_JSON_FILE = f'{SOURCE_PREPROCESSED_TXT.replace("preprocessed.txt", "")}' +'&_' + f'{TARGET_PREPROCESSED_TXT.replace("_preprocessed.txt", "")}' + "_word2id.json"

In [3]:
def load_preprocessed_txt(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

#Load the source file
source_text = load_preprocessed_txt(SOURCE_PREPROCESSED_TXT)

#Load the target file
target_text = load_preprocessed_txt(TARGET_PREPROCESSED_TXT)


# Suddividi il testo in token
tokenized_source_text = source_text.split()
tokenized_terget_text = target_text.split()

Now we get into the real model definition, the two embeddings should be created applying the same methodology.
The following parameters are defined. For further reference see the thesis or Gensim library

In [4]:
VECTOR_SIZE = 300
WINDOW_SIZE = 5
EPOCHS = 5
#if we want the Skip-gram model (1) or the CBOW model (0)
SG = 1

In [5]:
source_model = Word2Vec([tokenized_source_text], vector_size=VECTOR_SIZE, window = WINDOW_SIZE, seed=42, epochs=EPOCHS, sg=SG)

In [6]:
target_model = Word2Vec([tokenized_terget_text], vector_size=VECTOR_SIZE, window = WINDOW_SIZE, seed=42, epochs=EPOCHS, sg=SG)

## 1.1 Saving the models

Gensim has its own format in order to save and load embedding models.
In the following we report the script to save the existing models and the way to load them.
The model is saved inside a directory called **tmp** which is created if it does not exist.

In [7]:
import os
import tempfile

# Specify the folder where you want to create the temporary file
tmp_dir = "tmp"

# Ensure the temporary directory exists, create if not
os.makedirs(tmp_dir, exist_ok=True)

def save_genism_model(model, name):
    # Create a temporary file with a specific prefix in the specified directory
    with tempfile.NamedTemporaryFile(prefix=f'gensim-model-{name.replace("_preprocessed.txt", "")}', dir=tmp_dir, delete=False) as tmp:
        temporary_filepath = tmp.name
        model.save(temporary_filepath)

        #
        # The model is now safely stored in the filepath.
        # You can copy it to other machines, share it with others, etc.


save_genism_model(source_model, SOURCE_PREPROCESSED_TXT)
save_genism_model(target_model, TARGET_PREPROCESSED_TXT)

Now we check how many words the two models have in common

In [8]:
common_words = set(source_model.wv.index_to_key) & set(target_model.wv.index_to_key)
len(common_words)

4367

# 1.2 Creation of the matrices and word2id dictionary

For our next use we now have to create the numpy matrices and the word2id dictionary. In the numpy matrices, indeed, we will lost the information about which word is the vector representig. The creation of the dictionary prevent this to happend since every line of the matrix correspond to a word. The reference line word is saved inside the word2id dictionary (that will be a json file).

In [9]:
# Creation of a list with the common words
common_words_list = [word for word in common_words]

# Now you have a list of words that will correspond to the rows in the X and Y matrices

#Example
# Suppose col_idx is the index of the line you are interested in
col_idx = 11  

# Now you can get the corresponding word from the list
parola_corrispondente = common_words_list[col_idx]
print(f"The word corresponding to row {col_idx} is: {parola_corrispondente}")

The word corresponding to row 11 is: reti


## 1.2.1 Creation of the json file

The json file is the word2id dictionary and it is fundamental for the netxt step that will be the alignment via amplpy modules.

In [10]:
import json

# Creare un dizionario con le parole comuni come chiavi e gli indici come valori
word_index_dict = {word: index for index, word in enumerate(common_words_list)}

# Salvare il dizionario in un file JSON
with open(NAME_JSON_FILE, 'w') as file:
    json.dump(word_index_dict, file)


## 1.2.2 Creation of the embedding matrices

In [11]:
X = np.array([source_model.wv[word] for word in common_words])
Y = np.array([target_model.wv[word] for word in common_words])


Saving the matrices

In [12]:
np.save(f'X_source_embeddings_{SOURCE_PREPROCESSED_TXT.replace("_preprocessed.txt", "")}', X)

np.save(f'Y_target_embeddings_{TARGET_PREPROCESSED_TXT.replace("_preprocessed.txt", "")}', Y)

These matrices and the json file will be necessary for the next step involving the use of amplpy. See the script **ampl_test_norm1.ipynb**.