## Imports

In [1]:
from preprocess import *
from tagger import POS_Tagger
import tensorflow as tf
import keras
import requests
import os

2023-11-22 20:03:02.290980: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-22 20:03:02.372463: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-22 20:03:02.374858: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-22 20:03:02.374867: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore 

In [2]:
print(tf.__version__)

2.11.0


In [3]:
print(keras.__version__)

2.11.0


## Datasets and preprocessing

The three datasets used during our tests are accessible on the following GitHub repositories:
- English dataset: https://github.com/UniversalDependencies/UD_English-EWT
- French dataset: https://github.com/UniversalDependencies/UD_French-GSD
- Italian dataset: https://github.com/UniversalDependencies/UD_Italian-ISDT

Below there is code to download the three sets of each dataset (train, dev/validation, test) to your computer. Feel free to comment/uncomment any lines to decide on what datasets you actually download (but keep in mind that you must do the same for the rest of the preprocessing steps).

In [4]:
def get_dataset(url, name):
    filename = os.path.join(os.getcwd(), name) #we save on the working directory

    r = requests.get(url) #get the webpage
    with open(filename, 'w', encoding="utf-8") as f: #and write it to a file
      f.write(r.text)

In [5]:
# English treebanks
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-train.conllu", "en_ewt-ud-train.conllu")
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu", "en_ewt-ud-dev.conllu")
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-test.conllu", "en_ewt-ud-test.conllu")

# French treebanks
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_French-GSD/master/fr_gsd-ud-train.conllu", "fr_gsd-ud-train.conllu")
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_French-GSD/master/fr_gsd-ud-dev.conllu", "fr_gsd-ud-dev.conllu")
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_French-GSD/master/fr_gsd-ud-test.conllu", "fr_gsd-ud-test.conllu")

# Italian treebanks
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-ISDT/master/it_isdt-ud-train.conllu", "it_isdt-ud-train.conllu")
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-ISDT/master/it_isdt-ud-dev.conllu", "it_isdt-ud-dev.conllu")
get_dataset("https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-ISDT/master/it_isdt-ud-test.conllu", "it_isdt-ud-test.conllu")

We preprocess the datasets downloaded, essentially stripping them from comments, multiwords, and empty tokens. This generates new .conllu files, that we'll use next to produce our data structures holding inputs and targets of the PoS Tagger. As said, feel free to comment/uncomment any lines referring to datasets you don't/do have.

This makes use of the *preprocess_dataset* function, defined in the preprocess.py file.

In [6]:
# Preprocessing the english treebanks
preprocess_dataset("en_ewt-ud-train.conllu", "my_en_train.conllu")
preprocess_dataset("en_ewt-ud-dev.conllu", "my_en_dev.conllu")
preprocess_dataset("en_ewt-ud-test.conllu", "my_en_test.conllu")

# Preprocessing the french treebanks
preprocess_dataset("fr_gsd-ud-train.conllu", "my_fr_train.conllu")
preprocess_dataset("fr_gsd-ud-dev.conllu", "my_fr_dev.conllu")
preprocess_dataset("fr_gsd-ud-test.conllu", "my_fr_test.conllu")

# Preprocessing the italian treebanks
preprocess_dataset("it_isdt-ud-train.conllu", "my_it_train.conllu")
preprocess_dataset("it_isdt-ud-dev.conllu", "my_it_dev.conllu")
preprocess_dataset("it_isdt-ud-test.conllu", "my_it_test.conllu")

Now we just need to generate the "proper" datasets that will be used with our PoS Tagger. For this, we'll make use of the *generate_samples* function, defined in the preprocess.py file.

In [7]:
# English samples
en_train_dataset = generate_samples("my_en_train.conllu")
en_val_dataset = generate_samples("my_en_dev.conllu")
en_test_dataset = generate_samples("my_en_test.conllu")

# French samples
fr_train_dataset = generate_samples("my_fr_train.conllu")
fr_val_dataset = generate_samples("my_fr_dev.conllu")
fr_test_dataset = generate_samples("my_fr_test.conllu")

# Italian samples
it_train_dataset = generate_samples("my_it_train.conllu")
it_val_dataset = generate_samples("my_it_dev.conllu")
it_test_dataset = generate_samples("my_it_test.conllu")

## Building and training the models

Now that we have our datasets prepared, we can proceed with the creation and training of the models that will perform the PoS Tagging. For that, we'll make use of the POS_Tagger class, defined in the tagger.py file, which allows us to easily perform all needed tasks in a concise manner.

To begin, we only need to pass our datasets to the class. The order is (train, validation, test), although the class only really needs a training dataset. The rest can be omitted, although some functionality might not be available (for example, without a validation set we won't have validation metrics during training, and without a test set, we won't be able to test the final model unless we provide the dataset explicitly, and on a correct format).

In [8]:
english_tagger = POS_Tagger(en_train_dataset, en_val_dataset, en_test_dataset)

french_tagger = POS_Tagger(fr_train_dataset, fr_val_dataset, fr_test_dataset)

italian_tagger = POS_Tagger(it_train_dataset, it_val_dataset, it_test_dataset)

2023-11-22 20:03:40.237966: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-11-22 20:03:40.238478: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-22 20:03:40.238620: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-11-22 20:03:40.238719: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-11-22 20:03:40.238816: W tensorflow/c

The first step is to actually build and compile the desired model. To do so, we use the build() method of the POS_Tagger class. On it, we may specify the number of PoS labels, and the output dimensions of the Embedding and LSTM layer. However, this is not strictly needed, as the default values work well enough.

In [9]:
english_tagger.build()

french_tagger.build()

italian_tagger.build()

Now we need to actually perform the training step. We can do so with the train() method of our tagger objects. It's possible to define the number of epochs of training (by default, it will train for 10 epochs, which can take a while depending on the computer's power).

In [10]:
english_tagger.train()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [17]:
french_tagger.train()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [18]:
italian_tagger.train()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Evaluating the models and making predictions

Once we've trained our taggers, we can start using them to predict PoS Tags. Considering we have a test set, it might be useful to evaluate the performance of the models before "freely" tagging sentences with them. We can do so using the test() method of the tagger objects. It accepts a specific test set, but it will use the test set defined when creating the tagger if there was one, and none other is provided.

In [13]:
english_tagger.test()

test loss, test acc: [0.38438504934310913, 0.8928030729293823]


In [14]:
french_tagger.test()

AssertionError: You must first build and train a model, by using build() and train(), before testing

In [None]:
italian_tagger.test()

With this done, we can start tagging user-defined sentences. For this, we use the predict_sentence() method of the taggers. We should write our sentence as an argument, although there is a default sentence if we give none (however, keep in mind that sentence is in english, so it won't work well for other-language taggers).

In [15]:
english_tagger.predict_sentence()

DET => this
AUX => is
DET => a
NOUN => sample
NOUN => sentence


In [16]:
english_tagger.predict_sentence("my name is John and I like trees")

PRON => my
NOUN => name
AUX => is
PROPN => John
CCONJ => and
PRON => I
VERB => like
NOUN => trees


In [None]:
french_tagger.predict_sentence("le chat mange un gros poisson")

In [None]:
italian_tagger.predict_sentence("Pedro saltò da un albero e si ruppe la mano")

### Appendix 1: Saving and loading models

The POS_Tagger class implements the usual Tensorflow save/load functionality to alleviate the need for constant training (which can take time). When we have a satisfactory model, we might save it by calling the save() method. It will save the model on a folder, on the current working directory, and the folder can be named using the first argument of the method.

In [None]:
english_tagger.save("en_model")

In [None]:
french_tagger.save("fr_model")

In [None]:
italian_tagger.save("it_model")

In order to load a model saved in this manner to a POS_Tagger object, we just need to call the load() method, passing the folder path as an argument.

In [None]:
italian_tagger_2 = POS_Tagger(it_train_dataset) #we still need to provide a training dataset, although we won't use it, because of how the class is implemented
italian_tagger_2.load("it_model")

Once the model is loaded, we can skip the build/train/test part and directly use it. As we loaded the same model, the prediction for the sentence used before should be identical.

In [None]:
italian_tagger_2.predict_sentence("Pedro saltò da un albero e si ruppe la mano")