This repository contains a usable code from the paper:
G. Marra, A. Zugarini, S. Melacci, and M. Maggini, “An unsupervised character-aware neural approach to word and context representation learning,” in Proceedings of the 27th International Conference on Artificial Neural Networks – ICANN 2018
The structure of the project contains:
- A
datafolder, containing the txt files from which to learn the embeddings. - A
logfolder, where the model is saved. - The
char2word.pyscript, which is the main routine. - The
encoder.pyscript, which contains some utility functions.
For a standard learning procedure do the following.
- Be sure both
dataandlogfolder are present. - Put your training data into the
datafolder, with files named asdata(something).txt, e.g.data01.txt. - Simply run
char2word.py
The script will create a vocabulary.txt file inside the data folder to be used during training.
All the configurations are set to the default ones (i.e. the paper ones). The script does not yet provide a command-line configurations (apart for folder configuration, run --helpfor info).
The user willing to have a custom configuration should modify the Config configuration class in the char2word.py script.
We will provide a more user-friendly command-line interface as soon as possible, together with more details about the training procedure and how to incorporate the model inside bigger models.
To see a fast way to exploit already trained embeddings look at the SentenceEncoder class together with the test function.
This code has been tested with tensorflow==1.4 and python2.7. Moreover, it has a dependency with the python library nltk
.