The source for the data is the Italian Wikipedia, downloaded from Wikipedia Dumps.
The goal is to produce a single text file with the content of the Wikipedia pages, with a whitespaced tokenization. Usually for the tokenization the approach is to remove punctuation, but I want to get word embeddings also for punctuation (because I don't want to discard any information provided by an input sentence). For producing this type of input, and also because I want to have an alignement between the tokenization used to train word embeddings and the tokenization I am using at runtime, I chose to use SpaCy for its great power and speed. SpaCy comes with word embeddings of this kind for the English language.
Two types of preprocessing have been tried:
- using spacy-dev-resources
- using wikiextractor + SpaCy for tokenization
GloVe is used to produce a text file that contains:
number_of_vectors vector_length
WORD1 values_of_word_1
WORD2 values_of_word_2
...
From the representation of word embeddings in text file, a binary representation is built, ready to be loaded into SpaCy.
The whole SpaCy model (a blank italian nlp + the word vectors) is saved and packaged using the script number 3.
Option 1: do the preceding steps to train the vectors and then load the vectors with nlp.vocab.vectors.from_disk('path')
.
Option 2: install with pip the complete model from the latest release with the following command:
pip install -U https://github.com/MartinoMensio/it_vectors_wiki_spacy/releases/download/v1.0.1/it_vectors_wiki_lg-1.0.1.tar.gz
then simply load the model in SpaCy with nlp = spacy.load('it_vectors_wiki_lg')
.
If you want to use the vectors in another environment (outside SpaCy) you can find the raw embeddings in the vectors-1.0 release which contains
The questions-words-ITA.txt
come from http://hlt.isti.cnr.it/wordembeddings/ as part of the paper:
@inproceedings{berardi2015word,
title={Word Embeddings Go to Italy: A Comparison of Models and Training Datasets.},
author={Berardi, Giacomo and Esuli, Andrea and Marcheggiani, Diego},
booktitle={IIR},
year={2015}
}
The preprocessing + the new dump of wikipedia gives the following results (script accuracy.py
): 58.14% that seems an improvement with respect to the scores in the paper.