This repository contains the word embeddings generated from Spanish corpora.
- Scielo Full-Text in Spanish: We retrieved all the full-text available in Scielo.org (until December/2018) and processed them into sentences. Scielo.org node contains all Spanish articles, thus includes Latin and European Spanish.
- Sentences: 3,267,556
- Tokens: 100,116,298
- Wikipedia Health: We retrieved all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology. Data were retrieved during December/2018.
- Sentences: 4,030,833
- Tokens: 82,006,270
- Scielo + Wikipedia Health: We concatenated the previous two corpora.
Two different approaches were used: Word2Vec and fastText.
We used the python Gensim package (https://radimrehurek.com/gensim/index.html) to train different Word2Vec embeddings. The following configurations were set:
- Embedding dimension: 50, 150 and 300.
- Epochs 15
- Window size: 10
- Minimum word count: 5
- Algorithm: CBOW
- Copora: Scielo, Wikipedia and Scielo + Wikipedia
We used the fastText (https://fasttext.cc/) to train word embeddings. We kept all standard options for training. The following corpora were used: Scielo, Wikipedia and Scielo + Wikipedia
Evaluation was carried out by both extrinsic (with a Named Entity Recognition framework) and intrinsic, with the three already available datasets for such task UMNSRS-sim, UMNSRS-rel, and MayoSRS.
With NER, we defined that the best model was with 300 dimensions, and projected the words using Principal Component Analysis.
Further details about evaluation and the steps performed can be found in our paper in this repository:
The PCA plots for our embedding and a general-domain embedding are available in this repository also:
The example below shows the structure for the Wikipedia subset with 50 dimensions. All other subsets have the same structure
./Wikipedia/Wikipedia_Fasttext.bin # Fasttext embedding in binary file ./Wikipedia/Wikipedia_Fasttext.vec # Fasttext embedding in text file ./Wikipedia/50/W2V_wiki_w10_c5_50_15epoch.txt # Word2Vec in text file ./Wikipedia/50/wiki_w10_c5_50_15epoch.w2vmodel # Word2Vec in gensim file ./Wikipedia/50/wiki_w10_c5_50_15epoch.w2vmodel.trainables.syn1neg.npy # Word2Vec in gensim file ./Wikipedia/50/wiki_w10_c5_50_15epoch.w2vmodel.wv.vectors.npy # Word2Vec in gensim file
Digital Object Identifier (DOI) and access to dataset files
Felipe Soares (firstname.lastname@example.org)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2018 Secretaría de Estado para el Avance Digital (SEAD)