Skip to content
[PlanTL/medicine/word embeddings] Word embeddings generated from Spanish corpora.
Branch: master
Clone or download
Latest commit dd2d212 May 3, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Biomedical_Word_Embeddings_for_Spanish__Development_and_Evaluation.pdf Add files via upload May 3, 2019
LICENSE Create LICENSE Jan 17, 2019
README.md Update README.md May 3, 2019
our_embeddings.pdf Add files via upload May 3, 2019
sbwc_embeddings.pdf Add files via upload May 3, 2019

README.md

Embeddings

This repository contains the word embeddings generated from Spanish corpora.

Corpora used

  • Scielo Full-Text in Spanish: We retrieved all the full-text available in Scielo.org (until December/2018) and processed them into sentences. Scielo.org node contains all Spanish articles, thus includes Latin and European Spanish.
    • Sentences: 3,267,556
    • Tokens: 100,116,298
  • Wikipedia Health: We retrieved all articles from the following Wikipedia categories: Pharmacology, Pharmacy, Medicine and Biology. Data were retrieved during December/2018.
    • Sentences: 4,030,833
    • Tokens: 82,006,270
  • Scielo + Wikipedia Health: We concatenated the previous two corpora.

Embeddings generated

Two different approaches were used: Word2Vec and fastText.

Word2Vec

We used the python Gensim package (https://radimrehurek.com/gensim/index.html) to train different Word2Vec embeddings. The following configurations were set:

  • Embedding dimension: 50, 150 and 300.
  • Epochs 15
  • Window size: 10
  • Minimum word count: 5
  • Algorithm: CBOW
  • Copora: Scielo, Wikipedia and Scielo + Wikipedia

fastText

We used the fastText (https://fasttext.cc/) to train word embeddings. We kept all standard options for training. The following corpora were used: Scielo, Wikipedia and Scielo + Wikipedia

Evaluation

Evaluation was carried out by both extrinsic (with a Named Entity Recognition framework) and intrinsic, with the three already available datasets for such task UMNSRS-sim, UMNSRS-rel, and MayoSRS.
With NER, we defined that the best model was with 300 dimensions, and projected the words using Principal Component Analysis.

Further details about evaluation and the steps performed can be found in our paper in this repository:

Biomedical_Word_Embeddings_for_Spanish__Development_and_Evaluation.pdf

The PCA plots for our embedding and a general-domain embedding are available in this repository also:

our_embeddings.pdf
sbwc_embeddings.pdf

Directory Structure

The example below shows the structure for the Wikipedia subset with 50 dimensions. All other subsets have the same structure

./Wikipedia/Wikipedia_Fasttext.bin # Fasttext embedding in binary file
./Wikipedia/Wikipedia_Fasttext.vec # Fasttext embedding in text file
./Wikipedia/50/W2V_wiki_w10_c5_50_15epoch.txt # Word2Vec in text file
./Wikipedia/50/wiki_w10_c5_50_15epoch.w2vmodel # Word2Vec in gensim file
./Wikipedia/50/wiki_w10_c5_50_15epoch.w2vmodel.trainables.syn1neg.npy # Word2Vec in gensim file
./Wikipedia/50/wiki_w10_c5_50_15epoch.w2vmodel.wv.vectors.npy # Word2Vec in gensim file

Digital Object Identifier (DOI) and access to dataset files

https://doi.org/10.5281/zenodo.2542722

Contact

Felipe Soares (felipe.soares@bsc.es)

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2018 Secretaría de Estado para el Avance Digital (SEAD)

You can’t perform that action at this time.