Skip to content

Improving Word Meaning Representations using Wikipedia Categories

License

Notifications You must be signed in to change notification settings

Svobikl/global_context

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

global_context

Improving Word Meaning Representations using Wikipedia Categories

Improving Word Meaning Representations using Wikipedia Categories

We extend Skip-Gram and Continuous Bag-of-Words models via global context information. We use Wikipedia corpus where articles are organized in a hierarchy of categories. These categories provide useful topical information about each article. We present several approaches how to enrich word meaning representation with such kind of information.

We experiment with English Wikipedia and evaluate our models on standard word similarity and word analogy datasets. Proposed models significantly outperform other word representation methods when similar size training data are used and provide similar performance compared with methods trained on much larger datasets.

Getting Started

see our publication at

http://www.nnw.cz/doi/2018/NNW.2018.28.029.pdf

and 

https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/download/3268/2686

Please cite both articles

@article{svoboda2018improving,
  title={IMPROVING WORD MEANING REPRESENTATIONS USING WIKIPEDIA CATEGORIES},
  author={Svoboda, L and Brychc{\i}n, T},
  journal={Neural Network World},
  volume={523},
  pages={534},
  year={2018}
}

and

@article{svoboda2019enriching,
  title={Enriching Word Embeddings with Global Information and Testing on Highly Inflected Language},
  author={Svoboda, Luk{\'a}{\v{s}} and Brychc{\'\i}n, Tom{\'a}{\v{s}}},
  journal={Computaci{\'o}n y Sistemas},
  volume={23},
  number={3},
  year={2019}
}

Download corpus

Categories mapping here:

https://uloz.to/!jqKCLASD6MmL/categories-filtmin10-prefixed-txt-zip

Wikipedia articles:

https://uloz.to/!nZLUURC15rkv/wiki-filteredmin10-txt-zip

another format with one sentence per line:

https://uloz.to/tam/_065GP5uM4VOj

alternative download:

https://drive.google.com/drive/folders/1uv18K8ZGaLzMuzfgSRQERJqZdsn51o9B?usp=sharing

Models

Trained models can be downloaded at following links: CBOW

https://uloz.to/!usCgCpNUbMEZ/trained-cbow-vec-zip

Skip-gram

https://uloz.to/!EJ8rDhVkMKoV/vectors-skip-cat-bin-zip

Or here:

https://drive.google.com/drive/folders/1uv18K8ZGaLzMuzfgSRQERJqZdsn51o9B?usp=sharing

CZ corpus:

https://drive.google.com/drive/folders/1rraa5-FGW-AfLeU3StVbsjN0YjDbUWe6?usp=sharing

Use with

Word2Vec, LexVec or fastText tools, clean implementation has to be done.

Authors

  • Lukáš Svoboda
  • Tomáš Brychcín

License

This corpus and models are free for research purposes.

Acknowledgments

This work has been partly supported from ERDF "Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)" (no.: CZ.02.1.01/0.0/0.0/17_048/0007267) and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures".

About

Improving Word Meaning Representations using Wikipedia Categories

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages