global_context

Improving Word Meaning Representations using Wikipedia Categories

We extend Skip-Gram and Continuous Bag-of-Words models via global context information. We use Wikipedia corpus where articles are organized in a hierarchy of categories. These categories provide useful topical information about each article. We present several approaches how to enrich word meaning representation with such kind of information.

We experiment with English Wikipedia and evaluate our models on standard word similarity and word analogy datasets. Proposed models significantly outperform other word representation methods when similar size training data are used and provide similar performance compared with methods trained on much larger datasets.

Getting Started

see our publication at

http://www.nnw.cz/doi/2018/NNW.2018.28.029.pdf

and 

https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/download/3268/2686

Please cite both articles

@article{svoboda2018improving,
  title={IMPROVING WORD MEANING REPRESENTATIONS USING WIKIPEDIA CATEGORIES},
  author={Svoboda, L and Brychc{\i}n, T},
  journal={Neural Network World},
  volume={523},
  pages={534},
  year={2018}
}

and

@article{svoboda2019enriching,
  title={Enriching Word Embeddings with Global Information and Testing on Highly Inflected Language},
  author={Svoboda, Luk{\'a}{\v{s}} and Brychc{\'\i}n, Tom{\'a}{\v{s}}},
  journal={Computaci{\'o}n y Sistemas},
  volume={23},
  number={3},
  year={2019}
}

Download corpus

Categories mapping here:

https://uloz.to/!jqKCLASD6MmL/categories-filtmin10-prefixed-txt-zip

Wikipedia articles:

https://uloz.to/!nZLUURC15rkv/wiki-filteredmin10-txt-zip

another format with one sentence per line:

https://uloz.to/tam/_065GP5uM4VOj

alternative download:

https://drive.google.com/drive/folders/1uv18K8ZGaLzMuzfgSRQERJqZdsn51o9B?usp=sharing

Models

Trained models can be downloaded at following links: CBOW

https://uloz.to/!usCgCpNUbMEZ/trained-cbow-vec-zip

Skip-gram

https://uloz.to/!EJ8rDhVkMKoV/vectors-skip-cat-bin-zip

Or here:

https://drive.google.com/drive/folders/1uv18K8ZGaLzMuzfgSRQERJqZdsn51o9B?usp=sharing

CZ corpus:

https://drive.google.com/drive/folders/1rraa5-FGW-AfLeU3StVbsjN0YjDbUWe6?usp=sharing

Use with

Word2Vec, LexVec or fastText tools, clean implementation has to be done.

Authors

Lukáš Svoboda
Tomáš Brychcín

License

This corpus and models are free for research purposes.

Acknowledgments

This work has been partly supported from ERDF "Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)" (no.: CZ.02.1.01/0.0/0.0/17_048/0007267) and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications. Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures".

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
AnalogyTester		AnalogyTester
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

global_context

Improving Word Meaning Representations using Wikipedia Categories

Getting Started

Please cite both articles

Download corpus

Models

Use with

Authors

License

Acknowledgments

About

Releases

Packages

Languages

License

Svobikl/global_context

Folders and files

Latest commit

History

Repository files navigation

global_context

Improving Word Meaning Representations using Wikipedia Categories

Getting Started

Please cite both articles

Download corpus

Models

Use with

Authors

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages