No description, website, or topics provided.
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
_layouts Update default.html Aug 20, 2018
_sass Add files via upload Dec 26, 2017
_site Change image links Dec 25, 2017
assets Add files via upload Dec 27, 2017
corpus Update magazines.md Dec 27, 2017
404.md Update 404.md Dec 27, 2017
Gemfile Fix images, update _config.yml Dec 20, 2017
Gemfile.lock Fix images, update _config.yml Dec 20, 2017
README.md Update README.md Jun 7, 2018
_config.yml Update _config.yml Dec 25, 2017
dcef3abedf0e0761203aaeb85886a6f3.jpg Add files via upload Nov 15, 2017
downloads.md Update downloads.md Nov 17, 2018
format.md Update format.md Dec 25, 2017
index.md Update index.md Apr 10, 2018
mission.md Update mission.md Dec 25, 2017
news.md Update news.md Nov 17, 2018
pipeline.md Update pipeline.md Jan 19, 2018
segments.md Update segments.md Apr 10, 2018

README.md

Taiga corpus

Welcome to the taiga site repository!

Here, as well as on our website, you can explore our documentation, leave feedback, open issues and create pull requests

About the project

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

  • open source
  • big - about 6 billion words by now
  • sorted by datasets applicable to different machine laearning tasks
  • made by linguists, experienced in text crawling, parsing and filtering
  • rich with metainformation
  • POS-tagged and syntactically tagged in Universal Dependencies

Our motivation

A wisely constructed web corpus has a lot more potential applications than is classically accounted to have. The “web as corpus” paradigm recently has had its natural continuation as a formulation “web as train set”. Open-source websites provide ample opportunities for NLP-developers and computational linguists, who nevertheless have to gather all the corresponding data by themselves, repeating the same actions for cleaning and de-duplicating the material, as traditional web corpora provide only search interface and do not give any access to the whole data. The "Taiga" corpus project unites the needs of developers, machine learners and computational linguists, as a web corpus for big linguistic data analysis and actual NLP and NLU systems modeling. Its main aim is to influence the culture of corpus research for Russian language and reflect the paradigm shift in linguistic methodology.

Project creators

Under inspiring supervision of Olga Lyashevskaya

References:

  1. Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of "CORPORA2017", international conference , Saint-Petersbourg, 2017.
  2. Shavrina T. (2018) Differential approach to webcorpus construction. In Dialogue, Russian International Conference on Computational Linguistics, RSUH, Moscow.