Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
196 lines (140 sloc) 6.23 KB

Corpus segments

Here you can explore information about our corpus sources and download them

Segment information

Genres Tokens, millions %
News 92 1.5
Literary Texts 4605 76
Special datasets 2.5 0.5
Social media 80 1.5
Subtitles 101 1.5
Poems 1130 19
<iframe src="https://cdn.datamatic.io/runtime/echarts/3.7.2_230/embedded/index.html#id=115038797393892898117/1XxvinvhVz-Gh0WJzjQ_0sD5_f7coQueI" frameborder="0" width="687" height="493" allowtransparency="true"></iframe>

Token distribution per segment

Stihi.ru

Meta-attributes:

  • 'textrubric' – genre of the poem
  • 'textid' – unique ID
  • 'textname' – poem title
  • 'author' – author(s)
  • 'authortexts' – number of poems written by the author
  • 'authorreaders' – number of visitors who read the poem
  • 'date' – date of publication
  • 'time' – time of publication
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Poems by Genre

![alt text]({{ site.baseurl }}/assets/images/stihi_ru_rubrics.png "corpus segments")

Click here for more info.

Proza.ru

Meta-attributes:

  • 'textrubric' – text genre
  • 'textid' – unique ID
  • 'textname' – title
  • 'author' – author(s)
  • 'authortexts' – number of texts written by the author
  • 'authorreaders' – number of visitors who read the text
  • 'date' – date of publication
  • 'time' – time of publication
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Texts by Genre

![alt text]({{ site.baseurl }}/assets/images/proza_ru_textrubric.png "corpus segments")

Click here for more info.

Lenta.ru

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – article title
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Category

![alt text]({{ site.baseurl }}/assets/images/lenta_rubrics.png "corpus segments")

Click here for more info.

Interfax

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Tag

![alt text]({{ site.baseurl }}/assets/images/interfax_tags.png "corpus segments")

Click here for more info.

NPlus1

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textdiff' – text difficulty
  • 'author' – author(s)
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Texts by Difficulty

![alt text]({{ site.baseurl }}/assets/images/nplus1_diff.png "corpus segments")

Click here for more info.

Komsomolskaya Pravda

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textregion' – news by region
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Region

![alt text]({{ site.baseurl }}/assets/images/kp_regions.png "corpus segments")

Click here for more info.

Russian Magazines Hall

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'magazine' – magazine title
  • 'author' – author(s)
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – tags
  • 'source' – reference to the original source (sometimes unavailable)

Click here for more info.

Fontanka.ru

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textregion' – news by region
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Year

![alt text]({{ site.baseurl }}/assets/images/fontanka_years.png "corpus segments")

Click here for more info.

Arzamas

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'authors' – author(s)
  • 'authorprofession' – author's profession
  • 'about_author' – short author bio
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Category

![alt text]({{ site.baseurl }}/assets/images/arzamas_rubrics.png "corpus segments")

Click here for more info.

TV Subtitles

Meta-attributes:

  • 'textid' – unique ID
  • 'title' – film title
  • 'language' – language
  • 'filepath' – file path

Distribution of Texts by Language

![alt text]({{ site.baseurl }}/assets/images/tvsubtitles_langs.png "corpus segments")

Click here for more info.