This repository contains a news dataset presented in the paper:
Daniil Gavrilov, Pavel Kalaidin, and Valentin Malykh. Self-Attentive Model for Headline Generation. 41st European Conference on Information Retrieval, 2019. arXiv:1901.07786 [cs.CL]
To download the dataset please use a direct link or clone the repository using git lfs
.
Full dataset contains 1003869 Russian language news documents from January, 2010 to December, 2014.
-
ria_20.json
contains the first 20 news documents from the dataset. -
ria_1k.json
contains the first 1000 news documents from the dataset. -
ria.json.gz
is full GZip'ed dataset.
Dataset format: each row contains a JSON document that consists of two fields: text
is a document body, while title
is a news headline.
This data is lisensed by Rossiya Segodnya news agency (ria.ru) under CC-BY-ND-NC license. The license text could be accessed here. The Russian version of the same license could be accessed here.
If you're using the data in a research please consider citing the mentioned paper:
@inproceedings{gavrilov2018self,
title={Self-Attentive Model for Headline Generation},
author={Gavrilov, Daniil and Kalaidin, Pavel and Malykh, Valentin},
booktitle={Proceedings of the 41st European Conference on Information Retrieval},
year={2019}
}