Annotated Fake News Dataset in Urdu and Augmentation using Machine Translation
===========================
March 03, 2020
Maaz Amjad, Grigori Sidorov, Alisa Zhila
Natural Language and Text Processing Laboratory
Center for Computing Research (CIC)
Instituto Politécnico Nacional (IPN)
Ciudad de México (Mexico City), Mexico
- Introduction
- Feedback
- Citation Info
- Acknowledgments
This dataset accompanies paper by Amjad, M., Sidorov, G., Zhila, A. Data Augmentation using Machine Translation for Fake News Detection in the Urdu Language (2020), LREC 2020 (accepted).
This is a language resource which contains a dataset of 900 news articles originally in Urdu annotated as real or fake. Additionally, it contains a 400 news article as an augmentation dataset generated using Google Translate MT system from English to Urdu, as well as a number of combinations of these datasets for exploration of the augmentation effect. The original English Fake News dataset is available from https://web.eecs.umich.edu/~mihalcea/downloads.html#FakeNews.
If you want to know how this dataset was build (include the explanation of crawling and annotation technique) and how we did our experiments for Fake News detection in Urdu language using this dataset, you can read our paper in here:
For further questions or inquiries about this dataset, you can contact Maaz Amjad (maazamjad@phystech.edu)
This dataset and the other resource can be used for free, but if you want to publish paper/publication using this dataset, please cite this publication:
@article{Maazaug2020,
author = {Maaz Amjad, Grigori Sidorov, Alisa Zhila},
title = {Annotated Fake News Dataset in Urdu and Augmentation using Machine Translation},
conference = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
page = {2530–2535}
year = {2020}
}
The work was done with partial support of CONACYT project 240844 and SIP-IPN projects 20195719.