Skip to content

A dataset consisting of 3576 documents in Sinhala, drawn from Sri Lankan news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UN- CERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was publis…

LIRNEasia/MisinformationCorpusSinhala

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

MisinformationCorpusSinhala

A dataset consisting of 3576 documents in Sinhala, drawn from Sri Lankan news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UNCERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was published.

Paper (covering methodology and results of machine learning classification): https://lirneasia.net/2021/07/a-corpus-and-machine-learning-models-for-fake-news-classification-in-sinhala/

Update as of Nov 2022: please note that some parts of the original corpus were corrupted, for reasons unknown to us. This repo restores the files.

Using this work

This dataset is released under a CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. For more information, see https://creativecommons.org/licenses/by/4.0/

Citing this work

@misc{jayawickrama2021sinhala,
   title={A corpus and machine learning models for fake news classification in sinhala},
   author={Vihanga Jayawickrama, Asanka Ranasinghe, Dimuthu C. Attanayake, and Yudhanjaya Wijeratne,
   year={2021},
   primaryClass={cs.CL}
}

About

A dataset consisting of 3576 documents in Sinhala, drawn from Sri Lankan news websites and factchecking operations, annotated as CREDIBLE, FALSE, PARTIAL or UN- CERTAIN. The dataset has markers for the content of the document, the classification, the web domain from which each document was retrieved, and the date on which the document was publis…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published