Sinhala News Media Identification

⚠️You must agree to the license and terms of use before using the dataset in this repo.

Sinhala News Media Identification

This is a text classification task created with the NSINA dataset. This dataset is also released with the same license as NSINA. The task is identifying news media given the news content.

Data

We only used 10,000 instances in NSINA 1.0 from each news source. For the two sources that had less than 10,000 instances ("Dinamina" and "Siyatha") we used the original number of instances they contained. We divided this dataset into a training and test set following a 0.8 split.
Data can be loaded into pandas dataframes using the following code.

from datasets import Dataset
from datasets import load_dataset

train = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Media', split='train'))
test = Dataset.to_pandas(load_dataset('sinhala-nlp/NSINA-Media', split='test'))

Citation

If you are using the dataset or the models, please cite the following paper.

@inproceedings{Nsina2024,
author={Hettiarachchi, Hansi and Premasiri, Damith and Uyangodage, Lasitha and Ranasinghe, Tharindu},
title={{NSINA: A News Corpus for Sinhala}},
booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
year={2024},
month={May},
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
experiments		experiments
transformer_model		transformer_model
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments

experiments

transformer_model

transformer_model

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Sinhala News Media Identification

Data

Citation

About

Releases

Packages

Languages

Sinhala-NLP/Sinhala-News-Media-Identification

Folders and files

Latest commit

History

Repository files navigation

Sinhala News Media Identification

Data

Citation

About

Resources

Stars

Watchers

Forks

Languages