Skip to content

PyThaiNLP/prachathai-67k

Repository files navigation

prachathai-67k

News Article Corpus from Prachathai.com

The prachathai-67k dataset was scraped from the news site Prachathai. We filtered out those articles with less than 500 characters of body text, mostly images and cartoons. It contains 67,889 articles wtih 51,797 tags from August 24, 2004 to November 15, 2018. The dataset was originally scraped by @lukkiddd and cleaned by @cstorm125. Download the dataset here. You can also see preliminary exploration in exploration.ipynb.

This dataset is a part of pyThaiNLP Thai text classification-benchmarks. For the benchmark, we selected the following tags with substantial volume that resemble classifying types of articles※:

  • การเมือง - politics
  • สิทธิมนุษยชน - human rights
  • คุณภาพชีวิต - quality of life
  • ต่างประเทศ - international
  • สังคม - social
  • สิ่งแวดล้อม - environment
  • เศรษฐกิจ - economics
  • วัฒนธรรม - culture
  • แรงงาน - labor
  • ความมั่นคง - national security
  • ไอซีที - ICT
  • การศึกษา - education

We provide 3 benchmarks for 12-topic multi-label classification of prachathai-67k: fastText, LinearSVC, ULMFit, and Multilingual Universal Sentence Encoder . In all cases, we first finetune the embeddings using all data. The data is then split into train, validation and test sets at 70/10/20 split. The benchmark numbers are based on the test set. Performance metrics are macro-averaged accuracy and F1 score. See classification.ipynb for more information.

model macro-accuracy macro-F1
fastText 0.9302 0.5529
LinearSVC 0.513277 0.552801
ULMFit 0.948737 0.744875
USE 0.856091 0.696172

※ Note that Prachathai.com is a left-leaning, human-right-focused news site, and thus unusual news labels such as human rights and quality of life.