KoMultiText

Korean Multi-task Dataset for Classifying Biased Speech in Real-World Online Services

Paper Title: Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services

This repository provides Korean Multi-task Text Dataset and PyTorch implementations for classification models.
(News) This work is accepted to the NeurIPS 2023 workshop on Socially Responsible Language Modelling Research (SoLaR).

Authors

Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park, Donbin Na,

Abstract

The anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. This phenomenon is especially relevant in South Korea, where large-scale hate speech detection algorithms have not yet been broadly explored. In this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform. Our proposed dataset provides annotations including (1) Preferences, (2) Profanities, and (3) Nine types of Bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. Leveraging state-of-the-art BERT-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.

Source Codes

	RoBERTa	KR-BERT	KoELECTRA	KoBigBird
Multi-task	RoBERTa	KR-BERT	KoELECTRA	KoBigBird
Single-task(Preference)	RoBERTa	KR-BERT	KoELECTRA	KoBigBird
Single-task(Profanity)	RoBERTa	KR-BERT	KoELECTRA	KoBigBird
Single-task(Bias)	RoBERTa	KR-BERT	KoELECTRA	KoBigBird

Dataset

sourced from a forum, "Real-time Best Gallery", of DC Inside, a well-known online community in South Korea

Download Dataset

Total 150,000 comments
- Labeled Dataset: Train Dataset (38,361 comments/5MB), Test Dataset (2,000 comments/286KB)
- Unlabeled Dataset (110,000 comments/11.5MD)

Models Performance

Download Models

The overall classification performance for both single-task and multi-task settings including the Preference, Profanity, and Bias tasks. The AUROC and PRROC for the Bias task represent the average values across all biases.

Detailed AUROC, F1-score, and PRROC results for each specific bias type.

Citation

If this work can be useful for your research, please cite our paper:

@misc{choi2023largescale,
      title={Large-Scale Korean Text Dataset for Classifying Biased Speech in Real-World Online Services}, 
      author={Dasol Choi and Jooyoung Song and Eunsun Lee and Jinwoo Seo and Heejune Park and Dongbin Na},
      year={2023},
      eprint={2310.04313},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
01_multi_task_models		01_multi_task_models
02_single_task_models		02_single_task_models
resources		resources
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

01_multi_task_models

01_multi_task_models

02_single_task_models

02_single_task_models

resources

resources

README.md

README.md

Repository files navigation

KoMultiText

Korean Multi-task Dataset for Classifying Biased Speech in Real-World Online Services

Authors

Abstract

Source Codes

Dataset

Models Performance

Citation

About

Releases

Packages

Contributors 2

Languages

Dasol-Choi/KoMultiText

Folders and files

Latest commit

History

Repository files navigation

KoMultiText

Korean Multi-task Dataset for Classifying Biased Speech in Real-World Online Services

Authors

Abstract

Source Codes

Dataset

Models Performance

Citation

About

Resources

Stars

Watchers

Forks

Languages