Multi-Lingual Hate Speech Detection

We leveraged the OLID multi-lingual hate speech dataset to explore different transformer models with the aim of determining the most effective method for hate speech detection across multiple languages. In our approach, we use a multi-lingual transformer model known as XLM-RoBERTa, as well as specialized BERT models pretrained for individual languages.

Full paper

Per-language pre-trained BERT models for hate speech detection

Requirements:

torch==1.10.2
transformers=4.18.0
pandas
scikit-learn

Training:

For subtask A: Navigate to the /notebooks folder and run run_train.ipynb. Make sure the data path and python file path are correctly specified.

In general, to run the training script from a terminal window, the following syntax can be used:

python bert_finetune.py \
--language 'english' \
--logs_dir 'PATH_TO_LOGS' \
--batch_size 64 \
--learning_rate 2e-5 \
--weight_decay 1e-2 \
--epochs 50

For subtasks B & C: Navigate to the /notebooks folder and run run_train_olid_subtask_b_c.ipynb. Make sure the data path and python file path are correctly specified.

In general, to run the training script from a terminal window, the following syntax can be used: NOTE: For training on subtask C, simply replace subtask_b below with subtask_c.

python bert_finetune_OLID.py \
--language 'english' \
--logs_dir 'PATH_TO_LOGS' \
--batch_size 64 \
--learning_rate 2e-5 \
--weight_decay 1e-2 \
--epochs 50 \
--subtask 'subtask_b'

Evaluation:

Navigate to the /notebooks folder and run error_analysis.ipynb. Make sure the data path and python file path are correctly specified.

In general, to run the evaluation script from a terminal window, the following syntax can be used:

python bert_finetune_OLID_error_analysis.py \
--language 'english' \
--ckpt_loc 'PATH_TO_SUBTASK_MODEL_CHECKPOINT'
--subtask 'subtask_a'

Access to our fine-tuned model checkpoints:

Link to fine-tuned models for the five different languages in the dataset for Task A.

Link to fine-tuned model checkpoint for Task B.

Link to fine-tuned model checkpoint for Task C.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
bert_finetune.py		bert_finetune.py
bert_finetune_OLID.py		bert_finetune_OLID.py
bert_finetune_OLID_error_analysis.py		bert_finetune_OLID_error_analysis.py
data_utils.py		data_utils.py
data_utils_OLID.py		data_utils_OLID.py
data_utils_v2.py		data_utils_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

bert_finetune.py

bert_finetune.py

bert_finetune_OLID.py

bert_finetune_OLID.py

bert_finetune_OLID_error_analysis.py

bert_finetune_OLID_error_analysis.py

data_utils.py

data_utils.py

data_utils_OLID.py

data_utils_OLID.py

data_utils_v2.py

data_utils_v2.py

Repository files navigation

Multi-Lingual Hate Speech Detection

Per-language pre-trained BERT models for hate speech detection

Requirements:

Training:

Evaluation:

Access to our fine-tuned model checkpoints:

About

Releases

Packages

Languages

sridhama/llm-offensive-language-detection

Folders and files

Latest commit

History

Repository files navigation

Multi-Lingual Hate Speech Detection

Per-language pre-trained BERT models for hate speech detection

Requirements:

Training:

Evaluation:

Access to our fine-tuned model checkpoints:

About

Resources

Stars

Watchers

Forks

Languages