One-step and Two-step Classification for Abusive Language Detection on Twitter

paper published at ALW1: 1st Workshop on Abusive Language Online to be held at the annual meeting of the Association of Computational Linguistics (ACL) 2017 (Vancouver, Canada), August 4th, 2017

Abstract (Park & Fung, 2017)

Automatic abusive language detection is a difficult but important task for online social media. Our research explores a two-step approach of performing classification on abusive language and then classifying into specific types and compares it with one-step approach of doing one multi-class classification for detecting sexist and racist languages. With a public English Twitter corpus of 20 thousand tweets in the type of sexism and racism, our approach shows a promising performance of 0.827 F-measure by using HybridCNN in one-step and 0.824 F-measure by using logistic regression in two-steps.

see whole paper at http://aclweb.org/anthology/W17-3006

Requirements

python 3.4
tensorflow 1.0
keras 2.0 (for tf model layer construction)
numpy
sklearn (for splitting dataset)
nltk (for ngram)
pandas (for loading csv)
re (for regex)
jupyter notebook (for debugging)
tqdm (for displaying process)
pickle (for saving python object)
wordsegment, pyenchant (for preprocessing words)
gensim (for word2vec)

Scripts

train.py

Modules

data

preprocess.py: clean the tweet
tokenizer.py: tokenize text into characters/words
char.py: helpers related to character features
word.py: helpers related to word features
hybrid.py: helpers related to char and word together
utils.py: helpers related to dataset (splitting, batch generation, error analysis)

model

char_cnn.py: character-level convolutional neural network
word_cnn.py: word-level convolutional neural network
hybrid.py: word & char hybrid convolutional neural network

See run results from Tensorboard

after each run, the logs will be saved at /logs/. check command line log for the exact directory
use --log_dir option to specify the log folder
run tensorboard --logdir=/logs/xxxxxx
if using remote server, ssh with portforwarding -L option. ex. ssh -L 16006:127.0.0.1:6006 jhpark@remoteserver.hk This option will forward remote 6006 port (default port for tensorboard) to localhost 16006.

Notebooks

unshared_task_analysis/ipynb: analysis of the Hate Speech Dataset
splitting_dataset.ipynb : scripts for splitting dataset for experiment
baseline_lr.ipynb : baseline ngram logistic regression experiments

Datasets

Hate Speech Dataset: see original/README.md
Pretrained word2vec: please download it at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing and put it in data/

Python Style Guide

Please follow PEP 8. https://www.python.org/dev/peps/pep-0008/ https://realpython.com/blog/python/vim-and-python-a-match-made-in-heaven/

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
crfsuite-0.12		crfsuite-0.12
data		data
model		model
original		original
.gitignore		.gitignore
README.md		README.md
aggregate_politician.ipynb		aggregate_politician.ipynb
baseline_lr.ipynb		baseline_lr.ipynb
config_helper.py		config_helper.py
crf_testing.ipynb		crf_testing.ipynb
data_helper.py		data_helper.py
england_analysis_terror.ipynb		england_analysis_terror.ipynb
error-analysis-abusive.ipynb		error-analysis-abusive.ipynb
error-analysis-racism.ipynb		error-analysis-racism.ipynb
error-analysis-sexism.ipynb		error-analysis-sexism.ipynb
evaluation.ipynb		evaluation.ipynb
glove.twitter.ipynb		glove.twitter.ipynb
hilllary_analysis.ipynb		hilllary_analysis.ipynb
may_analysis.ipynb		may_analysis.ipynb
metric_helper.py		metric_helper.py
preprocess-waasem-davidson.ipynb		preprocess-waasem-davidson.ipynb
split-dataset-waasem-davidson.ipynb		split-dataset-waasem-davidson.ipynb
splitting_dataset.ipynb		splitting_dataset.ipynb
train_hybrid_cnn.py		train_hybrid_cnn.py
train_word_cnn.py		train_word_cnn.py
trump_analysis.ipynb		trump_analysis.ipynb
unshared_task_analysis.ipynb		unshared_task_analysis.ipynb
waasem.hybrid		waasem.hybrid
waasem.word		waasem.word

jihopark/abusive-language

Folders and files

Latest commit

History

Repository files navigation

One-step and Two-step Classification for Abusive Language Detection on Twitter

Abstract (Park & Fung, 2017)

Requirements

Scripts

Modules

data

model

See run results from Tensorboard

Notebooks

Datasets

Python Style Guide

About

Resources

Stars

Watchers

Forks

Languages