- paper published at ALW1: 1st Workshop on Abusive Language Online to be held at the annual meeting of the Association of Computational Linguistics (ACL) 2017 (Vancouver, Canada), August 4th, 2017
Automatic abusive language detection is a difficult but important task for online social media. Our research explores a two-step approach of performing classification on abusive language and then classifying into specific types and compares it with one-step approach of doing one multi-class classification for detecting sexist and racist languages. With a public English Twitter corpus of 20 thousand tweets in the type of sexism and racism, our approach shows a promising performance of 0.827 F-measure by using HybridCNN in one-step and 0.824 F-measure by using logistic regression in two-steps.
see whole paper at http://aclweb.org/anthology/W17-3006
- python 3.4
- tensorflow 1.0
- keras 2.0 (for tf model layer construction)
- numpy
- sklearn (for splitting dataset)
- nltk (for ngram)
- pandas (for loading csv)
- re (for regex)
- jupyter notebook (for debugging)
- tqdm (for displaying process)
- pickle (for saving python object)
- wordsegment, pyenchant (for preprocessing words)
- gensim (for word2vec)
train.py
preprocess.py
: clean the tweettokenizer.py
: tokenize text into characters/wordschar.py
: helpers related to character featuresword.py
: helpers related to word featureshybrid.py
: helpers related to char and word togetherutils.py
: helpers related to dataset (splitting, batch generation, error analysis)
char_cnn.py
: character-level convolutional neural networkword_cnn.py
: word-level convolutional neural networkhybrid.py
: word & char hybrid convolutional neural network
- after each run, the logs will be saved at
/logs/
. check command line log for the exact directory - use
--log_dir
option to specify the log folder - run
tensorboard --logdir=/logs/xxxxxx
- if using remote server, ssh with portforwarding -L option. ex.
ssh -L 16006:127.0.0.1:6006 jhpark@remoteserver.hk
This option will forward remote 6006 port (default port for tensorboard) to localhost 16006.
unshared_task_analysis/ipynb
: analysis of the Hate Speech Datasetsplitting_dataset.ipynb
: scripts for splitting dataset for experimentbaseline_lr.ipynb
: baseline ngram logistic regression experiments
Hate Speech Dataset
: seeoriginal/README.md
Pretrained word2vec
: please download it athttps://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
and put it indata/
Please follow PEP 8. https://www.python.org/dev/peps/pep-0008/ https://realpython.com/blog/python/vim-and-python-a-match-made-in-heaven/