A deep, LSTM-based part of speech tagger and sentiment analyser using character embeddings instead of words. Compatible with Theano and TensorFlow. Optimized for Twitter.
Clone or download
Latest commit 878f81c Jul 24, 2016
Permalink
Failed to load latest commit information.
Data Data: newest combined corpus Apr 13, 2016
LICENSE.md Update LICENSE.md Jul 24, 2016
README.md Update README.md Jul 24, 2016
avg.py QAD visualization of lengths/pos tags Jun 21, 2015
corpus.py Only about ~75% accuracy on GATE Jan 25, 2016
eval.py Fix evaluation script Apr 11, 2016
export.py convert to typed arrays Apr 9, 2016
hidden.py Sample code etc Jun 27, 2015
imdb.py Tightening I/O test Aug 2, 2015
lstm.py nn_layers: update to the new averaging mechanism May 2, 2016
lstm_model.npz Revert to 32-size embedding (best mix of performance and accuracy) Apr 24, 2016
lstm_model.npz.pkl Revert to 32-size embedding (best mix of performance and accuracy) Apr 24, 2016
matcher.py Wuh-oh, accuracy a little lower than predicted Nov 18, 2015
mlp.py Sample code etc Jun 27, 2015
modelio.py temporary changes for next training run Apr 13, 2016
nn_dropout.py Factored out the dropout layer Sep 12, 2015
nn_layers.py nn_layers: update to the new averaging mechanism May 2, 2016
nn_lstm.py New bidirectional architecture Jan 1, 2016
nn_optimizers.py rethinking how to do this distribution layers Apr 2, 2016
nn_params.py 85.98% accuracy Mar 23, 2016
nn_serialization.py New bidirectional architecture Jan 1, 2016
nn_support.py 86% accuracy, 88% on training set Apr 7, 2016
server.py Revert to 32-size embedding (best mix of performance and accuracy) Apr 24, 2016
strap.py Trying to get drop-out working, but I think I did it wrong Sep 13, 2015
substitution.py Trying to substitute words, didn't really work Nov 15, 2015
substitutions.pkl Trying to substitute words, didn't really work Nov 15, 2015
tag.py Represent tags with numbers. Jun 21, 2015
test_io.py tests: fix test_io.py May 2, 2016
test_matcher.py Trying to do windowing, expand our range beyond 16 words Nov 16, 2015
test_nn_layers.py nn_layers: update to the new averaging mechanism May 2, 2016
train.py Sample code etc Jun 27, 2015
train.sh Turns out, I never checked this in May 28, 2016
util.py Almost in the right order Aug 8, 2015

README.md

You're looking at the 2016-04-16 release of Dracula, a part-of-speech tagger optimized for Twitter. This tagger offers very competitive performance whilst only learning character embeddings and neural network weights, meaning it requires considerably less pre-processing that another techniques. This branch represents the release, the actual contents of this branch may change as additional things are documented, but there will be no functional changes.

Background

Part of speech tagging is a fundamental task in natural language processing, and its part of figuring out the meaning of a particular, for example if the word heated represents an adjective ("he was involved in a heated conversation") or a past-tense verb ("the room was heated for several hours"). It's the first step towards a more complete understanding of a phrase through parsing. Tweets are particularly hard to deal with because they contain links, emojis, at-mentions, hashtags, slang, poor capitalisation, typos and bad spelling.

How the model works

Unlike most other part of speech taggers, Dracula doesn't look at words directly. Instead, it reads the characters that make up a word and then uses deep neural network techniques to figure out the right tag. Read more »

Installing the model

You'll need Theano 0.7 or better. See Theano's installation page for additional details »

Training the model

Run the train.sh script to train with the default settings. You may need to modify the THEANO_FLAGS variable at the top of this file to suit your hardware configuration (by default, it assumes a single GPU system).

Assessing the model

  1. Start the HTTP server, using THEANO_FLAGS="floatX=float32" python server.py.
  2. In another terminal, type python eval.py path/to/assessment/file.conll.

How well does the model perform?

Here's the model's performance for various character embedding sizes. This is assessed using GATE's TweetIE Evaluation Set (Data/Gate-Eval.conll).

TagSizeAccuracy (% tokens correct)Accuracy (% entire sentences correct)
2016-04-16-12812888.69%20.33%
2016-04-16-646487.29%16.10%
2016-04-16-323284.98%11.86%
2016-04-16-161674.24%3.39%

Changing the the embedding size

Make the following modifications:

  • server.py, in the prepare_data call on line 122, change 32 (the last argument) to the correct size.
  • lstm.py, in the train_lstm arguments on line 104, change dim_proj_chars default value to the correct size.

Licensing

All the code in this repository is distributed under the terms of LICENSE.md.

Acknowledgements, references

The code in lstm.py is a heavily modified version of Pierre Luc Carrier and Kyunghyun Cho's LSTM Networks for Sentiment Analysis tutorial.

The inspiration for using character embeddings to do this job is from C. Santos' series of papers linked below.

Finally, GATE gathered the the most important corpora used for training, and provide a reference benchmark: