Code, data, and models from "Civil Unrest on Twitter (CUT): A Dataset of Tweets to Support Research on Civil Unrest" EMNLP 2020 W-NUT
The data is in
/data. As per Twitter guidelines it only contains the tweet IDs and not the full tweet content.
keywords_english.txt: Civil unrest-related keywords
known_annotations.csv: "Cround truth" annotations by the authors used to evaluate Mechanical Turk worker annotations
labelled_tweets_is_general_unrest.csv: Labels for tweets (IDs only) and whether they were annotated as "general unrest" and "specific/nonspecific event"
labelled_tweets_is_protest_event.csv: Labels for tweets (IDs only) and whether they were annotated as "specific/nonspecific event"
majority_annotation_results.csv: All labels for the tweets (IDs along with year and country)
Civil Unrest Event Prediction Models
We evaluated ngram and embedding-based models on how well they can identify tweets discussing specific/nonspecific protests and riots (
/data/labelled_tweets_is_protest_event.csv). See the above paper for details.
The below trained models are in
The Keyword model and Unigram model had F1 0.782 and 0.775 F1, respectively.
- Run settings:
Note: these scripts handle both the general ngram and civil unrest-related keyword count models.
This model was not included in the final paper and is still being improved. Currently achieves an F1 of 0.814.
- Run settings:
Note: Using a GPU for BERTweet is highly recommended
Please email Alexandra DeLucia if you have any issues or questions (email@example.com).