# Data and model description

There are several datasets and models that are used throughout. First, let me explain the naming convention. As mentioned in the introduction, there are several datasets:

- `core_hate_corpus`: this refers to the article and tweets generated by the hate speech network of users, as provided by Melvyn. It is a combination of the Dailystormer articles and the tweets from users who share these articles.
- `core_tweets_hs_keyword`: This refers data collected using the known hate speech keywords provided by HateBase. The keywords can be found in `hatespeech_core/data/search_streaming/refined_hs_keywords`
- `core_combined_corpus`: This refers to a combination of data collected around specific themes, including the 2016 US Presidential elections, the 2017 Manchester bomber, and other major social events that took place throughout 2016 to 2017.
- Unfiltered: this refers to tweets collected without keywords. It represents *unbiased* data

My reasoning is that in order to model the use of code words, we need to build models that reflect words under their normal usage, and under their potential hate speech usage. This is why we have the Unfiltered and the other datasets. As an example, in the general Twitter stream the word `animals` would most likely be used for its actual meaning, but if we were to check the tweets of users who are in hate speech communities, then `animals` might take on its hate speech meaning which describes people of *"lower class"*.

To rebuild this work, you only need to collect data using the keywords I have provided, and collect a separate data set without the use of any keywords. This is the data that is non unique. However, it is necessary to download the Dailystormer and the hate speech user network tweets as it is not easy to obtain a dataset that is specific to hatespeech.

I'll add them here for convinience:

- [Website article datasets](https://www.dropbox.com/s/lcg2j3zx2kuz2re/dailystormer_archive.20170901.gz?dl=0)
- [User Tweets datasets](https://www.dropbox.com/s/96mcbq260mgo1gs/melvyn_hs_users.20170901.gz?dl=0)



## Neural Embedding models

There are two types of models in use, tradition word vector models (built with fasttext) and syntactic dependecy models (built with dependency2vec). The word embedding models reflect word collocation or word relatedness. The dependency2vec models reflect word similarity or word behaviour. You can check out the slides in the `docs` folder for a more visual explanation.

I trained word embedding and dependency2vec models for BOTH the unfiltered and the hate speech corpus. I have included all the embedding models that I trained, the notebook `03_codeword_selection` shows how they can be loaded and used.

In terms of training new models, the module `neural_embeddings` takes care of that, more specifically, it is under the functions `train_word_embeddings` and `train_dep2vec_model`.