Data

The data are stored as a CSV and as a pickled pandas dataframe (Python 2.7). Each data file contains 5 columns:

`count =` number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF).

`hate_speech` = number of CF users who judged the tweet to be hate speech.

`offensive_language` = number of CF users who judged the tweet to be offensive.

`neither =` number of CF users who judged the tweet to be neither offensive nor non-offensive.

`class =` class label for majority of CF users. 0 - hate speech 1 - offensive language 2 - neither


Other notes: 
* 80-10-10 split because there's not that much data 
* 100 samples held out for oracle (completed by all experimenters)
* Data shuffled before oracle held out (seeded at 42 each time for reproducibility)
* Used train test split from sklearn for train-dev-test split (also seeded at 42 each time)

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle

In [48]:
df = pd.read_csv('hate-speech-and-offensive-language/data/labeled_data.csv',encoding='utf-8')

In [49]:
shuffled = shuffle(df,random_state=42)
shuffled.shape

(24783, 7)

In [50]:
oracle = shuffled.iloc[:100]

rest = shuffled.iloc[100:]

rest.shape[0]

24683

In [52]:
oracle.to_csv('data/oracles/oracle_with_labels.csv',encoding='utf-8')

In [53]:
to_do = oracle[['Unnamed: 0','tweet']]
to_do.to_csv('data/oracles/oracle_no_labels.csv',encoding='utf-8')

In [54]:
train,dev = train_test_split(rest,test_size = 0.2,random_state=42)

In [55]:
train.to_csv('data/train/training_data.csv',encoding='utf-8')

In [56]:
val,test = train_test_split(dev,test_size = 0.5,random_state=42)

In [57]:
val.to_csv('data/dev/development_data.csv',encoding='utf-8')
test.to_csv('data/test/testing_data.csv',encoding='utf-8')