# Cloud Constable Content-Based Threat Detection
______
### Stephen Camera-Murray, Himani Garg, Vijay Thangella
## Wikipedia Personal Attacks corpus
(https://figshare.com/articles/Wikipedia_Detox_Data/4054689)

115,864 verbatims out of which 13,590 are labelled aggressive and 102,274 are not.

Aggressive Speech                                      |  Normal Speech
:-----------------------------------------------------:|:------------------------------------------------------:
<img src="thumbsdown.png" alt="Aggressive" style="width: 200px;"/> | <img src="thumbsup.png" alt="Normal" style="width: 200px;"/>

**Note**: Be sure to delete all files in the data folder *before* committing changes to github so they don't get angry with us. :-)

### Step 1 - Data Exploration and Cleansing
____

#### Import required libraries

In [3]:
#import libraries
import numpy as np
import pandas as pd
import string
import email.parser 
import os, sys, stat
import shutil
import nltk
import urllib
from PIL import Image
import re
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
from bs4 import BeautifulSoup

#### Download vebatims
Download verbatims from figshare using code swiped from:
https://github.com/ewulczyn/wiki-detox/blob/master/src/figshare/Wikipedia%20Talk%20Data%20-%20Getting%20Started.ipynb

**Note**: Run once to land the files locally. Once there, no need to run again. And please don't check the raw files in to github.

In [12]:
# several of the following cells swiped their code from here: https://github.com/ewulczyn/wiki-detox/blob/master/src/figshare/Wikipedia%20Talk%20Data%20-%20Getting%20Started.ipynb

# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
download_file(ANNOTATED_COMMENTS_URL, 'data/attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'data/attack_annotations.tsv')

#### Load verbatims into dataframes

In [113]:
# load verbatims and annotations into dataframes
comments = pd.read_csv('data/attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('data/attack_annotations.tsv',  sep = '\t')

# count the number of verbatims and annotations
print ( 'There are', '{:,}'.format(len(annotations['rev_id'].unique())), 'verbatims with', '{:,}'.format(len(annotations)) , 'manual annotations' )

There are 115,864 verbatims with 1,365,217 manual annotations


In [114]:
# display the comments dataframe
comments.head(5)

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
37675,`-NEWLINE_TOKENThis is not ``creative``. Thos...,2002,False,article,random,train
44816,`NEWLINE_TOKENNEWLINE_TOKEN:: the term ``stand...,2002,False,article,random,train
49851,"NEWLINE_TOKENNEWLINE_TOKENTrue or false, the s...",2002,False,article,random,train
89320,"Next, maybe you could work on being less cond...",2002,True,article,random,dev
93890,This page will need disambiguation.,2002,True,article,random,train


In [115]:
# display the annotations dataframe
annotations.head(5)

Unnamed: 0,rev_id,worker_id,quoting_attack,recipient_attack,third_party_attack,other_attack,attack
0,37675,1362,0.0,0.0,0.0,0.0,0.0
1,37675,2408,0.0,0.0,0.0,0.0,0.0
2,37675,1493,0.0,0.0,0.0,0.0,0.0
3,37675,1439,0.0,0.0,0.0,0.0,0.0
4,37675,170,0.0,0.0,0.0,0.0,0.0


#### Create an aggressive label, add it to the comment dataframe, clean up, and create the final dataframe
Comments are labelled agressive if the manual consensus "threat" score is more than 0.5. 

In [143]:
# label comments as "aggressive" if the mean attack score is > 0.5
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

# add labels to the comments dataframe
comments['aggressive'] = round ( labels )

# remove newline tokens, tab tokens, and non-alphas
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: re.sub('[^a-zA-Z]+',' ', x).lower())

# create a cleaned-up dataframe with just the content and the label
verbatimsDF = comments[['comment','aggressive']]

# rename columns
verbatimsDF.columns = ['content', 'aggressive']

# display stats
print ( '{:,}'.format ( verbatimsDF[verbatimsDF['aggressive']==1].shape[0] ), 'verbatims are labelled aggressive and', '{:,}'.format ( verbatimsDF[verbatimsDF['aggressive']==0].shape[0] ), 'are not.' )

13,590 verbatims are labelled aggressive and 102,274 are not.


In [136]:
# display cleaned data
verbatimsDF.head(5)

Unnamed: 0_level_0,content,aggressive
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1
37675,this is not creative those are the dictionary...,0.0
44816,the term standard model is itself less npov t...,0.0
49851,true or false the situation as of march was s...,0.0
89320,next maybe you could work on being less conde...,0.0
93890,this page will need disambiguation,0.0


In [137]:
# display the nasty stuff
verbatimsDF[ verbatimsDF ['aggressive'] == 1 ].head(5)

Unnamed: 0_level_0,content,aggressive
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1
801279,iraq is not good usa is bad,1.0
2702703,fuck off you little asshole if you want to ta...,1.0
4632658,i have a dick its bigger than yours hahaha,1.0
6545332,renault you sad little bpy for driving a rena...,1.0
6545351,renault you sad little bo for driving a renau...,1.0


#### Create Word Clouds

Let's pull all of the words into two single strings, one for aggressive and one for normal.

In [140]:
# get all of the spam words and ham words in a single string
aggressive_words = verbatimsDF[verbatimsDF['aggressive']==1]['content'].str.cat()
normal_words     = verbatimsDF[verbatimsDF['aggressive']==0]['content'].str.cat()

Create the word cloud images

In [141]:
d = os.path.dirname('.')

aggressive_mask = np.array(Image.open(os.path.join(d, "thumbsdown.png")))
normal_mask = np.array(Image.open(os.path.join(d, "thumbsup.png")))

stopwords = set(STOPWORDS)

# generate word cloud
wc_aggressive = WordCloud(background_color=None, mode="RGBA", max_words=100, mask=aggressive_mask,
               stopwords=stopwords)
wc_aggressive.generate(aggressive_words)

wc_normal = WordCloud(background_color=None, mode="RGBA", max_words=100, mask=normal_mask,
               stopwords=stopwords)
wc_normal.generate(normal_words)

# store to file
wc_aggressive.to_file(os.path.join(d, "AggressiveWordCloud.png"))
wc_normal.to_file (os.path.join(d, "NormalWordCloud.png"))

<wordcloud.wordcloud.WordCloud at 0x2291b1aa3c8>

We observe that each set of classified words are quite different and should be useful in building a predictive model. There is some repetition, but that is likely due to extra spaces that were not filtered out in our cleansing. Tokenization in our data preparation step should take care of this, but we should double-check to be sure. One interesting detail we also notice is the the normal speech obviously has many words related to Wikipedia (e.g. article, page), which may not be representative of everyday speech. Once we build the model, we should use additional datasets to check the accuracy of our final model.

Aggressive Speech                                      |  Normal Speech
:-----------------------------------------------------:|:------------------------------------------------------:
<img src="AggressiveWordCloud.png" alt="Aggressive" style="height: 300px;"/> | <img src="NormalWordCloud.png" alt="Normal" style="height: 300px;"/>

#### Write our cleansed dataset to the data folder

In [142]:
# write the cleansed dataframe to a file
verbatimsDF.to_csv('data/cleansedVerbatims.tab.gz', index=False, compression='gzip', sep='\t')