## Wikipedia Personal Attacks

The data and most of the code in this Notebook are taken from Ellery Wulczyn, Nithum Thain, and Lucas Dixon. (paper here: https://arxiv.org/abs/1610.08914, notebook here: https://github.com/ewulczyn/wiki-detox/blob/master/src/figshare/Wikipedia%20Talk%20Data%20-%20Getting%20Started.ipynb)

These authors' data contain:

- a large historical corpus of discussion comments on Wikipedia talk pages
- a sample of over 100k comments with human labels for whether the comment contains a personal attack
- a sample of over 100k comments with human labels for whether the comment has aggressive tone


Please note that some of these comments contain offensive language. 

## Building a classifier for personal attacks (code from Wulczyn et al)

First we import some packages.

In [1]:
import pandas as pd
import urllib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

### Question 1

What are these packages and why are we using them?  (Feel free to Google around.)  It is okay if you do not understand all of this, just do your best.

### Your answer to Question 1

goes here

In [2]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [3]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [None]:
print(comments)


                                                     comment  year  logged_in  \
rev_id                                                                          
37675      `-NEWLINE_TOKENThis is not ``creative``.  Thos...  2002      False   
44816      `NEWLINE_TOKENNEWLINE_TOKEN:: the term ``stand...  2002      False   
49851      NEWLINE_TOKENNEWLINE_TOKENTrue or false, the s...  2002      False   
89320       Next, maybe you could work on being less cond...  2002       True   
93890                   This page will need disambiguation.   2002       True   
102817     NEWLINE_TOKEN-NEWLINE_TOKENNEWLINE_TOKENImport...  2002       True   
103624     I removed the following:NEWLINE_TOKENNEWLINE_T...  2002       True   
111032     `:If you ever claimed in a Judaic studies prog...  2002       True   
120283     NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENMy apol...  2002       True   
128532     `Someone wrote:NEWLINE_TOKENMore recognizable,...  2002       True   
133562     NEWLINE_TOKENNEWL

In [4]:
print(annotations)

            rev_id  worker_id  quoting_attack  recipient_attack  \
0            37675       1362             0.0               0.0   
1            37675       2408             0.0               0.0   
2            37675       1493             0.0               0.0   
3            37675       1439             0.0               0.0   
4            37675        170             0.0               0.0   
...            ...        ...             ...               ...   
1365212  699897151        628             0.0               0.0   
1365213  699897151         15             0.0               0.0   
1365214  699897151         57             0.0               0.0   
1365215  699897151       1815             0.0               0.0   
1365216  699897151        472             0.0               0.0   

         third_party_attack  other_attack  attack  
0                       0.0           0.0     0.0  
1                       0.0           0.0     0.0  
2                       0.0           0

### Question 2

We've now downloaded the data.  Please open it up and take a look.  How are the data formatted?  What's in there?  What do you notice?

### Your answer to Question 2

goes here

In [5]:
len(annotations['rev_id'].unique())


115864

In [6]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [7]:
# join labels and comments
comments['attack'] = labels

In [8]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

In [9]:
comments.query('attack')['comment'].head()


rev_id
801279             Iraq is not good  ===  ===  USA is bad   
2702703      ____ fuck off you little asshole. If you wan...
4632658         i have a dick, its bigger than yours! hahaha
6545332      == renault ==  you sad little bpy for drivin...
6545351      == renault ==  you sad little bo for driving...
Name: comment, dtype: object

In [10]:
comments.query('attack')['comment'].tail()


rev_id
699645524     Brandon Semenuk has won the event four times ...
699659494    im soory since when is google images not allow...
699660419    what ever you fuggin fag Question how did you ...
699661020      == Nice try but no cigar........idiot ==  Th...
699664687     shut up mind your own business and go fuck so...
Name: comment, dtype: object

In [11]:
# fit a simple text classifier

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', LogisticRegression()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Test ROC AUC: 0.957


### Question 3

What has happened here?  Can you explain, in general terms, what this code is doing?  What does ROC AUC mean?

### Your answer to Question 3

goes here

In [12]:
# now try to classify new comments
clf.predict(['Thanks for you contribution, you did a great job!'])


array([False])

In [13]:
clf.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])

### Question 4

Edit the code above to try out some new nice and nasty comments of your own invention.  Can you "break" the classifier?  How/why or not?

### Your answer to Question 4

goes here

### Question 5

Please summarize what has happened in this notebook as if you are explaining it to someone who has never heard of document classification or machine learning.

### Your answer to Question 5

goes here

### Question 6

Now please take a look at the authors' original paper ( https://arxiv.org/abs/1610.08914).  What did they do with these Wikipedia comments?  What was their larger goal?

### Your answer to Question 6

goes here

### Question 7

Please read the Document Classification chapter of our in-progress "textbook" and use bullet points to indicate 5 things you learned and/or constructive suggestions.

### Your answer to Question 7

goes here