# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

## Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

#### Use Python 3

In [1]:
import pandas as pd
import urllib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [2]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [2]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [3]:
len(annotations['rev_id'].unique())

115864

In [4]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [5]:
# join labels and comments
comments['attack'] = labels

In [6]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

a. What are the text cleaning methods you tried? What are the ones you have included in the
final code?<br>
I tried removing all punctuations, removing all numbers, changing text into lowercase, and removing stop words. I included the first three methods: removing all punctuations, removing all numbers, changing text into lowercase in the final code because using these three methods, precision increases; ROC AUC, recall and accuracy remain the same. However, removing stop words decreases ROC AUC, precision, and accuracy, only increases recall. Therefore, I chose not to include it in the final code.

In [7]:
import string
# remove all punctuations
comments['comment'] = comments['comment'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

# remove all numbers
comments['comment'] = comments['comment'].apply(lambda x: x.translate(str.maketrans('', '', string.digits)))

# change into lowercase
comments['comment'] = comments['comment'].apply(lambda x: x.lower())

# remove stop words
#from nltk.corpus import stopwords
#stop = set(stopwords.words('english'))
#comments['comment'] = comments['comment'].str.split()
#comments['comment'] = comments['comment'].apply(lambda x : [word for word in x if word not in stop])
#comments['comment'] = comments['comment'].str.join(' ')

In [8]:
comments.query('attack')['comment'].head()

rev_id
801279                   iraq is not good      usa is bad   
2702703       fuck off you little asshole if you want to ...
4632658           i have a dick its bigger than yours hahaha
6545332       renault   you sad little bpy for driving a ...
6545351       renault   you sad little bo for driving a r...
Name: comment, dtype: object

b. What are the features you considered using? What features did you use in the final code?<br>
I considered using word n-gram and character n-gram. When I tried word n-gram, when n increases from 1 to 4, ROC AUC, precision, recall, and accuracy decreases. When I tried character n-gram, when n increases from 1 to 4, ROC AUC, precision, recall, and accuracy increases. The combination of word n-gram and character n-gram performed better than using one of them individually. I used a combination of word unigram and character 4-gram in the final code. It is noticed that the combination of word unigram and character n-gram, when n is larger than 4 (tried 5 and 6), precision, recall, and accuracy start to decrease. The reason might be when the contiguous sequence of character is too long, it becomes closer to word length, leading to weaker performance combining with word unigram.

c. What optimizations did you add in your code, if any?<br>
I don't have additional optimizations in my code.

In [9]:
# fit a simple text classifier

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion

# use word unigram and character 4-gram features
vectorizerW = TfidfVectorizer(ngram_range = (1,1), analyzer = 'word')
vectorizerC = TfidfVectorizer(ngram_range = (1,4), analyzer = 'char')
combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])

from sklearn.calibration import CalibratedClassifierCV

clf = Pipeline([
    ('features', combined_features),
    ('classifier', LogisticRegression(penalty = 'l2', C = 1.5, solver = 'liblinear'))
])

d. What are the ML methods you tried out, and what were your best results with each method? Which was the best ML method you saw before tuning hyperparameters?<br>
I tried LogisticRegression(from strawman code), MultinomialNB, RandomForestClassifier, SVC, MLPClassifier, and LinearSVC. I didn't have results from SVC and MLPClassifier since they ran very slow.<br>

LogisticRegression:<br>
Test ROC AUC: 0.966<br>
Test precision: 0.934<br>
Test recall: 0.798<br>
Test confusion matrix:<br>
[[20275   147]<br>
 [ 1091  1665]]<br>
Test accuracy: 0.947<br>

MultinomialNB:<br>
Test ROC AUC: 0.815<br>
Test precision: 0.930<br>
Test recall: 0.571<br>
Test confusion matrix:<br>
[[20407    15]<br>
 [ 2362   394]]<br>
Test accuracy: 0.897<br>

RandomForestClassifier:<br>
Test ROC AUC: 0.888<br>
Test precision: 0.933<br>
Test recall: 0.665<br>
Test confusion matrix:<br>
[[20373    49]<br>
 [ 1842   914]]<br>
Test accuracy: 0.918<br>

LinearSVC:<br>
Test ROC AUC: 0.962<br>
Test precision: 0.926<br>
Test recall: 0.817<br>
Test confusion matrix:<br>
[[20222   200]<br>
 [  981  1775]]<br>
Test accuracy: 0.949<br>

The best ML method before tuning hyperparameters is LogisticRegression.

e. What hyperparameters tuning did you do, and by how many percentage points did your
accuracy go up?<br>
I tried to tune hyperparameters penalty, C, and solver in LogisticRegression. I attempted to tune hyperparameters using GridSearchCV, but GridSearchCV ran too slow to yield results, so I ended up tuning them manually. I tried liblinear, sag, saga, and newton-cg solver, l1 and l2 penalty, and 0.01, 0.1, 1, 1.2, 1.5, 2, 3, 5, 10 C values. The best combination I got was liblinear solver with l2 penalty and 1.5 C value. My accuracy goes up by 0.106 percentage.

In [10]:
#from sklearn.model_selection import GridSearchCV
#import numpy as np

#param_grid = dict(
    #classifier__penalty = ['l1', 'l2'],    
    #classifier__C = np.logspace(0, 5, 10),
    #classifier__solver = ['liblinear', 'sag', 'saga', 'newton-cg']
#)
#clf = GridSearchCV(clf, param_grid = param_grid, n_jobs = -1)
clf = clf.fit(train_comments['comment'], train_comments['attack'])
#print(clf.best_score_)
#for param_name in sorted(parameters.keys()):
    #print("%s: %r" % (param_name, clf.best_params_[param_name]))

In [11]:
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix

trueVals = test_comments['attack']
predictedVals = clf.predict(test_comments['comment'])
precision, recall, fscore, support = precision_recall_fscore_support(trueVals, predictedVals, average = 'macro')
print('Test precision: %.3f' %precision)
print('Test recall: %.3f' %recall)
print('Test confusion matrix:')
print(confusion_matrix(trueVals, predictedVals))

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(trueVals, predictedVals)
print('Test accuracy: %.3f' %accuracy)

Test ROC AUC: 0.967
Test precision: 0.932
Test recall: 0.805
Test confusion matrix:
[[20258   164]
 [ 1050  1706]]
Test accuracy: 0.948


Test ROC AUC: 0.967<br>
Test precision: 0.932<br>
Test recall: 0.805<br>
Test confusion matrix:<br>
[[20258   164]<br>
 [ 1050  1706]]<br>
Test accuracy: 0.948<br>

f. What did you learn from the different metrics? Did you try cross-validation?<br>
I learned that the metric AUC ROC is a measure of how good the machine learning model is. Precision is a measure of the correctness of the classifier labeling a sample. Recall is a measure of the ability of the classifier to find all the correct samples that should be found. Confusion matrix is a measure of the accuracy of the classification. Confusion matrix tabulates the number of correct classifications and misclassifications. Accuracy is the fraction of correct predictions made among the total number of predictions made. The combination of these metrics is a powerful indicator of the goodness of the model. I tried cross-validation with k = 5. The results of cross-validation are stable and close to test results. I think 5-fold cross-validation is good enough to validate the stability of the results and can be finished in a reasonable amount of time.

g. What are your best final Result Metrics? By how much is it better than the strawman figure?
Which model gave you this performance?<br>
My best final result matrics are:<br>
Test ROC AUC: 0.967<br>
Test precision: 0.932<br>
Test recall: 0.805<br>
Test confusion matrix:<br>
[[20258   164]<br>
 [ 1050  1706]]<br>
Test accuracy: 0.948<br>
Compared to the strawman results, ROC AUC improves from 0.957 to 0.967 (1.04%), precision improves from 0.929 to 0.932 (0.323%), recall improves from 0.772 to 0.805 (4.27%), the misclassifications from confusion matrix changes from 142 and 1236 to 164 and 1050, accuracy improves from 0.941 to 0.948 (0.744%). LogisticRegression model gave me this performance.

In [12]:
from sklearn.model_selection import cross_val_score
import numpy as np

cross_auc = cross_val_score(clf, comments['comment'], comments['attack'], scoring = 'roc_auc', cv = 5)
print('Cross validation ROC AUC: ', np.around(cross_auc, 3))

Cross validation ROC AUC:  [0.962 0.969 0.969 0.969 0.966]


Cross validation ROC AUC:  [0.962 &nbsp;0.969 &nbsp;0.969 &nbsp;0.969 &nbsp;0.966]

In [13]:
cross_precision = cross_val_score(clf, comments['comment'], comments['attack'], scoring = 'precision_macro', cv = 5)
print('Cross validation precision: ', np.around(cross_precision, 3))

Cross validation precision:  [0.918 0.923 0.911 0.921 0.921]


Cross validation precision:  [0.918 &nbsp;0.923 &nbsp;0.911 &nbsp;0.921 &nbsp;0.921]

In [14]:
cross_recall = cross_val_score(clf, comments['comment'], comments['attack'], scoring = 'recall_macro', cv = 5)
print('Cross validation recall: ', np.around(cross_recall, 3))

Cross validation recall:  [0.806 0.818 0.834 0.815 0.81 ]


Cross validation recall:  [0.806 &nbsp;0.818 &nbsp;0.834 &nbsp;0.815 &nbsp;0.81]

In [15]:
cross_accuracy = cross_val_score(clf, comments['comment'], comments['attack'], scoring = 'accuracy', cv = 5)
print('Cross validation accuracy: ', np.around(cross_accuracy, 3))

Cross validation accuracy:  [0.946 0.949 0.95  0.949 0.947]


Cross validation accuracy:  [0.946 &nbsp;0.949 &nbsp;0.95 &nbsp;0.949 &nbsp;0.947]

In [16]:
# correctly classify nice comment
clf.predict(['Thanks for you contribution, you did a great job!'])

array([False])

In [17]:
# correctly classify nasty comment
clf.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])

h. What is the most interesting thing you learned from doing the report?<br>
The most interesting thing I learned is to improve the machine learning model incrementally. For example, I may try to change one variable while keeping other things constant. The order of trying these modifications may make a difference in the final results.

i. What was the hardest thing to do?<br>
The hardest thing to do is to tune the hyperparameters. Since GridSearchCV is very slow to run, I chose to tune hyperparameters manually and only improved the LogisticRegression model slightly. If I can make use of GridSearchCV or other tools, I may be able to try out a lot more combinations of hyperparameters to achieve better improvement of the model.