In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

sns.set_style("whitegrid")

## Loading and examining the PAN-WVC-11 dataset

`Edits.csv` contains a list of all edits in the dataset, uniquely indexed by `editid`. Note in particular the `oldrevisionid` and `newrevisionid` fields, which allow us to access more information about the respective edits using the Wikipedia API. We may also use the `editor` to access information about the editor's user behavior on the website (?)

In [70]:
edits= pd.read_csv("pan-wikipedia-vandalism-corpus-2010/edits.csv", index_col="editid", parse_dates=['edittime'])
edits

Unnamed: 0_level_0,editor,oldrevisionid,newrevisionid,diffurl,edittime,editcomment,articleid,articletitle
editid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,TheHeartbreakKid15,328391343,328391582,http://en.wikipedia.org/w/index.php?diff=32839...,2009-11-28 15:21:18+00:00,/* Episodes */,24477266,Top Gear (series 14)
2,Stepopen,327585467,327607921,http://en.wikipedia.org/w/index.php?diff=32760...,2009-11-24 04:43:37+00:00,removed factually wrong information,476288,List of United Nations resolutions concerning ...
3,93.6.135.185,328227083,328242890,http://en.wikipedia.org/w/index.php?diff=32824...,2009-11-27 18:22:12+00:00,/* History */,174853,W.A.S.P.
4,Plasticspork,314955274,327191082,http://en.wikipedia.org/w/index.php?diff=32719...,2009-11-21 23:12:24+00:00,Clean infobox + general fixes using [[Project:...,1418363,Psusennes II
5,Thatguyflint,329276563,329276581,http://en.wikipedia.org/w/index.php?diff=32927...,2009-12-02 17:45:02+00:00,Reverted edits by [[Special:Contributions/151....,1930796,"James W. Robinson, Jr."
...,...,...,...,...,...,...,...,...
45962,Wrestlinglover,326861545,326861667,http://en.wikipedia.org/w/index.php?diff=32686...,2009-11-20 03:19:39+00:00,/* In wrestling */,438904,Jason Reso
45963,DanDud88,329277631,329278946,http://en.wikipedia.org/w/index.php?diff=32927...,2009-12-02 17:57:33+00:00,/* See also */,24504237,List of United Kingdom Christmas television ep...
45964,Freekra,328364418,328365524,http://en.wikipedia.org/w/index.php?diff=32836...,2009-11-28 11:49:57+00:00,/* Notable cases (and an oft-cited non-case) */,46187,Lobotomy
45965,67.52.48.226,325484005,329268511,http://en.wikipedia.org/w/index.php?diff=32926...,2009-12-02 16:56:44+00:00,/* Characters */,22849170,Succubus Blues


In [71]:
edits.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32439 entries, 1 to 45966
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   editor         32439 non-null  object             
 1   oldrevisionid  32439 non-null  int64              
 2   newrevisionid  32439 non-null  int64              
 3   diffurl        32439 non-null  object             
 4   edittime       32439 non-null  datetime64[ns, UTC]
 5   editcomment    25009 non-null  object             
 6   articleid      32439 non-null  int64              
 7   articletitle   32439 non-null  object             
dtypes: datetime64[ns, UTC](1), int64(3), object(4)
memory usage: 2.2+ MB


`gold-annotations.csv` contains the labels (either 'regular' or 'vandalism') assigned by humans to each edit. `annotators` indicates how many annotators out of `totalannotators` concurred with the label assigned. The label is assigned via a simple majority among all annotators shown the respective edit.

Note that annotators were allowed to say that they were unsure whether the edit constitutes vandalism. Thus, the annotators who did not agree with a 'regular' label did not necessarily consider it 'vandalism'. Some of the dissenters may have marked it as 'dunno'.

In [73]:
editLabels = pd.read_csv("pan-wikipedia-vandalism-corpus-2010/gold-annotations.csv", index_col="editid", dtype={'class': 'category'})

In [74]:
editLabels.head()


Unnamed: 0_level_0,class,annotators,totalannotators
editid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,regular,3,3
2,regular,10,18
3,regular,3,3
4,regular,3,3
5,regular,5,6


In [75]:
editLabels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32439 entries, 1 to 45966
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   class            32439 non-null  category
 1   annotators       32439 non-null  int64   
 2   totalannotators  32439 non-null  int64   
dtypes: category(1), int64(2)
memory usage: 792.0 KB


In [76]:
editLabels['class'].cat.categories

Index(['regular', 'vandalism'], dtype='object')

`annotations.csv` contains information about the decision made by each annotator for each edit. `editid` is not a unique key for this file; instead the pair `(editid, annotatorid)` is a unique index. To keep things simple we can let `pandas` index this DataFrame using a positional index instead.

In [80]:
anno = pd.read_csv("pan-wikipedia-vandalism-corpus-2010/annotations.csv", parse_dates=['submittime'], date_format='%a %b %d %H:%M:%S %Z %Y', dtype={'class': 'category'})

In [81]:
anno

Unnamed: 0,editid,annotatorid,class,decisiontime,submittime
0,1642,83,no,7755,2010-03-02 19:46:32+00:00
1,1643,83,no,21713,2010-03-02 19:46:32+00:00
2,1641,83,no,11653,2010-03-02 19:46:32+00:00
3,1640,83,no,10387,2010-03-02 19:46:32+00:00
4,1639,83,no,10776,2010-03-02 19:46:32+00:00
...,...,...,...,...,...
153615,33828,64,yes,61538,2010-03-22 21:34:59+00:00
153616,34867,64,no,8183,2010-03-22 21:34:59+00:00
153617,14239,64,yes,24611,2010-03-22 21:34:59+00:00
153618,24152,64,yes,28217,2010-03-22 21:34:59+00:00


In [82]:
anno.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153620 entries, 0 to 153619
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype              
---  ------        --------------   -----              
 0   editid        153620 non-null  int64              
 1   annotatorid   153620 non-null  int64              
 2   class         153620 non-null  category           
 3   decisiontime  153620 non-null  int64              
 4   submittime    153620 non-null  datetime64[ns, UTC]
dtypes: category(1), datetime64[ns, UTC](1), int64(3)
memory usage: 4.8 MB


In [89]:
anno['class'].cat.categories

Index(['dunno', 'error', 'no', 'yes'], dtype='object')

Huh. I wonder how many entries have `'error'` as the value for `'class'`...

In [91]:
anno[anno['class']=='error']

Unnamed: 0,editid,annotatorid,class,decisiontime,submittime
4981,8749,74,error,21655,2010-02-28 12:22:28+00:00
5672,42659,56,error,7887,2010-02-24 00:21:03+00:00
5674,42661,56,error,5602,2010-02-24 00:21:03+00:00
5939,9178,411,error,4160,2010-03-10 19:36:24+00:00
7229,41119,57,error,15712,2010-02-26 01:41:45+00:00
...,...,...,...,...,...
149772,26361,20,error,11953,2010-03-01 19:19:47+00:00
149813,40435,20,error,18090,2010-02-24 21:23:08+00:00
149871,43642,20,error,38587,2010-02-24 21:29:25+00:00
150193,16162,136,error,3604,2010-03-05 22:14:53+00:00


In [92]:
annoByEditID = anno.groupby('editid')

In [102]:
for name, group in annoByEditID:
    if (group['class'] == 'error').any():
        print(name)
        print(group)
        print(len(group))

415
        editid  annotatorid  class  decisiontime                submittime
13683      415          566    yes          7078 2010-03-11 04:43:14+00:00
18034      415           52     no         12171 2010-03-11 05:19:11+00:00
22964      415           51     no          3937 2010-03-23 15:48:43+00:00
23332      415           31     no          8184 2010-03-03 15:54:33+00:00
36318      415           37    yes         43098 2010-03-02 18:33:47+00:00
36517      415          715  error         25679 2010-03-15 10:34:09+00:00
77851      415            6     no          3813 2010-03-14 03:07:36+00:00
79076      415            6     no          9348 2010-03-20 19:41:24+00:00
83266      415          544    yes          3096 2010-03-14 12:36:12+00:00
85426      415          328     no         16485 2010-03-19 22:51:22+00:00
86165      415          439     no         10686 2010-03-14 12:46:45+00:00
99741      415           53     no          9394 2010-03-23 13:19:16+00:00
100654     415       

In [103]:
for id, group in annoByEditID:
    if (group['class'] == 'error').any():
        print(id)
        print(group['class'].value_counts())

415
class
no       13
yes       4
error     1
dunno     0
Name: count, dtype: int64
1626
class
no       7
error    2
dunno    0
yes      0
Name: count, dtype: int64
1879
class
no       5
error    1
dunno    0
yes      0
Name: count, dtype: int64
2682
class
no       5
error    1
dunno    0
yes      0
Name: count, dtype: int64
3214
class
no       5
error    1
dunno    0
yes      0
Name: count, dtype: int64
4251
class
no       6
error    1
dunno    0
yes      0
Name: count, dtype: int64
5064
class
yes      11
error     1
dunno     0
no        0
Name: count, dtype: int64
5461
class
no       5
error    1
dunno    0
yes      0
Name: count, dtype: int64
5508
class
no       5
dunno    1
error    1
yes      0
Name: count, dtype: int64
5564
class
yes      6
error    1
no       1
dunno    0
Name: count, dtype: int64
5584
class
no       3
error    1
dunno    0
yes      0
Name: count, dtype: int64
6587
class
no       13
error     3
yes       2
dunno     0
Name: count, dtype: int64
7126
class
no    

Let's see how the `annotators` and `totalannotators` counts are affected by the presence if annotations with value `'error'` for `'class'`...

In [113]:
editLabels.loc[415]

class              regular
annotators              13
totalannotators         18
Name: 415, dtype: object

In [114]:
editLabels.loc[1626]

class              regular
annotators               7
totalannotators          9
Name: 1626, dtype: object

So it looks like `annotators` equals the highest count of annotators saying _any_ of `{'yes', 'no', 'dunno', 'error'}`. I would think excluding the `'error'` responses from `totalannotators` makes sense, if we end up using `totalannotators` as a feature.

The `'dunno'` responses do contain information though, indicating that the edit in question might be borderline.