In [1]:
import pandas
import numpy as np

In [2]:
SEED = 12345

## Dataset

We rely on a ground-truth dataset provided by Golzadeh et al. The dataset contains around 5,000 accounts that were manually identified as bots or humans by three raters. The rating application displayed the most recent comments made by each account, and a decision had to be made based on these comments. By default, and if there are enough comments, the 20 most recent ones are displayed. 

In their work, a label (bot or human) is attributed to each account based on these comments. Since our goal is to work at the granularity of the comments and not of the accounts, we asked them to send, for accounts that have at least 20 comments, the ones that were displayed to the raters. The idea is that since these comments were the basis to made a decision about these accounts, it is very likely that the comments belonging to a bot (resp. to a human) were made by a bot (resp. by a human).

In [5]:
df = (
    pandas.read_csv('../data-raw/comments.csv.gz')
)

In [6]:
df.sample(n=10, random_state=SEED)

Unnamed: 0,commenter,comment,label
18401,b'wYeBGdDRbOzwxtkFiKTPeg==',Testing:\r\n- `npm run test-e2e`: passed\r\n- ...,Human
10684,b'ryMz7+sTtRIpb4Gcvm5llw==',"For the latest 1.6 release, I merged a few ver...",Human
6964,b'MSuTYRBQJS0vUYVvktWfXQ==',GitHub reverse integration Pull Request for co...,Bot
9206,b'Kvb12/QOb5z1r/z1gmG5fQ==',@pbteja1998 Hello from Gitcoin Core - are you ...,Bot
4661,b'7DgHJyohvrYD6EgXaCS7Ug==',**Build failed**\n[Swift Test OS X Platform](h...,Bot
3487,b'c+RTfnO9yWMJsuqS8tNiEw==',This pull request is not suitable for automati...,Bot
1672,b'vGW+/s2flqimoAvtAIKsMQ==',Closing obsolete PR.,Bot
5107,b'wW7Bo+Sq0Nf7aLBGuNvGtA==',Test live: https://calypso.live/?branch=add/co...,Bot
17936,b'L1cD+/A284M3kcu9j+ToXw==',"### Issue Description\r\n\r\nHi there, I'm get...",Human
6723,b'Me7LHfdo+lkhHQZRaKK3hQ==',\n<!--\n 0 failure: \n 3 warning: Please ad...,Bot


In [7]:
df = (
    df
    .assign(training=lambda d: d['commenter'].isin(
        df
        # Group per class
        .groupby('label', as_index=False, sort=False)
        .apply(lambda g: 
            g
            .drop_duplicates('commenter')
            .sample(frac=0.5, random_state=SEED)
            .commenter
        )
    ))
)

In [9]:
(
    df
    .groupby(['label'])
    .agg({'comment':'count','commenter':'nunique'})
)

Unnamed: 0_level_0,comment,commenter
label,Unnamed: 1_level_1,Unnamed: 2_level_1
Bot,9641,519
Human,9641,4090


In [10]:
df['label'] = (df['label'] == 'Bot').astype(int)

In [12]:
df[df.training].to_csv('../data/df_training.csv.gz',index=False)
df[~df.training].to_csv('../data/df_test.csv.gz',index=False)