This notebook contains code used for generating temporal variation patterns from SuicideWatch on Reddit, as well as from a control set of posts from AskReddit, used in a study to understand posting patterns in terms of time. Here, the focus is on day of the week.

The underlying data is from https://redd.it/3mg812. To get access to the subsets used in this study, please contact us directly.

The key steps for the analysis are the following:
- We start with the published/downloaded archive and generate subsets for the analysis: SuicideWatch and AskReddit
- We cast timestamp into datetime (we lost 0 posts as a result of this transformation, i.e. all have been casted to datetime properly)
- We only keep posts and filter out comments by keeping messages that have no parent_id (NB: a message that has a valid parent_id means this message is a comment; this is documented in the Reddit pages)
- We calculate the first Monday (firstMonday) and the last Sunday (lastSunday) from within our dataset. We filter out all posts that are before firstMonday or after lastSunday.

Written by George Gkotsis, with input from Sumithra Velupillai, King's College London, 2016-2019

In [2]:
import pandas as pd

In [25]:
# 1. suicidewatch
# df = pd.read_pickle("../reddit/suicidewatch.pickle")
# fname = "Suicidewatch-weekly"

# 2. suicidewatch - throwaway
# df = pd.read_pickle("../reddit/suicidewatch.pickle")
# df = df[df['author'].str.contains('throw', case=False)]
# fname = "suicidewatch-throwaway-weekly"

# 3. AskReddit
# df = pd.read_pickle("AskReddit_min.pickle")
# fname = "AskReddit-weekly"

# 4. AskReddit - control
df = pd.read_pickle("AskReddit_min.pickle")
authors = pd.read_pickle("../reddit/suicidewatch.pickle")
authors = authors[authors['parent_id'].astype(str)=='nan']
authors = set(authors['author'].unique())
authors.remove('[deleted]')
df = df[df['author'].isin(authors)]
fname = "AskReddit-control-weekly"

In [23]:
def doDT(st):
    import datetime
    try:
        return datetime.datetime.fromtimestamp(st)
    except Exception:
        return None

def analyseDFWeek(df):
    import datetime
    if 'parent_id' in df.columns:
        posts = df[df['parent_id'].astype(str)=='nan']
    else:
        posts = df
    posts['created'] = posts['created'].apply(doDT)
    a = len(posts)
    posts = posts.dropna(subset=['created'])
    b = len(posts)
    print "lost", b-a, "posts"
    posts = posts.set_index('created')        
    firstMonday = posts[posts.index.dayofweek==0].index.min()
    lastSunday = posts[posts.index.dayofweek==6].index.max()
    posts = posts[posts.index>=firstMonday]
    posts = posts[posts.index<=lastSunday]
    d = firstMonday.date()
    rs = pd.DataFrame()
    while d<lastSunday.date():
        tmp = posts[posts.index.date>=d]
        end = d + datetime.timedelta(7)
        tmp = posts[posts.index.date<end]
        tmp['weekday'] = tmp.index.weekday
        week = tmp.groupby(tmp.index.weekday)['weekday'].count()
        week = pd.DataFrame(week)
        cnt = week['weekday'].sum()
        week['pct'] = week['weekday']/float(week['weekday'].sum())
        week = pd.DataFrame(week['pct']).transpose()
        week['cnt'] = cnt
        if cnt<200:
            d = end
            continue
        rs = pd.concat([rs, week])
        d = end
    return rs

In [24]:
rs = analyseDFWeek(df)

lost 0 posts


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [25]:
rs.to_excel(fname + ".xlsx")

Code below is for the paper and the table generation

In [26]:
if 'parent_id' in df.columns:
    posts = df[df['parent_id'].astype(str)=='nan']
else:
    posts = df

In [27]:
authors = set(posts['author'].unique())

In [28]:
authors.remove('[deleted]')

KeyError: '[deleted]'

In [29]:
len(authors)

8065

In [30]:
len(posts)

66934

In [31]:
posts[posts['author'].isin(authors)]['author'].value_counts().describe()

count    8065.000000
mean        8.299318
std        29.997825
min         1.000000
25%         1.000000
50%         3.000000
75%         7.000000
max      1459.000000
Name: author, dtype: float64