# Accessing Reddit Data

This notebook will look at the Pushshift dataset of reddit submissions and comments.

The dataset consists of two files per month. For instance, for January 2021, we have RS_2021-01.zst and RC_2021-01.zst. The RS_2021-01.zst file contains all the submissions (or posts) made in January 2021, and the RC_2021-01.zst file contains all the comments made in January 2021. Each file literally contains **everything** posted on Reddit during that month, which makes them so large. You will want to filter the data to create a smaller dataset for your analyses.

In this tutorial, we are going to retrieve all the submissions and comments posted on r/AmItheAsshole in January 2021. We assume that RS_2021-01.zst and RC_2021-01.zst have been uploaded to Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd /content/drive/MyDrive

You need zstandard package for this.

In [None]:
!pip install zstandard

## Filtering .zst file to create a smaller .json file.

Let's open RS_2021-01.zst and filter all submissions posted on a particular subreddit (r/AmItheAssHole) on January 2021. Then, we are going to create a smaller .json file that can be easily opened by pandas and be used for the analyses.

In [None]:
import zstandard
import os
import json
import sys
from datetime import datetime

def read_and_decode(reader, chunk_size, max_window_size, previous_chunk=None, bytes_read=0):
	chunk = reader.read(chunk_size)
	bytes_read += chunk_size
	if previous_chunk is not None:
		chunk = previous_chunk + chunk
	try:
		return chunk.decode()
	except UnicodeDecodeError:
		if bytes_read > max_window_size:
			raise UnicodeError(f"Unable to decode frame after reading {bytes_read:,} bytes")
		log.info(f"Decoding error with {bytes_read:,} bytes, reading another chunk")
		return read_and_decode(reader, chunk_size, max_window_size, chunk, bytes_read)


def read_lines_zst(file_name):
	with open(file_name, 'rb') as file_handle:
		buffer = ''
		reader = zstandard.ZstdDecompressor(max_window_size=2**31).stream_reader(file_handle)
		while True:
			chunk = read_and_decode(reader, 2**27, (2**29) * 2)

			if not chunk:
				break
			lines = (buffer + chunk).split("\n")

			for line in lines[:-1]:
				yield line, file_handle.tell()

			buffer = lines[-1]

		reader.close()

Reddit datasets consist of two files per month. For instance, for January 2021, we have RS_2021-01.zst and RC_2021-01.zst. RS_2021-01.zst file contains the entire submissions posted on January 2021, and RC_2021-01.zst file contains the entire comments posted on January 2021.

First, let's go through all submissions in RS_2021-01.zst, filter submissions in r/AmItheAsshole, and save them to another file RS-2021-01_subreddit.json. You can change the field and value in the following code to change the filtering criteria.

In [None]:
file_lines = 0
file_bytes_processed = 0
created = None
field = "subreddit"
value = "AmItheAsshole"
bad_lines = 0
file_written = open('RS-2021-01_subreddit.json', 'w')
for line, file_bytes_processed in read_lines_zst('RS_2021-01.zst'):
    try:
        obj = json.loads(line)
        created = datetime.utcfromtimestamp(int(obj['created_utc']))
        temp = obj[field] == value
        if temp:
            file_written.write(json.dumps(obj) + '\n')
    except (KeyError, json.JSONDecodeError) as err:
        bad_lines += 1
    file_lines += 1
    if file_lines % 100000 == 0:
        print(f"{created.strftime('%Y-%m-%d %H:%M:%S')} : {file_lines:,} : {bad_lines:,} : {file_bytes_processed:,}:{(file_bytes_processed / file_size) * 100:.0f}%")

Second, let's go through all comments in RC_2021-01.zst, filter comments in r/AmItheAsshole, and save them to another file RS-2021-01_subreddit.json. You can change the field and value in the following code to change the filtering criteria.

In [None]:
file_lines = 0
file_bytes_processed = 0
created = None
field = "subreddit"
value = "AmItheAsshole"
bad_lines = 0
file_written = open('RC-2021-01_subreddit.json', 'w')
for line, file_bytes_processed in read_lines_zst('RC_2021-01.zst'):
    try:
        obj = json.loads(line)
        created = datetime.utcfromtimestamp(int(obj['created_utc']))
        temp = obj[field] == value
        if temp:
            file_written.write(json.dumps(obj) + '\n')
    except (KeyError, json.JSONDecodeError) as err:
        bad_lines += 1
    file_lines += 1
    if file_lines % 100000 == 0:
        print(f"{created.strftime('%Y-%m-%d %H:%M:%S')} : {file_lines:,} : {bad_lines:,} : {file_bytes_processed:,}:{(file_bytes_processed / file_size) * 100:.0f}%")

Now, you have a smaller dataset that you can use for your analyses. The following code shows how you can open the json file in Python, sample the subset of submissions or comments, and save it to csv file.

## Loading the data

In [None]:
import pandas as pd

In [None]:
path  = "../data/reddit/RC_2008-04.json"

In [None]:
rc_2008_04_df = pd.read_json(path, lines=True)

In [None]:
rc_2008_04_df.head()

Unnamed: 0,body,parent_id,subreddit,author,author_flair_css_class,edited,retrieved_on,name,gilded,id,...,archived,downs,score,controversiality,score_hidden,subreddit_id,ups,distinguished,author_flair_text,link_id
0,"I always thought ""why"" would be answered one d...",t1_c03lan8,offbeat,patchwork,,1,1425839644,t1_c03lejq,0,c03lejq,...,True,0,2,0,False,t5_2qh11,2,,,t3_6e1ct
1,Turn off your television.\nCancel your cable.\...,t3_6e25x,entertainment,leehar24,,0,1425839644,t1_c03lejr,0,c03lejr,...,True,0,14,0,False,t5_2qh0f,14,,,t3_6e25x
2,"we did, last year they were ok, this year they...",t1_c03leht,offbeat,gitgat,,0,1425839644,t1_c03lejs,0,c03lejs,...,True,0,1,0,False,t5_2qh11,1,,,t3_6e1wh
3,78 - it wouldn't accept my spelling of neodyni...,t3_6e1wa,science,mk_gecko,,0,1425839644,t1_c03lejt,0,c03lejt,...,True,0,1,0,False,t5_mouw,1,,,t3_6e1wa
4,"i don't know if that's necessary, it looks lik...",t3_6e1ay,pics,[deleted],,0,1425839644,t1_c03leju,0,c03leju,...,True,0,1,0,False,t5_2qh0u,1,,,t3_6e1ay


In [None]:
rc_2008_04_df.columns

Index(['body', 'parent_id', 'subreddit', 'author', 'author_flair_css_class',
       'edited', 'retrieved_on', 'name', 'gilded', 'id', 'created_utc',
       'archived', 'downs', 'score', 'controversiality', 'score_hidden',
       'subreddit_id', 'ups', 'distinguished', 'author_flair_text', 'link_id'],
      dtype='object')

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].head()

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
0,patchwork,offbeat,c03lejq,t3_6e1ct,t1_c03lan8,"I always thought ""why"" would be answered one d...",1207008001
1,leehar24,entertainment,c03lejr,t3_6e25x,t3_6e25x,Turn off your television.\nCancel your cable.\...,1207008005
2,gitgat,offbeat,c03lejs,t3_6e1wh,t1_c03leht,"we did, last year they were ok, this year they...",1207008006
3,mk_gecko,science,c03lejt,t3_6e1wa,t3_6e1wa,78 - it wouldn't accept my spelling of neodyni...,1207008010
4,[deleted],pics,c03leju,t3_6e1ay,t3_6e1ay,"i don't know if that's necessary, it looks lik...",1207008013


In [None]:
from datetime import date

In [None]:
date.fromtimestamp(1207008001)

datetime.date(2008, 3, 31)

## Sample Analysis

Searching by parent ID.

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("parent_id == 't1_c03le37'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
26,patchwork,offbeat,c03lekg,t3_6e1ct,t1_c03le37,See that would be even better if he had circle...,1207008110


Searching by link id.

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("link_id == 't3_6e1ct'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
0,patchwork,offbeat,c03lejq,t3_6e1ct,t1_c03lan8,"I always thought ""why"" would be answered one d...",1207008001
26,patchwork,offbeat,c03lekg,t3_6e1ct,t1_c03le37,See that would be even better if he had circle...,1207008110
4040,Fiserfully,offbeat,c03lho7,t3_6e1ct,t1_c03lc4c,"Alright, first off. Homeschoolers have to pay...",1207028566
11411,skippy17,offbeat,c03lndn,t3_6e1ct,t3_6e1ct,"This is funny because it's true, right?",1207073684


Searching by subreddit.

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("subreddit == 'offbeat'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
0,patchwork,offbeat,c03lejq,t3_6e1ct,t1_c03lan8,"I always thought ""why"" would be answered one d...",1207008001
2,gitgat,offbeat,c03lejs,t3_6e1wh,t1_c03leht,"we did, last year they were ok, this year they...",1207008006
26,patchwork,offbeat,c03lekg,t3_6e1ct,t1_c03le37,See that would be even better if he had circle...,1207008110
117,otterdam,offbeat,c03lemz,t3_6e30h,t1_c03le9w,This is a sign not to do it for so long ;o),1207008501
523,niomi,offbeat,c03leya,t3_6e30h,t3_6e30h,I'm a chick and I completely agree with this a...,1207010383
...,...,...,...,...,...,...,...
467850,[deleted],offbeat,c03vg8i,t3_6hnzu,t3_6hnzu,[deleted],1209597554
468204,vague_blur,offbeat,c03vgic,t3_6hof5,t3_6hof5,The birds are watching,1209599413
468234,[deleted],offbeat,c03vgj6,t3_6hoie,t3_6hoie,"Somewhere, in a dark and dingy assembly room, ...",1209599567
468258,deadsoon,offbeat,c03vgju,t3_6hoie,t3_6hoie,Big Lots is a closeout store. That means that ...,1209599715


In [None]:
offbeat_df = rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("subreddit == 'offbeat'")

Choosing undeleted comments only.

In [None]:
offbeat_df.query("author != '[deleted]'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
0,patchwork,offbeat,c03lejq,t3_6e1ct,t1_c03lan8,"I always thought ""why"" would be answered one d...",1207008001
2,gitgat,offbeat,c03lejs,t3_6e1wh,t1_c03leht,"we did, last year they were ok, this year they...",1207008006
26,patchwork,offbeat,c03lekg,t3_6e1ct,t1_c03le37,See that would be even better if he had circle...,1207008110
117,otterdam,offbeat,c03lemz,t3_6e30h,t1_c03le9w,This is a sign not to do it for so long ;o),1207008501
523,niomi,offbeat,c03leya,t3_6e30h,t3_6e30h,I'm a chick and I completely agree with this a...,1207010383
...,...,...,...,...,...,...,...
467341,-J-,offbeat,c03vfud,t3_6hof5,t3_6hof5,Yes: http://www.youtube.com/watch?v=oHg5SJYRHA...,1209595053
467609,memsisthefuture,offbeat,c03vg1t,t3_6hof5,t3_6hof5,"God, no. I should be so lucky.",1209596418
467715,OMG_my_BABY,offbeat,c03vg4r,t3_6hoie,t3_6hoie,Oh my god!!!\r\n,1209596943
468204,vague_blur,offbeat,c03vgic,t3_6hof5,t3_6hof5,The birds are watching,1209599413


In [None]:
link_id_lens = {}

In [None]:
linkids = offbeat_df["link_id"].unique()

In [None]:
for link in linkids:
    link_id_lens[link] = len(offbeat_df.query("link_id == '" +link+"'"))

In [None]:
link_id_lens["t3_6e97t"]

64

In [None]:
long_thread = rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("link_id == 't3_6e97t'")

In [None]:
len(long_thread["parent_id"].unique())

30

So there are 30 conversations here. Let's pick them up and see how the dataset relates to

In [None]:
long_thread["parent_id"].unique()

array(['t3_6e97t', 't1_c03lsy2', 't1_c03lt4s', 't1_c03lt86', 't1_c03lt7s',
       't1_c03ltha', 't1_c03ltio', 't1_c03ltj8', 't1_c03ltmj',
       't1_c03ltet', 't1_c03ltd1', 't1_c03ltx1', 't1_c03lueu',
       't1_c03luiy', 't1_c03ltqn', 't1_c03lurx', 't1_c03ltlu',
       't1_c03ltg2', 't1_c03lu19', 't1_c03lvot', 't1_c03lucj',
       't1_c03ltya', 't1_c03lvry', 't1_c03lx0v', 't1_c03lwm9',
       't1_c03lxj6', 't1_c03lxh7', 't1_c03lxoy', 't1_c03lw0g',
       't1_c03lwqv'], dtype=object)

Let us first see all the top level comments

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("parent_id == 't3_6e97t'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
18610,wgardenhire,offbeat,c03lsy2,t3_6e97t,t3_6e97t,Pot is non-addictive. End of story.,1207106991
18943,zerogravity,offbeat,c03lt7b,t3_6e97t,t3_6e97t,"Share, you no good low down dirty pothead.",1207108719
18960,spliffy,offbeat,c03lt7s,t3_6e97t,t3_6e97t,I would dissuade those looking to go into fiel...,1207108835
19163,Fauster,offbeat,c03ltdf,t3_6e97t,t3_6e97t,"Yes! You should worry about your ""immortal"" s...",1207110178
19213,[deleted],offbeat,c03ltet,t3_6e97t,t3_6e97t,Pot makes tedious tasks more enjoyable. Wheth...,1207110450
19302,h0dg3s,offbeat,c03ltha,t3_6e97t,t3_6e97t,You are **not** addicted to pot. Downvote for...,1207111126
19451,thabc,offbeat,c03ltlf,t3_6e97t,t3_6e97t,"I was in the same situation, until a few weeks...",1207111928
19466,silentbobsc,offbeat,c03ltlu,t3_6e97t,t3_6e97t,"""...reasonably challenging liberal arts colleg...",1207112047
19867,[deleted],offbeat,c03ltx1,t3_6e97t,t3_6e97t,I'm a chemist and you'd be surprised at how ma...,1207114627
19921,NewSc2,offbeat,c03ltyj,t3_6e97t,t3_6e97t,I smoked pot every day until I turned 21. Then...,1207115071


Now, we see the responses to the first comment.

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("parent_id == 't1_c03lsy2'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
18852,spliffy,offbeat,c03lt4s,t3_6e97t,t1_c03lsy2,I wouldn't go quite that far. Maybe not physic...,1207108264
18974,[deleted],offbeat,c03lt86,t3_6e97t,t1_c03lsy2,Its mentally addictive... and thats what the w...,1207108888
19146,Fauster,offbeat,c03ltcy,t3_6e97t,t1_c03lsy2,I don't know about that. I'd say it's on par ...,1207110041


Peak at the first comment again:

In [None]:
rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "created_utc"]].query("id == 'c03lsy2'")

Unnamed: 0,author,subreddit,id,link_id,parent_id,body,created_utc
18610,wgardenhire,offbeat,c03lsy2,t3_6e97t,t3_6e97t,Pot is non-addictive. End of story.,1207106991


So each comment has an ID for itself, one for the link, and one for the parent comment, which is the same as the link id if it is a comment replying to the post.
This information will help you create conversation graphs.

In [None]:
key_info_rc_2008_04 = rc_2008_04_df[["author", "subreddit", "id", "link_id", "parent_id","body", "score","created_utc"]]

In [None]:
key_info_rc_2008_04.to_csv("../data/reddit/key_info_rc_2008_04.csv")

The saved CSV file is less than half the size of the JSON.