# Collect Reddit Data using Pushshift.io

adapted from: https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563

Here we will demonstrate how to use pushift.io to collect Reddit data. We will first see what the website looks like and try to query it through the browser using an HTTP request. You can try copying and pasting "https://api.pushshift.io/reddit/search/submission/" into a browser and see what you get.

Now we will import some packages that will help us query pushshift programmatically. 

In [1]:
import pandas as pd # for data manipulation
import requests # for executing the HTTP request
import json # for data manipulation in JSON format
import csv # for data manipulation in CSV format
import time   # api's not designed for big data analysis so we need to regulate our requests
import datetime # converting UNIX timestamps to human readable date formats

Next we will write a function (getPushSiftData) that builds and executes the HTTP request based on the parameters specified which in this case would be a keyword (query), a time stamp (after) after which we want data and the subreddit (sub).

In [2]:
def getPushshiftData(query, after, sub):
    url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&subreddit='+str(sub)
    print(url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

The next function is a data helper which helps us organize and convert the JOSN data we get from pushshift into a dictionary so that we can work with it

In [3]:
def collectSubData(subm):
    subData = list() #list to store data points
    title = subm['title']
    url = subm['url']
    try:
        flair = subm['link_flair_text']
    except KeyError:
        flair = "NaN"    
    author = subm['author']
    sub_id = subm['id']
    score = subm['score']
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments']
    permalink = subm['permalink'] # remove later
    
    subData.append((sub_id,title,url,author,score,created,numComms,permalink,flair))
    subStats[sub_id] = subData

Now we specify the two parameters, subreddit names and the query. We will keep the query blank for now since we want to collect all the posts in these subreddits. We are interested in collecting data from 3 different subreddits with different political ideologies so we use a list of them.

In [4]:
#Subreddit to query
subs = ['politics', 'JoeBiden']
query = ""
#before and after dates https://www.unixtimestamp.com/index.php
subStats = {}

Now, we finally come to the actual step where we call the different functions we had previously defined and call them to collect data

In [5]:
all_data = []

for sub in subs:
    after = "	1667599156"  
    subCount = 0
    data = getPushshiftData(query, after, sub)
    # Will run until all posts have been gathered 
    # from the 'after' date up until before date
    while len(data) > 0:
        for submission in data:
            collectSubData(submission)
            subCount+=1
            # Calls getPushshiftData() with the created date of the last submission
        print(len(data))
        print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
        after = data[-1]['created_utc']
        data = getPushshiftData(query, after, sub)
        time.sleep(10)
    
    print(len(data))
        

https://api.pushshift.io/reddit/search/submission/?title=&size=1000&after=	1667599156&subreddit=politics
248
2022-11-05 21:28:28
https://api.pushshift.io/reddit/search/submission/?title=&size=1000&after=1667683708&subreddit=politics
248
2022-11-06 20:57:28
https://api.pushshift.io/reddit/search/submission/?title=&size=1000&after=1667768248&subreddit=politics


KeyboardInterrupt: ignored

In [6]:
len(subStats)

496

Here are some statistics on the data that we collected

In [7]:
print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

496 submissions have added to list
1st entry is:
Faith on the ballot: White Christian nationalism vs. Black Christian tradition created: 2022-11-04 22:01:26
Last entry is:
Make America Truly Great...For the Very First Time created: 2022-11-06 20:57:28


In [8]:
list(subStats.items())[-1]

('yo2imw',
 [('yo2imw',
   'Make America Truly Great...For the Very First Time',
   'https://www.commondreams.org/views/2022/11/06/make-america-truly-greatfor-very-first-time',
   'Picture-unrelated',
   1,
   datetime.datetime(2022, 11, 6, 20, 57, 28),
   1,
   '/r/politics/comments/yo2imw/make_america_truly_greatfor_the_very_first_time/',
   'NaN')])

Now we will save the data as a CSV file so that it is easy to manipulate

In [9]:
def updateSubs_file():
    upload_count = 0
    location = ""
    print("input filename of submission file, please add .csv")
    filename = input()
    file = location + filename
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["Post ID","Title","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")

updateSubs_file()

input filename of submission file, please add .csv
elections.csv
496 submissions have been uploaded


Other options: https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892

Before we finish, let us upload the csv file with elections data that has already been collected for you and that we will analyze in the rest of the workshop

In [10]:
elections_df = pd.read_csv("elections.csv")
elections_df.head()

Unnamed: 0,Post ID,Title,Url,Author,Score,Publish Date,Total No. of Comments,Permalink,Flair
0,ymbsdc,Faith on the ballot: White Christian nationali...,https://www.bostonglobe.com/2022/11/04/opinion...,Adult-male,1,2022-11-04 22:01:26,1,/r/politics/comments/ymbsdc/faith_on_the_ballo...,
1,ymbsxo,Michigan governor's race: Whitmer leads Dixon ...,https://www.foxnews.com/politics/michigan-gove...,WarWolf343,1,2022-11-04 22:02:04,1,/r/politics/comments/ymbsxo/michigan_governors...,
2,ymbt2l,The nightmarish Supreme Court case that could ...,https://www.vox.com/policy-and-politics/2022/1...,Streona,1,2022-11-04 22:02:13,1,/r/politics/comments/ymbt2l/the_nightmarish_su...,
3,ymbx4m,Gov. Newsom posthumously pardons abortion prov...,https://www.latimes.com/california/story/2022-...,misana123,1,2022-11-04 22:07:01,1,/r/politics/comments/ymbx4m/gov_newsom_posthum...,
4,ymbxqt,‘Dark money’ groups aligned with party leaders...,https://www.opensecrets.org/news/2022/11/dark-...,artistpearl,1,2022-11-04 22:07:43,1,/r/politics/comments/ymbxqt/dark_money_groups_...,


In [11]:
len(elections_df)

496

We would like to extract the subreddit name from the URL link using string manipulation

In [12]:
elections_df['Subreddit'] = [i.split('/')[2] for i in elections_df['Permalink']]

In [13]:
elections_df.to_csv("elections.csv", encoding = 'utf-8', index = False)

This notebook demonstrates how you can collect Reddit data in JSON format and save it as a CSV. Much of social media data like Twitter or news media data is avaliable as JSON
Therefore, you can reuse what you learnt here to collect Twitter data. You would just have to use a different API but many of the same processes (rate limiting, data format) apply.