<a href="https://colab.research.google.com/github/Nikhileswar-Komati/Suicide_Ideation/blob/master/Pushshift_Module_to_extract_Submissions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Pushshift Module to extract Submissions Data from Reddit via Python

PRAW is pretty good at gettin reddit data but there are some limitations with it.
Including the removal of the [subreddit.submissions endpoint](https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/.). 

So for extracting Reddit submissions and the primarily data such as upvotes and comments count, I put together this notebook using Pushshift.

If you still prefer PRAW for extract submissions, I have written a code [template here](https://github.com/SeyiAgboola/Seyi_Projects/blob/master/submission_list.py).

I will also [host the code on GitHub](https://github.com/SeyiAgboola/Reddit-Data-Mining/blob/master/Using_Pushshift_Module_to_extract_Submissions.ipynb).

More info on the removal of the [subreddit.submissions endpoint](https://www.reddit.com/r/redditdev/comments/8bia9n/praw_psa_the_subredditsubmissions_method_no/).

# Import modules

In [32]:
import pandas as pd
import requests #Pushshift accesses Reddit via an url so this is needed
import json #JSON manipulation
import csv #To Convert final table into a csv file to save to your machine
import time
import datetime

# Pushshift URL Examples

In [62]:
#We can access the Pushshift API through building an URL with the relevant parameters without even needing Reddit credentials.
#These are some examples. You can follow the links and they will generate a page with JSON data
test_url = "https://api.pushshift.io/reddit/search/submission/?&after=1609505059&before=1609520400&subreddit=SuicideWatch"

In [None]:
data = getPushshiftData(1268811603, 1609520400, 'SuicideWatch')
data

# Parameters for your Pushshift URL
These are probably the most important parameters to consider when building your Pushshift URL:

* size — increase limit of returned entries to 1000
* after — where to start the search
* before — where to end the search
* title — to search only within the submission’s title
* subreddit — to narrow it down to a particular subreddit

In [135]:
#Adapted from this https://gist.github.com/dylankilkenny/3dbf6123527260165f8c5c3bc3ee331b
#This function builds an Pushshift URL, accesses the webpage and stores JSON data in a nested list
def getPushshiftData(after, before, sub):
    #Build URL
    url = 'https://api.pushshift.io/reddit/search/submission/?&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    #Print URL to show user
    print(url)
    #Request URL
    r = requests.get(url)
    #Load JSON data from webpage into data variable
    data = json.loads(r.text)
    #return the data element which contains all the submissions data
    # print("No Error")
    return data['data']

# Extract key information from Submissions

We want key data for further analysis including: 
* Submission Title
* URL 
* Flair
* Author
* Submission post ID
* Score
* Upload Time
* No. of Comments 
* Permalink.


In [136]:
#This function will be used to extract the key data points from each JSON result
def collectSubData(subm):
    #subData was created at the start to hold all the data which is then added to our global subStats dictionary.
    subData = list() #list to store data points
    title = subm['title']
    sub_id = subm['id']
    score = subm['score']
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments']
    selftext = subm['selftext']
    subreddit = subm['subreddit']
    over_18 = subm['over_18']
    #Put all data points into a tuple and append to subData
    subData.append((sub_id,title,selftext,score,created,numComms,over_18,subreddit))
    #Create a dictionary entry of current submission data and store all data related to it
    subStats[sub_id] = subData

# Update your Search Settings here

In [139]:
#Create your timestamps and queries for your search URL
#https://www.unixtimestamp.com/index.php > Use this to create your timestamps
after = "1230768000" #Submissions after this timestamp (1577836800 = 01 Jan 20)1229385600
before = "1609520400" #Submissions before this timestamp (1607040000 = 04 Dec 20)
#Keyword(s) to look for in submissions
sub = "depression" #Which Subreddit to search in

#subCount tracks the no. of total submissions we collect
subCount = 0
#subStats is the dictionary where we will store our data.
subStats = {}

In [140]:
# We need to run this function outside the loop first to get the updated after variable
data = getPushshiftData(after, before, sub)
# Will run until all posts have been gathered i.e. When the length of data variable = 0
# from the 'after' date up until before date
while len(data) > 0: #The length of data is the number submissions (data[0], data[1] etc), once it hits zero (after and before vars are the same) end
    for submission in data:
        try:
          collectSubData(submission)
          subCount+=1
        except:
          continue
    # Calls getPushshiftData() with the created date of the last submission
    # print(len(data))
    # print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    #update after variable to last created date of submission
    after = data[-1]['created_utc']
    # print("CHECK")
    #data has changed due to the new after variable provided by above code
    try:
      data = getPushshiftData(after, before, sub)
      print(subCount)
    except:
      while 1:
        if int(after) >= int(before):
          break
        try:
          print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
          after += 10000000
          data = getPushshiftData(after, before, sub)
          print(subCount)
          break
        except:
          after += 10000000

print(len(data))

https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1230768000&before=1609520400&subreddit=depression
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1252392783&before=1609520400&subreddit=depression
100
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1258121161&before=1609520400&subreddit=depression
200
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1262885210&before=1609520400&subreddit=depression
300
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1266030119&before=1609520400&subreddit=depression
400
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1268971790&before=1609520400&subreddit=depression
500
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1272035916&before=1609520400&subreddit=depression
600
https://api.pushshift.io/reddit/search/submission/?&size=1000&after=1275048382&before=1609520400&subreddit=depression
700
https://api.pushshift.io/red

# Check your Submission Extraction was successful

In [141]:
print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

78889 submissions have added to list
1st entry is:
Girls, the future and some other stuff. created: 0
Last entry is:
I am trying so hard but things just keep getting worse created: 0


# Save data to CSV file

In [142]:
def updateSubs_file():
    upload_count = 0
    #location = "\\Reddit Data\\" >> If you're running this outside of a notebook you'll need this to direct to a specific location
    print("input filename of submission file, please add .csv")
    filename = input() #This asks the user what to name the file
    file = filename
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["Post_iD","Title","Body","Score","Publish_date","Total_no_of_comments", "Over_18", "Subreddit"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")
updateSubs_file()

input filename of submission file, please add .csv
pushshift_dep.csv
78889 submissions have been uploaded


In [123]:
import pandas as pd

df = pd.read_csv('/content/pushshiftmax.csv')
df1 = pd.read_csv('/content/pushshiftmax1.csv')
df2 = pd.read_csv('/content/pushshiftmax2.csv')
df3 = pd.read_csv('/content/pushshiftmax3.csv')

data = pd.concat([df, df1, df2, df3])
data.shape

(50886, 8)

In [148]:
df1 = pd.read_csv('/content/pushshift_dep.csv')
df2 = pd.read_csv('/content/pushshift_sw.csv')

data = pd.concat([df1, df2])
data.shape

(129776, 8)

In [149]:
data.to_csv('PushShift_129776.csv', index = False)