# Exploring the reddit API

In [11]:
#Go to root folder
import os
os.chdir("..")

## Using praw

Praw is a good way to get easy access to Reddit data. [This](https://www.geeksforgeeks.org/python-praw-python-reddit-api-wrapper/) is a useful article. 

First, you will need to setup a connection to the Reddit API, hence you need to first [sign up and register your app with Reddit](https://www.reddit.com/wiki/api) to access their API.

You then use your details to setup the connection. [This archived post](https://github.com/reddit-archive/reddit/wiki/OAuth2) also explains the authentification. I have stored the details in a configuration file which I load with ConfigParser. 
The file looks like this (example data) 

    [reddit-config]
    client_id =  yourclientid
    client_secret = yourclientsecret 
    user_agent = my user agent 
    username = yourusername 
    password = yourpassword


Second, you need to install praw (check out the [official documentation](https://praw.readthedocs.io/en/latest/index.html) for help), and you are ready to go. 

In [6]:
import pandas as pd
import configparser
import praw
from datetime import datetime

### Setting up the details

In [7]:
# retrieve details from config file
def get_config_values(config_file, section):
    config = configparser.ConfigParser()
    config.read(config_file)

    return {
        "username": config.get(section, 'username'),
        "password": config.get(section, 'password'),
        "user_agent": config.get(section, 'user_agent'),
        "client_id": config.get(section, 'client_id'),
        "client_secret": config.get(section, 'client_secret'),
    }

details = get_config_values("reddit-config.cfg", "reddit-config")

In [12]:
# setup praw Reddit connection
reddit = praw.Reddit(client_id = details["client_id"], 
                     client_secret = details["client_secret"], 
                     user_agent = details["user_agent"], 
                     username = details["username"], 
                     password = details["password"]) 
  
# to verify whether the instance is authorised instance or not 
print(reddit.read_only)

False


### Access a Subreddit

In [9]:
subreddit = reddit.subreddit('sourdough') 
  
# display the subreddit name 
print(subreddit.display_name) 
  
# display the subreddit title  
print(subreddit.title)        
  
# display the subreddit description  
print(subreddit.description) 

sourdough
Sourdough
Want to learn how to make and bake sourdough? Love the aroma, taste, and texture of homemade bread? If yes, this is your subreddit!  Ask questions, start discussions, share recipes, photos, baking tips, and much more.

***

Get your own custom bread flair by clicking "edit" just above and selecting your flair.

***

**Rules:**

* Be polite & respectful
* No submitting irrelevant content (including memes)
* No spamming
* [Reddiquette](http://www.reddit.com/help/reddiquette)

***

**Resources:**

* [Beginner's Guide to Sourdough](/r/Sourdough/wiki/starter_culture_resources)
* [General Information on Sourdough](/r/Sourdough/wiki/general_information)
* [Basic Troubleshooting](https://i.redd.it/w3ami1gnyqr41.jpg)
* [Sourdough Recipe w/ Wakthrough](/r/Sourdough/wiki/standard-sd-recipe)

***

**Baking Related Subreddits:**

* [/r/ArtisanBread](http://www.reddit.com/r/ArtisanBread/)
* [/r/Breadit](http://www.reddit.com/r/Breadit/)
* [/r/Pizza](http://www.reddit.com/r/Pizza/

In [4]:
# to find the top most submission in the subreddit "sourdough" 
subreddit = reddit.subreddit('sourdough') 
  
for submission in subreddit.top(limit = 1): 
    # displays the submission title 
    print("Title: ", submission.title)   
  
    # displays the net upvotes of the submission 
    print("Score: ", submission.score)   
  
    # displays the submission's ID 
    print("ID: ", submission.id)    
  
    # displays the url of the submission 
    print("URL: ", submission.url) 
    
    # displays when the submission was created in unix time
    print("Created: ", submission.created_utc)  
    
    # displays number of comments to the submission
    print("Number of comments: ", submission.num_comments) 

Title:  Here’s another video of me shaping sourdough. I added some music this time because baking is rock ’n roll.
Score:  4435
ID:  glzuwy
URL:  https://v.redd.it/t8jaoor0giz41
Created:  1589801997.0
Number of comments:  214


In [5]:
# to find the top most submission in the subreddit "sourdough" 
subreddit = reddit.subreddit('sourdough') 

df_title = []

for submission in subreddit.new(limit = 5): 
    title = submission.title
    df_title.append(title)

df_title

['An olive rosemary loaf! First sourdough bake of the year & first time adding inclusions was a success. The aromatics & flavor of this loaf is insane, excited to experiment more w/ inclusions this year.',
 'Delighted with the crumb on this one!',
 'My first sourdough bread.',
 'Sourdough newbie here',
 'Below is a slice from the end, top is a slice from the middle of the same bread. What went wrong?']

In [10]:
# to find the top most submission in the subreddit "sourdough" 
subreddit = reddit.subreddit('sourdough') 

topics_dict = { "title":[], 
               "score":[],
              "created_UTC": [],
               "num_comments": [],
               "comments_text": []}

for submission in subreddit.top(limit = 10): 
    topics_dict["title"].append(submission.title)
    topics_dict["score"].append(submission.score)
    topics_dict["created_UTC"].append(submission.created)
    topics_dict["num_comments"].append(submission.num_comments)
    
    #https://praw.readthedocs.io/en/latest/tutorials/comments.html
    submission.comments.replace_more(limit=None)
    comment_body = ""
    for comment in submission.comments.list():
        comment_body =  comment_body + comment.body + "\n"
    topics_dict["comments_text"].append(comment_body)

#convert dictionary to dataframe
topics_data = pd.DataFrame(topics_dict)
topics_data

Unnamed: 0,title,score,created_UTC,num_comments,comments_text
0,Here’s another video of me shaping sourdough. ...,4435,1589831000.0,214,"This is hypnotizing, and I’m so impressed with..."
1,My dad decided to make a Coronavirus themed so...,3356,1589723000.0,44,"it's got such a 1990's aesthetic to it, i love..."
2,by request: a timelapse of me scoring my sunfl...,3294,1584190000.0,104,Ngl I thought the cutting was over about 10 di...
3,How else was I suppose to announce that we are...,2920,1591844000.0,86,Congrats on the sex!\nCongratulations on loaf ...
4,Today is a good day.,2844,1588969000.0,121,It's so pretty I want to cry\nAmazing ! What w...
5,I finally achieved what I thought was impossib...,2841,1600390000.0,165,Do you use gluten free flour in the banneton? ...
6,That incredible moment when you take off the t...,2643,1590562000.0,119,That bread has never even heard of gluten\nOn ...
7,My pizza dough this morning,2536,1593475000.0,78,holy gluten batman.\nWould you mind sharing yo...
8,Hey guys here’s a video of me dividing and pre...,2520,1590619000.0,109,I find every one of your videos so hypnoticall...
9,✨ a little project I did tonight for my kitche...,2337,1593784000.0,113,"[deleted]\nYou forgot 10% Luck, 20% Skill\nThi..."


## Using pushshift.io

Using Praw is a very convenient way to access the Reddit API. However, it does not allow to filter for specific dates and has a data limit of 5000. More info about this change [here](https://stackoverflow.com/questions/53988619/praw-6-get-all-submission-of-a-subreddit).

There is another way to access Reddit data via the [Pushshift API](https://pushshift.io/api-parameters/). This is what I tried to explore below. 

### Method 1

Following [this](https://rareloot.medium.com/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563) article but with small amends.

In [4]:
import pandas as pd
import requests
import json
import math
from datetime import datetime
import time

def getPushshiftData(start_at, end_at, subreddit):
    url = 'https://api.pushshift.io/reddit/search/submission?&size=100&after='+str(start_at)+'&before='+str(end_at)+'&subreddit='+str(subreddit)
    r = requests.get(url)
    print('server status:', r.status_code)
    
    # if page available, run code as normal
    if r.status_code == 200:
        data = json.loads(r.text)
        return data['data']
    
    # if page not able to load, wait 1 min and try again
    else:
        print("sleep 60")
        time.sleep(60)
        url = 'https://api.pushshift.io/reddit/search/submission?&size=100&after='+str(start_at)+'&before='+str(end_at)+'&subreddit='+str(subreddit)
        r = requests.get(url)
        data = json.loads(r.text)
        return data['data']
        
#dictionary to store values in
post_dict = { "id" : [], 
             "score" :[],
            "created_utc":[],
             "title":[],
             "num_comments" : [],
             "can_mod_post": [],
             "author":[]
            }

#define search parameters
subreddit='Sourdough'
start_at = str(math.ceil(datetime(2020, 1, 1, 0, 0, 0).timestamp()))
end_at = str(math.floor(datetime(2020, 1, 3, 23, 59, 59).timestamp()))

#retrieve data given the parameters
data = getPushshiftData(start_at, end_at, subreddit)

# Will run until all posts have been gathered from the 'start_at' date until the 'end_at' date
while len(data) > 0:
    for submission in data:
        post_dict["id"].append(submission["id"])
        post_dict["title"].append(submission["title"])
        post_dict["created_utc"].append(submission["created_utc"])
        post_dict["score"].append(submission["score"])
        post_dict["num_comments"].append(submission["num_comments"])
        post_dict["can_mod_post"].append(submission["can_mod_post"])
        post_dict["author"].append(submission["author"])
        
    # Calls getPushshiftData() with the created date of the last submission
    print('start again at:', data[-1]['created_utc'])
    print('data loaded:', len(post_dict["title"]))
    time.sleep(15)
    data = getPushshiftData(subreddit=subreddit, start_at=data[-1]['created_utc'], end_at=end_at)

server status: 200
start again at: 1578019458
data loaded: 100
server status: 200
start again at: 1578095504
data loaded: 139
server status: 200


In [5]:
#convert dictionary to dataframe
post_df = pd.DataFrame(post_dict)

post_df[:10]

Unnamed: 0,id,score,created_utc,title,num_comments,can_mod_post,author
0,eibhvl,1,1577839131,"First attempt at a starter, really hope I mana...",5,False,coentertainer
1,eibvur,1,1577841129,Skillet &amp; Dutch Oven Sourdough in the rain...,0,False,Richness69
2,eiby7m,1,1577841483,My last bread of 2019. I used Brad and Claire’...,0,False,canioli019
3,eictkk,1,1577846281,I started baking in September and I have never...,0,False,singular-chip
4,eidmqm,1,1577851082,Sourdough Books,3,False,TheNightBaker97
5,eidtic,1,1577852213,Analyzing sourdough?,1,False,amisanyal
6,eidxpd,1,1577852956,Ginger tumeric loaf to guide me out of the decade,2,False,bleuxballs
7,eidyxu,1,1577853173,Behold Bread Majors. He will incite the Rocky ...,3,False,ClandestineOni
8,eifrvq,1,1577864698,Last loaves of the year.,0,False,gorpz
9,eigw2g,1,1577873535,Wheat flour starter vs rye starter,9,False,bacafreak


### Article 2 
https://medium.com/@pasdan/how-to-scrap-reddit-using-pushshift-io-via-python-a3ebcc9b83f4

In [22]:
import math
import json
import requests
import itertools
import numpy as np
import time
from datetime import datetime, timedelta

In [35]:
# create logic to retrieve more data above limit

def make_request(uri, max_retries = 5):
    def fire_away(uri):
        response = requests.get(uri)
        assert response.status_code == 200
        return json.loads(response.content)    
    current_tries = 1
    while current_tries < max_retries:
        try:
            time.sleep(1)
            response = fire_away(uri)
            return response
        except:
            time.sleep(1)
            current_tries += 1    
    return fire_away(uri)

In [38]:
def pull_posts_for(subreddit, start_at, end_at):
    
    def map_posts(posts):
        return list(map(lambda post: {
            'id': post['id'],
            'created_utc': post['created_utc'],
            'prefix': 't4_'
        }, posts))
    
    SIZE = 500
    URI_TEMPLATE = r'https://api.pushshift.io/reddit/search/submission?subreddit={}&after={}&before={}&size={}'
    
    post_collections = map_posts( \
        make_request(URI_TEMPLATE.format(subreddit, start_at, end_at, SIZE))['data'])    
    
    n = len(post_collections)
    while n == SIZE:
        last = post_collections[-1]
        new_start_at = last['created_utc'] - (10)
        
        more_posts = map_posts( \
            make_request(URI_TEMPLATE.format(subreddit, new_start_at, end_at, SIZE))['data'])
        
        n = len(more_posts)
        post_collections.extend(more_posts)
        
    return post_collections

In [117]:
# building time search intervals
def give_me_intervals(start_at, end_at, days_per_interval): 
           
    period = (86400 * days_per_interval)    ## 1 day = 86400
    end = start_at + period
    yield (int(start_at), int(end))    
    padding = 1 
    while end <= end_at:
        start_at = end + padding
        end = (start_at - padding) + period
        yield int(start_at), int(end)

In [129]:
## test function
start_at = math.floor((datetime(2020, 1, 1, 0, 0, 0)).timestamp())
end_at = math.ceil(c

print("length:", len(list(give_me_intervals(start_at, end_at, 0.5))))
print(list(give_me_intervals(start_at,end_at, 7)))

length: 732
[(1577836800, 1578441600), (1578441601, 1579046400), (1579046401, 1579651200), (1579651201, 1580256000), (1580256001, 1580860800), (1580860801, 1581465600), (1581465601, 1582070400), (1582070401, 1582675200), (1582675201, 1583280000), (1583280001, 1583884800), (1583884801, 1584489600), (1584489601, 1585094400), (1585094401, 1585699200), (1585699201, 1586304000), (1586304001, 1586908800), (1586908801, 1587513600), (1587513601, 1588118400), (1588118401, 1588723200), (1588723201, 1589328000), (1589328001, 1589932800), (1589932801, 1590537600), (1590537601, 1591142400), (1591142401, 1591747200), (1591747201, 1592352000), (1592352001, 1592956800), (1592956801, 1593561600), (1593561601, 1594166400), (1594166401, 1594771200), (1594771201, 1595376000), (1595376001, 1595980800), (1595980801, 1596585600), (1596585601, 1597190400), (1597190401, 1597795200), (1597795201, 1598400000), (1598400001, 1599004800), (1599004801, 1599609600), (1599609601, 1600214400), (1600214401, 1600819200),

In [124]:
## Pull submissions

# Define parameters
#To be safe, I changed the day interval to 1/2 day, in case any day had more than 100 posts
subreddit = 'Sourdough'
start_at = math.floor((datetime(2020, 1, 1, 0, 0, 0)).timestamp())
end_at = math.ceil(datetime(2020, 12, 31, 23, 59, 59).timestamp()) 
days_per_interval = 0.5

posts = []
for interval in give_me_intervals(start_at, end_at, days_per_interval):
    pulled_posts = pull_posts_for(subreddit, interval[0], interval[1])
    posts.extend(pulled_posts)
    time.sleep(.500)

# check results
print(len(posts))

3146
3146


In [147]:
# write the result to a file in case I want to use it later to avoid having to rerun the code above
import pickle

my_object = posts
pickle_out = open("posts_list.pickle","wb")
pickle.dump(my_object, pickle_out)
pickle_out.close()

In [146]:
## Unpickling: read result back in using pickle
#pickle_in = open("posts_list.pickle","rb")
#test_object = pickle.load(pickle_in)
#test_object

[{'id': 'eibhvl', 'created_utc': 1577839131, 'prefix': 't4_'},
 {'id': 'eibvur', 'created_utc': 1577841129, 'prefix': 't4_'},
 {'id': 'eiby7m', 'created_utc': 1577841483, 'prefix': 't4_'},
 {'id': 'eictkk', 'created_utc': 1577846281, 'prefix': 't4_'},
 {'id': 'eidmqm', 'created_utc': 1577851082, 'prefix': 't4_'},
 {'id': 'eidtic', 'created_utc': 1577852213, 'prefix': 't4_'},
 {'id': 'eidxpd', 'created_utc': 1577852956, 'prefix': 't4_'},
 {'id': 'eidyxu', 'created_utc': 1577853173, 'prefix': 't4_'},
 {'id': 'eifrvq', 'created_utc': 1577864698, 'prefix': 't4_'},
 {'id': 'eigw2g', 'created_utc': 1577873535, 'prefix': 't4_'}]

In [125]:
## get more data from reddit using praw
# Generate list including Submissions and their id to then get the rest of the data from
posts_from_reddit = []

for submission_id in np.unique([ post['id'] for post in posts ]):
    submission = reddit.submission(id=submission_id) 
    
    posts_from_reddit.append(submission)  

print(len(posts_from_reddit))
print(posts_from_reddit[:10])

3146


In [None]:
# Pull other data from reddit including title, score, date, number of comments, etc
posts_dict = { "title":[], 
               "score":[],
              "created": [],
               "created_UTC": [],
               "num_comments": [],
               "comments_text": []
              }

for submission in posts_from_reddit:
    topics_dict["title"].append(submission.title)
    topics_dict["score"].append(submission.score)
    topics_dict["created_UTC"].append(submission.created)
    topics_dict["num_comments"].append(submission.num_comments)

    
#convert dictionary to dataframe
posts_data = pd.DataFrame(posts_dict)
posts_data

In [127]:
## convert unix time into date --> not working properly, need to fix unix conversion
def get_yyyy_mm_dd_from_utc(dt):
    date = datetime.utcfromtimestamp(dt)
    return str(date.year) + "-" + str(date.month) + "-" + str(date.day)

created = []

for submission in posts_from_reddit:
    created.append(get_yyyy_mm_dd_from_utc(submission.created))

In [128]:
pd.DataFrame(created).to_csv("date.csv")

In [62]:
pd.DataFrame(list(np.unique([ post['id'] for post in posts ]))).to_csv("submission_id.csv")

In [65]:
pd.DataFrame(topics_dict["title"]).to_csv("submisstion_titles.csv")