# Lockdown baking - part 1

A project for scraping and analysing data from reddit (r/Sourdough) to explore baking trends during 2020.

Part 1: Webscraping

- setting up reddit API account
- creating functions for extracting data using [praw]( https://praw.readthedocs.io/en/latest/code_overview/models/submission.html) and [pushshift.io](https://pushshift.io/api-parameters/) 
- storing result in a csv file 

# Setup

In [7]:
import pandas as pd
import numpy as np
import os #to change file paths
import configparser #to read config file
import praw #to access reddit API
import pickle #to store objects
import math
import json
import requests
import itertools
import time
from datetime import datetime, timedelta

In [5]:
## go to root folder
os.chdir("..")

## Setting up reddit API connection

In [47]:
# retrieve details from config file
def get_config_values(config_file, section):
    config = configparser.ConfigParser()
    config.read(config_file)

    return {
        "username": config.get(section, 'username'),
        "password": config.get(section, 'password'),
        "user_agent": config.get(section, 'user_agent'),
        "client_id": config.get(section, 'client_id'),
        "client_secret": config.get(section, 'client_secret'),
    }

details = get_config_values("reddit-config.cfg", "reddit-config")

In [48]:
# setup praw Reddit connection
reddit = praw.Reddit(client_id = details["client_id"], 
                     client_secret = details["client_secret"], 
                     user_agent = details["user_agent"], 
                     username = details["username"], 
                     password = details["password"]) 
  
# to verify whether the instance is authorised instance or not 
print(reddit.read_only)

False


## Test connection

In [49]:
# to find the top most submission in the subreddit "sourdough" 
subreddit = reddit.subreddit('sourdough') 
  
for submission in subreddit.top(limit = 1): 
    # displays the submission title 
    print("Title: ", submission.title)   
  
    # displays the net upvotes of the submission 
    print("Score: ", submission.score)   
  
    # displays the submission's ID 
    print("ID: ", submission.id)    
  
    # displays the url of the submission 
    print("URL: ", submission.url) 
    
    # displays when the submission was created in unix time
    print("Created: ", submission.created_utc)  
    
    # displays number of comments to the submission
    print("Number of comments: ", submission.num_comments) 

Title:  Here’s another video of me shaping sourdough. I added some music this time because baking is rock ’n roll.
Score:  4432
ID:  glzuwy
URL:  https://v.redd.it/t8jaoor0giz41
Created:  1589801997.0
Number of comments:  214


# Access pushshift.io

Functions and approach based on this article. 

In [99]:
import pandas as pd
import requests
import json

def getPushshiftData(start_at, subreddit):
    url = 'https://api.pushshift.io/reddit/search/submission?&size=1000&after='+str(start_at)+'&subreddit='+str(subreddit)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

#dictionary to store values in
post_dict = { "id" : [], 
             "score" :[],
            "created_utc":[],
             "title":[],
             "num_comments" : []
            }

#define search parameters
subreddit='Sourdough'
start_at = str(math.ceil(datetime(2020, 12, 1, 0, 0, 0).timestamp()))

#retrieve data given the parameters
data = getPushshiftData(start_at, subreddit)

# Will run until all posts have been gathered from the 'after' date up until todays date
while len(data) > 0:
    for submission in data:
        post_dict["id"].append(submission["id"])
        post_dict["title"].append(submission["title"])
        post_dict["created_utc"].append(submission["created_utc"])
        post_dict["score"].append(submission["score"])
        post_dict["num_comments"].append(submission["num_comments"])
        
    # Calls getPushshiftData() with the created date of the last submission
    data = getPushshiftData(subreddit=subreddit, start_at=data[-1]['created_utc'])

print(len(post_dict["title"]))
print(len(post_dict["id"]))
print(len(post_dict["score"]))
print(len(post_dict["created_utc"]))

2934
2934
2934
2934


In [100]:
post_dict["title"]

['Sourdough cinammon rolls I made for work!',
 '3rd try, first ear!!',
 'Pleasantly surprised, so many grains so many flavors.',
 'Burst',
 'Oven spring trouble shooting',
 'Today’s Bake - Accidental 5-day Cold Proof',
 '6 months of practice finally paid off! My first successful ear and semi-uniform crumb!',
 'Really happy with how this little baguette turned out. 48 hour ferment in the fridge, some great flavors!',
 'Anyone know why this keeps happening?',
 'One oval one round banneton Advice!',
 'Overnight Country Brown recipe made with Full Proof open crumb method. My favorite bread yet',
 'Flour blending resources.',
 'Slap and Fold',
 "Pumpkin spice bread with raisins! I decorated with scored flowers and I'm so happy with how they came out!",
 'I made a pumpkin spice loaf with pumpkin puree and raisins! I am really happy with how my flower decorations came out 😊',
 'Can someone diagnose this sourdough crumb please? I didn’t get good rise at all. Are those big holes indicative of u