Web Crawling & Scraping (Reddit)

This script connect to Reddit API and crawls posts based on search terms.

Ref: http://www.storybench.org/how-to-scrape-reddit-with-python/

In [None]:
import praw
import pandas as pd
import datetime as dt

In [None]:
'''
Getting Reddit and subreddit instances

PRAW stands for Python Reddit API Wrapper.

First, we connect to Reddit by calling the praw.Reddit function and storing it in a variable.

I’m calling mine reddit. You should pass the following arguments to that function:
'''

reddit = praw.Reddit(client_id='YOUR_CLIENT_ID', \
                     client_secret='YOUR_CLIENT_SECRET_KEY', \
                     user_agent='YOUR_APP_NAME', \
                     username='YOUR_REDDIT_USER_NAME', \
                     password='YOUR_REDDIT_LOGIN_PASSWORD')

In [None]:
'''
From that, we use the same logic to get to the subreddit we want and 
call the .subreddit instance from reddit and pass it the name of the subreddit we want to access.

It can be found after “r/” in the subreddit’s URL.
I’m going to use r/singapore, one of the subreddits we used in the story.
Assign a new variable like this:
'''
subreddit = reddit.subreddit('singapore')

In [None]:
'''
Accessing the threads

Each subreddit has five different ways of organizing the topics created by redditors:
.hot, .new, .controversial, .top, .gilded

You can also use .search("SEARCH_KEYWORDS") to get only results matching an engine search.

Let’s just grab the most up-voted topics all-time with the below.

This will return a list-like object with the top-100 submission in r/singapore
'''
top_subreddit = subreddit.top()

In [None]:
'''
You can control the size of the sample by passing a limit to .top(),
but be aware that Reddit’s request limit* is 1000, like this:
'''
top_subreddit = subreddit.top(limit=500)

In [None]:
'''
Parsing and downloading the data

We are right now really close to getting the data in our hands.
Our top_subreddit object has methods to return all kinds of information from each submission.
You can check it for yourself with these simple two lines:
'''
for submission in subreddit.top(limit=10):
    print(submission.title, submission.id)

In [None]:
'''
We will scrape this information about the topics:
  title, score, url, id, number of comments, date of creation, body text
  
This can be done very easily with a for lop just like above, but 
first we need to create a place to store the data.

In Python, that is usually done with a dictionary. Let’s create it with the following code:
'''
topics_dict = { "author": [],
                "title":[],
                "score":[],
                "id":[], "url":[],
                "comms_num": [],
                "created": [],
                "body":[]}

In [None]:
'''
Now we are ready to start scraping the data from the Reddit API.
We will iterate through our top_subreddit object and append the information to our dictionary.
'''
for submission in top_subreddit:
    topics_dict["author"].append(submission.author)
    topics_dict["title"].append(submission.title)
    topics_dict["score"].append(submission.score)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

In [None]:
'''
Python dictionaries, however, are not very easy for us humans to read.
This is where the Pandas module comes in handy.
We’ll finally use it to put the data into something that
looks like a spreadsheet — in Pandas, we call those Data Frames.
'''
topics_data = pd.DataFrame(topics_dict)

In [None]:
'''
The data now looks like this:
'''
topics_data

In [None]:
'''
Fixing the date column

Reddit uses UNIX timestamps to format date and time. 
Instead of manually converting all those entries, or using a site like 
www.unixtimestamp.com, we can easily write up a function in Python to automate that process.

We define it, call it, and join the new column to dataset with the following code:
'''

def get_date(created):
    return dt.datetime.fromtimestamp(created)

_timestamp = topics_data["created"].apply(get_date)

topics_data = topics_data.assign(timestamp = _timestamp)

In [None]:
'''
The dataset now has a new column that we can understand and is ready to be exported.
'''
topics_data

In [None]:
'''
Exporting a CSV

Pandas makes it very easy for us to create data files in various formats,
including CSVs and Excel workbooks.

To finish up the script, add the following to the end.
'''

topics_data.to_csv('Reddit_output.csv', index=False) 

In [None]:
'''
Retrieving a particular submission
Ref: https://praw.readthedocs.io/en/latest/tutorials/comments.html

Assume we want to process the comments for this submission:

https://www.reddit.com/r/singapore/comments/sykq9h/humans_working_together_to_rescue_a_cat/


First, we need to obtain a submission object. There are 2 ways to do this.
1) Retrieve by URL
2) Retrieve by submission ID (which we happen to know, it is 'sykq9h')
'''

submission = reddit.submission(url='https://www.reddit.com/r/singapore/comments/sykq9h/humans_working_together_to_rescue_a_cat/')
#submission = reddit.submission(id='sykq9h')

In [None]:
'''
With a submission object we can then interact with its CommentForest
through the submission’s comments attribute. 

A CommentForest is a list of top-level comments each of which contains a CommentForest of replies.

If we wanted to output only the body of the top level comments in the thread we could do:
'''
for top_level_comment in submission.comments:
    print(top_level_comment.body)

In [None]:
'''
While running this you will most likely encounter the exception

AttributeError: 'MoreComments' object has no attribute 'body'

This submission’s comment forest contains a number of MoreComments objects.

These objects represent the “load more comments”, and “continue this thread” links 
encountered on the website.

While we could ignore MoreComments in our code, like so:
'''
from praw.models import MoreComments
for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    print(top_level_comment.body)

In [None]:
'''
A limit of None means that all MoreComments objects will be replaced until there are none left,
as long as they satisfy the threshold.
'''
from praw.models import MoreComments

submission.comments.replace_more(limit=None)

for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    print(top_level_comment.body)

In [None]:
'''
Now we are able to successfully iterate over all the top-level comments.

What about their replies? We could output all second-level comments like so:
'''
from praw.models import MoreComments

submission.comments.replace_more(limit=None)

for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    
    print("=========== Top Level Comment ===========")
    print(top_level_comment.body)
    
    for second_level_comment in top_level_comment.replies:
        print("      ############ Second Level Comment ############")
        print(second_level_comment.body)

In [None]:
'''
However, the comment forest can be arbitrarily deep, so we’ll want a more robust solution.

One way to iterate over a tree, or forest, is via a breadth-first traversal using a queue:
'''
submission.comments.replace_more(limit=None)
comment_queue = submission.comments[:]  # Seed with top-level
while comment_queue:
    comment = comment_queue.pop(0)
    print(comment.body)
    comment_queue.extend(comment.replies)

In [None]:
'''
The above code will output all the top-level comments, followed by second-level, third-level, etc. 

While it is awesome to be able to do your own breadth-first traversals, 
CommentForest provides a convenience method, list(), which returns a list of comments 
traversed in the same order as the code above.

Thus the above can be rewritten as:
'''

submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
    print("=== Author: ", comment.author, "===")
    print(comment.body)

You can now properly extract and parse all (or most) of the comments belonging to a single submission.

For more information about what attributes you can crawl:

1) Submission
https://praw.readthedocs.io/en/latest/code_overview/models/submission.html

2) Comment
https://praw.readthedocs.io/en/latest/code_overview/models/comment.html
