# Web Data Scraping

[Spring 2021 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Class outline

* **Week 1**: Introduction to Jupyter, browser console, structured data, ethical considerations
* **Week 2**: Scraping HTML with `requests` and `BeautifulSoup`
* **Week 3**: Scraping web data with Selenium
* **Week 4**: Scraping an API with `requests` and `json`, Wikipedia and Reddit
* **Week 5**: Scraping data from Twitter

## Acknowledgements

This course will draw on resources built by myself and [Allison Morgan](https://allisonmorgan.github.io/) for the [2018 Summer Institute for Computational Social Science](https://github.com/allisonmorgan/sicss_boulder), which were in turn derived from [other resources](https://github.com/simonmunzert/web-scraping-with-r-extended-edition) developed by [Simon Munzert](http://simonmunzert.github.io/) and [Chris Bail](http://www.chrisbail.net/). 

Thank you also to Professor [Terra KcKinnish](https://www.colorado.edu/economics/people/faculty/terra-mckinnish) for coordinating the ITSS seminars.

This notebook is adapted from excellent notebooks in Dr. [Cody Buntain](http://cody.bunta.in/)'s seminar on [Social Media and Crisis Informatics](http://cody.bunta.in/teaching/2018_winter_umd_inst728e/) as well as the [PRAW documentation](https://praw.readthedocs.io/en/latest/).

## Class 4 goals

* Sharing accomplishments and challenges with last week's material
* Accessing an API with `requests`
* Retrieiving historical web data from the Internet Archive
* Why you don't want to write a parser for Wikipedia's data
* Fundamentals of retrieving information from Wikipedia's API
* EDA with data from Wikipedia's API
* Reddit's API, time permitting

We'll need a few common libraries for all these examples.

In [None]:
# Lets us talk to other servers on the web
import requests

# APIs spit out data in JSON
import json

# Use BeautifulSoup to parse some HTML
from bs4 import BeautifulSoup

# Handling dates and times
from datetime import datetime

# DataFrames!
import pandas as pd
import numpy as np

# Data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

## Scraping the Internet Archive's Wayback Machine

Now we'll leave some of the ethically-fraught methods of web scraping behind. The Internet Archive maintains the "[Wayback Machine](https://www.archive.org/web/)" where old versions of websites are stored. Some of my favorites:

* [CNN in June 2000](https://web.archive.org/web/20000815052826/http://www.cnn.com/)
* [Facebook in August 2004](https://web.archive.org/web/20040817020419/http://www.facebook.com/)
* [Apple in April 1997](https://web.archive.org/web/19970404064444/http://www.apple.com:80/)

In these URLs above, there is a numeric identifier corresponding to the timestamp when the image of the website was captured. How do we know when the Wayback Machine archived a webpage? There's a free and open API!

### Using the Wayback Machine API

The simplest API request we can make asks for the most recent snapshot of a webpage archived by the Wayback Machine.

In [None]:
wb_url = 'http://archive.org/wayback/available?url=facebook.com'

wb_response = requests.get(wb_url)

wb_response.json()

This response tells us the timestamp and location of this snapshot, which we could then go retrieve and parse.

In [None]:
wb_response_json = wb_response.json()

recent_fb_wb_url = wb_response_json['archived_snapshots']['closest']['url']

recent_fb_wb_response = requests.get(recent_fb_wb_url)

Get the raw text out, soupify, and look for links. For some reason all the links in this snapshot are in German.

In [None]:
recent_fb_wb_raw = recent_fb_wb_response.text

recent_fb_wb_soup = BeautifulSoup(recent_fb_wb_raw)

[link.text for link in recent_fb_wb_soup.find_all('a')]

We can also ask for the most recent snapshot of a webpage around a specific date. Let's ask the Wayback Machine for a snapshot of Facebook around February 1, 2008.

In [None]:
wb_url = 'http://archive.org/wayback/available?url=facebook.com&timestamp=20080201'

wb_response = requests.get(wb_url)

wb_response_json = wb_response.json()

wb_response_json

Note that this is a relatively deep JSON object we have to navigate into to access information like the Wayback URL or the timestamp of the snapshot. The closest snapshot to February 1, 2008 was January 30, 2008. We use the [`datetime.strptime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) function to turn this numeric string that we recognize as a timestamp into a datetime object.

In [None]:
wb_response_json['archived_snapshots']['closest']['timestamp']

In [None]:
print(datetime.strptime(wb_response_json['archived_snapshots']['closest']['timestamp'],
                        '%Y%m%d%H%M%S'))

As before, we could scrape out the links on this 2008 version of the page.

In [None]:
# Find the old URL
fb_wb_url = wb_response_json['archived_snapshots']['closest']['url']

# Go get the archived snapshot from the Wayback Machine
fb_wb_response = requests.get(fb_wb_url)

# Get the text from the response
fb_wb_raw = fb_wb_response.text

# Soup-ify
fb_wb_soup = BeautifulSoup(fb_wb_raw)

# Make a list of the text of the links
[link.text for link in fb_wb_soup.find_all('a')]

### Scraping historical web pages

A current project I am working on is exploring how social media platforms' terms of service have evolved over time. Let's start with Facebook's terms of service and privacy policy.

In [None]:
fb_tos = 'http://www.facebook.com/terms.php'
fb_pp = 'http://www.facebook.com/policy.php'

We will take advantage of the [`date_range`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) fuction in `pandas` to generate a range of dates between January 2005 and January 2019.

In [None]:
dates_list = pd.date_range(start='2005-01-01',end='2021-03-01',freq='M')
dates_list

We'll use [`datetime.strftime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) (the inverse of `strptime`) to make these date objects into specifically-formatted strings that we can format into a URL.

In [None]:
# Take the first datetime object and turn it into a string
datetime.strftime(dates_list[0],'%Y%m%d')

Use string formatting to put the `fb_tos` URL and formatted timestamp into a request to the Wayback Machine.

In [None]:
date_str = datetime.strftime(dates_list[0],'%Y%m%d')

wb_api_url = 'https://archive.org/wayback/available?url={0}&timestamp={1}'
wb_api_url_formatted = wb_api_url.format(fb_tos,date_str)

print(wb_api_url_formatted)

Make the request to the Wayback Machine to get the URL and timestamp of the Wayback Machine's closest snapshot of Facebook's Terms of Service before January 31, 2005.

In [None]:
wb_api_response = requests.get(wb_api_url_formatted)

wb_api_response.json()

Parse the markup of this old version.

In [None]:
# Find the old URL
wb_fb_old_url = wb_api_response.json()['archived_snapshots']['closest']['url']

# Go get the archived snapshot from the Wayback Machine
wb_fb_raw = requests.get(wb_fb_old_url).text

# Soup-ify
wb_fb_soup = BeautifulSoup(wb_fb_raw)

# Find the content element and get the text out
wb_fb_terms_str = wb_fb_soup.find('div',{'id':'content'}).text.strip()

# Inspect
wb_fb_terms_str

We could use a really dumb stemmer, [`.split()`](https://docs.python.org/3.7/library/stdtypes.html#str.split) to count the number of words in these terms.

In [None]:
len(wb_fb_terms_str.split())

Write a loop to find a snapshot of Facebook's ToS each month in our `dates_list`. 

In [None]:
def get_urls(url_str,start_date='2005-01-01',end_date='2021-03-01',freq='M'):
    
    # Make the list of dates
    date_l = pd.date_range(start_date,end_date,freq=freq)
    
    # Create an empty container to store our data
    urls = dict()

    # For each date in the list of dates
    for date in date_l:
        
        # Turn the date object back into a string
        date_str = datetime.strftime(date,'%Y%m%d%H%M%S')
        
        # Define the API URL request to the Wayback machine
        wb_api_url = 'http://archive.org/wayback/available?url={0}&timestamp={1}'
        
        # Format the API URL with the URL of the website and the closest datetime
        wb_api_request = wb_api_url.format(url_str,date_str)
        
        # Make the request
        r = requests.get(wb_api_request).json()

        # Check if the returned request has all the right parts (this is probably overkill)
        if 'archived_snapshots' in r.keys():
            if 'closest' in r['archived_snapshots'].keys():
                if 'url' in r['archived_snapshots']['closest'].keys():
                    
                    # If it does have all the right parts, get the URL
                    _url = r['archived_snapshots']['closest']['url']
                    
                    # Get the timestamp
                    _timestamp = r['archived_snapshots']['closest']['timestamp']
                    
                    # Save to our URL dictionary with the timestamp of the snapshot as key, the url as value
                    urls[_timestamp] = _url
    return urls

Run our function to make a dictionary of keys returning the Wayback Machine URLs for each month's version of the terms of service. We'll write a loop to get the Terms for each snapshot and count the words. 

This will take a few minutes. 

I've coverted the code block into a "Raw" cell to prevent accidental execution. You can always turn it into a "Code" cell if you really want to run it.

In [None]:
# Get the list of timestamps and URLs for each monthly version of the Terms of Service
fb_terms_d = get_urls('https://www.facebook.com/terms.php')

# Create an empty container to store our data
fb_terms_wordcount = {}

# Loop through the fb_terms_d dictionary
for timestamp,url in fb_terms_d.items():

    # Get the raw HTML from the Wayback Machine
    raw = requests.get(url).text
    
    # Soup-ify
    soup = BeautifulSoup(raw)
    
    # Find the content of the TOS
    content = soup.find('div',{'id':'content'}).text.strip()
    
    # Split the content into words, count the number of words, save to the container
    fb_terms_wordcount[timestamp] = len(content.split())
    
# Write to disk
with open('facebook_tos_archive.json','w') as f:
    json.dump(fb_terms_wordcount,f)

In [None]:
with open('facebook_tos_archive.json','w') as f:
    json.dump(fb_terms_wordcount,f)

To avoid having everyone hit the Internet Archive server with the same requests, you can also load this file with the same data.

In [None]:
with open('facebook_tos_archive.json','r') as f:
    fb_terms_wordcount2 = json.load(f)

Visualize the changes in the size of Facebook's Terms of Service over time.

In [None]:
# Turn the dictionary into a pandas Series
fb_terms_s = pd.Series(fb_terms_wordcount2)

# Conver the index to datetime objects
fb_terms_s.index = pd.to_datetime(fb_terms_s.index)

# Plot
ax = fb_terms_s.plot()

# Make the x-tick labels less weird
# ax.set_xticklabels(range(2004,2021,2),rotation=0,horizontalalignment='center')

# Always label your axes
ax.set_ylabel('Word count');

## Scraping Wikipedia

Consider the Wikipedia page for [George H.W. Bush](https://en.wikipedia.org/wiki/George_H._W._Bush). This seems like a relatively straightforward webpage to scrape out the hyperlinks to other articles or to compare the content to other presidential biographies. However, Wikipedia also preserves the [history of every revision made to this article](https://en.wikipedia.org/w/index.php?title=George_H._W._Bush&action=history) going back to the first (available) revisions in 2001, like [this](https://en.wikipedia.org/w/index.php?title=George_H._W._Bush&oldid=345784898). Thinking back to the Oscars example, it seems promising to find the "oldid" values and visit each revision's webpage to parse the content out. However, Wikipedia will give you much of this revision history data for free through its [application programming interface](http://en.wikipedia.org/w/api.php) (API).

### Current content
We can use `requests` to get the current HTML markup of an article from the API, for example.

In [None]:
# Where the API server lives
query_url = "https://en.wikipedia.org/w/api.php"

# An empty dictionary to store our query parameters
query_params = {}

# We want to parse the content of a page
query_params['action'] = 'parse'

# Which page?
query_params['page'] = 'George H. W. Bush'

# We want the text
query_params['prop'] = 'text'

# Ignore the edit buttons and table of contents
query_params['disableeditsection'] = 1
query_params['disabletoc'] = 1

# Get the results back as JSON
query_params['format'] = 'json'

# Format the data in an easier-to-parse option
query_params['formatversion'] = 2

We have only set up our request to the API, but not sent it or received the data back.

In [None]:
json_response = requests.get(url = query_url, params = query_params).json()

What's waiting inside? A dictionary of dictionaries. The inner dictionary has keys for the title of the page we requested ("George H. W. Bush"), the pageid (a numeric identifier), and the text of the article.

In [None]:
json_response['parse'].keys()

We could count the number of links in the article.

In [None]:
ghwb_soup = BeautifulSoup(json_response['parse']['text'])

ghwb_soup.find_all('a')[:5]

Or the content of the article.

In [None]:
ghwb_soup.find_all('p')[:5]

### Revision history

There is also an API endpoint for the revision history of this article that contains metadata about the who and when of previous changes.

In [None]:
# Where the API server lives
query_url = "https://en.wikipedia.org/w/api.php"

# An empty dictionary to store our query parameters
query_params = {}

# We want to query properties of a page
query_params['action'] = 'query'

# Which page?
query_params['titles'] = 'George H. W. Bush'

# We want the revisions
query_params['prop'] = 'revisions'

# In particular, we want the revision ids, users, comments, timestamps
query_params['rvprop'] = 'ids|userid|comment|timestamp|user|size|sha1'

# Get 500 revisions
query_params['rvlimit'] = 500

# Start old and go newer
query_params['rvdir'] = 'newer'
    
# Get the results back as JSON
query_params['format'] = 'json'

# Format the data in an easier-to-parse option
query_params['formatversion'] = 2

Make the request.

In [None]:
json_response = requests.get(url = query_url, params = query_params).json()

Inspect this `json_response`. This returns a dictionary with both "continue" and "query" keys. The continue indicates there are more than 500 revisions present in the article's history and provides an index for the next query to pick up from. The query contains the revision history we care about—buried a bit in a nested data structure of lists and dictionaries, but we eventually get to the "revisions" list of dictionaries with the revision histories.

In [None]:
revisions = json_response['query']['pages'][0]['revisions']
revisions[:3]

Convert to a DataFrame.

In [None]:
rev_df = pd.DataFrame(revisions)
rev_df.head()

Plot out how the size of the article changed over the first 500 revisions.

In [None]:
ax = rev_df.plot(y='size',legend=False)
ax.set_ylabel('Size (bytes)')
ax.set_xlabel('Revision')
ax.set_xlim((0,500))

Or count how many times an editor made a contribution.

In [None]:
rev_df['user'].value_counts().head()

There are many other parts of the very powerful Wikipedia API and scraping these APIs exposes much more metadata than parsing the HTML of these webpages, while also being easier on the servers hosting it. I will share a notebook that has functions for retrieving and parsing content, revisions, pageviews, and other information.

## Scraping Reddit

Reddit also hosts a lot of detailed behavioral data that could be of interest to social scientists. As was the case with Wikipedia, our naïve inclination may be to develop scrapers and parsers to extract this information, but Reddit will give much of it to you for free through their API!

You can retrieve a few different types of entities from Reddit's API: sub-reddits, submissions, comments, and redditors. Many of these are interoperable: a sub-reddit contains submissions contributed by redditors with comments from other redditors.

We will use a wrapper library to communicate with the Reddit API called [Python Reddit API Wrapper](https://praw.readthedocs.io/en/latest/) or `praw`. 

Copy the code below to your terminal to install `praw`.

`conda install -c conda-forge praw`

Afterwards, we can import `praw`.

In [None]:
import praw

We then need to authenticate with Reddit to get access to the API. Typically you can just enter the client ID, client secret, password, username, *etc*. as strings. 

1. You will need to create an account on Reddit. After you have created an account and logged in, go to https://www.reddit.com/prefs/apps/. 
2. Scroll down and click the "create app" button at the bottom. Provide a basic name, description, and enter a URL for your homepage (or just use http://www.colorado.edu).
3. You will need the client ID (the string of characters beneath the name of your app) as well as the secret (the other string of characters) as well as your username and password.
4. You can make up a user-agent string, but include your username as good practice for the sysadmins to track you down if you break things.

![Image from Cody Buntain](http://www.cs.umd.edu/~cbuntain/inst728e/reddit_screens/1-003a.png)

You'll create an API connector object (`r`) below that will authenticate with the API and handle making the requests.

In [None]:
r = praw.Reddit(client_id='your application id',
                client_secret='your application secret',
                password='your account password',
                user_agent='scraping script by /u/youraccountname',
                username='your account name')

You can confirm that this authentication process worked by making a simple request like printing your username.

In [None]:
print(r.user.me())

I'm going to read them in from a local file ("login.json") so that I post this notebook on the internet in the future without compromising my account security. This won't work for you, so just skip this step.

In [None]:
# Load my credentials from a local disk so I don't show the world
with open('reddit_login.json','r') as f:
    r_creds = json.load(f)

In [None]:
# Create an authenticated reddit instance using the creds
r = praw.Reddit(client_id = r_creds['client_id'],
                client_secret = r_creds['client_secret'],
                password = r_creds['password'],
                user_agent = r_creds['user_agent'],
                username = r_creds['username'])

# Make sure your reddit instance works
print(r.user.me())

### Sub-reddits
Now print the top 25 stories in /r/news.

[Documentation for the Subreddit model in PRAW](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html).

Create a `news_subreddit` object to store the various attributes about this sub-reddit.

In [None]:
news_subreddit = r.subreddit('news')

The `news_subreddit` has a number of attributes and methods you can call on it. The time the sub-reddit was founded.

In [None]:
news_subreddit.created_utc

That's formatted in a UNIX timecode (seconds since 1 January 1970), but we can convert it into a more readable timestamp with `datetime`'s `utcfromtimestamp`.

In [None]:
print(datetime.utcfromtimestamp(news_subreddit.created_utc))

There are other attributes such as the number of subscribers, current active users, as well as the description of the sub-reddit.

In [None]:
'{0:,}'.format(news_subreddit.subscribers)

In [None]:
news_subreddit.over18

In [None]:
news_subreddit.active_user_count

In [None]:
print(news_subreddit.description)

The rules of the sub-reddit are available as a method `.rules()` which returns a list of dictionaries of rule objects.

In [None]:
news_subreddit.rules()['rules']

When were each of these rules created? Loop through each of the rules and print the "short_name" of the rule and the rule timestamp.

In [None]:
for rule in news_subreddit.rules()['rules']:
    created = rule['created_utc']
    print(rule['short_name'], datetime.utcfromtimestamp(created))

We can also get a list of the moderators for this subreddit.

In [None]:
mod_list = []

for mod in news_subreddit.moderator():
    mod_list.append(mod.name)
    
mod_list

### Submissions

We can get a list of submissions to a sub-reddit using [a few different methods](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html).

* `.controversial()`
* `.hot()`
* `.new()`
* `.rising()`
* `.search()`
* `.top()`

Here we will use the `.top()` method to get the top 25 submissions on the /r/news subreddit from the past 12 months.

[Documentation for the Submission model in PRAW](https://praw.readthedocs.io/en/latest/code_overview/models/submission.html).

In [None]:
top25_news = r.subreddit('news').top('year',limit=25)

`top25_news` is a `ListingGenerator` object, which is a special [generator](https://www.dataquest.io/blog/python-generators-tutorial/) class defined by PRAW. It does not actually go out and get the data at this stage. There's not much you can do to look inside this `ListingGenerator` other than loop through and perform operations. In this case, lets add each submission to a list of `top25_submissions`.

In [None]:
top25_submissions = []

for submission in r.subreddit('news').top('year',limit=25):
    top25_submissions.append(submission)

We can inspect the first (top) `Submission` object.

In [None]:
first_submission = top25_submissions[0]
first_submission

Use the `dir` function to see the other methods and attributes inside this first top `Submission` object. (There are a lot of other "hidden" attributes and methods that use the "\_" which we can ignore with this list comprehension.)

In [None]:
[i for i in dir(first_submission) if '_' not in i]

`vars` may be even more helpful.

In [None]:
vars(first_submission)

We can extract the features of each submission, store them in a dictionary, and save to an external list. This step will take a while (approximately one second per submission) because we make an API call for each submission in the `ListingGenerator` returned by the `r.subreddit('news').top('year',limit=25)` we're looping through.

In [None]:
submission_stats = []

for submission in r.subreddit('news').top('year',limit=25):
    d = {}
    d['id'] = submission.id
    d['title'] = submission.title
    d['num_comments'] = submission.num_comments
    d['score'] = submission.score
    d['upvote_ratio'] = submission.upvote_ratio
    d['date'] = datetime.utcfromtimestamp(submission.created_utc)
    d['domain'] = submission.domain
    d['gilded'] = submission.gilded
    d['num_crossposts'] = submission.num_crossposts
    d['nsfw'] = submission.over_18
    if submission.author is not None:
        d['author'] = submission.author.name
    submission_stats.append(d)

We can turn `submission_stats` into a pandas DataFrame.

In [None]:
top25_df = pd.DataFrame(submission_stats)
top25_df.head()

Plot out the relationship between score and number of comments.

In [None]:
ax = top25_df.plot.scatter(x='score',y='num_comments',s=50,c='k',alpha=.5)
# ax.set_xlim((0,200000))
# ax.set_ylim((0,16000))

### Comments

This is a simple Reddit submission: [What is a dataset that you can't believe is available to the public?](https://www.reddit.com/r/datasets/comments/akb4mr/what_is_a_dataset_that_you_cant_believe_is/). We can inspect the comments in this simple submission.

[Documentation for Comment model in PRAW](https://praw.readthedocs.io/en/latest/code_overview/models/comment.html).

In [None]:
cant_believe = r.submission(id='akb4mr')

print("This submission was made on {0}.".format(datetime.utcfromtimestamp(cant_believe.created_utc)))
print("There are {0:,} comments.".format(cant_believe.num_comments))

We can inspect these comments, working from the [Comment Extraction and Parsing](https://praw.readthedocs.io/en/latest/tutorials/comments.html) tutorial in PRAW.

In [None]:
cant_believe.comments.replace_more(limit=None)

for comment in cant_believe.comments.list():
    print(comment.body)

Each comment has a lot of metadata we can preserve.

In [None]:
cant_believe_comment_metadata = []

for comment in cant_believe.comments.list():
    if not comment.collapsed: # Skip collapsed/deleted comments
        d = {}
        d['id'] = comment.id
        d['parent_id'] = comment.parent_id
        d['body'] = comment.body
        d['depth'] = comment.depth
        d['edited'] = comment.edited
        d['score'] = comment.score
        d['date'] = datetime.utcfromtimestamp(comment.created_utc)
        d['submission_id'] = comment.submission.id
        d['submission_title'] = comment.submission.title
        d['subreddit'] = comment.subreddit.display_name
        if comment.author is not None:
            d['author'] = comment.author.name
        cant_believe_comment_metadata.append(d)

Convert to a DataFrame.

In [None]:
cant_believe_df = pd.DataFrame(cant_believe_comment_metadata)

# How long is the comment
cant_believe_df['comment_length'] = cant_believe_df['body'].str.len()

cant_believe_df.head()

Do comments deeper in this comment tree have lower scores?

In [None]:
sb.catplot(x='depth',y='score',data=cant_believe_df,kind='bar',color='lightblue')

Do comments deeper in this comment tree have shorter lengths?

In [None]:
sb.catplot(x='depth',y='comment_length',data=cant_believe_df,kind='bar',color='lightblue')

### Redditors

A Redditor is a user and we can get meta-data about the account as well as the history of the user's comments and submissions from the API.

[Documentation for the Redditor model in PRAW](https://praw.readthedocs.io/en/latest/code_overview/models/redditor.html).

How much link and comment karma does this user have?

In [None]:
spez = r.redditor('spez')
print("Link karma: {0:,}".format(spez.link_karma))
print("Comment karma: {0:,}".format(spez.comment_karma))

Interestingly, Reddit flags the users who are employees of Reddit as well as if accounts have verified email addresses.

In [None]:
spez.is_employee

In [None]:
spez.has_verified_email

We can also get the time this user's account was created.

In [None]:
datetime.utcfromtimestamp(spez.created_utc)

We can also get information about individual redditors' submissions and comment histories. Here we will use u/spez (the CEO of Reddit), get his top-voted submissions, and loop through them to get the data for each submission.

In [None]:
spez_submissions = []

for submission in r.redditor('spez').submissions.top('all',limit=25):
    d = {}
    d['id'] = submission.id
    d['title'] = submission.title
    d['num_comments'] = submission.num_comments
    d['score'] = submission.score
    d['upvote_ratio'] = submission.upvote_ratio
    d['date'] = datetime.utcfromtimestamp(submission.created_utc)
    d['domain'] = submission.domain
    d['gilded'] = submission.gilded
    d['num_crossposts'] = submission.num_crossposts
    d['nsfw'] = submission.over_18
    if comment.author is not None:
        d['author'] = submission.author.name
    spez_submissions.append(d)

Again we can turn this list of dictionaries into a DataFrame to do substantive data analysis.

In [None]:
pd.DataFrame(spez_submissions).head()

We can also get all the comments made by an editor.

In [None]:
spez_comments = []

for comment in r.redditor('spez').comments.top('all',limit=25):
    d = {}
    d['id'] = comment.id
    d['body'] = comment.body
    try:
        d['depth'] = comment.depth
    except:
        d['depth'] = np.nan
    d['edited'] = comment.edited
    d['score'] = comment.score
    d['date'] = datetime.utcfromtimestamp(comment.created_utc)
    d['submission_id'] = comment.submission.id
    d['submission_title'] = comment.submission.title
    d['subreddit'] = comment.subreddit.display_name
    if comment.author is not None:
        d['author'] = comment.author.name
    spez_comments.append(d)

In [None]:
pd.DataFrame(spez_comments).head()

This user's top comments are mostly focused in the /r/announcements subreddit.

In [None]:
pd.DataFrame(spez_comments)['subreddit'].value_counts()