# Scraping Reddit Data  

<table align="left"><td>
  <a target="_blank"  href="https://colab.research.google.com/github/TannerGilbert/Tutorials/blob/master/Reddit%20Webscraping%20using%20PRAW/Reddit%20API.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab
  </a>
</td><td>
  <a target="_blank"  href="https://github.com/TannerGilbert/Tutorials/blob/master/Reddit%20Webscraping%20using%20PRAW/Reddit%20API.ipynb">
    <img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
</td></table>

![](https://www.redditstatic.com/new-icon.png)  
Using the PRAW library, a wrapper for the Reddit API, everyone can easily scrape data from Reddit or even create a Reddit bot.

In [None]:
!pip install praw

In [1]:
import praw

Before it can be used to scrape data we need to authenticate ourselves. For this we need to create a Reddit instance and provide it with a client_id , client_secret and a user_agent . To create a Reddit application and get your id and secret you need to navigate to [this page](https://www.reddit.com/prefs/apps).

In [2]:
reddit = praw.Reddit(client_id='my_client_id',
                     client_secret='my_client_secret',
                     user_agent='my_user_agent')

We can get information or posts from a specifc subreddit using the reddit.subreddit method and passing it a subreddit name.

In [3]:
# get 10 hot posts from the MachineLearning subreddit
hot_posts = reddit.subreddit('MachineLearning').hot(limit=10)

Now that we scraped 10 posts we can loop through them and print some information.

In [4]:
for post in hot_posts:
    print(post.title)

[D] What is the best ML paper you read in 2018 and why?
[D] Machine Learning - WAYR (What Are You Reading) - Week 53
[R] A Geometric Theory of Higher-Order Automatic Differentiation
UC Berkeley and Berkeley AI Research published all materials of CS 188: Introduction to Artificial Intelligence, Fall 2018
[Research] Accurate, Data-Efficient, Unconstrained Text Recognition with Convolutional Neural Networks
[Project] New Tsetlin Machine implementation with 3.5x faster learning, 8x faster pattern recognition, using 10x less memory, including MNIST demo
[D] My agent found some sort of exploit in the Atari 2600 Pong game?
[discussion] Object detection in video - using temporal information?
[P] TMTrackNN — generating TrackMania tracks with neural networks
[R] Sim2Real – Using Simulation to Train Real-Life Grasping Robots


In [5]:
# get hot posts from all subreddits
hot_posts = reddit.subreddit('all').hot(limit=10)
for post in hot_posts:
    print(post.title)

I've been lying to my wife about film plots for years.
I don’t care if this gets downvoted into oblivion! I DID IT REDDIT!!
I’ve had enough of your shit, Karen
Stranger Things 3: Coming July 4th, 2019
TIL that Game of Thrones actor Peter Dinklage and his wife, Erica Schmidt, have never revealed the name of their daughter, born in 2011. Dinklage and Schmidt had a second child in 2017. The couple never revealed the name or sex of this second child.
Sex abuse victims can finally sue churches in New South Wales as 'Ellis defence' abolished - Previously churches were protected from being sued by a legal precedent which said they did not legally exist
Vadim Afanasev at the 2018 Men's Tumbling World Championships
Perfect catch and camera look
Icing a Mario cookie.
Happy new year


In [6]:
# get MachineLearning subreddit data
ml_subreddit = reddit.subreddit('MachineLearning')

print(ml_subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles lesson plans on popular machine learning topics.

For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/

For career related questions, visit /r/csca

Because we only have a limited amoung of requests per day it is a good idea to save the scraped data in some kind of variable or file.

In [7]:
import pandas as pd

posts = []
ml_subreddit = reddit.subreddit('MachineLearning')
for post in ml_subreddit.hot(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
posts

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,[D] What is the best ML paper you read in 2018...,421,a6cbzm,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,54,"Enjoyed this thread last year, so I am making ...",1544877000.0
1,[D] Machine Learning - WAYR (What Are You Read...,25,a8yaro,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,8,This is a place to share machine learning rese...,1545628000.0
2,[R] A Geometric Theory of Higher-Order Automat...,49,abesyt,MachineLearning,https://arxiv.org/abs/1812.11592,3,,1546345000.0
3,UC Berkeley and Berkeley AI Research published...,867,ab4207,MachineLearning,https://inst.eecs.berkeley.edu/~cs188/fa18/,62,,1546262000.0
4,"[Research] Accurate, Data-Efficient, Unconstra...",10,aber0d,MachineLearning,https://arxiv.org/abs/1812.11894,1,,1546344000.0
5,[Project] New Tsetlin Machine implementation w...,30,ab8na3,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,0,Finished today. First run through shows approx...,1546303000.0
6,[D] My agent found some sort of exploit in the...,14,ab9llv,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,7,"Hello,\n\nHere is a video of my AI playing a p...",1546309000.0
7,[discussion] Object detection in video - using...,3,abayg9,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,5,"When applying object detection to video, I oft...",1546317000.0
8,[P] TMTrackNN — generating TrackMania tracks w...,58,ab2kd3,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,14,"Hello! First time posting here, I wanted to s...",1546251000.0
9,[R] Sim2Real – Using Simulation to Train Real-...,76,ab0hiv,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,3,"Hey, I published a summary of RCAN, a new robo...",1546237000.0


In [8]:
posts.to_csv('top_ml_subreddit_posts.csv')

PRAW also allows us to get information about a specifc post/submission

In [9]:
submission = reddit.submission(url="https://www.reddit.com/r/MapPorn/comments/a3p0uq/an_image_of_gps_tracking_of_multiple_wolves_in/")
# or 
submission = reddit.submission(id="a3p0uq") #id comes after comments/

In [10]:
for top_level_comment in submission.comments:
    print(top_level_comment.body)

Source: [https://www.facebook.com/VoyageursWolfProject/](https://www.facebook.com/VoyageursWolfProject/)
I thought this was a shit post made in paint before I read the title
Wow, that’s very cool.  To think how keen their senses must be to recognize and avoid each other and their territories.  Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a badass.
That’s really cool. The edges are surprisingly defined.
White wolf is a dick constantly trespassing other's territories.
[Link to Story](https://www.duluthnewstribune.com/news/science-and-nature/4538836-voyageurs-national-park-wolves-eating-beaver-and-blueberries-not) 
Cool to imagine that there are similar zones surrounding all these, we just didn't tag those wolves. 
You know the white wolf fucked some red's bitch for sure. 
It’s wild how they are all roughly the same size. 
This what i am gonna show people when they ask for a photo of a sixpack
That's ac

AttributeError: 'MoreComments' object has no attribute 'body'

This will work for some submission, but for others that have more comments this code will throw an AttributeError saying:

``AttributeError: 'MoreComments' object has no attribute 'body'``

These MoreComments object represent the “load more comments” and “continue this thread” links encountered on the websites, as described in more detail in the comment documentation.

There get rid of the MoreComments objects, we can check the datatype of each comment before printing the body.

In [11]:
from praw.models import MoreComments
for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    print(top_level_comment.body)

Source: [https://www.facebook.com/VoyageursWolfProject/](https://www.facebook.com/VoyageursWolfProject/)
I thought this was a shit post made in paint before I read the title
Wow, that’s very cool.  To think how keen their senses must be to recognize and avoid each other and their territories.  Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a badass.
That’s really cool. The edges are surprisingly defined.
White wolf is a dick constantly trespassing other's territories.
[Link to Story](https://www.duluthnewstribune.com/news/science-and-nature/4538836-voyageurs-national-park-wolves-eating-beaver-and-blueberries-not) 
Cool to imagine that there are similar zones surrounding all these, we just didn't tag those wolves. 
You know the white wolf fucked some red's bitch for sure. 
It’s wild how they are all roughly the same size. 
This what i am gonna show people when they ask for a photo of a sixpack
That's ac

The below cell is another way of getting rid of the MoreComments objects

In [12]:
submission.comments.replace_more(limit=0)
for top_level_comment in submission.comments:
    print(top_level_comment.body)

Source: [https://www.facebook.com/VoyageursWolfProject/](https://www.facebook.com/VoyageursWolfProject/)
I thought this was a shit post made in paint before I read the title
Wow, that’s very cool.  To think how keen their senses must be to recognize and avoid each other and their territories.  Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a badass.
That’s really cool. The edges are surprisingly defined.
White wolf is a dick constantly trespassing other's territories.
[Link to Story](https://www.duluthnewstribune.com/news/science-and-nature/4538836-voyageurs-national-park-wolves-eating-beaver-and-blueberries-not) 
Cool to imagine that there are similar zones surrounding all these, we just didn't tag those wolves. 
You know the white wolf fucked some red's bitch for sure. 
It’s wild how they are all roughly the same size. 
This what i am gonna show people when they ask for a photo of a sixpack
That's ac

The above codeblocks only got the top lebel comments. If we want to get the complete ``CommentForest`` we need to use the ``.list`` method.

In [13]:
submission.comments.replace_more(limit=None)
for comment in submission.comments.list():
    print(comment.body)

Source: [https://www.facebook.com/VoyageursWolfProject/](https://www.facebook.com/VoyageursWolfProject/)
I thought this was a shit post made in paint before I read the title
Wow, that’s very cool.  To think how keen their senses must be to recognize and avoid each other and their territories.  Plus, I like to think that there’s one from the white colored clan who just goes way into the other territories because, well, he’s a badass.
That’s really cool. The edges are surprisingly defined.
White wolf is a dick constantly trespassing other's territories.
[Link to Story](https://www.duluthnewstribune.com/news/science-and-nature/4538836-voyageurs-national-park-wolves-eating-beaver-and-blueberries-not) 
Cool to imagine that there are similar zones surrounding all these, we just didn't tag those wolves. 
You know the white wolf fucked some red's bitch for sure. 
It’s wild how they are all roughly the same size. 
This what i am gonna show people when they ask for a photo of a sixpack
That's ac