<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 3.2.3 
# *Mining Social Media on Reddit*

## The Reddit API and the PRAW Package

The Reddit API is rich and complex, with many endpoints (https://www.reddit.com/dev/api/). It includes methods for navigating its collections, which include various kinds of media as well as comments. Fortunately, the Python library PRAW reduces much of this complexity.

Reddit requires developers to create and authenticate an app before they can use the API, but the process is much less onerus than some, and does not have waiting period for approval of new developers (as of 18 August 2018).

### 1. Create a Reddit App

Go to https://www.reddit.com/prefs/apps and click "create an app".

Enter the following in the form:

- a name for your app
- select "script" radio button
- a description
- a redirect URI

(Nb. For pulling data into a data science experiment, a local port can be used for the Redirect URI; try http://127.0.0.1:1410)

- click "create app"
- from the form that displays, copy the following to a local text file (or to this notebook):

  - name (the name you gave to your app)
  - redirect URI
  - personal use script (this is your OAuth 2 Client ID)
  - secret (this is your OAuth 2 Secret)

### 2. Register for API Access

- follow the link at https://www.reddit.com/wiki/api and read the terms of use for Reddit API access 
- fill in the form fields at the bottom 
  - make sure to enter your new OAuth Client ID where indicated
  - your use case could be something like "Training in API usage for data science projects"
  - your platform could be something like "Jupyter Notebooks / Python"
  
- click "SUBMIT"
 
- when asked for User-Agent, enter something that fits this pattern:
  `your_os-python:your_reddit_appname:v1.0 (by /u/your_reddit_username)`

### 3. Load Python Libraries

In [1]:
import praw
import requests
import json
import pprint
from datetime import datetime, date, time

### 4. Authenticate from your Python script

You could assign your authentication details explicitly, as follows:

In [2]:
my_user_agent = 'macOS-python:matsalleh2020:v1.0 (by /u/matsalleh2020)'   # your user Agent string goes in here
my_client_id = ' '   # your Client ID string goes in here
my_client_secret = ' '   # your Secret string goes in here

A better way would be to store these details externally, so they are not displayed in the notebook:

- create a file called "auth_reddit.json" in your "notebooks" directory, and save your credentials there in JSON format:

`{   "my_client_id": "your Client ID string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;` "my_client_secret": "your Secret string goes in here",` <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`"my_user_agent": "your user Agent string goes in here"` <br>
`}`

Use the following code to load the credentials:  

In [3]:
pwd()  # make sure your working directory is where the file is

'/Users/gregory_murray/Documents/Magic Briefcase/Data Science/DG-SG-FT-16Apr21/Module 3/Answers'

In [4]:
path_auth = 'auth_reddit.json'
auth = json.loads(open(path_auth).read())

# For debugging only:
# pp = pprint.PrettyPrinter(indent=4)
#pp.pprint(auth)

my_user_agent = auth['my_user_agent']
my_client_id = auth['my_client_id']
my_client_secret = auth['my_client_secret']

Security considerations: 
- this method only keeps your credentials invisible as long as nobody else gets access to this notebook file 
- if you wanted another user to have access to the executable notebook without divulging your credentials you should set up an OAuth 2.0 workflow to let them obtain and apply their own API tokens when using your app
- if you just want to share your analyses, you could use a separate script (which you don't share) to fetch the data and save it locally, then use a second notebook (with no API access) to load and analyse the locally stored data

### 5. Exploring the API

Here is how to connect to Reddit with read-only access:

In [5]:
reddit = praw.Reddit(client_id = my_client_id, 
                     client_secret = my_client_secret, 
                     user_agent = my_user_agent)

print('Read-only = ' + str(reddit.read_only))  # Output: True

Read-only = True


Version 7.1.0 of praw is outdated. Version 7.2.0 was released Wednesday February 24, 2021.


In the next cell, put the cursor after the '.' and hit the [tab] key to see the available members and methods in the response object:

Consult the PRAW and Reddit API documentation. Print a few of the response members below:

In [None]:
# reddit.auth
# reddit.comment
# reddit.config
# reddit.delete
# reddit.domain
# reddit.front
# reddit.get
# reddit.inbox
# reddit.live
# reddit.multireddit
# reddit.patch
# reddit.post
# reddit.put
# reddit.random_subreddit
# reddit.read_only
# reddit.redditor
# reddit.redditors
# reddit.submission
# reddit.subreddit
# reddit.subreddits
# reddit.update_checked
# reddit.user
# reddit.validate_on_submit



Content in Reddit is grouped by topics called "subreddits". Content, called "submissions", is fetched by calling the `subreddit` method of the connection object (which is our `reddit` variable) with an argument that matches an actual topic. 

We also need to append a further method call to a "subinstance", such as one of the following:

- controversial
- gilded
- hot
- new
- rising
- top

One of the submission objects members is `title`. Fetch and print 10 submission titles from the 'learnpython' subreddit using one of the subinstances above:

In [9]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.title)

Ask Anything Monday - Weekly Thread
Sharing my win.
Python 101: 2nd Edition is FREE for PyCon 2021!
Thanks for the help
Flask Showing Wrong URL
Scraping Amazon with requests + BeautifulSoup
Why is Python solving this equation incorrectly?
How long does it take to learn Python?
how to check for cropped copies of images with python
Code works as expected in VSCode debugger, but terminal closes socket early


Now retrieve 10 authors:

In [10]:
for submission in reddit.subreddit('learnpython').hot(limit=10):
    print(submission.author)

AutoModerator
Brothercford
driscollis
DanDannymite_27
beefyliltank
Nicolozz0
TheRealTengri
Ascetic_Banana-20
daddygoose04
CharmingMidnight8191


In [None]:
submission.

Note that we obtained the titles and authors from separate API calls. Can we expect these to correspond to the same submissions? If not, how could we gurantee that they do?

In [11]:
#ANSWER:
submissions = reddit.subreddit('learnpython').hot(limit=10)
for submission in submissions:
    print("Author: {} | Title: {}".format(submission.author, submission.title))

Author: AutoModerator | Title: Ask Anything Monday - Weekly Thread
Author: Brothercford | Title: Sharing my win.
Author: driscollis | Title: Python 101: 2nd Edition is FREE for PyCon 2021!
Author: DanDannymite_27 | Title: Thanks for the help
Author: beefyliltank | Title: Flask Showing Wrong URL
Author: Nicolozz0 | Title: Scraping Amazon with requests + BeautifulSoup
Author: a0311tr | Title: Writing Tests
Author: ThePerceptionist | Title: Github Pages site that updates daily with content from Python script
Author: johannadambergk | Title: Extracting numbers
Author: TheRealTengri | Title: Why is Python solving this equation incorrectly?


In [15]:
# BONUS : Various features can be extracted to a DataFrame

import pandas as pd

posts = []
ml_subreddit = reddit.subreddit('MachineLearning')
for post in ml_subreddit.hot(limit=10):
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])


In [18]:
posts.head(10)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,[D] Machine Learning - WAYR (What Are You Read...,14,n8m6ds,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,1,This is a place to share machine learning rese...,1620619000.0
1,[R] Enhancing Photorealism Enhancement,34,nbyrcj,MachineLearning,https://arxiv.org/abs/2105.04619,7,,1620990000.0
2,[D] Are ResNets as good as it gets?,193,nbgb6a,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,36,TLDR: For training from scratch on non-classi...,1620940000.0
3,[R] Unsupervised Progressive Learning and the ...,6,nbzv92,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,0,"Check out our recent work, which was accepted ...",1620994000.0
4,[D] Disentangling Medical Image features using...,13,nbsp3t,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,1,Recently I found out about NFs and their prope...,1620972000.0
5,[D] GAN training - aren't the double discrimin...,39,nbk34d,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,11,I've noticed that a lot of GAN codebases runs ...,1620951000.0
6,[P] Twitter bot that tweets trending ML papers,45,nbioy3,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,7,Hey everyone!\n\nI created a twitter bot that ...,1620947000.0
7,[D] Why is the baseline for fairness and bias ...,40,nbisqb,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,39,When we talk about ML models' accuracy and oth...,1620947000.0
8,[P] Enigma: GPT-2 trained on 10K Nature Papers...,160,nb9ifz,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,31,Project Enigma: https://stefanzukin.com/enigma...,1620913000.0
9,[P] Machine Learning in Physics?,57,nbdoc6,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,19,Hey everyone!\n\nI've been doing machine learn...,1620931000.0


In [19]:
# General information about the subreddit can be obtained and using the .description function on
# the subreddit object

In [21]:
print(ml_subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

Why doesn't the next cell produce output?

In [22]:
for submission in submissions:
    print(submission.comments)

In [23]:
print(type(submissions))

<class 'praw.models.listing.generator.ListingGenerator'>


In [24]:
#ANSWER:
# The API is lazy, and submissions is a generator -- not a data structure:
submissions
# it must be refreshed in the same cell that invokes its output.

<praw.models.listing.generator.ListingGenerator at 0x7f7f18a99cd0>

Print two comments associated with each of these submissions:

In [25]:
submissions = reddit.subreddit('learnpython').hot(limit=10)
for submission in submissions:
    top_level_comments = list(submission.comments)
    all_comments = submission.comments.list()[:2]
    for comment in all_comments:
        print(comment.body)

How long did it take you to ACTUALLY understand python?
Where can I find source code for the built-in function `pow()` ?

I want to find out why `pow(a, b, 10)` is much faster than `a**b % 10` when you input arbitrarily huge numbers.
Awesome job OP. Your job is very much like mine and yes chunk size is a godsend. 

I wrote a script that read data from a marketing platform API and scanned hundreds of emails for certain terms as we were sun setting some things and looking to consolidate other stuff. Saved the team hundreds of hours of mindless reading to identify terms. LOL

GREAT job dude/dudette!
I get the feeling this surpasses r/learnpython but I get that from a lot of other posts too.
I'll give this book a read when I can. Thank you!
I haven't read it yet, but it looks really nice. Thank you very much, for giving this away for free.
one way is, instead of using `flask run` use `python app.py` and then your `app.run` settings will be used. 

another way is, `flask run --ip=0.0.0.0 --

Referring to the API documentation, explore the submissions object and print some interesting data:

In [None]:
# submissions = reddit.subreddit('learnpython').banned
# submissions = reddit.subreddit('learnpython').collections
# submissions = reddit.subreddit('learnpython').comments
# submissions = reddit.subreddit('learnpython').contributor
# submissions = reddit.subreddit('learnpython').controversial
# submissions = reddit.subreddit('learnpython').display_name
# submissions = reddit.subreddit('learnpython').emoji
# submissions = reddit.subreddit('learnpython').filters
# submissions = reddit.subreddit('learnpython').flair
# submissions = reddit.subreddit('learnpython').fullname
# submissions = reddit.subreddit('learnpython').gilded
# submissions = reddit.subreddit('learnpython').hot
# submissions = reddit.subreddit('learnpython').message
# submissions = reddit.subreddit('learnpython').MESSAGE_PREFIX
# submissions = reddit.subreddit('learnpython').mod
# submissions = reddit.subreddit('learnpython').moderator
# submissions = reddit.subreddit('learnpython').mutedodmail
# submissions = reddit.subreddit('learnpython').muted
# submissions = reddit.subreddit('learnpython').new
# submissions = reddit.subreddit('learnpython').parse
# submissions = reddit.subreddit('learnpython').post_requirements
# submissions = reddit.subreddit('learnpython').quaran
# submissions = reddit.subreddit('learnpython').random
# submissions = reddit.subreddit('learnpython').random_rising
# submissions = reddit.subreddit('learnpython').rules
# submissions = reddit.subreddit('learnpython').search
# submissions = reddit.subreddit('learnpython').sticky
# submissions = reddit.subreddit('learnpython').stream
# submissions = reddit.subreddit('learnpython').stylesheet
# submissions = reddit.subreddit('learnpython').submit
# submissions = reddit.subreddit('learnpython').submit_image
# submissions = reddit.subreddit('learnpython').submit_poll
# submissions = reddit.subreddit('learnpython').submit_video
# submissions = reddit.subreddit('learnpython').subscribe
# submissions = reddit.subreddit('learnpython').top
# submissions = reddit.subreddit('learnpython').traffic
# submissions = reddit.subreddit('learnpython').unsubscribe
# submissions = reddit.subreddit('learnpython').widgets
# submissions = reddit.subreddit('learnpython').wiki

submissions = reddit.subreddit('learnpython').


#### Posting to Reddit

To be able to post to your Reddit account (i.e. contribute submissions), you need to connect to the API with read/write privilege. This requires an *authorised instance*, which is obtained by including your Reddit user name and password in the connection request: 

In [27]:
reddit = praw.Reddit(client_id='my client id',
                     client_secret='my client secret',
                     user_agent='my user agent',
                     username='my username',
                     password='my password')
print(reddit.read_only)  # Output: False

False


You could hide these last two credentials by adding them to your JSON file and then reading all five values at once.

In [30]:
path_auth = 'auth_reddit.json'
auth = json.loads(open(path_auth).read())

In [31]:
reddit = praw.Reddit(client_id='my client id',
                     client_secret='my client secret',
                     user_agent='my user agent',
                     username='my username',
                     password='my password')
print(reddit.read_only)  # Output: False

False




---



---



> > > > > > > > > © 2021 Institute of Data


---



---



