<img src="https://i.imgur.com/Y6EMKKg.jpg" style="float: left; margin: 15px;" width="75">

# Import Python Libraries Needed

---

In [2]:
import requests
import sys, json
import pandas as pd
import time

# Set Reddit URL's for Scraping

In [3]:
url_t = 'https://www.reddit.com/r/The_Donald.json'

url_r = 'https://www.reddit.com/r/esist.json'

- **Received 429 error for the following code:**

``res_t = requests.get(url_t)
res_r = requests.get(url_r)
res_t.status_code
res_r.status_code``

In [4]:
#create a user agent to get around the '429: "too many requests" error
headers = {'User-agent' : 'Jojo_Gun' }

In [5]:
res_t = requests.get(url_t, headers=headers)
res_r = requests.get(url_r, headers=headers)

In [10]:
# Confirm that each sub-reddit does not return error code
print(res_t.status_code)
res_t.status_code

200


200

# Exploring the Reddit API

In [11]:
json_t = res_t.json()
json_r = res_r.json()

In [12]:
# Confirmed same key structure 
print(json_t.keys())
json_r.keys()

dict_keys(['kind', 'data'])


dict_keys(['kind', 'data'])

In [15]:
print(json_t['kind'])
print(json_r['kind'])

Listing
Listing


- There is one data type for this key, so we can assume that everything we want will be in the outher key, 'data'

In [19]:
# Confirmed that all of the data is indeed in the other key. 
print(json_t['data'].keys())
print(json_r['data'].keys())
print(sorted(json_t['data']))
sorted(json_r['data'])

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])
dict_keys(['modhash', 'dist', 'children', 'after', 'before'])
['after', 'before', 'children', 'dist', 'modhash']


['after', 'before', 'children', 'dist', 'modhash']

- The 'children' key is where we will find our posts. 

In [12]:
# Display the structure of the last post per subreddit
json_t['data']['children'][25]
json_r['data']['children'][26]['data']

{'approved_at_utc': None,
 'subreddit': 'esist',
 'selftext': '',
 'author_fullname': 't2_884o7',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'House panel votes to authorize subpoena for unredacted Mueller report',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/esist',
 'hidden': False,
 'pwls': 1,
 'link_flair_css_class': None,
 'downs': 0,
 'thumbnail_height': 73,
 'hide_score': False,
 'name': 't3_b95ss5',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 14,
 'domain': 'cnbc.com',
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': None,
 'is_reddit_media_domain': False,
 'is_meta': False,
 'category': None,
 'secure_media_embed': {},
 'link_flair_text': None,
 'can_mod_post': False,
 'score': 14,
 'approved_by': None,
 'thumbnail': 'https://b.thumbs.

- I noticed that there are 26 posts returned by The_Donald and 27 by esist

In [17]:
print(len(json_t['data']['children']))  # 27 posts per page
print(len(json_r['data']['children']))  # 26 posts per page

27
26


In [13]:
# this is showing the data of the first post of the subreddits
json_t['data']['children'][0]['data']
# and/or
json_r['data']['children'][0]['data']

{'approved_at_utc': None,
 'subreddit': 'esist',
 'selftext': '[Click here to go to the more detailed wiki](https://www.reddit.com/r/esist/wiki/index)\n\n[Follow us on Twitter!](https://twitter.com/redditresist)\n\nPlease message the moderators or comment on this thread if you have new links, especially to local resistance groups.\n\n* A Practical Guide for Resisting the Trump Agenda: https://www.indivisibleguide.com/\n\n* Claire McCaskill: Any federal employee who wants to visit, we will listen. We will protect you. whistleblowers@mccaskill.senate.gov\n202-224-2630\n\n* Make 5 calls a day to Congressmen in five minutes: https://5calls.org/\n\n* Sign up for daily text alerts for direct action: https://dailyaction.org/\n\n* This script is for complaining about the selling of public lands, but could be tweaked to be useful elsewhere. https://docs.google.com/document/d/10cmFKG3t30XAGQEHNyaVf3DZpx2cJOClgmKdJHCuIuA/edit\n\n* Here is information on voicing your opposition to the nomination o

# Create Pandas DataFrames

In [14]:
# In essence, we have a dataframe
df_t = pd.DataFrame(json_t['data']['children'])
df_r = pd.DataFrame(json_r['data']['children'])

In [16]:
# The_Donald
df_t.head()

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'The_Do...",t3
1,"{'approved_at_utc': None, 'subreddit': 'The_Do...",t3
2,"{'approved_at_utc': None, 'subreddit': 'The_Do...",t3
3,"{'approved_at_utc': None, 'subreddit': 'The_Do...",t3
4,"{'approved_at_utc': None, 'subreddit': 'The_Do...",t3


In [17]:
# esist
df_r.head()

Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'esist'...",t3
1,"{'approved_at_utc': None, 'subreddit': 'esist'...",t3
2,"{'approved_at_utc': None, 'subreddit': 'esist'...",t3
3,"{'approved_at_utc': None, 'subreddit': 'esist'...",t3
4,"{'approved_at_utc': None, 'subreddit': 'esist'...",t3


# Collecting Multiple Pages of Data from the Reddit API

In [18]:
json_t['data']['after']
# and/or
json_r['data']['after']

't3_b95ss5'

- The 'after' key returns the ID of the post. This particular statement returns the last ID from our list of 25 posts

In [19]:
t_id = [post['data']['name'] for post in json_t['data']['children']]  
len(t_id)

r_id = [post['data']['name'] for post in json_r['data']['children']]  
len(r_id)

As post = integer indecies, this type of iteration works. These ID's will need to be you ANCHORS for the next time your Python script hits the Reddit API for reference

In order to hit the next 25 posts, we need to specify the ID where we left off when we make our next request.
We do this through the param argument of requests.get and pass through a dictionary specifiying the location:

In [21]:
# The_Donald
param_t = {'after': 't3_b7n1tt'}
res = requests.get(url, params=param_t, headers=headers)
res.status_code

# esist: 
param_r = {'after': 't3_b7324u'}
res_r = requests.get(url_r, params=param_r, headers=headers)
res.status_code

Next, we can set up a for loop to to hit the API a given number of times (specified through a range):

In [23]:
posts_t = []
after_t = None

for i in range(39):  # for loop will run through 216 iterations
    if after_t == None:      # to start us off with an empty parameter dictionary
        param_t = {}        # the empty prameter dictionary
    else:                  # this will run after the first iteration as after = the next post id at the end of the for loop
        param_t = {'after': after_t}               # the new parameter dictionary specified with the next post id
    url_t = 'https://www.reddit.com/r/The_Donald.json'                   # URL to hit
    res_t = requests.get(url_t, params=param_t, headers=headers)    # request the json content from URL with the following params
    if res_t.status_code == 200:                                    # if our request was accepted, run the next code
        json_t = res_t.json()                                       # point the requested json content to the_json variable
        posts_t.extend(json_t['data']['children'])    # List method: .extend() to add the json content to the body of the list. .append() adds another dict to the list
        after_t = json_t['data']['after']     # after is redefined to the name of the next post id until the loops is done
        print(i, after_t)                     # print the number of iterartions
    else:                                     # else: if we get a status code other than 200
        print(res_t.status_code)                # print the status code
        break                                 # then break the loop
    time.sleep(2)                             # wait two seconds to run the next iteration of the for loop
    
    
    
df_tpost = pd.DataFrame(posts_t)

df_tpost.to_csv('./data/The_Donald_Scrape.csv')

0 t3_b9c4of
1 t3_b9b4s8
2 t3_b9iydh
3 t3_b9b9mk
4 t3_b9j6rv
5 t3_b9hrxf
6 t3_b9kfth
7 t3_b9b64p
8 t3_b9gwdp
9 t3_b9jr6v
10 t3_b9jbqx
11 t3_b9k79w
12 t3_b9h7oc
13 t3_b9gmwv
14 t3_b94xn8
15 t3_b9fikc
16 t3_b9khd8
17 t3_b9if39
18 t3_b9l09d
19 t3_b9jeuw
20 t3_b9jz7s
21 t3_b9keqh
22 t3_b9b54b
23 t3_b9kipp
24 t3_b9jyjb
25 t3_b9kkyu
26 t3_b9k1xd
27 t3_b9gsi4
28 t3_b9heks
29 t3_b9bjcq
30 t3_b9d5be
31 t3_b9fu62
32 t3_b9j5ka
33 t3_b9j3tq
34 t3_b9d9bq
35 t3_b9f2mv
36 t3_b93zpi
37 t3_b9hxs5
38 t3_b9by3m


In [26]:
df_tpost.shape

(977, 2)

Next, we can set up a for loop to to hit the API a given number of times (specified through a range):

In [27]:
posts_r = []
after_r = None

for i in range(30):
    if after_r == None:
        param_r = {}
    else:                  
        param_r = {'after': after_r}               
    url_r = 'https://www.reddit.com/r/esist.json'                   
    res_r = requests.get(url_r, params=param_r, headers=headers)    
    if res_r.status_code== 200:       
        json_r = res_r.json()       
        posts_r.extend(json_r['data']['children'])    
        after_r = json_r['data']['after']
        print(i, after_r)
    else:                                     
        print(res_r.status_code)                
        break                                 
    time.sleep(2) 
    
df_rpost = pd.DataFrame(posts_r)
    
df_rpost.to_csv('./data/Resist_Scrape.csv')

0 t3_b95ss5
1 t3_b8ixq4
2 t3_b7yln6
3 t3_b6xf53
4 t3_b6ifmk
5 t3_b5sssh
6 t3_b5h8zx
7 t3_b59nxp
8 t3_b4odtd
9 t3_b3ytzk
10 t3_b3c2dn
11 t3_b2k31r
12 t3_b261ty
13 t3_b1t047
14 t3_b0v5wz
15 t3_b16cq2
16 t3_b09718
17 t3_b0ai4f
18 t3_azkr5s
19 t3_azt6ti
20 t3_ayx25g
21 t3_ayvm5j
22 t3_ay6drn
23 t3_axzblm
24 t3_axlulm
25 t3_axay3y
26 t3_awxchp
27 t3_awc697
28 t3_aw4eko
29 None


In [29]:
df_rpost.shape

(744, 2)

# Creating a Combined DataFrame

In [30]:
# esist:
r_post_titles = [each['data']['title'] for each in posts_r]
r_post_url = [each['data']['url'] for each in posts_r]
r_post_subreddit = [each['data']['subreddit'] for each in posts_r]


# The_Donald
t_post_titles = [each['data']['title'] for each in posts_t]
t_post_url = [each['data']['url'] for each in posts_t]
t_post_subreddit = [each['data']['subreddit'] for each in posts_t]

In [31]:
all_post_titles = r_post_titles + t_post_titles
all_post_url = r_post_url + t_post_url
all_post_subreddit = r_post_subreddit + t_post_subreddit
p3_dataset = {'post_title': all_post_titles, 'subreddit_name': all_post_subreddit, 'url':all_post_url}

In [32]:
# df_shuffled = pd.DataFrame(p3_dataset, columns=["post_title"])
df_shuffled = pd.DataFrame(p3_dataset)

In [33]:
df_shuffled.head()

Unnamed: 0,post_title,subreddit_name,url
0,Thread for Useful Links,esist,https://www.reddit.com/r/esist/comments/5rid3d...
1,Some on Mueller’s Team See Their Findings as M...,esist,https://www.nytimes.com/2019/04/03/us/politics...
2,Elizabeth Warren Wants to Make It Easier to Th...,esist,https://www.vanityfair.com/news/2019/04/elizab...
3,Congress: “We’re going to need a copy of the P...,esist,https://www.reddit.com/r/esist/comments/b99z1q...
4,Mueller’s Team Gathered ‘Alarming’ Trump Obstr...,esist,https://www.thedailybeast.com/muellers-team-ga...


In [34]:
df_shuffled.to_csv('./data/Final_Dataset3.csv')

## Just for Reference:

This is how I figured out that my original code was looping back over the same tags and populating tons of duplicates.

In [36]:
for i in df_shuffled['data']:
    if i['name'] == 't3_b7o4et':
        print(True)