# Subreddit Data Collection (legwork)

#### This is how I had first started scraping

I eventually found a much cleaner and more effective means of scraping subreddits using a function, but I had already scraped a significant amount of posts from the cheese subreddit prior to that.  This information was also saved into a csv in a ugly and poorly formatted manner, so I had to pull it out and reformat it and save it to a new csv so that I could pull it again and merge it with the other data in the main notebook.  Here is the documentation of how I did that, in a seperate jupyter notebook so as to not clutter up my final jupyter notebook.

In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd 
import time

In [2]:
# create variable for the cheese subreddit page url
url_cheese = 'https://www.reddit.com/r/Cheese.json'
    
# create variable for the old movies subreddit page url
url_movies = 'https://www.reddit.com/r/oldmovies.json'

In [3]:
# create a custom user agent in order to avoid 429 error (too many requests) - python and chrome 
# (which countless people use) are both their own user agent.  by creating my own tool to access 
# a given resource,  the resource I attempt to access (in this case Reddit), will only read as 
# being accessed by this one custom user agent 'MichaelKnight4714', which i instatiated as a dictionary and
# saved to the variable headers, which i will pass as a paramater for headers in both of the subreddit get requests
headers = {'User-agent':'MichaelKnight4714'}

In [6]:
# Perform a get requests on cheese subreddit
cheese_request = requests.get(url_cheese, headers=headers)


In [7]:
# 429 is error - 200 is no errors
cheese_request.status_code

200

In [8]:
# when hitting an API, calling a .json gets back a dictionary of the info pulled off our subreddit
# save the chese subreddit info into the variable cheese_json
cheese_json = cheese_request.json()

In [10]:
# look at the keys of the dictionaries made for each subreddit
print(sorted(cheese_json.keys()))



['data', 'kind']


In [11]:
# nothing in 'kind' key, everything in 'data' key
print(cheese_json['data'].keys())


dict_keys(['modhash', 'dist', 'children', 'after', 'before'])


In [12]:
# everything in each 'data' key is under the a key titled 'children', which are the posts in each subreddit
# everything has been scraped about 25/26 posts at a time
print(len(cheese_json['data']['children']))


26


In [13]:
# the key 'after' gives the name of the last post in our data scraped so far
print(cheese_json['data']['children'][len(cheese_json['data']['children'])-1]['data']['name'])
print(cheese_json['data']['after'])


t3_c9z0gw
t3_c9z0gw


In [14]:
# create a params dictionary for the url get requests for both subreddits
param_cheese = {'after':'t3_c7kkxy'}


In [14]:
# add new params to our get request on the url for the cheese reddit, which gets another page of results
requests.get(url_cheese, params=param_cheese, headers=headers)


<Response [200]>

In [358]:
## based HEAVILY off of Riley Dallas' code, provided in the youtube Project 3 info session video from May 2018
## https://www.youtube.com/watch?v=5Y3ZE26Ciuk

# make a for loop to get 1000 or so posts (over a range of 40 gets about 1000)
cheese_posts = []  #initiate a list to contain all of the posts on the cheese subreddit
after = None       #initiate an after value
for i in range(40): # run the loop 40 times
    print(i)       #print how many times the loops has gone through each time it runs
    
    # set params to be empty to start off
    if after == None:  
        params = {} # first time hit the subreddits url, there will be no params
        
    # after the first run    
    else:
        params = {'after': after} # set the params dictionary
    url_cheese = 'https://www.reddit.com/r/Cheese.json'   # create variable for the cheese subreddit page url
    # Perform a get requests on cheese subreddit
    cheese_request = requests.get(url_cheese, params=params, headers=headers) 
    
    # check to make sure not getting error before doing main quest of the code
    # 429 is error - 200 is no errors
    if cheese_request.status_code == 200:  
        
        # when hitting an API, calling a .json gets back a dictionary of the info pulled off our subreddit
        # save the chese subreddit info into the variable cheese_json
        cheese_json = cheese_request.json()
        cheese_posts.extend(cheese_json['data']['children'])
        after = cheese_json['data']['after']
        
        
    #if code getting an error, do a print message notification and break the for loop    
    else:
        print(cheese_request.status_code)
        break
        
    time.sleep(1) #sleep for 1 second in between for loops so as not to appear to be DDoS attack to Reddit servers

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


In [359]:
# see how many posts from the cheese subreddit have been scraped
len(cheese_posts)

1000

In [360]:
#create a dataframe out of the values returned from the for loop pulling in all the posts from the cheese subreddit
cheese_df = pd.DataFrame(cheese_posts)

In [361]:
# inspect the dataframe at this/these cell(s)
cheese_df['data'][0]['name']

't3_7ultff'

In [362]:
# inspect the dataframe at this/these cell(s)
cheese_df['data'][26]['name']

't3_c918ys'

In [363]:
# check for first repeating post
#for k in range(len(cheese_df)):
for i in range(len(cheese_df)):
    k = 0
    #for k in range(len(cheese_df)):
    if (cheese_df['data'][i]['name'] == cheese_df['data'][k]['name']) & (k!=i):
        print(cheese_df['data'][i]['name'])
        print(i)
        #print(k)
    k += k
    #print(cheese_df['data'][i]['name'])

In [364]:
# inspect full length of posts scraped
len(cheese_df)

1000

In [365]:
# how many unique values in the dataframe
len(set(t['data']['name'] for t in cheese_posts))

1000

In [275]:
# inspect the dataframe at this/these cell(s)
cheese_df['data'][203]['name']

't3_bwjqpt'

In [276]:
# inspect the dataframe at this/these cell(s)
cheese_df['data'][1202]['name']

't3_bwnxdm'

In [277]:
# inspect the dataframe at this/these cell(s) for every row 
# do this by running a for loop
for i in range(len(cheese_df)):
    print(cheese_df['data'][i]['name'])

t3_7ultff
t3_caj5xk
t3_cabzeg
t3_ca7apm
t3_caipu1
t3_cakdqu
t3_ca8a6a
t3_cadynh
t3_ca825g
t3_c9zxxp
t3_c9uh1c
t3_ca1gtt
t3_c9wr0l
t3_c9z0gw
t3_ca2meq
t3_c9j8nt
t3_c9pomb
t3_c9or1s
t3_c9kvzd
t3_c9jje8
t3_c9cux2
t3_c9vnii
t3_c9tomr
t3_c918ys
t3_c9577h
t3_c8vn0t
t3_c8pix7
t3_c8vh0o
t3_c8sgy0
t3_c8r1oz
t3_c8dhc8
t3_c8di81
t3_c7zjv2
t3_c7xsv6
t3_c83z2s
t3_c7vpep
t3_c88ws3
t3_c7s167
t3_c84wi5
t3_c87fjk
t3_c7jzf3
t3_c7k92d
t3_c7kkxy
t3_c7lb7u
t3_c7cs6a
t3_c79b67
t3_c7hva1
t3_c7it0t
t3_c72udm
t3_c76vik
t3_c73ql1
t3_c6xmu7
t3_c6zuat
t3_c72a1n
t3_c6xc4g
t3_c6n88f
t3_c72o7z
t3_c6swb1
t3_c6ppfe
t3_c6zimo
t3_c6qcow
t3_c6ux0q
t3_c6djhg
t3_c6fy6a
t3_c684ug
t3_c69vlq
t3_c6iyct
t3_c5vms6
t3_c66fir
t3_c5yqbj
t3_c65vbx
t3_c5zgu7
t3_c5o02m
t3_c5px1e
t3_c56by8
t3_c5h529
t3_c51fia
t3_c521cf
t3_c4umop
t3_c4w6h0
t3_c4nanj
t3_c4n9h2
t3_c4zq8z
t3_c4ogf3
t3_c4rai1
t3_c522j7
t3_c4atta
t3_c4p45g
t3_c44e1y
t3_c4914f
t3_c4bnci
t3_c4hagw
t3_c3wrx3
t3_c3wvs8
t3_c43x8h
t3_c3x1wu
t3_c3xwda
t3_c3xduu
t3_c3lyqm
t3_c3eo8f


In [134]:
# save our cheese subreddit dataframe to the file cheese_posts.csv
# mode='a' to append the new df to the old ones, not overwrite
cheese_df.to_csv('./datasets/cheese_posts.csv', mode='a', index=False) 

In [33]:
#read cheese_posts.csv into 'train'
train = pd.read_csv('./datasets/cheese_posts.csv')

In [34]:
# check the length of the dataframe after pulling it from the csv file
len(train)

5215

In [38]:
# inspect the dataframe at this/these cell(s)
train['data'][1249]

{'approved_at_utc': None,
 'subreddit': 'Cheese',
 'selftext': '',
 'author_fullname': 't2_on19bki',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'Cheese plane Cheese plane',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/Cheese',
 'hidden': False,
 'pwls': 6,
 'link_flair_css_class': None,
 'downs': 0,
 'thumbnail_height': 140,
 'hide_score': False,
 'name': 't3_btnypj',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 1,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': None,
 'is_original_content': False,
 'user_reports': [],
 'secure_media': None,
 'is_reddit_media_domain': True,
 'is_meta': False,
 'category': None,
 'secure_media_embed': {},
 'link_flair_text': None,
 'can_mod_post': False,
 'score': 1,
 'approved_by': None,
 'thumbnail': 'https://a.thumbs.redditmedia.com/FcLdaglOH7-O4KzOoQk2

In [35]:
# inspect the dataframe at this/these cell(s)
# notice that instead of a dictionary, this is a dictionary within a string
train['data'][5214]

"{'approved_at_utc': None, 'subreddit': 'test', 'selftext': '', 'author_fullname': 't2_3zrujr9s', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'Preview Image Gallery Test with Link', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/test', 'hidden': False, 'pwls': 6, 'link_flair_css_class': None, 'downs': 0, 'thumbnail_height': 105, 'hide_score': False, 'name': 't3_cafpis', 'quarantine': False, 'link_flair_text_color': 'dark', 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 1, 'total_awards_received': 0, 'media_embed': {}, 'thumbnail_width': 140, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_domain': False, 'is_meta': False, 'category': None, 'secure_media_embed': {}, 'link_flair_text': None, 'can_mod_post': False, 'score': 1, 'approved_by': None, 'thumbnail': 'https://b.thumbs.redditmedia.com/h_aadkQh7IHfvKbjiEJmwFKFv7JjnsWMNU-RIj2pfBU.jpg'

In [37]:
# from Jacob Gabrielson and Josh Lee from StackOverflow
# https://stackoverflow.com/questions/988228/convert-a-string-representation-of-a-dictionary-to-a-dictionary
# convert all of the strings of dictionaries that were saved into csv file back into dictionaries
import ast

# use a for loop
# notice that the first 1250 rows were already dictionaries and don't need to be converted
for i in range(1251, len(train)):

    train['data'][i] = ast.literal_eval(train['data'][i])

In [39]:
# inspect the dataframe at this/these cell(s)
# notice that instead of a dictionary there is a single string ('data')
train['data'][1250]

'data'

In [40]:
# drop this row
train.drop(1250, inplace=True)

In [41]:
# look at the first 5 values of the 'data' column of the dataframe
train['data'].head()

0    {'approved_at_utc': None, 'subreddit': 'Cheese...
1    {'approved_at_utc': None, 'subreddit': 'Cheese...
2    {'approved_at_utc': None, 'subreddit': 'Cheese...
3    {'approved_at_utc': None, 'subreddit': 'Cheese...
4    {'approved_at_utc': None, 'subreddit': 'Cheese...
Name: data, dtype: object

In [42]:
# move the dataframe of dictionaries to a list of dictionaries
# only take the dictionaries that came from the cheese subreddit (only rows before 4224)
old_cheese_list = train[:4224]['data'].values.tolist()

In [46]:
# inspect the dataframe at this/these cell(s).  
# the info is now a list of dictionaries, as it was initially read in 
old_cheese_list[2]['title']

'A selection of goat cheeses this evening with some yummy snacks'

In [49]:
# taken from Adi Bronshtein
# use a list comprehension for each of our subs. Then insert these into a dataframe, calling this columns text
def combine_text(posts):
    return[' '.join([post['title'], post['selftext']]) for post in posts] 

In [51]:
# format the data pulled from the poorly formated csv and format it to be concated with
# the scraped data in the main notebook (post titles and the content of the posts are all collected
# as one column)
old_cheese_text = combine_text(old_cheese_list)

# save to a new dataframe, where text is the text
old_cheese_df = pd.DataFrame(old_cheese_text, columns=['text'])
# all of these are from the cheese subreddit, so make a 'cheese' column, every value set to 1
# meaning "yes, this is from the cheese subreddit"
old_cheese_df['cheese'] = 1

In [52]:
# inspect the first five values of the dataframe
old_cheese_df.head()

Unnamed: 0,text,cheese
0,"FAQ: What is cheese, anyway?",1
1,Cheese plate I made yesterday for a friend and I,1
2,A selection of goat cheeses this evening with ...,1
3,delice de poitou,1
4,Cheese “dessert” plate: whipped chèvre in chou...,1


In [53]:
# look at the length of the dataframe
len(old_cheese_df)

4224

In [54]:
# drop the duplicates from the dataframe, so all that is left are unique posts
old_cheese_df.drop_duplicates(inplace=True)

In [57]:
# look at the length of the dataframe after dropping the duplicates
len(old_cheese_df)

989

In [58]:
# save our cheese subreddit dataframe to the file old_cheese_posts.csv
# mode='a' to append the new df to the old ones, not overwrite
old_cheese_df.to_csv('./datasets/old_cheese_posts.csv', mode='a', index=False)