# Web Scraping
---

## Problem statement:
We are looking into breaking into the world of freelance data journalism and are reaching out to Nate Silver and co. at FiveThirtyEight so they can hear our pitch on how to create a Reddit post that will get the most engagement. We want to find out what characteristics of a post on Reddit will be the most predictive of the overall interaction on a post as measured by number of comments (above/below the median). With this we hope to provide a classification model to FiveThirtyEight that is satisfactory and jumpstart our career!

## Use Reddit API PRAW to web scrape Reddit

In [75]:
import praw
import pandas as pd
import datetime
import pprint
import time

In [2]:
# enter your own information into praw.Reddit()
reddit = praw.Reddit(client_id = '',
                     client_secret = '',
                     user_agent = '') 

In [3]:
reddit.read_only

True

### Now let's do a quick test to make sure we are pulling the "hot" posts from r/all

In [11]:
for sub in reddit.subreddit('all').hot(limit=5):
    print(sub.title, sub.num_comments)

Show me 1314
Her name was Michelle Snow and she was almost 8 487
So Costco apparently doesn't re-take membership card photos if you sneeze 769
Dear god, no, Morbius 2 has not been greenlit 914
Unadulterated reaction 830


### We can use `pprint` and `vars()` to see what attributes reddit.subreddit has

In [5]:
pprint.pprint(vars(sub))

{'_comments_by_id': {},
 '_fetched': False,
 '_reddit': <praw.reddit.Reddit object at 0x000002D678A15FA0>,
 'all_awardings': [{'award_sub_type': 'GLOBAL',
                    'award_type': 'global',
                    'awardings_required_to_grant_benefits': None,
                    'coin_price': 500,
                    'coin_reward': 100,
                    'count': 1,
                    'days_of_drip_extension': None,
                    'days_of_premium': 7,
                    'description': 'Gives 100 Reddit Coins and a week of '
                                   'r/lounge access and ad-free browsing.',
                    'end_date': None,
                    'giver_coin_reward': None,
                    'icon_format': None,
                    'icon_height': 512,
                    'icon_url': 'https://www.redditstatic.com/gold/awards/icon/gold_512.png',
                    'icon_width': 512,
                    'id': 'gid_2',
                    'is_enabled': True,
     

### Now lets create lists and iterate through adding in the information we need from every post. Since we are only iterating through the posts in `reddit.subreddit('all').hot()` and not making repeating server requests there is no need to add a time delay

In [18]:
# create lists for various information we need from "hot" post
subreddit_id = []
subreddit = []
post_title = []
post_body = []
num_comments = []
time_posted = []
time_now = []
time_dif = []

# Let us limit ourselves to only 4000 posts a day and collect data for the next few days
for sub in reddit.subreddit('all').hot(limit = 4000):
    # sub id
    subreddit_id.append(sub.id)
    
    # subreddit
    subreddit.append(sub.subreddit)
    
    # title
    post_title.append(sub.title)
    
    # post body
    post_body.append(sub.selftext)

    # number of comments on post
    num_comments.append(sub.num_comments)
    
    # time posted
    time_posted.append(datetime.datetime.utcfromtimestamp(sub.created_utc))
    
    # time it was scraped
    time_now.append(datetime.datetime.utcnow())
    
    # time difference
    time_dif.append(datetime.datetime.utcnow() - datetime.datetime.utcfromtimestamp(sub.created_utc))
    

### Now we can create a dataframe from the various lists we created

In [20]:
reddit_web_scrape_1 = pd.DataFrame({'id': subreddit_id, 'subreddit': subreddit, 'title': post_title, 
                                    'post_body': post_body, 'number of comments': num_comments, 
                                    'time_posted': time_posted,'time_now': time_now, 'time_difference': time_dif})

In [35]:
reddit_web_scrape_1

Unnamed: 0,id,subreddit,title,post_body,number of comments,time_posted,time_now,time_difference
0,uzz4yd,facepalm,Show me,,1413,2022-05-28 23:47:41,2022-05-29 03:17:37.780016,0 days 03:29:56.780016
1,uzysdi,WhitePeopleTwitter,Her name was Michelle Snow and she was almost 8,,577,2022-05-28 23:27:57,2022-05-29 03:17:37.780016,0 days 03:49:40.780016
2,uzz6t5,movies,"Dear god, no, Morbius 2 has not been greenlit",,1001,2022-05-28 23:50:22,2022-05-29 03:17:37.780016,0 days 03:27:15.780016
3,uzwk33,funny,So Costco apparently doesn't re-take membershi...,,792,2022-05-28 21:28:11,2022-05-29 03:17:37.780016,0 days 05:49:26.780016
4,uzzxuf,oddlysatisfying,How to draw balls,,416,2022-05-29 00:33:47,2022-05-29 03:17:37.780016,0 days 02:43:50.780016
...,...,...,...,...,...,...,...,...
3995,uzoan9,tumblr,Former Australian Prime Ministers,,75,2022-05-28 14:19:12,2022-05-29 03:18:45.669964,0 days 12:59:33.669964
3996,uznwbq,ShitpostXIV,forgot about this little reference in ShB. Bra...,,23,2022-05-28 13:58:37,2022-05-29 03:18:45.669964,0 days 13:20:08.669964
3997,uzzgo2,argentina,El programa de Canosa ayer:,,37,2022-05-29 00:05:30,2022-05-29 03:18:45.669964,0 days 03:13:15.669964
3998,uzybsb,CatsAreAssholes,"Just got myself a nice glass of ice water, soooo…",,23,2022-05-28 23:02:23,2022-05-29 03:18:45.669964,0 days 04:16:22.669964


### Now let's save the dataframe to a csv file

In [23]:
reddit_web_scrape_1.to_csv('../data/reddit_1.csv', index=False)

### We will continue to do this for the following days creating separate csv files for every web scrape

In [32]:
subreddit_id = []
subreddit = []
post_title = []
post_body = []
num_comments = []
time_posted = []
time_now = []
time_dif = []

for sub in reddit.subreddit('all').hot(limit = 4000):
    # sub id
    subreddit_id.append(sub.id)
    
    # subreddit
    subreddit.append(sub.subreddit)
    
    # title
    post_title.append(sub.title)
    
    # post body
    post_body.append(sub.selftext)

    # number of comments on post
    num_comments.append(sub.num_comments)
    
    # time posted
    time_posted.append(datetime.datetime.utcfromtimestamp(sub.created_utc))
    
    # time it was scraped
    time_now.append(datetime.datetime.utcnow())
    
    # time difference
    time_dif.append(datetime.datetime.utcnow() - datetime.datetime.utcfromtimestamp(sub.created_utc))
    

In [33]:
reddit_web_scrape_2 = pd.DataFrame({'id': subreddit_id, 'subreddit': subreddit, 'title': post_title, 'post_body': post_body, 
              'number of comments': num_comments, 'time_posted': time_posted,
              'time_now': time_now, 'time_difference': time_dif})

In [34]:
reddit_web_scrape_2

Unnamed: 0,id,subreddit,title,post_body,number of comments,time_posted,time_now,time_difference
0,v0fqsb,todayilearned,TIL that in the 1630s there was a song only pl...,,218,2022-05-29 17:14:29,2022-05-29 20:49:55.081123,0 days 03:35:26.081123
1,v0e88q,clevercomebacks,Weird motives,,983,2022-05-29 16:01:21,2022-05-29 20:49:55.081123,0 days 04:48:34.081123
2,v0e8mi,formula1,Sergio Perez wins the 2022 Monaco Grand Prix,,1569,2022-05-29 16:01:49,2022-05-29 20:49:55.081123,0 days 04:48:06.081123
3,v0dz9q,BlackPeopleTwitter,Our rent going up cause of “inflation” but our...,,845,2022-05-29 15:49:21,2022-05-29 20:49:55.081123,0 days 05:00:34.081123
4,v0culz,WhitePeopleTwitter,Brilliant front page,,1725,2022-05-29 14:54:54,2022-05-29 20:49:55.081123,0 days 05:55:01.081123
...,...,...,...,...,...,...,...,...
3995,v0e37d,GYM,110kg strict press @ 90.9kg + gf mirin,,21,2022-05-29 15:54:42,2022-05-29 20:55:40.967027,0 days 05:00:58.967027
3996,v0g0r9,HypixelSkyblock,Hp moment,,11,2022-05-29 17:28:10,2022-05-29 20:55:40.967027,0 days 03:27:30.967027
3997,v09jaj,Nails,How do you guys think of my new nail extension?,,20,2022-05-29 11:49:32,2022-05-29 20:55:40.967027,0 days 09:06:08.967027
3998,v0apj8,LoveForLandlords,Disgraceful behaviour from the rentoids,,26,2022-05-29 13:01:08,2022-05-29 20:55:40.967027,0 days 07:54:32.967027


In [36]:
reddit_web_scrape_2.to_csv('../data/reddit_2.csv', index=False)

### Since it seemed like a lot of copy/paste to scrape reddit, creating a function to do this for us was the clear next step.

In [49]:
def scrape_hot_posts():
    # create an overarching list that we can then append dictionary to
    reddit_lists = []

    for sub in reddit.subreddit('all').hot(limit = 4000):
        # create an empty dictionary
        reddit_list = {}
        
        # add keys (column names) and values to the dictionary
        
        # sub id
        reddit_list['id'] = sub.id

        # subreddit
        reddit_list['subreddit'] = sub.subreddit

        # title
        reddit_list['title'] = sub.title

        # post body
        reddit_list['post_body'] = sub.selftext

        # number of comments on post
        reddit_list['number_of_comments'] = sub.num_comments

        # time posted
        reddit_list['time_posted'] = datetime.datetime.utcfromtimestamp(sub.created_utc)

        # time it was scraped
        reddit_list['time_now'] = datetime.datetime.utcnow()

        # time difference
        reddit_list['time_difference'] = datetime.datetime.utcnow() - datetime.datetime.utcfromtimestamp(sub.created_utc)
        
        # append the dictionrary to the list, we will get a list of dictionaries!
        reddit_lists.append(reddit_list)
    # return the list of dictionaries to create a dataframe
    return reddit_lists  

### We can use our function directly into `pd.DataFrame`, as it returns a list, to create another data frame iteration of our web scraping.

In [50]:
reddit_web_scrape_3 = pd.DataFrame(scrape_hot_posts())

In [51]:
reddit_web_scrape_3

Unnamed: 0,id,subreddit,title,post_body,number_of_comments,time_posted,time_now,time_difference
0,v0zded,news,A 9-year-old describes escaping through a wind...,,1461,2022-05-30 12:42:48,2022-05-30 16:44:58.527382,0 days 04:02:10.527382
1,v0yqdd,gaming,Infectious tilt,,536,2022-05-30 12:05:21,2022-05-30 16:44:58.527382,0 days 04:39:37.527382
2,v0yule,WhitePeopleTwitter,And this continues and nothing happened,,579,2022-05-30 12:12:15,2022-05-30 16:44:58.527382,0 days 04:32:43.527382
3,v0yop8,PoliticalHumor,Marjorie Taylor Greene is an idiot,,639,2022-05-30 12:02:51,2022-05-30 16:44:58.527382,0 days 04:42:07.527382
4,v0yi3t,AbsoluteUnits,Absolute unit,,59,2022-05-30 11:52:14,2022-05-30 16:44:58.527382,0 days 04:52:44.527382
...,...,...,...,...,...,...,...,...
3995,v0fuqr,StoppedWorking,Not once but twice,,119,2022-05-29 17:19:55,2022-05-30 16:46:08.564852,0 days 23:26:13.564852
3996,v0vzzs,TankPorn,"BMPT ""Terminator""",,37,2022-05-30 09:00:40,2022-05-30 16:46:08.564852,0 days 07:45:28.564852
3997,v0yu1s,thegrandtour,What do we call this part of Star Wars,,18,2022-05-30 12:11:18,2022-05-30 16:46:08.564852,0 days 04:34:50.564852
3998,v126t3,KerbalSpaceProgram,The Nightingale is flying fast on KerbalX now!,,4,2022-05-30 15:04:37,2022-05-30 16:46:08.564852,0 days 01:41:31.564852


In [68]:
reddit_web_scrape_3.to_csv('../data/reddit_3.csv', index=False)

In [69]:
reddit_web_scrape_4 = pd.DataFrame(scrape_hot_posts())

In [70]:
reddit_web_scrape_4

Unnamed: 0,id,subreddit,title,post_body,number_of_comments,time_posted,time_now,time_difference
0,v1oqwn,clevercomebacks,"""It's actually like that everywhere""",,435,2022-05-31 12:06:23,2022-05-31 15:59:14.788832,0 days 03:52:51.788832
1,v1opk8,therewasanattempt,to plant drugs during a traffic stop,,2393,2022-05-31 12:04:26,2022-05-31 15:59:14.788832,0 days 03:54:48.788832
2,v1pbcy,technology,Netflix's plan to charge people for sharing pa...,,1216,2022-05-31 12:37:38,2022-05-31 15:59:14.788832,0 days 03:21:36.788832
3,v1oq3c,WhitePeopleTwitter,This is how it should be done,,2168,2022-05-31 12:05:13,2022-05-31 15:59:14.788832,0 days 03:54:01.788832
4,v1ni9m,antiwork,yesyesnono,,835,2022-05-31 10:51:57,2022-05-31 15:59:14.788832,0 days 05:07:17.788832
...,...,...,...,...,...,...,...,...
3995,v1pvhl,OkBrudiMongo,han unt bal,,1,2022-05-31 13:06:46,2022-05-31 16:02:10.628603,0 days 02:55:24.628603
3996,v1jpe6,india,"Maggi for breakfast, lunch, dinner: Man divorc...",,115,2022-05-31 06:13:54,2022-05-31 16:02:10.628603,0 days 09:48:16.628603
3997,v1c8g7,AbsoluteUnits,Absolute Unit of a snow man,,10,2022-05-30 23:01:21,2022-05-31 16:02:10.628603,0 days 17:00:49.628603
3998,v1olec,wholesomememes,The very definition of rewarding:,,2,2022-05-31 11:58:45,2022-05-31 16:02:10.628603,0 days 04:03:25.628603


In [71]:
reddit_web_scrape_4.to_csv('../data/reddit_4.csv', index=False)

In [72]:
reddit_web_scrape_5 = pd.DataFrame(scrape_hot_posts())

In [73]:
reddit_web_scrape_5

Unnamed: 0,id,subreddit,title,post_body,number_of_comments,time_posted,time_now,time_difference
0,v2h1i4,BlackPeopleTwitter,“U.S. median rent to reach record high of $200...,,683,2022-06-01 13:28:58,2022-06-01 16:07:00.285441,0 days 02:38:02.285441
1,v2g3gs,todayilearned,TIL: Bats eat enough insects to save the US ov...,,511,2022-06-01 12:40:36,2022-06-01 16:07:00.285441,0 days 03:26:24.285441
2,v2g2sd,antiwork,the propaganda machine is running!,,284,2022-06-01 12:39:38,2022-06-01 16:07:00.285441,0 days 03:27:22.285441
3,v2eh1s,politics,‘It’s going to be an army’: Tapes reveal GOP p...,,1631,2022-06-01 11:11:33,2022-06-01 16:07:00.285441,0 days 04:55:27.285441
4,v2ghce,pokemon,Pokémon Scarlet and Violet - New Trailer,,2788,2022-06-01 13:00:18,2022-06-01 16:07:00.285441,0 days 03:06:42.285441
...,...,...,...,...,...,...,...,...
3995,v286j8,cats,sometimes I find Gus laying in the window like...,,34,2022-06-01 04:02:16,2022-06-01 16:08:08.327153,0 days 12:05:52.327153
3996,v2e9bp,SkyrimMemes,"Hadring got himself another customer, eh?",,6,2022-06-01 10:58:59,2022-06-01 16:08:08.327153,0 days 05:09:09.327153
3997,v2bvui,pokemongo,Cubone are you okay?,,29,2022-06-01 08:09:59,2022-06-01 16:08:08.327153,0 days 07:58:09.327153
3998,v2700j,h3h3productions,us rn,,38,2022-06-01 02:54:13,2022-06-01 16:08:08.327153,0 days 13:13:55.327153


In [74]:
reddit_web_scrape_5.to_csv('../data/reddit_5.csv', index=False)