# Project 3.01: Webscraping from Google Home Reddit
---

**<u>Problem Statement.</u>** The Google Business strategy team has identified reddit as a potential information resource to harness insights on User needs. Specifically, the two reddit sources are subreddits on the Google Home products and the Google pixel. The team would want to extract the reddit posts and classify them according to their sources (i.e. Google Home or Google pixel) before proceeding with further analyses.

Manually trawling through the reddit posts to identify which post the subreddit came from is untenable, as the team has other project tasks on hand. Outsourcing this piece of work to other departments or interns is not favored due to potential of human error and sensitivity of the work. 

**<u>Proposed Solution.</u>** The team decided to utilize Natural Language Processing (NLP) and machine learning to help classify the extracted posts. The automation workflow would help alleviate workload thereby freeing up time for the team for high-value tasks. The model could also be re-purposed to classify other reddit posts, when the team moves to derive insights on User needs on competitors' products.

In [20]:
# Import libraries
import requests
import time
import pandas as pd
import random

from tqdm.notebook import trange, tqdm

### Test pull and review of reddit post

In [21]:
# Google home url
url = 'https://www.reddit.com/r/googlehome/.json'

In [22]:
# reddit shuts down all Python scripts from accessing its API.
"""Modify our request bit to make it not use the default user agent."""
header = {'User-agent': 'Pony Inc 1.0'}
res = requests.get(url, headers=header)

In [23]:
res.status_code

200

In [24]:
# json is a program agnostic format for structuring data
# Parse it into a dictionary
reddit_dict1 = res.json()

In [25]:
# Review sorted keys
sorted(reddit_dict1.keys())

['data', 'kind']

In [26]:
reddit_dict1['kind']
# Information for key: 'kind' is limited

'Listing'

In [27]:
# Review sorted keys of data
sorted(reddit_dict1['data'].keys())

['after', 'before', 'children', 'dist', 'modhash']

In [28]:
reddit_dict1['data']['children'][0]['data']
# Children key is of interest to us (where the posts are)

{'approved_at_utc': None,
 'subreddit': 'googlehome',
 'selftext': "[\\[FAQ - Frequently Asked Questions\\]](https://www.reddit.com/r/googlehome/wiki/faq)  We've already answered most of the basics. For example: There is no way to change the wake words. There is no setting for less talkative responses (but there is a useful workaround).\n\n[\\[Commands\\]](https://www.reddit.com/r/googlehome/wiki/commands)  Read the list of favorite commands to use with your Google Home. The topics will give you helpful ideas for what you can do with your device.\n\n[\\[New Features\\]](https://www.reddit.com/r/googlehome/wiki/new_features)  If you are curious about the new features that have rolled out lately, then read this list. It is updated as we find out about new releases from Google.\n\nCheck the subreddit wiki sidebar for more helpful links, including available 3rd Party Apps [\\[Actions &amp; Apps\\]](https://assistant.google.com/explore/) and fun [\\[Easter Eggs\\]](https://www.reddit.com/r/

In [29]:
# Analyzing first post, Class label (target)
print(reddit_dict1['data']['children'][0]['data']['subreddit'])

# Title of post
print(reddit_dict1['data']['children'][0]['data']['title'])

# Text of post
reddit_dict1['data']['children'][0]['data']['selftext']

googlehome
FAQ: Please read the subreddit FAQ before posting similar questions! Also read the subreddit Rules in the sidebar.


"[\\[FAQ - Frequently Asked Questions\\]](https://www.reddit.com/r/googlehome/wiki/faq)  We've already answered most of the basics. For example: There is no way to change the wake words. There is no setting for less talkative responses (but there is a useful workaround).\n\n[\\[Commands\\]](https://www.reddit.com/r/googlehome/wiki/commands)  Read the list of favorite commands to use with your Google Home. The topics will give you helpful ideas for what you can do with your device.\n\n[\\[New Features\\]](https://www.reddit.com/r/googlehome/wiki/new_features)  If you are curious about the new features that have rolled out lately, then read this list. It is updated as we find out about new releases from Google.\n\nCheck the subreddit wiki sidebar for more helpful links, including available 3rd Party Apps [\\[Actions &amp; Apps\\]](https://assistant.google.com/explore/) and fun [\\[Easter Eggs\\]](https://www.reddit.com/r/googlehome/wiki/eastereggs)."

In [30]:
testposts = [p['data'] for p in reddit_dict1['data']['children']]
df_test = pd.DataFrame(testposts)
df_test

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview
0,,googlehome,[\[FAQ - Frequently Asked Questions\]](https:/...,t2_q648wkk,False,,0,False,FAQ: Please read the subreddit FAQ before post...,[],...,True,https://www.reddit.com/r/googlehome/comments/a...,152107,1547063000.0,0,,False,,,
1,,googlehome,Do you want to liberate your Google so it can ...,t2_8hlsb,False,,0,False,My Google has Coronavirus! - Monthly Rants and...,[],...,True,https://www.reddit.com/r/googlehome/comments/g...,152107,1588345000.0,0,,False,,,
2,,googlehome,Because that's all it is.,t2_1mgxh,False,,0,False,[Subreddit Request] Can we rename this subredd...,[],...,False,https://www.reddit.com/r/googlehome/comments/g...,152107,1589287000.0,0,,False,489ab03e-b78e-11e6-8fd6-0e00dc2f4472,,
3,,googlehome,,t2_mnm0w,False,,0,False,This one really takes the cake.,[],...,False,https://i.imgur.com/vli0N3i.jpg,152107,1589250000.0,0,,False,,image,{'images': [{'source': {'url': 'https://extern...
4,,googlehome,,t2_me5yarn,False,,0,False,Where did my routines go to? They're still in ...,[],...,False,https://i.redd.it/nmag8r30h6y41.png,152107,1589221000.0,0,,False,0ab35abe-b78e-11e6-9303-0ed3c72a1f42,image,{'images': [{'source': {'url': 'https://previe...
5,,googlehome,,t2_pelv6,False,,0,False,Somehow I got Netflix to play on my Chromecast...,[],...,False,https://i.redd.it/yzfwnnyvtby41.jpg,152107,1589286000.0,0,,False,,image,{'images': [{'source': {'url': 'https://previe...
6,,googlehome,So I set an alarm there and Google responded '...,t2_4jmd4xz0,False,,0,False,Google just said no problem,[],...,False,https://www.reddit.com/r/googlehome/comments/g...,152107,1589269000.0,0,,False,,,
7,,googlehome,Recently got a Nest WiFi router. I currently l...,t2_bujoj,False,,0,False,Nest WiFi - who can access router settings?,[],...,False,https://www.reddit.com/r/googlehome/comments/g...,152107,1589295000.0,0,,False,008d9c20-b78e-11e6-9e45-0e53646228d0,,
8,,googlehome,,t2_70r6i,False,,0,False,"Hi, all! I'm a developer who recently built a ...",[],...,False,https://youtu.be/Q0j9Pu6JGW0,152107,1589293000.0,0,"{'type': 'youtube.com', 'oembed': {'provider_u...",False,8784af18-6a3f-11ea-bce7-0eff973eebf1,rich:video,{'images': [{'source': {'url': 'https://extern...
9,,googlehome,"We have multiple Minis, Hubs, and the Guards f...",t2_556xu8o2,False,,0,False,Google Hub Keeps Engaging,[],...,False,https://www.reddit.com/r/googlehome/comments/g...,152107,1589291000.0,0,,False,,,


In [31]:
# text for each post
df_test['selftext']

0     [\[FAQ - Frequently Asked Questions\]](https:/...
1     Do you want to liberate your Google so it can ...
2                             Because that's all it is.
3                                                      
4                                                      
5                                                      
6     So I set an alarm there and Google responded '...
7     Recently got a Nest WiFi router. I currently l...
8                                                      
9     We have multiple Minis, Hubs, and the Guards f...
10    Hi all. Just got myself a Google Nest Mini 2. ...
11    Just got Google nest wifi with one access poin...
12    I'm not sure why Google makes this so difficul...
13    Does anyone know if Google will add group vide...
14                                                     
15    When I got my myactivity I can see the history...
16    There's no Bluetooth in my PC.  So when I cast...
17    Hi all,\n\nI recently bought a second mini

In [32]:
# features for each post
pd.DataFrame(testposts).columns

Index(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved',
       'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext',
       ...
       'stickied', 'url', 'subreddit_subscribers', 'created_utc',
       'num_crossposts', 'media', 'is_video', 'link_flair_template_id',
       'post_hint', 'preview'],
      dtype='object', length=106)

In [33]:
# Save testpull data as csv
pd.DataFrame(testposts).to_csv('testpull.csv', index = False)

In [34]:
# Check saved data
df_check = pd.read_csv('testpull.csv')
print(df_check.shape)
df_check.head()

(27, 106)


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview
0,,googlehome,[\[FAQ - Frequently Asked Questions\]](https:/...,t2_q648wkk,False,,0,False,FAQ: Please read the subreddit FAQ before post...,[],...,True,https://www.reddit.com/r/googlehome/comments/a...,152107,1547063000.0,0,,False,,,
1,,googlehome,Do you want to liberate your Google so it can ...,t2_8hlsb,False,,0,False,My Google has Coronavirus! - Monthly Rants and...,[],...,True,https://www.reddit.com/r/googlehome/comments/g...,152107,1588345000.0,0,,False,,,
2,,googlehome,Because that's all it is.,t2_1mgxh,False,,0,False,[Subreddit Request] Can we rename this subredd...,[],...,False,https://www.reddit.com/r/googlehome/comments/g...,152107,1589287000.0,0,,False,489ab03e-b78e-11e6-8fd6-0e00dc2f4472,,
3,,googlehome,,t2_mnm0w,False,,0,False,This one really takes the cake.,[],...,False,https://i.imgur.com/vli0N3i.jpg,152107,1589250000.0,0,,False,,image,{'images': [{'source': {'url': 'https://extern...
4,,googlehome,,t2_me5yarn,False,,0,False,Where did my routines go to? They're still in ...,[],...,False,https://i.redd.it/nmag8r30h6y41.png,152107,1589221000.0,0,,False,0ab35abe-b78e-11e6-9303-0ed3c72a1f42,image,{'images': [{'source': {'url': 'https://previe...


In [35]:
# 'after' contains the id of last post for current pull.
# Anything that is after '?' of url is a query string (key=value)
reddit_dict1['data']['after']

't3_ghoywt'

### Actual Reddit WebScrap

In [36]:
# Parameters for 1_000 reddit pulls
url = 'https://www.reddit.com/r/googlehome/.json'
header = {'User-agent': 'Pony Inc 1.0'}

In [37]:
# get 1_000 posts; reddit pulls approx. 25 posts per request
# Set posts as empty list
# after by default is None
posts = []
after = None
# Extend posts per pull request
# Provide feedback via progress bar, url and time interval per pull
for _ in trange(40, desc='pull'):
    """Set params to 'empty' if after is 'None'."""
    if after == None:
        param = {}
    else:
        param = {'after': after}
    """Print query string of each pull's last post."""
    print(f"https://www.reddit.com/r/googlehome/.json?after{after}")
    
    res = requests.get(url, params=param, headers=header)
    """Break res status and stop request pull if status code is not 200."""
    
    if res.status_code == 200:
        reddit_gh = res.json()
        """track and print current number of posts pulled."""
        current_posts = [p['data'] for p in reddit_gh['data']['children']] 
        print("No. of posts pulled: " + str(len(current_posts)))
        
        """Extend the children list with new incoming ones."""
        posts.extend(current_posts)
        
        """Set after from None to that of last post(of each pull)."""
        after = reddit_gh['data']['after']
    else:
        print(res.status_code)
        break
    
    if _ > 0:
        #prev_posts = pd.read_csv('reddit_gh.csv')
        current_df = pd.DataFrame(posts).to_csv('reddit_gh.csv', index = False)
    else:
        pd.DataFrame(posts).to_csv('reddit_gh.csv', index = False)
        
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(1,5)
    print(sleep_duration)
    time.sleep(sleep_duration)

HBox(children=(FloatProgress(value=0.0, description='pull', max=40.0, style=ProgressStyle(description_width='i…

https://www.reddit.com/r/googlehome/.json?afterNone
No. of posts pulled: 27
4
https://www.reddit.com/r/googlehome/.json?aftert3_ghoywt
No. of posts pulled: 25
4
https://www.reddit.com/r/googlehome/.json?aftert3_ghd5ln
No. of posts pulled: 25
1
https://www.reddit.com/r/googlehome/.json?aftert3_gh0gtp
No. of posts pulled: 25
4
https://www.reddit.com/r/googlehome/.json?aftert3_ggh8ys
No. of posts pulled: 25
3
https://www.reddit.com/r/googlehome/.json?aftert3_gg35zo
No. of posts pulled: 25
1
https://www.reddit.com/r/googlehome/.json?aftert3_gfmgks
No. of posts pulled: 25
3
https://www.reddit.com/r/googlehome/.json?aftert3_gfainv
No. of posts pulled: 25
3
https://www.reddit.com/r/googlehome/.json?aftert3_gezmw1
No. of posts pulled: 25
3
https://www.reddit.com/r/googlehome/.json?aftert3_gekkrr
No. of posts pulled: 25
5
https://www.reddit.com/r/googlehome/.json?aftert3_gdyjk7
No. of posts pulled: 25
5
https://www.reddit.com/r/googlehome/.json?aftert3_gdan12
No. of posts pulled: 25
1
https://w

In [38]:
len(posts)

996

In [39]:
# Check csv file
df_read = pd.read_csv('reddit_gh.csv')

In [40]:
df_read.shape

(996, 111)

In [41]:
df_read.tail()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,media,is_video,link_flair_template_id,post_hint,preview,media_metadata,crosspost_parent_list,crosspost_parent,author_cakeday,poll_data
991,,googlehome,,t2_kwua3,False,,0,False,Google Home Max $50 off at Target.com,[],...,,False,dad079e8-d6c3-11e7-96b4-0e468bcaecf2,link,{'images': [{'source': {'url': 'https://extern...,,,,,
992,,googlehome,I just want google home to 'ding' as feedback....,t2_h5yiw,False,,0,False,Can you turn off speech confirmation?,[],...,,False,,,,,,,,
993,,googlehome,Hello everyone. I've set up a Google home rout...,t2_5ipj6bd5,False,,0,False,Help with routines,[],...,,False,008d9c20-b78e-11e6-9e45-0e53646228d0,,,,,,,
994,,googlehome,Used to stay on forever. Now the stream turns ...,t2_d7dbe,False,,0,False,"“Hey google, stream nest driveway on living ro...",[],...,,False,,,,,,,,
995,,googlehome,Is there any way to view upcoming songs being ...,t2_one13,False,,0,False,Control Google play music through home without...,[],...,,False,008d9c20-b78e-11e6-9e45-0e53646228d0,,,,,,,
