


<img src= "images/reddit.png" alt="reddit logo" align="left"> 



# Project 3: Web APIs & Classification (Part 1)

## Contents: 


#### [Step 1:  Problem Statement](#Step-1:-Problem-Statement)


#### [Step 2:  Gather Data](#Step-2:-Gather-Data)


<span style="color:gray">*Step 3 to Step 6 can be found in a separate jupyter notebook [part 2](part_2-explore_data_and_modelling.ipynb)*</span>

---

# Step 1: Problem Statement

<img src= "images/Google_Home_vs_Amazon_Alexa.png" alt="gh_alexa"> 

**[Amazon Echo](https://en.wikipedia.org/wiki/Amazon_Echo)** (aka **Alexa**) is a virtual home assistant device by Amazon, launched in November 2014 and dominates the US market with 61% share. [[source]](https://voicebot.ai/2019/03/07/u-s-smart-speaker-ownership-rises-40-in-2018-to-66-4-million-and-amazon-echo-maintains-market-share-lead-says-new-report-from-voicebot/)  However, the main competitor **Google Home** (rebranded to [Google Nest](https://en.wikipedia.org/wiki/Google_Nest_(smart_speakers)) is expanding rapidly in Asia given the stronger presence of Google.  

In order to expand Alexa in Asia market, the marketing analytics team at Amazon would like to understand what users of each brand are talking about and distill insights which could help drive marketing campaigns in Asia.

Reddit is the perfect source of data for this case because:
- **Direct user feedback:** Reddit is an American social news aggregation, web content rating, and discussion website where the discussion forums (aka 'Subreddits') is a *community* that is formed around user-created areas of interest. The website is known for its open nature and diverse user community that generate its content.
- **Strong presence in the US/UK/Germany/Canada:** As we're looking for feedback in markets where both devices have high presence, Reddit's user base fits this profile.  [[source-Reddit]](https://www.statista.com/statistics/325144/reddit-global-active-user-distribution/) [[source-Devices]](https://voicebot.ai/2019/04/15/smart-speaker-installed-base-to-surpass-200-million-in-2019-grow-to-500-million-in-2023-canalys/)

The 2 relevant Subreddits for this project are:
- **r/alexa**  (40.5k Members / 201 Online / latest post: less than 1 day)
- **r/googlehome** (202k Members / 710 Online / latest post: less than 1 day)


In the process of gathering data from Reddit API, the team lost lost the labels of the subreddit texts.  

### So the scope of the project is 2 folds:

### <span style='color:royalblue'>Problem 1: Classifying the texts with missing label</span>

Develop a classification model to correctly identify the unclassified texts.  The marketing analytics team at Amazon would like to understand what users of each brand are talking about.  Our focus would be to have more complete classification for Alexa to get deeper insights to develop marketing plan, while for Google Home the team only needs enough data for comparative analysis.

### <span style='color:royalblue'>Problem 2: Insights for Marketing Analytics</span>

Identify key insights that are generated from user discussion and identify next steps.

---


## Executive Summary

**Objective:** 
1. To create a text classification model using Natural Language Processing to correctly classified missing labels.  Focus is on texts for `alexa` as insights for marketing plan, while only enough texts for `googlehome` for comparative analysis.
2. Gain insights related to the consumer discussions and identify next steps for the marketing analytics team.


**Process:** 

To address our problem statement, the following approach was taken:

<img src= "images/process.png" alt="process"> 


**Outcome:** 

The classification model has around 80% accuracy in classifying the missing labels.  
The following key insights & next steps have been identified:
- The small formats (Amazon Echo Dot and Google Nest Mini) are most talked about and could be use to recruit new users.
- Music, Light, Time are common functions for both, while Routine is more unique to Alexa.  Futher sentiment analysis on these topics and more topic-modelling could help to identify marketing plan.

Details can be found in [Part 2](part_2-explore_data_and_modelling.ipynb) 

---



# Step 2: Gather Data


### Import all libraries:

In [1]:
import requests
import pandas as pd
import time
import random

### Define a function to gather data from reddit api

In [2]:
def get_subreddit(name, url):
    posts = []
    after = None
    
    for a in range(35):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Home Inc 1.0'})
        
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        subreddit_dict = res.json()
        current_posts = [p['data'] for p in subreddit_dict['data']['children']]
        posts.extend(current_posts)
        after = subreddit_dict['data']['after']
    
        # save to a csv file: first iteration save to blank csv, after that append to the same file
        if a > 0:    
            prev_posts = pd.read_csv(f'datasets/{name}.csv')
            current_df = pd.DataFrame(posts)
            combined_df = pd.concat([prev_posts,current_df],axis = 0)
            combined_df.to_csv(f'datasets/{name}.csv', index = False)
        else:
            pd.DataFrame(posts).to_csv(f'datasets/{name}.csv', index = False)

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        time.sleep(sleep_duration)
    
    print(f'Download completed.\n File saved to: (datasets/{name}.csv)')

### Assign subreddits and identify URLs

In [3]:
subreddit_1 = 'alexa'

In [4]:
subreddit_2 = 'googlehome'

In [5]:
url_1 = f'https://www.reddit.com/r/{subreddit_1}.json'

In [6]:
url_2 = f'https://www.reddit.com/r/{subreddit_2}.json'

### Call the function to get subreddits 

In [7]:
get_subreddit(subreddit_1, url_1)

https://www.reddit.com/r/alexa.json
6
https://www.reddit.com/r/alexa.json?after=t3_ilv6u0
4
https://www.reddit.com/r/alexa.json?after=t3_ik44ah
5
https://www.reddit.com/r/alexa.json?after=t3_iiccu6
6
https://www.reddit.com/r/alexa.json?after=t3_igjy43
5
https://www.reddit.com/r/alexa.json?after=t3_iep0ee
5
https://www.reddit.com/r/alexa.json?after=t3_idnogn
4
https://www.reddit.com/r/alexa.json?after=t3_ic7rxn
6
https://www.reddit.com/r/alexa.json?after=t3_ib2vr9
6
https://www.reddit.com/r/alexa.json?after=t3_i9ljjz
2
https://www.reddit.com/r/alexa.json?after=t3_i8imwt
5
https://www.reddit.com/r/alexa.json?after=t3_i6ucki
6
https://www.reddit.com/r/alexa.json?after=t3_i5i10z
4
https://www.reddit.com/r/alexa.json?after=t3_i3lcj7
4
https://www.reddit.com/r/alexa.json?after=t3_i22i3b
4
https://www.reddit.com/r/alexa.json?after=t3_i0g0fx
4
https://www.reddit.com/r/alexa.json?after=t3_hzeto1
3
https://www.reddit.com/r/alexa.json?after=t3_hygs8s
4
https://www.reddit.com/r/alexa.json?after=t3

  if (await self.run_code(code, result,  async_=asy)):


4
https://www.reddit.com/r/alexa.json?after=t3_hii36f
6
https://www.reddit.com/r/alexa.json?after=t3_hgui3b
5
https://www.reddit.com/r/alexa.json?after=t3_hfwg8r
4
https://www.reddit.com/r/alexa.json?after=t3_hee425
5
https://www.reddit.com/r/alexa.json?after=t3_hd5rmx
3
https://www.reddit.com/r/alexa.json?after=t3_hc9gwm
3
https://www.reddit.com/r/alexa.json
2
https://www.reddit.com/r/alexa.json?after=t3_ilv6u0
5
Download completed.
 File saved to: (datasets/alexa.csv)


In [8]:
get_subreddit(subreddit_2, url_2)

https://www.reddit.com/r/googlehome.json
4
https://www.reddit.com/r/googlehome.json?after=t3_imyyxl
2
https://www.reddit.com/r/googlehome.json?after=t3_imdgzs
2
https://www.reddit.com/r/googlehome.json?after=t3_ilkjkb
4
https://www.reddit.com/r/googlehome.json?after=t3_il43ou
5
https://www.reddit.com/r/googlehome.json?after=t3_iky4ep
2
https://www.reddit.com/r/googlehome.json?after=t3_ijzpdo
5
https://www.reddit.com/r/googlehome.json?after=t3_ik0bp2
2
https://www.reddit.com/r/googlehome.json?after=t3_ijs5c8
6
https://www.reddit.com/r/googlehome.json?after=t3_ij7k26
2
https://www.reddit.com/r/googlehome.json?after=t3_iinmxp
6
https://www.reddit.com/r/googlehome.json?after=t3_iia61b
3
https://www.reddit.com/r/googlehome.json?after=t3_ihxkme
3
https://www.reddit.com/r/googlehome.json?after=t3_ih22qx
2
https://www.reddit.com/r/googlehome.json?after=t3_ih1if5
6
https://www.reddit.com/r/googlehome.json?after=t3_igham0
3
https://www.reddit.com/r/googlehome.json?after=t3_igd3xe
5
https://www.r

### Read the saved `.csv` files to dataframe to check if it was saved correctly

In [9]:
df_1 = pd.read_csv(f'datasets/{subreddit_1}.csv')
df_1.head(2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,is_video,post_hint,url_overridden_by_dest,preview,crosspost_parent_list,crosspost_parent,is_gallery,media_metadata,gallery_data,poll_data
0,,alexa,There has been a persistent referral link spam...,t2_9i6dd,False,,0,False,[Announcement] Raising the minimum karma neede...,[],...,False,,,,,,,,,
1,,alexa,,t2_1ki5wjwb,False,,0,False,Is there a way to see when a routine has playe...,[],...,False,,,,,,,,,


In [10]:
df_1.tail(2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,is_video,post_hint,url_overridden_by_dest,preview,crosspost_parent_list,crosspost_parent,is_gallery,media_metadata,gallery_data,poll_data
15743,,alexa,I think maybe it was either an alarm or a remi...,t2_i7q0q,False,,0,False,Woke up this morning with Alexa going off with...,[],...,False,,,,,,,,,
15744,,alexa,"Voice commands Pause, Resume, and Stop no long...",t2_y6sn2o5,False,,0,False,"Pause, Resume, and Stop No Longer Work With Sp...",[],...,False,,,,,,,,,


In [11]:
df_1.shape

(15745, 113)

In [12]:
# check error message on Columns (78) have mixed types.  
# these columns will not be part of the classification and will be dropped
df_1.iloc[0:10,77:80]

Unnamed: 0,num_reports,distinguished,subreddit_id
0,,moderator,t5_2qtg6
1,,,t5_2qtg6
2,,,t5_2qtg6
3,,,t5_2qtg6
4,,,t5_2qtg6
5,,,t5_2qtg6
6,,,t5_2qtg6
7,,,t5_2qtg6
8,,,t5_2qtg6
9,,,t5_2qtg6


In [13]:
df_2 = pd.read_csv(f'datasets/{subreddit_2}.csv')
df_2.head(2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url_overridden_by_dest,preview,link_flair_template_id,media_metadata,crosspost_parent_list,crosspost_parent,is_gallery,gallery_data,poll_data,author_cakeday
0,,googlehome,[\[FAQ - Frequently Asked Questions\]](https:/...,t2_q648wkk,False,,0,False,FAQ: Please read the subreddit FAQ before post...,[],...,,,,,,,,,,
1,,googlehome,Please tell us about your Google problems. Pos...,t2_q648wkk,False,,0,False,Monthly Rants and Complaints Thread - Septembe...,[],...,,,,,,,,,,


In [14]:
df_2.tail(2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url_overridden_by_dest,preview,link_flair_template_id,media_metadata,crosspost_parent_list,crosspost_parent,is_gallery,gallery_data,poll_data,author_cakeday
15818,,googlehome,I migrated from GPM (Google Play Music) to YTM...,t2_1zop4s3c,False,,0,False,Wife's music is showing on my YTM,[],...,,,,,,,,,,
15819,,googlehome,Trying to setup a bedtime routine that will pl...,t2_5g4sr,False,,0,False,Command to play White Noise Sleep sounds on sp...,[],...,,,,,,,,,,


In [15]:
df_2.shape

(15820, 115)

*In order to not run and download the data from Reddit again, this notebook ends here.  The data cleaning process will be done in the following jupyter notebook in* **[Part 2](part_2-explore_data_and_modelling.ipynb)**

[Back to top](#Project-3:-Web-APIs-&-Classification-(Part-1))