# Project 3: Web APIs & NLP

---

#### 01: <b>Web Scraping</b>

## Introduction

A marketing and advertising firm that specializes in the Food & Beverage (F&B) industry has been engaged by both a sizeable tea company (think T2 Tea or TWG) and a sizeable coffee chain (think Starbuks) to run a digital advertising campaign (think Facebook and Instagram ads) over the Christmas festive period, that targets tea-lovers and tea collectors or coffee-lovers and coffee roasters respectively.

One issue that the advertising team faced was the association of tea with coffee, in which tea and coffee would have similar properties which might cause the marketing algorithm to target coffee drinkers instead and vice versa. Examples of these similar properties are: caffeine content, brewing method, tastes profiles and crockery used.

## Problem Statement

Our in-house data team has been tasked to run models to find some of the commonly used words among coffee and tea drinkers respectively and recommend words that the marketing algorithm should pick up when people type these words on the internet, allowing targeted facebook ads to show up on their feed. 

Ideally, our final model would result in tea advertisements being shown to only tea drinkers and coffee advertisements being shown to only coffee drinkers, thereby reaching a larger and more specific audience and making both advertising campaigns a success.

## Executive Summary

Reddit is a social news website and forum where content is socially curated and promoted by site members through voting. The site name is a play on the words "I read it." The site is composed of hundreds of subcommunities, known as subreddits. Each subreddit has a specific topic, such as technology, politics or music. Reddit's homepage, or the front page, as it is often called, is composed of the most popular posts from each default subreddit. The default list is predetermined and includes subreddits such as "pics," "funny," "videos," "news" and "gaming."

Reddit site members, also known as redditors, submit content which is then voted upon by other members. The goal is to send well-regarded content to the top of the site's front page. Content is voted on via upvotes and downvotes. The more upvotes a post gets, the more popular it becomes, and the higher up it appears on its respective subreddit or the front page. To access a subreddit via the address bar, simply type "reddit.com/r/subreddit name."<sup>[(source)](https://searchcio.techtarget.com/definition/Reddit)</sup>

To address our problem statement at hand, text data will be extracted from two different subreddits : `r/tea` and `r/Coffee` via the **Pushshift Reddit API**. The most recent 1500 posts from each subreddit have been scrapped and subsequently, null values, URL links, HTML special entities, spamming and moderator posts were removed to ensure the quality of vectorized words that we will be training our models with.

We have tested out a number of combinations of vectorizers, classifiers and text normalization methods:

**Vectorizers used:**

`CountVectorizer` and `Tfidfvectorizer`

**Models used:**

`Random Forest`, `Multinomial Naïve Bayes`, `Logistic Regression`

**Text Normalization used:**

`Lemmatization` and `Snowball Stemmer`


and found that a combination of CountVectorizer and Multinomial Naive Bayes was able to accurately classify 88.3% of posts correctly after engaging Snowball Stemmer. Our primary evaluation metric used was `accuracy` - a measure of the proportion of true predictions over all predictions because our classes were balanced ('tea' = 50.4%, 'coffee' = 49.6%). Furthermore, the main aim is not to minimize either False Positives or False Negatives, because ideally, both should be as low as possible (i.e. improper classifying of posts is equally bad). We will also be visualizing the best categorization method to be our final model by plotting the `ROC-AUC` curve.

To interpret our Multinomial Naive Bayes model, we will be calculating the empirical log probability of features for a given class. In fact, to convert `log_probability` into an actual probability score, we would have to exponentiate the log_probability (i.e `np.exp(feature_log_prob)`). However, in this case, as what we are really interested in is to get the top 15 predictor words that are most important when classifying subreddit posts into either 'Coffee' or 'Tea'. just doing `log_probability` alone would suffice. 

Now, looking at the results below, some of the top predictive words for 'Tea' included matcha, loose leaf, teapot, herbal and green. This makes sense as these are usually words most talked about even by the average tea drinkers. Hence, these are also the words that we would be recommending for the marketing algorithm to pick up when deciding on which users to show the digital tea advertisements to.

Similar for 'Coffee', the top predictive words for Multinomial NB included grinder, burr, v60, drip and aeropress 

Overall, we did not observe much overlap in the top predictive words. This is also likely the reason why our test accuracy score was relatively high (88%). 


Tea | Coffee
- | - 
![alt](../Pictures/mnb_tea.png) | ![alt](../Pictures/mnb_coffee.png)



Next, we have also calculated the coefficients of our Logistic Regression model, which was out 2nd best performing model. By looking at the coefficients, we can determine the words that are most related to r/Coffee and r/tea. 

In the case of 'Tea' for the Logistic Regression model, the top predictive words also included matcha, loose leaf and tea pot while also included some additional words such as japanese and chai. While that of Logistic Refression included ginder, v60, pour and maker. We were also able to observe consistent predictive words between the models. Overall, I am satisfied the the top predictive words for both the Multinomial NB and Logistic Regression model are consistent with one another. This would further reaffirm the words that would get recommended in the marketing algorithm to pick up to show tea advertisements for tea drinkers. 

Tea | Coffee
- | - 
![alt](../Pictures/lr_tea.png) | ![alt](../Pictures/lr_coffee.png)

Our final model's limitations would mainly comprise of the misclassified posts (i.e. false predictions) made. Since the primary objective of the model was to accurately predict which subreddit should the post belong to. Therefore, if the model was unable to achieve this primary objective, it will be considered a limitation. After analysing some of the misclassified posts, there were indeed some true misclassifications, however we did also note that some misclassifications were due to the posts not being a quality one where it does not relate to either topics of 'Coffee' and 'Tea. Overall, I am satisfied with the low false positive/negative score of 108.

Some recommendatios to improve on our model in future includes:

- a bigger corpus that incorporates a larger set of vocabulary on the topics of coffee and tea. This could also be taken from other sites such as food review blogs and related Facebook groups
- as mentioned earlier, preferences in the Food & Beverage scene are everchanging. In order for our model to maintain a comparatively high accuracy, it should ideally be re-trained at regular periods so that it does not contain out-of-date information and trends from coffee/tea drinkers, for example the 'Dalgona coffee' craze that took place at the start of COVID-19
- use word similarities (e.g. word2vec) to classify posts instead of frequency
- try other estimators like AdaBoost / GradientBoosting and try other vectorizers like lancaster Stemmer
- explore relationship between post content, number of comments, and upvote ratio
- use VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon to analyze the sentiments of posts

### Contents:
- [Library imports](#Library-imports)
- [Coffee reddit scrape](#Coffee-reddit-scrape)
- [Tea reddit scrape](#Tea-reddit-scrape)

## Web Scraping
---
With regards to our problem statement at hand, text data will be extracted from two different subreddits : `r/tea` and `r/Coffee`. In order to do so, we will be utilising **Pushshift Reddit API**.

The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

This RESTful API gives full functionality for searching Reddit data and also includes the capability of creating powerful data aggregations. With this API, users can quickly find the data that they are interested in and find fascinating correlations.

Through Pushshift, the following key data has been extracted:
   - Time of posts ('created_utc')
   - Title of posts ('title')
   - Author of posts ('author')
   - Text within posts ('selftext')

### Library imports

In [1]:
import requests
import pandas as pd
pd.set_option("display.max_columns", 90)
pd.set_option("display.max_rows", 200)

### Function for webscraping

We have created a function that extracts the latest 1500 subreddit posts via a for-loop (15 iterations) since the max number of posts that can be extracted through Pushshift Reddit API is only 100. 

We chose 1500 latest posts because because after removing duplicates, posts that only contain images/hyperlinks, spam and promotional posts, we will still have posts safely greater than 1000, which would be sufficient data for our model to learn.

Furthermmore, we are only choosing the 'latest' and not the 'most popular' posts due to the everchanging tastes and preferences of consumers in the F&B industry (in this case 'coffee' and 'tea'). Hence, the most popular posts may occur many years ago and thus may not be relevant in today's context.

In [2]:
# web scrapping function
def web_scrape(subreddit_post):
    url = 'https://api.pushshift.io/reddit/search/submission'
    bef = 0
    for i in range(1,16):
        if i == 1:
            params = {'subreddit': subreddit_post,
                     'size': 100,
                     'sort': 'desc',
                     'before': 1642036674}
        else:
            params = {'subreddit': subreddit_post,
                     'size': 100,
                     'sort': 'desc',
                     'before': bef}
        res = requests.get(url, params)
        data = res.json()
        posts = data['data']
        bef = posts[-1]['created_utc']
        
        
        if i == 1:
            df = pd.DataFrame(posts)
        else:
            df = pd.concat([df,pd.DataFrame(posts)], ignore_index=True, axis = 0)
# -------------- Export respective dataframes to csv format ---------------------
    df.to_csv(f"../datasets/{subreddit_post.lower()}.csv", index = False)

## Coffee reddit scrape

In [3]:
# scrape r/Coffee subreddit 
web_scrape('Coffee')

In [4]:
# read csv that was scraped and exported
coffee_df = pd.read_csv("../datasets/coffee.csv")

In [5]:
# sanity check
coffee_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,thumbnail_height,thumbnail_width,url_overridden_by_dest,author_flair_template_id,author_flair_text_color,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,suggested_sort,distinguished,gallery_data,is_gallery,media_metadata,author_cakeday
0,[],False,Oddish_Flumph,,[],,text,t2_7so9zfi,False,False,True,[],False,False,1642034770,self.Coffee,https://www.reddit.com/r/Coffee/comments/s2lvp...,{},s2lvpk,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Coffee/comments/s2lvpk/new_coffee_enjoyer_l...,False,6,1642034780,1,Ive been enjoying coffee more in general and b...,True,False,False,Coffee,t5_2qhze,860476,public,self,new coffee enjoyer looking for suggestions one...,0,[],1.0,https://www.reddit.com/r/Coffee/comments/s2lvp...,all_ads,6,,,,,,,,,,,,,,,,,,,
1,[],False,PsychosisCustoms,,[],,text,t2_fuk9bnko,False,False,False,[],False,False,1642032948,self.Coffee,https://www.reddit.com/r/Coffee/comments/s2la4...,{},s2la4s,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Coffee/comments/s2la4s/99_100_people_do_not...,False,6,1642032959,1,[https://youtu.be/UQV0J-IgcyE](https://youtu.b...,True,False,False,Coffee,t5_2qhze,860447,public,self,99% - 100% people Do Not Know... Tell Your Bud...,0,[],1.0,https://www.reddit.com/r/Coffee/comments/s2la4...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'qdLqeep0...",,,,,,,,,,,,,,,,,
2,[],False,KlutzyNail435,,[],,text,t2_6bua15wh,False,False,False,[],False,False,1642031561,self.Coffee,https://www.reddit.com/r/Coffee/comments/s2krk...,{},s2krkh,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Coffee/comments/s2krkh/my_coffee_tastes_lik...,False,6,1642031571,1,"Ok so this may sound dumb, but whenever I make...",True,False,False,Coffee,t5_2qhze,860427,public,self,My coffee tastes like tomato?? 🍅☕️,0,[],1.0,https://www.reddit.com/r/Coffee/comments/s2krk...,all_ads,6,,,,,,,,,,,,,,,,,,,
3,[],False,KlutzyNail435,,[],,text,t2_6bua15wh,False,False,False,[],False,False,1642031280,self.Coffee,https://www.reddit.com/r/Coffee/comments/s2knm...,{},s2knmx,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Coffee/comments/s2knmx/my_coffee_tastes_sour/,False,6,1642031290,1,I was gifted an espresso machine for Christmas...,True,False,False,Coffee,t5_2qhze,860426,public,self,My coffee tastes sour. ☕️,0,[],1.0,https://www.reddit.com/r/Coffee/comments/s2knm...,all_ads,6,,,,,,,,,,,,,,,,,,,
4,[],False,Lovewinsbruh,,[],,text,t2_i181cjym,False,False,False,[],False,False,1642029981,self.Coffee,https://www.reddit.com/r/Coffee/comments/s2k6h...,{},s2k6hw,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/Coffee/comments/s2k6hw/i_had_my_first_ever_...,False,6,1642029992,1,,True,False,False,Coffee,t5_2qhze,860406,public,self,I had my first ever nitro from Starbucks an ho...,0,[],1.0,https://www.reddit.com/r/Coffee/comments/s2k6h...,all_ads,6,,,,,,,,,,,,,,,,,,,


## Tea reddit scrape

In [6]:
# scrape r/tea subreddit
web_scrape('tea')

In [7]:
# read csv that was scraped and exported
tea_df = pd.read_csv("../datasets/tea.csv")

In [8]:
# sanity check
tea_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,removed_by_category,gallery_data,is_gallery,media_metadata,author_cakeday,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_text_color,poll_data,author_flair_template_id,crosspost_parent,crosspost_parent_list
0,[],False,_pureguava,,[],,text,t2_aor7cgt8,False,False,False,[],False,False,1642036605,i.imgur.com,https://www.reddit.com/r/tea/comments/s2mijw/r...,{},s2mijw,False,True,False,False,False,True,False,False,,reco,"[{'e': 'text', 't': 'Recommendation'}]",7863b26c-9f57-11e4-a2b0-22000bc1889b,Recommendation,dark,richtext,False,False,True,1,0,False,all_ads,/r/tea/comments/s2mijw/rooibos_is_one_of_my_fa...,False,image,"{'enabled': True, 'images': [{'id': '-VdnpfNai...",6,1642036616,1,,True,False,False,tea,t5_2qq5e,562520,public,https://b.thumbs.redditmedia.com/-vlldqIrr2COu...,140.0,140.0,Rooibos is one of my favorites right now,0,[],1.0,https://i.imgur.com/DE4cist.jpg,https://i.imgur.com/DE4cist.jpg,all_ads,6,,,,,,,,,,,,,,,
1,[],False,DS9B5SG-1,,[],,text,t2_aitvrdc4,False,False,False,[],False,False,1642034639,self.tea,https://www.reddit.com/r/tea/comments/s2lu50/i...,{},s2lu50,False,False,False,False,False,False,True,False,,help,"[{'e': 'text', 't': 'Question/Help'}]",64c60b7e-9f57-11e4-adfe-22000b680aa5,Question/Help,dark,richtext,False,False,True,0,0,False,all_ads,/r/tea/comments/s2lu50/is_cold_tea_as_healthy_...,False,,,6,1642034649,1,[removed],True,False,False,tea,t5_2qq5e,562498,public,self,,,Is cold tea as healthy as hot tea?,0,[],1.0,https://www.reddit.com/r/tea/comments/s2lu50/i...,,all_ads,6,automod_filtered,,,,,,,,,,,,,,
2,[],False,ImprovementSenior992,,[],,text,t2_eyyrrakw,False,False,False,[],False,False,1642031749,makeityourown.com,https://www.reddit.com/r/tea/comments/s2ktzg/e...,{},s2ktzg,False,True,False,False,False,True,False,False,,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/tea/comments/s2ktzg/earl_grey_liqueur/,False,link,"{'enabled': False, 'images': [{'id': 'HzrpR3Eb...",6,1642031759,1,,True,False,False,tea,t5_2qq5e,562465,public,https://a.thumbs.redditmedia.com/2svGMXWKYVg-a...,140.0,140.0,Earl Grey Liqueur,0,[],1.0,https://makeityourown.com/recipes/blueberry-ea...,https://makeityourown.com/recipes/blueberry-ea...,all_ads,6,,,,,,,,,,,,,,,
3,[],False,DisastrousTarget5060,,[],,text,t2_8alp9lvv,False,False,False,[],False,False,1642026117,self.tea,https://www.reddit.com/r/tea/comments/s2ips5/i...,{},s2ips5,False,True,False,False,False,True,True,False,,reco,"[{'e': 'text', 't': 'Recommendation'}]",7863b26c-9f57-11e4-a2b0-22000bc1889b,Recommendation,dark,richtext,False,False,True,0,0,False,all_ads,/r/tea/comments/s2ips5/i_want_to_try_tea_but_d...,False,,,6,1642026128,1,My brother in law LOVES tea and he always look...,True,False,False,tea,t5_2qq5e,562404,public,self,,,I want to try tea but don't know where to start,0,[],1.0,https://www.reddit.com/r/tea/comments/s2ips5/i...,,all_ads,6,,,,,,,,,,,,,,,
4,[],False,Busy_Letter7448,,[],,text,t2_bc3wf232,False,False,False,[],False,False,1642025247,self.tea,https://www.reddit.com/r/tea/comments/s2idnv/t...,{},s2idnv,False,True,False,False,False,True,True,False,,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/tea/comments/s2idnv/twinings_chai/,False,,,6,1642025258,1,Any advice on how to make Twinings Chai Tea ta...,True,False,False,tea,t5_2qq5e,562397,public,self,,,Twinings Chai ...,0,[],1.0,https://www.reddit.com/r/tea/comments/s2idnv/t...,,all_ads,6,,,,,,,,,,,,,,,
