# Project 3 : Webscraping & NLP

Notebook 1 of 4

## Introduction

### Executive Summary
Brand positioning and digital presence are two of the important success factors in brand battles nowadays. Companies are doing competitor analysis at various aspects in all manners feasible. There will be multiple subprojects running across multiple platforms from a nationwide community to complete the whole analysis. This project, as a subset of the whole project, focuses ONLY on the Reddit platform and aims to assist Dunkin' Brands Group Inc. - Dunkin' Donuts to know their company's online presence status and product reviews as compared to their largest competitor, Starbucks Corporation. 

The fact that a subreddit provides an unbiased platform for discussion on all things regarding a brand. Customers, Employees, Connoisseurs, and Executive Chefs are all welcome to join to celebrate, commiserate or inform. Each community supports general discussion of brands whatever their source. [source](https://www.reddit.com/r/DunkinDonuts/) 

In the first step of all, we utilize the latest Reddit posts from the respective brands and implement a classification model to allocate the posts to either 'Dunkin' Donuts' or 'Starbucks'. The classification model here is binary as the output variable is binary. After evaluating the various models, our top five models are $*Logistic Regression, Extra Tree Classifier, Ridge Classifier, Gradient Boost Classifier, and Random Forest*$. The $Logistic Regression Model$ was chosen as the best model as it gave the best accuracy score.

The use cases of this classification model include that we can make use of unbiased posts to assess how the community perceives the Dunkin' Donuts brand as well as the competitor. Besides that, we also gain insights from the community to evaluate or retrospect our marketing strategies to determine the focus and pick up the latest trends. This subsequently helps in forecasting sales and demand, planning additional manpower during seasonal times, and even exploring more customization services.  

From the research, Starbucks seems to be more popular and has a more active community than Dunkin's. For both coffee chains, some trending topics are similar. Those topics are their services and new products. After performing sentiments and emotion analysis on the Reddit posts, 'iced coffee' and 'reward' seem to gather more positive sentiments. Thus, these two could be the focus of a marketing campaign. 

We will be sharing potential future use cases of our work in the later part of Notebook.

### Introduction & Problem Statement
Dunkin' has a digital presence across Twitter, Facebook, Instagram, and Pinterest and has launched many successful digital campaigns to attract new customers and increase sales. It has implemented a simple strategy to enhance its social media presence, namely, marketing a colorful and quirky personality online. [source](https://unmetric.com/brands/dunkin-donuts) 

At all times, Dunkin' is commited to leverage technology to provide consumer conveniences, such instance as the launch of integrated On-the-Go mobile ordering application with Google Assistant. [source](https://www.prnewswire.com/news-releases/dunkin-donuts-integrates-on-the-go-mobile-ordering-with-the-google-assistant-300613861.html) The popular Pumpkin Spice Signature Latte and fall range of beverages and snacks, which has released on August 17 and was just ended on September 13. [source](https://hypebae.com/2022/8/dunkin-donuts-fall-menu-pumpkin-spice-latte-coffee-cold-brew-release-info) Dunkin's group is very interested in knowing their brand online presence, their product reviews, and the effectiveness of marketing strategies as compared to their largest competitor, Starbucks. 

The first step would be to unveil the unbiased trending topics of these two big brands within Reddit. Subsequently, they are exploring ideas from the data across multiple platforms among the nationwide community, including product reviews, the share of voices, and mentioners. We are entrusted to develop a classification model to predict which class a post belongs to. At the same time, we can gain insights from the community to evaluate or retrospect our marketing strategies to determine the focus and to stay current with the latest trends by being consumer-centric and adapting to consumer insights. The customers get to enjoy a better and more pleasant experience with Dunkin'.

From the analysis, we identified the recent topics of interest related to the business and the community’s sentiments towards them. Following that, we provided recommendations for boosting their upcoming marketing campaign. That said, this would provide them with an indicative area of focus.

*To approach this problem, our goal is to:*
- Identify the trending topics from the subreddits of Starbucks and Dunkin Donuts
- What are the sentiments and emotions of the community in general and towards the topics/products
- Develop a Classification Model to distinguish Starbucks and Dunkin Donuts posts

#### Key Questions
- Which community is more active?
- What are the trending topics for each community?
- What is the best model to classify posts?
- Which products should we focus our marketing on?
- Regarding top topics, what are the community’s sentiments and emotions towards them?

### Data Science Process
- Data Collection
- Data Cleaning and Exploration
- Pre-processing
- Modelling
- Model Evaluation
- Sentiments and Emotions Analysis

### Data Collection

In this process, we will extract the last 2,500 posts from Dunkin Donuts and Starbucks subreddit respectively for analysis.
- Webscrapped using Pushshift Reddit API
- Subreddit : Dunkin Donuts and Starbucks
- Time Frame : Thursday, September 15, 2022 1137hr GMT+08:00

In [1]:
# import library
import requests
import json
import pandas as pd
import time
import random

In [2]:
# create function to get latest utc
def get_latest_utc (subreddit):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {'subreddit' : subreddit, 'size' : 1}
    res = requests.get(url,params)
    print(res.status_code)
    reddit_subs = []
    if res.status_code != 200:
        print("error")
    else:
        reddit_extract = res.json()
        reddit_subs += reddit_extract['data']
    utc = reddit_subs[0]['created_utc']
    return utc

In [3]:
get_latest_utc ('DunkinDonuts')

200


1663213037

In [4]:
get_latest_utc ('Starbucks')

200


1663230045

In [5]:
# create function
def extract_data(iteration, subreddit, utc):
    # loop test
    reddit_subs = []
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {'subreddit' : subreddit, 'size' : 250, 'before' : utc}

    for i in range(0,iteration):
        res = requests.get(url,params)
        # print(res.status_code)
        if res.status_code != 200:
            print("error")
        else:
            reddit_extract = res.json()
            reddit_subs += reddit_extract['data']
            params['before'] = reddit_subs[-1]['created_utc']
            time.sleep((random.randint(10,20)))
            print(f"batch {i} completed")
                
    return reddit_subs

## Webscraping 

### Dunkin Donuts

In [6]:
ddonut_df = extract_data(10, 'DunkinDonuts', utc = '1663213037')

batch 0 completed
batch 1 completed
batch 2 completed
batch 3 completed
batch 4 completed
batch 5 completed
batch 6 completed
batch 7 completed
batch 8 completed
batch 9 completed


In [7]:
len(ddonut_df)

2498

In [8]:
ddonut_df = pd.DataFrame(ddonut_df)

In [9]:
ddonut_df[['subreddit', 'selftext', 'title', 'created_utc']]

Unnamed: 0,subreddit,selftext,title,created_utc
0,DunkinDonuts,,My coworker placing the hash browns like army ...,1663204910
1,DunkinDonuts,,whats the deal with these?,1663196066
2,DunkinDonuts,I know I asked about this before but I'm just ...,Working for dunkin,1663193081
3,DunkinDonuts,On the door dash app whenever I order drinks s...,How to make the ice tea ordered from door dash...,1663190691
4,DunkinDonuts,,We still got 4 more hours of shift and this is...,1663185603
...,...,...,...,...
2493,DunkinDonuts,,Tell me HOW a franchise owner of a new Dunkin ...,1643834240
2494,DunkinDonuts,,Tell me HOW an franchise owner of a Dunkin Don...,1643834106
2495,DunkinDonuts,I ordered a cocoa mocha iced coffee this morni...,What’s the difference between cocoa mocha and ...,1643832284
2496,DunkinDonuts,Today I had someone manually ring up my order ...,Customer Service Question,1643828728


2,498 out of 2,500 Dunkin Donuts posts were extracted. Most posts were extracted dated from 3 February 2022 to 15 September 2022. 

We will use this dataset for analysis.

### Starbucks

In [10]:
sbucks_df = extract_data(10, 'Starbucks', utc = '1663213037')

batch 0 completed
batch 1 completed
batch 2 completed
batch 3 completed
batch 4 completed
batch 5 completed
batch 6 completed
batch 7 completed
batch 8 completed
batch 9 completed


In [11]:
len(sbucks_df)

2499

In [12]:
sbucks_df = pd.DataFrame(sbucks_df)

In [13]:
sbucks_df[['subreddit', 'selftext', 'title', 'created_utc']]

Unnamed: 0,subreddit,selftext,title,created_utc
0,starbucks,Hi all! Hopefully this question isn’t repetiti...,Interview tips?,1663212467
1,starbucks,,We had horses come through the drive-thru rece...,1663212017
2,starbucks,,Having horses in the drive-thru makes everythi...,1663211903
3,starbucks,,The Coffee Cavaliers/Ristretto Ranchers,1663211763
4,starbucks,So my birthday is Wednesday and I have no idea...,Free birthday drink coming up,1663211474
...,...,...,...,...
2494,starbucks,A week later and the egg is still there and ha...,Siren Freezer Egg,1661206051
2495,starbucks,just transferred to a new store (it’s my 3rd o...,no one at my store uses shakers for refreshers...,1661205449
2496,starbucks,"So I started a new job because let’s face it, ...",Looks like I may be leaving the Siren,1661205072
2497,starbucks,,"Sooo is this like, actually any good?",1661205010


2,499 out of 2,500 Starbucks posts were extracted. Most posts were extracted dated from 23 August 2022 to 15 September 2022. 

We will use this dataset for analysis.

## Export Data

In [14]:
ddonut_df.to_csv("./datasets/dunkindonuts.csv")

In [15]:
sbucks_df.to_csv("./datasets/starbucks.csv")

## Data Collection Summary
We webscrapped data using Pushshift Reddit API. In this process, we intended to extract the last 2,500 posts from Dunkin Donuts and Starbucks subreddit respectively for analysis. The timeframe we set was before September 15, 2022 1137hr GMT+08:00.

2,498 out of 2,500 Dunkin Donuts posts were extracted. Most posts were extracted dated from 3 February 2022 to 15 September 2022.

2,499 out of 2,500 Starbucks posts were extracted. Most posts were extracted dated from 23 August 2022 to 15 September 2022. 