# Sentiment and Emotion Analysis

Notebook 4 of 4

We will analyse the post from each of the subreddits as well as some of the major topics from each to get understanding on the communities sentiment and emotion. In order to do so, we will utilise the Hugging Face pre-trained models for sentiment analysis as well as emotion analysis.
 
The topics for each subreddit we will explore are:
- Dunkin Donuts
 1. dunkin donuts 
 2. cold brew vs cold foam
 3. iced coffee vs frozen coffee 
 4. butter pecan
 5. local dunkin
 
- Starbucks
 1. dress code? 
 2. pumpkin spice
 3. cold brew vs cold foam
 4. apple crisp
 5. fall launch
 
Dunkin Donuts and Starbucks are the brands name? These follows by the top 3 most popular products for each subreddit based on the frequency of the words appear in the subreddit. The local and upcoming launch of product are also hot topics in both subreddits.


## Import Clean Data

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize, RegexpTokenizer
from transformers import pipeline

In [3]:
combined_df = pd.read_csv('./datasets/combined_cleaned_reddit_selftext.csv')
combined_df.shape

(4623, 4)

In [4]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4623 entries, 0 to 4996
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   subreddit               4623 non-null   int64 
 1   title_selftext          4623 non-null   object
 2   created_utc             4623 non-null   int64 
 3   stemmed_title_selftext  4623 non-null   object
dtypes: int64(2), object(2)
memory usage: 180.6+ KB


In [5]:
combined_df.head(3)

Unnamed: 0,subreddit,title_selftext,created_utc,stemmed_title_selftext
0,1,cowork place hash brown like armi troopsfacewi...,1663204910,cowork place hash brown like armi troopsfacewi...
3,1,make ice tea order door dash tast ice tea orde...,1663190691,make ice tea order door dash tast ice tea orde...
4,1,"still got hour shift gotexplodinghead,nan",1663185603,still got hour shift gotexplodinghead


In [6]:
# check for null values
combined_df.isnull().sum()

subreddit                 0
title_selftext            0
created_utc               0
stemmed_title_selftext    0
dtype: int64

There is no missing values in datasets.

### Tokenize words and join back into a sentence to remove unwanted characters

In [8]:
tokenizer = RegexpTokenizer(r'\w+')

In [10]:
combined_df['tokenized'] = combined_df['title_selftext'].apply(lambda x: tokenizer.tokenize(x.lower()))
combined_df.head(3)

Unnamed: 0,subreddit,title_selftext,created_utc,stemmed_title_selftext,tokenized
0,1,cowork place hash brown like armi troopsfacewi...,1663204910,cowork place hash brown like armi troopsfacewi...,"[cowork, place, hash, brown, like, armi, troop..."
3,1,make ice tea order door dash tast ice tea orde...,1663190691,make ice tea order door dash tast ice tea orde...,"[make, ice, tea, order, door, dash, tast, ice,..."
4,1,"still got hour shift gotexplodinghead,nan",1663185603,still got hour shift gotexplodinghead,"[still, got, hour, shift, gotexplodinghead, nan]"


In [11]:
combined_df['title_selftext'] = combined_df['tokenized'].apply(lambda x: " ".join(x))
combined_df.head(3)

Unnamed: 0,subreddit,title_selftext,created_utc,stemmed_title_selftext,tokenized
0,1,cowork place hash brown like armi troopsfacewi...,1663204910,cowork place hash brown like armi troopsfacewi...,"[cowork, place, hash, brown, like, armi, troop..."
3,1,make ice tea order door dash tast ice tea orde...,1663190691,make ice tea order door dash tast ice tea orde...,"[make, ice, tea, order, door, dash, tast, ice,..."
4,1,still got hour shift gotexplodinghead nan,1663185603,still got hour shift gotexplodinghead,"[still, got, hour, shift, gotexplodinghead, nan]"


### Separate into Starbucks and Dunkin Donuts datasets for analysis

In [12]:
ddonuts_text_df = combined_df[combined_df['subreddit'] == 1]
ddonuts_text_df.shape

(2306, 5)

In [13]:
ddonuts_text_df.head(3)

Unnamed: 0,subreddit,title_selftext,created_utc,stemmed_title_selftext,tokenized
0,1,cowork place hash brown like armi troopsfacewi...,1663204910,cowork place hash brown like armi troopsfacewi...,"[cowork, place, hash, brown, like, armi, troop..."
3,1,make ice tea order door dash tast ice tea orde...,1663190691,make ice tea order door dash tast ice tea orde...,"[make, ice, tea, order, door, dash, tast, ice,..."
4,1,still got hour shift gotexplodinghead nan,1663185603,still got hour shift gotexplodinghead,"[still, got, hour, shift, gotexplodinghead, nan]"


In [15]:
sbucks_text_df = combined_df[combined_df['subreddit'] == 0]
sbucks_text_df.shape

(2317, 5)

In [16]:
sbucks_text_df.head(3)

Unnamed: 0,subreddit,title_selftext,created_utc,stemmed_title_selftext,tokenized
2498,0,interview tip hi hope question isn t repetitiv...,1663212467,interview tip hi hope question repetitiveannoy...,"[interview, tip, hi, hope, question, isn, t, r..."
2499,0,hors come drivethru recent present caffein cav...,1663212017,hor come drivethru recent present caffein cavalri,"[hors, come, drivethru, recent, present, caffe..."
2500,0,hors drivethru make everyth better present caf...,1663211903,hor drivethru make everyth better present caff...,"[hors, drivethru, make, everyth, better, prese..."


### Create separate dataframe for each of the subtopics for analysis

In [17]:
dunkin_donuts = ddonuts_text_df[ddonuts_text_df['title_selftext'].str.contains('dunkin donuts')]
dunkin_c_brew = ddonuts_text_df[ddonuts_text_df['title_selftext'].str.contains('cold brew')]
dunkin_i_coffee = ddonuts_text_df[ddonuts_text_df['title_selftext'].str.contains('iced coffee')]
dunkin_b_pecan = ddonuts_text_df[ddonuts_text_df['title_selftext'].str.contains('butter pecan')]
dunkin_local = ddonuts_text_df[ddonuts_text_df['title_selftext'].str.contains('local dunkin')]

sbucks_dress = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('dress code')]
sbucks_p_spice = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('pumkin spice')]
sbucks_c_brew = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('cold brew')]
sbucks_a_crisp = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('apple crisp')]
sbucks_f_launch = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('fall_launch')]

In [17]:
dunkin_donuts = dunkin_donuts.copy()
dunkin_c_brew = dunkin_c_brew.copy()
dunkin_i_coffee = dunkin_i_coffee.copy()
dunkin_b_pecan = dunkin_b_pecan.copy()
dunkin_local = dunkin_local.copy()

sbucks_dress = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('dress code')]
sbucks_p_spice = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('pumkin spice')]
sbucks_c_brew = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('cold brew')]
sbucks_a_crisp = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('apple crisp')]
sbucks_f_launch = sbucks_text_df[sbucks_text_df['title_selftext'].str.contains('fall_launch')]