<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

# Part 4 of 4

# Contents: 
Part 1: Data Collection 

- Webscrapping Subreddit Tea

- Webscrapping Subreddit Coffee

- Summary

Part 2: EDA and Data Cleaning

- Tea

- Coffee

- Initial Identification of Top Words

- Compare Lemmatization and Stemming

- Summary

Part 3: Modelling and Model Evaluation

- Base Model

- CVEC with Log Regression

- TFIDF with Log Regression

- CVEC with Naive Bayes

- TFIDF with Naive Bayes

- CVEC with Random Forest

- TFIDF with Random Forest

Part 4: Sentiment Analysis, Recommendation and Conclusion

-[Summary for Tea](#SumTea)

-[Summary for Coffee](#SumCoffee)

-[Recommendation and Conclusion](#Conclusion)

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV # split our data and run hyperparameter search
# from sklearn.pipeline import Pipeline # to compactly pack multiple modeling operations
# from sklearn.naive_bayes import MultinomialNB # to build our classification model
# from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # to access results from binary classification task (you may also import other specific classification metrics)
# from sklearn.linear_model import LogisticRegression

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# imports for contextual embeddings
from transformers import pipeline

In [2]:
# import csv
df = pd.read_csv('./data/df_export.csv')
# stem_X = pd.read_csv('./data/stem_X.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24926 entries, 0 to 24925
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   subreddit      24926 non-null  int64 
 1   title          24926 non-null  object
 2   stemmed_title  24871 non-null  object
dtypes: int64(1), object(2)
memory usage: 584.3+ KB


In [4]:
df.loc[df['stemmed_title'].isnull() , :]

Unnamed: 0,subreddit,title,stemmed_title
113,1,what is this on my tea,
645,1,what tea is that,
1946,1,☕,
2260,1,tea,
2646,1,that is a no,
3124,1,the tea🍵,
4267,1,it is what it is,
4427,1,tea,
6382,1,as above so below,
6938,1,is this tea,


In [5]:
df.dropna(subset=['stemmed_title'], inplace=True)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24871 entries, 0 to 24925
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   subreddit      24871 non-null  int64 
 1   title          24871 non-null  object
 2   stemmed_title  24871 non-null  object
dtypes: int64(1), object(2)
memory usage: 777.2+ KB


## Hugging Face

In [7]:
# classifier=pipeline("sentiment-analysis",
#                     model="Seethal/sentiment_analysis_generic_dataset"
#                     )

In [8]:
# classifier=pipeline("sentiment-analysis",
#                     model="Seethal/sentiment_analysis_generic_dataset",
#                     tokenizer="Seethal/sentiment_analysis_generic_dataset",
#                     max_length=512,
#                     truncation=True)

In [9]:
senti_classifier = pipeline("sentiment-analysis",
                            model="cardiffnlp/twitter-roberta-base-sentiment",
                            max_length=512,
                            truncation=True)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


In [10]:
# Define function to extract sentiments
def sentiments(dataset):
    dataset['sentiment'] = dataset['title'].apply(senti_classifier)
    dataset['sentiments'] = dataset['sentiment'].apply(lambda x: x[0]['label'])
    dataset['sentiments'] = dataset['sentiments'].map({'LABEL_0': 'negative', 'LABEL_1': 'neutral', 'LABEL_2': 'positive'})
    return dataset

With the words identified in part 2 of the project, we picked out some of interest to do our sentiment analysis. We also picked out 2000 samples to assess what the public sentiment is towards coffee and tea. There are two sentiment analysis used here, one being twitter-roberta-base-sentiment ([source](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment)) and the other being spacy ([source](https://spacy.io/usage/spacy-101)). Apart from the models being trained differently, another difference is that roberta base sentiment has a neutral score which spacy does not have (spacy returns a score which we determine > 0 as a positive sentiment and < 0 as a negative sentiment). 

## Sample Tea Overall

In [11]:
tea_df=df.loc[df['subreddit'] == 1, :]
tea_df

Unnamed: 0,subreddit,title,stemmed_title
0,1,can someone recommend a tea for me my lovely w...,someon recommend love wife drink sloan heavenl...
1,1,buying chamomile in us i took a look at the ve...,buy chamomil us took look vendor list none cha...
2,1,is this a good introduction to kamairicha,good introduct kamairicha
3,1,i was curious about the story on colombian bla...,curiou stori colombian black found answer alth...
4,1,haul any suggestions on how to brew these teas,haul suggest brew
...,...,...,...
11668,1,your favorite tea brands i just moved to a new...,favorit brand move new citi sadli go store pre...
11669,1,my new ring for my mad hatter cosplay,new ring mad hatter cosplay
11670,1,yummy tea break,yummi break
11671,1,do you have some teacups youd love to post abo...,teacup youd love post without worri look love ...


In [12]:
senti_classifier('can someone recommend a tea for me my lovely wife')

[{'label': 'LABEL_2', 'score': 0.6980944275856018}]

In [13]:
tea_sample_senti = sentiments(tea_df.sample(n=2000, random_state=42))

In [14]:
tea_sample_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
3864,1,tea bringing comfort and joy even in the darke...,bring comfort joy even darkest situat,"[{'label': 'LABEL_2', 'score': 0.8959719538688...",positive
5695,1,i have been on a scone kick ever since my trip...,scone kick ever sinc trip room,"[{'label': 'LABEL_1', 'score': 0.6411350369453...",neutral
8529,1,any decaf oolongs or greens anyone can recomme...,decaf oolong green anyon recommend enjoy bed s...,"[{'label': 'LABEL_2', 'score': 0.4960118532180...",positive
10508,1,a nice oolong to round out the week,nice oolong round week,"[{'label': 'LABEL_2', 'score': 0.9107728004455...",positive
11639,1,2 quick questions 😁 any of you use brita long ...,2 quick question use brita long last like seco...,"[{'label': 'LABEL_1', 'score': 0.7153263688087...",neutral


In [15]:
tea_sample_senti['sentiments'].value_counts()

neutral     1090
positive     709
negative     201
Name: sentiments, dtype: int64

In [16]:
tea_sentiments = pd.DataFrame(tea_sample_senti['sentiments'].value_counts())
tea_sentiments

Unnamed: 0,sentiments
neutral,1090
positive,709
negative,201


## Interested words in tea

In [17]:
tea_green_tea = tea_df[tea_df['title'].str.contains('green tea')]
tea_green_tea = tea_green_tea.copy()

tea_earl_grey = tea_df[tea_df['title'].str.contains('earl grey tea')]
tea_earl_grey = tea_earl_grey.copy()

tea_tie_guan_yin = tea_df[tea_df['title'].str.contains('tie guan yin')]
tea_tie_guan_yin = tea_tie_guan_yin.copy()

tea_black = tea_df[tea_df['title'].str.contains('black tea')]
tea_black = tea_black.copy()

In [18]:
tea_green_tea.shape

(832, 3)

In [19]:
tea_earl_grey.shape

(36, 3)

In [20]:
tea_tie_guan_yin.shape

(46, 3)

In [21]:
tea_black.shape

(713, 3)

## Green Tea

In [22]:
green_tea_senti = sentiments(tea_green_tea)

In [23]:
green_tea_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
10,1,is this a good introduction to chinese green t...,good introduct chines green,"[{'label': 'LABEL_1', 'score': 0.6991882324218...",neutral
25,1,is the caffeine in matcha absorbed differently...,caffein matcha absorb differ usual drink caffe...,"[{'label': 'LABEL_0', 'score': 0.8034375309944...",negative
56,1,alternatives to shincha spring harvest japanes...,altern shincha spring harvest japanes green he...,"[{'label': 'LABEL_2', 'score': 0.7693625092506...",positive
71,1,thanks for the reminder i’m primarily a coffee...,thank remind primarili drinker drink lot late ...,"[{'label': 'LABEL_1', 'score': 0.3819150030612...",neutral
84,1,alternatives to tea i have trouble drinking te...,altern troubl drink realli overstimul point ti...,"[{'label': 'LABEL_0', 'score': 0.7013951539993...",negative


In [24]:
green_tea_senti['sentiments'].value_counts()

neutral     359
positive    352
negative    121
Name: sentiments, dtype: int64

In [25]:
green_tea_sentiments = pd.DataFrame(green_tea_senti['sentiments'].value_counts())
green_tea_sentiments

Unnamed: 0,sentiments
neutral,359
positive,352
negative,121


## Earl Grey Tea

In [26]:
earl_grey_senti = sentiments(tea_earl_grey)

In [27]:
earl_grey_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
141,1,earl grey tea in afternoon is earl grey consid...,earl grey afternoon earl grey consid afternoon...,"[{'label': 'LABEL_1', 'score': 0.9283571243286...",neutral
468,1,loose leaf earl grey tea recommendations i am ...,loos leaf earl grey recommend finish box twin ...,"[{'label': 'LABEL_0', 'score': 0.8306356072425...",negative
1179,1,how can i fix my weak earl grey i love earl gr...,fix weak earl grey love earl grey drink everi ...,"[{'label': 'LABEL_0', 'score': 0.7725271582603...",negative
1858,1,earl grey tea peach 🍑,earl grey peach,"[{'label': 'LABEL_1', 'score': 0.7264797687530...",neutral
1919,1,help me find a french earl grey tea like this ...,help find french earl grey like australia firs...,"[{'label': 'LABEL_0', 'score': 0.4266401827335...",negative


In [28]:
earl_grey_senti['sentiments'].value_counts()

neutral     16
positive    14
negative     6
Name: sentiments, dtype: int64

In [29]:
earl_grey_sentiments = pd.DataFrame(earl_grey_senti['sentiments'].value_counts())
earl_grey_sentiments

Unnamed: 0,sentiments
neutral,16
positive,14
negative,6


## Tie Guan Yin

In [30]:
tgy_senti = sentiments(tea_tie_guan_yin)

In [31]:
tgy_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
419,1,anxi tie guan yin,anxi tie guan yin,"[{'label': 'LABEL_1', 'score': 0.7665809988975...",neutral
562,1,happy mid autumn festival celebrating it with ...,happi mid autumn festiv celebr homemad moon ca...,"[{'label': 'LABEL_2', 'score': 0.9662119746208...",positive
1257,1,test driving some vintage sea dyke shui xian i...,test drive vintag sea dyke shui xian bad came ...,"[{'label': 'LABEL_2', 'score': 0.4865322411060...",positive
1320,1,is their no place for the sour tie guan yin i ...,place sour tie guan yin tri mei leaf sour sap ...,"[{'label': 'LABEL_0', 'score': 0.7389737367630...",negative
1321,1,taiwan tie guan yin oolong,taiwan tie guan yin oolong,"[{'label': 'LABEL_1', 'score': 0.7894954085350...",neutral


In [32]:
tgy_senti['sentiments'].value_counts()

neutral     27
positive    18
negative     1
Name: sentiments, dtype: int64

In [33]:
tgy_sentiments = pd.DataFrame(tgy_senti['sentiments'].value_counts())
tgy_sentiments

Unnamed: 0,sentiments
neutral,27
positive,18
negative,1


## Black Tea

In [34]:
black_tea_senti = sentiments(tea_black)

In [35]:
black_tea_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
3,1,i was curious about the story on colombian bla...,curiou stori colombian black found answer alth...,"[{'label': 'LABEL_1', 'score': 0.5055576562881...",neutral
8,1,hypersensitivity allergy to teas developing ov...,hypersensit allergi develop year anyon els exp...,"[{'label': 'LABEL_0', 'score': 0.7761902213096...",negative
25,1,is the caffeine in matcha absorbed differently...,caffein matcha absorb differ usual drink caffe...,"[{'label': 'LABEL_0', 'score': 0.8034375309944...",negative
41,1,what black tea is used the most in earl grey b...,black use earl grey blend,"[{'label': 'LABEL_1', 'score': 0.9118570685386...",neutral
47,1,i made a little blend of some scottish made bl...,made littl blend scottish made black lavend ro...,"[{'label': 'LABEL_1', 'score': 0.7252510786056...",neutral


In [36]:
black_tea_senti['sentiments'].value_counts()

neutral     313
positive    303
negative     97
Name: sentiments, dtype: int64

In [37]:
black_tea_sentiments = pd.DataFrame(black_tea_senti['sentiments'].value_counts())
black_tea_sentiments

Unnamed: 0,sentiments
neutral,313
positive,303
negative,97


## Compare with spacy

In [38]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [39]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x20e5f0b2c70>

In [40]:
tea_df_spacy = tea_df.sample(n=2000, random_state=42)

In [41]:
%%time
# Let's calculate the accuracy of Spacy!

# Find the predicted value of sentiment (< 0 -> negative and > 0 -> positive)
def spacy_sentiment_pred(text):
    spacy_output = nlp(text) # extend code from above for single text
    
    if spacy_output._.polarity < 0:
        # negative
        return 'negative'
    else:
        # positive
        return 'positive'

tea_df_spacy['spacy_sentiment_pred'] = tea_df_spacy['title'].apply(spacy_sentiment_pred) # same approach as shown with Vader
tea_df_spacy.head()

CPU times: total: 23.8 s
Wall time: 24.1 s


Unnamed: 0,subreddit,title,stemmed_title,spacy_sentiment_pred
3864,1,tea bringing comfort and joy even in the darke...,bring comfort joy even darkest situat,positive
5695,1,i have been on a scone kick ever since my trip...,scone kick ever sinc trip room,positive
8529,1,any decaf oolongs or greens anyone can recomme...,decaf oolong green anyon recommend enjoy bed s...,positive
10508,1,a nice oolong to round out the week,nice oolong round week,positive
11639,1,2 quick questions 😁 any of you use brita long ...,2 quick question use brita long last like seco...,negative


In [42]:
tea_df_spacy['spacy_sentiment_pred'].value_counts()

positive    1731
negative     269
Name: spacy_sentiment_pred, dtype: int64

In [43]:
tea_sentiments_spacy = pd.DataFrame(tea_df_spacy['spacy_sentiment_pred'].value_counts())
tea_sentiments_spacy

Unnamed: 0,spacy_sentiment_pred
positive,1731
negative,269


<a id="SumTea"></a>
## Summary for Tea

In [44]:
tea_cols = ["Overall", "Green Tea", "Earl Grey Tea", "Tie Guan Yin", "Black Tea", "Overall_spacy"]
all_tea_sentiments = pd.concat([tea_sentiments, green_tea_sentiments, earl_grey_sentiments, tgy_sentiments, black_tea_sentiments, tea_sentiments_spacy], axis=1, ignore_index=True)
all_tea_sentiments.columns = tea_cols

In [45]:
all_tea_sentiments

Unnamed: 0,Overall,Green Tea,Earl Grey Tea,Tie Guan Yin,Black Tea,Overall_spacy
neutral,1090,359,16,27,313,
positive,709,352,14,18,303,1731.0
negative,201,121,6,1,97,269.0


In [46]:
all_tea_sentiments_perc = round(all_tea_sentiments / all_tea_sentiments.sum(), 2)
all_tea_sentiments_perc

Unnamed: 0,Overall,Green Tea,Earl Grey Tea,Tie Guan Yin,Black Tea,Overall_spacy
neutral,0.55,0.43,0.44,0.59,0.44,
positive,0.35,0.42,0.39,0.39,0.42,0.87
negative,0.1,0.15,0.17,0.02,0.14,0.13


## Sample Coffee Overall

In [47]:
coffee_df=df.loc[df['subreddit'] == 0, :]
coffee_df

Unnamed: 0,subreddit,title,stemmed_title
11673,0,coffee reduces cardiovascular disease large p...,reduc cardiovascular diseas larg popul studi c...
11674,0,where do europeans get their coffee stareuros,european get stareuro
11675,0,dark and stormy coffee why am i just learning ...,dark stormi learn 12oz black cold brew hot 8oz...
11676,0,should i learn how to roast coffee beans befor...,learn roast bean learn particular brew method ...
11677,0,roast me,roast
...,...,...,...
24921,0,is store brand cold brew less acidic,store brand cold brew less acid
24922,0,on yunnan coffee i was drinking a mix of yunna...,yunnan drink mix yunnan guatemala cafe morn sp...
24923,0,what is an easy way to get strong noteless cof...,easi way get strong noteless hello drank nesca...
24924,0,how to make coffee less bitter hi i have been ...,make less bitter hi drink instant find extrem ...


In [48]:
coffee_sample_senti = sentiments(coffee_df.sample(n=2000, random_state=42))

In [49]:
coffee_sample_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
20113,0,got a nespresso not bad,got nespresso bad,"[{'label': 'LABEL_2', 'score': 0.8882138729095...",positive
16507,0,how to “properly” drink a layered drink iced m...,properli drink layer drink ice macchiato bumbl...,"[{'label': 'LABEL_1', 'score': 0.6695403456687...",neutral
24719,0,filter not working properly in mrcoffee machine,filter work properli mrcoffe machin,"[{'label': 'LABEL_0', 'score': 0.7453929781913...",negative
13033,0,virtuoso alternatives i am looking at automati...,virtuoso altern look automat grinder w timer k...,"[{'label': 'LABEL_1', 'score': 0.5048190355300...",neutral
16151,0,my only gripe with the encore baratza,gripe encor baratza,"[{'label': 'LABEL_1', 'score': 0.6687008142471...",neutral


In [50]:
coffee_sample_senti['sentiments'].value_counts()

neutral     972
positive    545
negative    483
Name: sentiments, dtype: int64

In [51]:
coffee_sentiments = pd.DataFrame(coffee_sample_senti['sentiments'].value_counts())
coffee_sentiments

Unnamed: 0,sentiments
neutral,972
positive,545
negative,483


## Interested words in coffee

In [52]:
coffee_cold_brew = coffee_df[coffee_df['title'].str.contains('cold brew')]
coffee_cold_brew = coffee_cold_brew.copy()

coffee_french_press = coffee_df[coffee_df['title'].str.contains('french press')]
coffee_french_press = coffee_french_press.copy()

coffee_moka_pot = coffee_df[coffee_df['title'].str.contains('moka pot')]
coffee_moka_pot = coffee_moka_pot.copy()

coffee_hand_grinder = coffee_df[coffee_df['title'].str.contains('hand grinder')]
coffee_hand_grinder = coffee_hand_grinder.copy()

## Cold Brew

In [53]:
cold_brew_senti = sentiments(coffee_cold_brew)

In [54]:
cold_brew_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
11675,0,dark and stormy coffee why am i just learning ...,dark stormi learn 12oz black cold brew hot 8oz...,"[{'label': 'LABEL_2', 'score': 0.9694674611091...",positive
11740,0,i was pleasantly surprised by these simple tru...,pleasantli surpris simpl truth nitro cold brew...,"[{'label': 'LABEL_2', 'score': 0.9112371802330...",positive
11760,0,how can i make a super smooth and creamy iced ...,make super smooth creami ice go restaur like f...,"[{'label': 'LABEL_0', 'score': 0.4376310706138...",negative
11779,0,cold brew help i am not sure if i am looking f...,cold brew help sure look someth exist truli sc...,"[{'label': 'LABEL_0', 'score': 0.8254464864730...",negative
11810,0,how to quickly use up flavored coffee beans he...,quickli use flavor bean help bought 100g bag g...,"[{'label': 'LABEL_0', 'score': 0.4908444285392...",negative


In [55]:
cold_brew_senti['sentiments'].value_counts()

neutral     280
positive    161
negative    137
Name: sentiments, dtype: int64

In [56]:
cold_brew_sentiments = pd.DataFrame(cold_brew_senti['sentiments'].value_counts())
cold_brew_sentiments

Unnamed: 0,sentiments
neutral,280
positive,161
negative,137


## French Press

In [57]:
french_press_senti = sentiments(coffee_french_press)

In [58]:
french_press_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
11693,0,french press scaled up is not tasting good i h...,french press scale tast good 34 oz bodom frenc...,"[{'label': 'LABEL_0', 'score': 0.5980810523033...",negative
11706,0,question is there a name for the method i use ...,question name method use seen friend use machi...,"[{'label': 'LABEL_1', 'score': 0.7231719493865...",neutral
11709,0,tried a french press i tried a french press an...,tri french press tri french press maje tast mu...,"[{'label': 'LABEL_2', 'score': 0.8298047184944...",positive
11725,0,i am looking for a low cost receptacle that wi...,look low cost receptacl boil water stove top p...,"[{'label': 'LABEL_1', 'score': 0.8764566779136...",neutral
11737,0,i am gonna say iti prefer single pouonly bloom...,gonna say iti prefer singl pouonli bloom metho...,"[{'label': 'LABEL_2', 'score': 0.5912940502166...",positive


In [59]:
french_press_senti['sentiments'].value_counts()

neutral     384
positive    237
negative    170
Name: sentiments, dtype: int64

In [60]:
french_press_sentiments = pd.DataFrame(french_press_senti['sentiments'].value_counts())
french_press_sentiments

Unnamed: 0,sentiments
neutral,384
positive,237
negative,170


## Moka Pot

In [61]:
moka_pot_senti = sentiments(coffee_moka_pot)

In [62]:
moka_pot_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
11683,0,hi there coffee reddit i would like some advic...,hi reddit would like advic moka pot neglect uh...,"[{'label': 'LABEL_1', 'score': 0.5028273463249...",neutral
11733,0,hand grinder this is another post asking what ...,hand grinder anoth post ask hand grinder buy t...,"[{'label': 'LABEL_1', 'score': 0.5737534165382...",neutral
11779,0,cold brew help i am not sure if i am looking f...,cold brew help sure look someth exist truli sc...,"[{'label': 'LABEL_0', 'score': 0.8254464864730...",negative
11795,0,expected moka pot yield so i am using a 101 wa...,expect moka pot yield use 101 watercoffe ratio...,"[{'label': 'LABEL_0', 'score': 0.4814542829990...",negative
11797,0,coffee quantity for moka pot 3 cups what am i ...,quantiti moka pot 3 cup wrong hi guy use moka ...,"[{'label': 'LABEL_0', 'score': 0.6876979470252...",negative


In [63]:
moka_pot_senti['sentiments'].value_counts()

neutral     236
positive    132
negative    118
Name: sentiments, dtype: int64

In [64]:
moka_pot_sentiments = pd.DataFrame(moka_pot_senti['sentiments'].value_counts())
moka_pot_sentiments

Unnamed: 0,sentiments
neutral,236
positive,132
negative,118


## Hand Grinder

In [65]:
hand_grinder_senti = sentiments(coffee_hand_grinder)

In [66]:
hand_grinder_senti.head()

Unnamed: 0,subreddit,title,stemmed_title,sentiment,sentiments
11733,0,hand grinder this is another post asking what ...,hand grinder anoth post ask hand grinder buy t...,"[{'label': 'LABEL_1', 'score': 0.5737534165382...",neutral
11767,0,new to espresso need direction and recommendat...,new espresso need direct recommend hello lover...,"[{'label': 'LABEL_1', 'score': 0.6123094558715...",neutral
11814,0,aftermarket burrs for handgrinder has anyone u...,aftermarket burr handgrind anyon use aftermark...,"[{'label': 'LABEL_1', 'score': 0.6050747632980...",neutral
11816,0,new grinder who dis last may our keurig machi...,new grinder di last may keurig machin broke st...,"[{'label': 'LABEL_2', 'score': 0.7181528806686...",positive
11821,0,buying a gaggia help i want to venture into es...,buy gaggia help want ventur espresso milk base...,"[{'label': 'LABEL_1', 'score': 0.4943251013755...",neutral


In [67]:
hand_grinder_senti['sentiments'].value_counts()

neutral     142
positive    108
negative     47
Name: sentiments, dtype: int64

In [68]:
hand_grinder_sentiments = pd.DataFrame(hand_grinder_senti['sentiments'].value_counts())
hand_grinder_sentiments

Unnamed: 0,sentiments
neutral,142
positive,108
negative,47


## Compare with spacy

In [69]:
coffee_df_spacy = coffee_df.sample(n=2000, random_state=42)

In [70]:
%%time
# Let's calculate the accuracy of Spacy!

# # Find the predicted value of sentiment (< 0 -> negative and > 0 -> positive)
# def spacy_sentiment_pred(text):
#     spacy_output = nlp(text) # extend code from above for single text
    
#     if spacy_output._.polarity < 0:
#         # negative
#         return 'negative'
#     else:
#         # positive
#         return 'positive'

coffee_df_spacy['spacy_sentiment_pred'] = coffee_df_spacy['title'].apply(spacy_sentiment_pred) # same approach as shown with Vader
coffee_df_spacy.head()

CPU times: total: 30.5 s
Wall time: 30.5 s


Unnamed: 0,subreddit,title,stemmed_title,spacy_sentiment_pred
20113,0,got a nespresso not bad,got nespresso bad,positive
16507,0,how to “properly” drink a layered drink iced m...,properli drink layer drink ice macchiato bumbl...,positive
24719,0,filter not working properly in mrcoffee machine,filter work properli mrcoffe machin,positive
13033,0,virtuoso alternatives i am looking at automati...,virtuoso altern look automat grinder w timer k...,positive
16151,0,my only gripe with the encore baratza,gripe encor baratza,positive


In [71]:
coffee_df_spacy['spacy_sentiment_pred'].value_counts()

positive    1688
negative     312
Name: spacy_sentiment_pred, dtype: int64

In [72]:
coffee_sentiments_spacy = pd.DataFrame(coffee_df_spacy['spacy_sentiment_pred'].value_counts())
coffee_sentiments_spacy

Unnamed: 0,spacy_sentiment_pred
positive,1688
negative,312


<a id="SumCoffee"></a>
## Summary for Coffee

In [73]:
coffee_cols = ["Overall", "Cold Brew", "French Press", "Moka Pot", "Hand Grinder", "Overall_spacy"]
all_coffee_sentiments = pd.concat([coffee_sentiments, cold_brew_sentiments, french_press_sentiments, moka_pot_sentiments, hand_grinder_sentiments, coffee_sentiments_spacy], axis=1, ignore_index=True)
all_coffee_sentiments.columns = coffee_cols

In [74]:
all_coffee_sentiments

Unnamed: 0,Overall,Cold Brew,French Press,Moka Pot,Hand Grinder,Overall_spacy
neutral,972,280,384,236,142,
positive,545,161,237,132,108,1688.0
negative,483,137,170,118,47,312.0


In [75]:
all_coffee_sentiments_perc = round(all_coffee_sentiments / all_coffee_sentiments.sum(), 2)
all_coffee_sentiments_perc

Unnamed: 0,Overall,Cold Brew,French Press,Moka Pot,Hand Grinder,Overall_spacy
neutral,0.49,0.48,0.49,0.49,0.48,
positive,0.27,0.28,0.3,0.27,0.36,0.84
negative,0.24,0.24,0.21,0.24,0.16,0.16


<a id="Conclusion"></a>
## Recommendation and Conclusion

A comparison between sentiment analysis shows that there is generally a positive sentiment towards tea and coffee, which is more evident in spacy's analysis since there is no neutral scoring. Roberta base sentiment may be misleading with a large percentage returning neutral sentiment (ie, coffee returns 27% positive to 24% negative with a large 49% being neutral - this shows up as 84% positive and 16% negative in spacy's sentiment analysis). Sentiment analysis should also be compared against a human label data set to test the accuracy (although we do not have it at this point). A simple preprocessing can be done also to convert the emojis to text for sentiment analysis. 

It is also observed that the unique flavours such as earl grey tea and tie guan yin are very low numbers (less than 100) as compared to the more common flavours out of an approximate 11,000 posts on subreddit tea. However, these are exactly what may be important in a business context and what the cafe would want to focus on. 

<h1 align="center">Tea</h1>

|          |   Overall |   Green Tea |   Earl Grey Tea |   Tie Guan Yin |   Black Tea |   Overall_spacy |
|:---------|----------:|------------:|----------------:|---------------:|------------:|----------------:|
| neutral  |      1090 |         359 |              16 |             27 |         313 |             nan |
| positive |       709 |         352 |              14 |             18 |         303 |            1731 |
| negative |       201 |         121 |               6 |              1 |          97 |             269 |

|          |   Overall |   Green Tea |   Earl Grey Tea |   Tie Guan Yin |   Black Tea |   Overall_spacy |
|:---------|----------:|------------:|----------------:|---------------:|------------:|----------------:|
| neutral  |      0.55 |        0.43 |            0.44 |           0.59 |        0.44 |          nan    |
| positive |      0.35 |        0.42 |            0.39 |           0.39 |        0.42 |            0.87 |
| negative |      0.1  |        0.15 |            0.17 |           0.02 |        0.14 |            0.13 |

<h1 align="center">Coffee</h1>

|          |   Overall |   Cold Brew |   French Press |   Moka Pot |   Hand Grinder |   Overall_spacy |
|:---------|----------:|------------:|---------------:|-----------:|---------------:|----------------:|
| neutral  |       972 |         280 |            384 |        236 |            142 |             nan |
| positive |       545 |         161 |            237 |        132 |            108 |            1688 |
| negative |       483 |         137 |            170 |        118 |             47 |             312 |

|          |   Overall |   Cold Brew |   French Press |   Moka Pot |   Hand Grinder |   Overall_spacy |
|:---------|----------:|------------:|---------------:|-----------:|---------------:|----------------:|
| neutral  |      0.49 |        0.48 |           0.49 |       0.49 |           0.48 |          nan    |
| positive |      0.27 |        0.28 |           0.3  |       0.27 |           0.36 |            0.84 |
| negative |      0.24 |        0.24 |           0.21 |       0.24 |           0.16 |            0.16 |

To conclude, we have managed to identify top words in n-grams using CountVectorizer and TF-IDF Vectorizer. From there we have managed to gain some insights to what may be added on to the cafe's menu. We have also identified the best model as Logistic Regression that can be used to predict if a post came from the subreddit tea or coffee. This can be used for sorting out feedback from the public to the cafe and also potentially identifying more key words. 

As one of the largest online forum platform with 52 million daily active users and a total of 430 million number of monthly active users ([source](https://backlinko.com/reddit-users)), Reddit is a good starting point to scrap and analysis data. However, it is also noted that the largest base of users are from the United States of America - which may present a skewed representation of the population and also to consider where the cafe might be set up. 

Subsequently, it would be beneficial to look into more data sets from different sources or more specific subreddit topics. For example, to look in to subreddits such as 'Starbucks' to identify the more unique and popular words, or other platforms such as Facebook groups and Twitter hashtags. 