# Beer Rating Sentiment Analysis

Aashray Anand - 12/20/2018

This notebook analyzes a beer rating data set, from a private kaggle competition for DATA 401 at Cal Poly. Instead of predicting the rating of a beer from the data provided (as was done in the competition), I will be using the overall review of the beer as a determining factor of review sentiment (classifying reviews of greater than 4 out of 5 as positive, and reviews less than 4 out of 5 as negative), and analyzing the sentiment of the review text, to determine if the review was positive or not

In [103]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [104]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [105]:
PATH = 'data/'

!ls {PATH}

[31msampleSubmission.csv[m[m [31mtest.csv[m[m             [31mtrain.csv[m[m


In [106]:
df = pd.read_csv(f'{PATH}train.csv')

In [107]:
test_df = pd.read_csv(f'{PATH}test.csv')

from the above data, we are only concerned with the beer name, overall rating, and review text

In [108]:
relevant_data = ['index', 'beer/name','review/overall','review/text']
df = df[relevant_data]
test_df = test_df[relevant_data]

We should add a column to denote a review as positive (4 out of 5 or more) or negative (less than 4 out of 5)

In [109]:
df['positive'] = df['review/overall'] >= 4.0
# rename columns to be easier to use
df.rename(columns={'beer/name': 'name',
                   'review/overall': 'review',
                   'review/text': 'text'},
                  inplace=True)

In [110]:
print(df.positive.sum()/df.positive.count() * 100, "% of train reviews are positive")

67.21333333333334 % of train reviews are positive


for both data sets, the majority of the reviews are positive! we can create a naive model that predicts a positive review for all of the reviews, and start with a very high initial success rate

In [111]:
df['review'].value_counts()

4.0    13868
4.5     8666
3.5     6551
3.0     3319
5.0     2671
2.5     1193
2.0      807
1.5      248
1.0      176
0.0        1
Name: review, dtype: int64

In [112]:
# generate train and test split from training set
df_train, df_test = train_test_split(df, test_size=0.2)

for the very first, naive model, we can simply fit a model, with the input being the word count of the review, and the output being whether or not the review is positive, lets create a word count field in the data frame

In [113]:
df_train.head()

Unnamed: 0,index,name,review,text,positive
27358,40360,Weltenburger Hefe-Weissbier Hell,2.5,"This one pours a light to medium in body, dull...",False
12807,24250,Founders Nemesis 2010,2.0,I wasn't crazy about the 2009 version of this ...,False
5648,21420,Founders Breakfast Stout,5.0,"Again, yet another beer by Founders that say's...",True
15953,4662,Hop Master's Abbey Belgian-Style Double IPA,4.5,Picked this up at the Beer Stop in West Hazlet...,True
24691,20068,Founders Breakfast Stout,4.0,the holy grail! after years of longing i final...,True


In [114]:
def add_word_count(df):
    df['word count'] = df.text.str.split().str.len()

In [115]:
df_train['text'] = df_train['text'].astype(str)
add_word_count(df_train)
df_train.head()
df_test['text'] = df_test['text'].astype(str)
add_word_count(df_test)
df_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,index,name,review,text,positive,word count
20653,9554,Pike Street XXXXX Stout,3.5,Pours a deep red/black with a thick and firm o...,False,55
22698,44086,Pilsner Urquell,3.0,"A good tapped Pilsner is a good Pilsner, too b...",False,114
18093,48902,Stoudt's Double IPA (India Pale Ale),4.0,This one's a cloudy amber brew with a good-loo...,True,177
28610,39951,CucapÃ¡ Chupacabras Pale Ale,3.5,Poured into a standard pint a deeper than expe...,False,74
17959,47331,Stoudts Heifer-in-Wheat,4.5,Poured from a 12oz. bottle into a standard pin...,True,46


In [116]:
# lets train a simple RF classifier, using only the word count feature, to predict a positive review
model = RandomForestClassifier()
model.fit(df_train.drop(['index', 'name', 'review', 'text', 'positive'], axis=1), df_train['positive'])



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [117]:
y_pred = model.predict(df_test.drop(['index', 'name', 'review', 'text', 'positive'], axis=1))

In [118]:
accuracy_score(df_test['positive'], y_pred)

0.6749333333333334

In [119]:
df_test['positive'].sum() / df_test['positive'].count()

0.6794666666666667

This model is only slightly better than just predicting positive for all reviews :( we will need to do some additional feature engineering to build a more effective model. Lets set a goal of >80% 

In [120]:
def word_count(df):
    words = {}
    for index, row in df.iterrows():
            for word in row['text'].split():
                if word in words:
                    words[word] = words[word] + 1
                else:
                    words[word] = 1
    print(sorted(((v, k) for k, v in words.items()), reverse=True))

In [121]:
word_count(df_train)



In [122]:
df_train.head()

Unnamed: 0,index,name,review,text,positive,word count
27358,40360,Weltenburger Hefe-Weissbier Hell,2.5,"This one pours a light to medium in body, dull...",False,112
12807,24250,Founders Nemesis 2010,2.0,I wasn't crazy about the 2009 version of this ...,False,155
5648,21420,Founders Breakfast Stout,5.0,"Again, yet another beer by Founders that say's...",True,92
15953,4662,Hop Master's Abbey Belgian-Style Double IPA,4.5,Picked this up at the Beer Stop in West Hazlet...,True,100
24691,20068,Founders Breakfast Stout,4.0,the holy grail! after years of longing i final...,True,86


In [123]:
negative_words = ['Disgustingly', 'bad', 'bad.', 'Bad', 'gross.', 
                  'Gross.', 'gross', 'disappointed', 'disappointing',
                  'poor', 'poor.', 'Poor', 'poorly']

after collecting the distinct word counts across the entire training set of reviews, we can collect a list of words that seem to have a negative connotation, these are more important than words with a positive connotation, since the percentage of positive reviews is already so high, and we are mainly concerned with identifying lower quality reviews, we should generate a feature that counts the ocurrences of negative words in a review, as a new feature to pass when predicting reviews

In [124]:
def negative_word_count(df, negative_words):
    for index, row in df.iterrows():
        words = {}
        for word in row['text'].split():
            if word in negative_words and word in words:
                words[word] = words[word] + 1
            elif word in negative_words:
                words[word] = 1
        df.at[index, 'negative word count'] = sum(words.values())
negative_word_count(df_train, negative_words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [125]:
sum(df_train['negative word count'] > 0) / len(df_train)

0.08596666666666666

In [126]:
df_train.columns

Index(['index', 'name', 'review', 'text', 'positive', 'word count',
       'negative word count'],
      dtype='object')

In [127]:
updated_model = RandomForestClassifier()
updated_model.fit(df_train.drop(['index', 'name', 'review', 'text', 'positive'], axis = 1), df_train['positive'])



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [128]:
# get negative word count for test data
negative_word_count(df_test, negative_words)
# get relevant columns for prediction
df_test_relevant = df_test.drop(['index', 'name', 'review', 'text', 'positive'], axis = 1)
y_pred = updated_model.predict(df_test_relevant)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [129]:
accuracy_score(df_test['positive'], y_pred)

0.6858666666666666

In [130]:
confusion_matrix(df_test['positive'], y_pred)

array([[ 259, 2145],
       [ 211, 4885]])

for our updated model, we predicted:

- 259 true negative
- 211 false negatives
- 4885 true positives
- 2145 false positive

the model is still better at predicting positive reviews, which makes sense, as the data set leans towards primarily positive reviews, and is not sophisticated enough to always determine a negative review, we will need to keep working on this...

In [146]:
df_pos = df_train[df_train.positive == False]

In [148]:
df_pos_bad_words = df_pos[df_pos['negative word count'] > 0]

The aroma is malty but a bit hoppy. There was not much head to it. I would say about a finger and the carbonation in the body was thin. The body itself was golden. It is mainly a malty beer with a well hopped finish. The mouthfeel is medium and a bit slick. It wasn't bad I would expect more for the price. 



200ml bottle: One time brew..this is an ESB ale with natural cherry fruit added.		Poured a clear burnished copper color with ruby highlights into my goblet....small cap does not last or lace. Softly carbonated.		Aromas of musty grains and sour fruit and a light sweetness.		Fair malt presense, nedium-light body, soapy mouth feel.		You get to chew on some decent rye bread tastes in the front then the sour cherry and goldings combine to balance with a fruity bittering. This bittering continues into the wet finish which is abrupt but clean with a slight metallic-sour aftertaste.		An interesting use of fruit here...the cherries are picked for their sourness which seems to compliment th

I actually found a sixer of this at a store less than a month ago. I don't know when Founder's stopped production but this is still out there. It was on sale so I thought what the hell and got it. Well, this beer has gone bad. Sour notes and skunk and all. Just wanted to write to people not to get it. It actually begins decent but grows sour and sickening towards the finish. I really hope the Founder's guys make another lager-type beer and bottle it sometime in the future. 



12 oz. brown bottle poured into a snifter glass. Served at cellar temperature, 45 degrees F.		A fairly delicate pour produces a deep copper-amber-tinged brew with a half-fingered head of off-white (leaning on beige) foam that recedes somewhat quickly before resting into a thin crown that leaves behind some light, splotchy lacing on the walls of the glass.		Some barrel aging is evident in the nose, as notes of vanilla and a hint of coconut mingle with sweet caramel and a whiff of alcohol as the beer warms.		Medium

Thanks to HeadyHops81 for sending me this one.		Poured from a 12 ounce bottle into a glass beer mug. The bottle was actually date stamped, June 2009.		The beer poured a deep, rich brown. Slightly hazy, with a thin, white head. The head faded rather quickly and left some soapy lacing that slowly slid down the glass.		The aroma was a bit weak. Slightly earthy. A bit of sweet malt. There was a little wisp of nuttiness, but it had to warm significantly before it started to come out. 		The mouthfeel was also a little thin. Not overly thin, as to make it feel too watery, but I felt it could have used a bit more carbonation.		Like the nose, the taste was also a bit weak. It started with a nice, sweet malt. Very faint nuttiness. The hops bitterness kicked in about mid-swallow and followed through to the finish. Afterwards, there was a lingering caramel flavor.		Not a bad brown ale, but not a spectacular one, either. Overall, a good aroma and flavor, just too weak to carry it far. 



Got this 

In [157]:
def check_negative_words(df, negative_words):
    for index, row in df.iterrows():
        words = {}
        for word in row['text'].split():
            if word in negative_words and word in words:
                words[word] = words[word] + 1
            elif word in negative_words:
                words[word] = 1

# lets check which bad words occur a lot in 
negative_word_count(df_pos_bad_words, negative_words)