# Test Project: Beer Reviews Analysis and Recommendation

###  Recommend beers based on the data

Data: 1.5M beer reviews from beer advocate. (https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz)


In [1]:
%matplotlib inline

### Import packages Numpy, Scikit Learn and Pandas

In [2]:
import numpy as np
import sklearn
import pandas as pd

Read in the "csv" file with reviews into a pandas "data frame". Pandas has this nice reader that can read a bunch of file formats and store the data in a "data frame"

In [3]:
beerData = pd.read_csv('/Users/phani/Downloads/beer_reviews/beer_reviews.csv', delimiter=",", encoding='utf-8')
# download the revievs csv form the link above. Alternatively, use 'urlopen' and rebuild this command 
#to read the csv file after unzipping the response 

In [4]:
#for i in range(len(beer_data.columns)):
#    print("Column",i, ": ", beer_data.columns[i])
#
#print ("*************************")
#print("Columns with Null Values")
#for column in beer_data.columns:
#    count = len(beer_data.loc[pd.isnull(beer_data[column])])
#    if count > 0 :
#       print ("Column Name: ", column, " Null Values: ", count)

#### Q2. Recommend beers based on the data - Approach 1

A simple ordering of data based on review_overall should give us a list with beers and their corresponding ranking in the list. 

#### Preprocessing
Number of reviews are not same for all beers. So, we will calculate the sample mean from the number of reviews we gathered for a beer to assign one overall_review for each beer. However, some beers have only one review where as others have more than one reviews. Hence, we need to clean up the data to include only those beers where we can calculate the mean within a certain margin of error. 

The statistics way to chose the threshold number of reviews (min number of samples) is to compute the minimum number of required reviews for a beer to predict the mean with 95% confidence interval. 

We use this formula: ($\frac{\sigma^2 * Z^2}{m^2}$), where $\sigma$ is the standard deviation of the sample, Z-score for a confidence interval of 95% is 1.96 and m is the allowed margin of error. 

In [5]:
# define a new dataframe with four attributes
samplesDF = beerData[["beer_beerid","beer_name","review_overall", "review_profilename"]]

# drop duplicate reviews for the same beer
samplesDF = samplesDF.drop_duplicates(["beer_beerid","review_profilename"])

# set indices for determining levels
samplesDF = samplesDF.set_index(["beer_beerid","beer_name"])

# Calculate nSamples, sampleMeans, sampleStdDev
nSamples = samplesDF.groupby(level=0).count().to_dict()
sampleMeans = samplesDF.groupby(level=0).mean().to_dict()
sampleStdDev = samplesDF.groupby(level=0).std()


# Define Margin of Error and Z-score for 95% confidence interval
mError = 0.1
zScore = 1.96

# filter out sampleMeans with less number of reviews than minimum required
# to achieve 95% confidence interval, sort sampleMeans and rank beer_ids 
# from the sorted sampleMeans
# reject samples with std dev = 0.0

sampleMeansTemp = {}
for key in nSamples.keys(): 
    if key == "review_overall": # we are only interested in overall_review
        for beerID in nSamples[key].keys(): # get the values - beer_beerid and overall review
            if sampleStdDev[key][beerID] > 0:
                nSamplesRequired = (sampleStdDev[key][beerID] * zScore/mError)**2
            if nSamples[key][beerID] > nSamplesRequired:
                sampleMeansTemp[beerID] =  sampleMeans[key][beerID]

# redefine sampleMeans by sorted overall_reviews 
sampleMeans = sorted(sampleMeansTemp.items(), key=lambda x: x[1] , reverse=True)

# Filter out the beerIDs that are not included in sampleMeans list
# make a new dataframe
# appending rows to make a new data frame takes a lot of time. So we will take this approach. 
#Take the original data frame and drop the rows by comparing beerIDs

reviewBeerIDs = [beerKey[0] for beerKey in sampleMeans]
# drop the duplicate beerIDs 
newBeerDF = beerData.drop_duplicates(["beer_beerid"])
beerIDsAll = newBeerDF.beer_beerid.tolist()

# list the iDs that we need to discard
discardBeerIDs = [beerID for beerID in beerIDsAll if beerID not in reviewBeerIDs]
newBeerDF = newBeerDF.set_index(["beer_beerid"])
newBeerDF = newBeerDF.drop(discardBeerIDs)

#drop other labels and leave only few for visualization
newBeerDF = newBeerDF.drop(['brewery_id','review_time', 'review_overall','review_aroma','review_taste',
                    'review_palate','review_profilename','beer_abv','review_appearance'], axis=1)

# Create a column review_overall with values from sampleMeans
review_overall = []
for beerIndex in newBeerDF.index.tolist():
    for keyIndex in range(len(sampleMeans)):
        if sampleMeans[keyIndex][0] == beerIndex:
            review_overall.append(sampleMeans[keyIndex][1])

# add the column review_overall values from sampleMeans list
newBeerDF['review_overall'] = review_overall

# sort the dataframe by overall reviews and print the top ten beers in the list
newBeerDF = newBeerDF.sort_values(by='review_overall', ascending=False)
newBeerDF.head(10)

Unnamed: 0_level_0,brewery_name,beer_style,beer_name,review_overall
beer_beerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
63649,Peg's Cantina & Brewpub / Cycle Brewing,American Double / Imperial Stout,Rare D.O.S.,4.848485
44910,De Struise Brouwers,Lambic - Unblended,Dirty Horse,4.820513
8626,Southampton Publick House,Berliner Weissbier,Southampton Berliner Weisse,4.768293
68548,Brouwerij Drie Fonteinen,Gueuze,Armand'4 Oude Geuze Lente (Spring),4.730769
70356,Brouwerij Drie Fonteinen,Gueuze,Armand'4 Oude Geuze Zomer (Summer),4.644444
56082,Kern River Brewing Company,American Double / Imperial IPA,Citra DIPA,4.628049
36316,Brasserie Cantillon,Lambic - Fruit,Cantillon Blåbær Lambik,4.625806
41928,Russian River Brewing Company,American Wild Ale,Deviation - Bottleworks 9th Anniversary,4.620536
16814,The Alchemist,American Double / Imperial IPA,Heady Topper,4.61851
1545,Brouwerij Westvleteren (Sint-Sixtusabdij van W...,Quadrupel (Quad),Trappist Westvleteren 12,4.617925


#### Q2. Recommend beers based on the data - Build a content-based recommendation system

Data Matrix: Lets assume that ratings are correct and correlate with user preferences. Data matrix contains overall reviews and users data. We collect user info based on review_profilename. Hence, some data points with missing "review_profilename" are dropped from the data matrix. 

Features Matrix: Features matrix contains four features (appearance,aroma, palate, taste) rated by different users. We assume that the rating is proportional to their tastes. We will not use ABV% here as one of the feature. It has many undefined values that could shrink our feature set.

We will use the reduced dataset from the above approach that satisfy minimum number of samples required to predict population means with 95% confidence level given a sample stdDev and margin of error. Here we will not reduce overall_reviews, but we will find out sample mean for each attribute: aroma, palette, appearance and taste

Finally, we use linear regression with regularization to calculate the parameters that correlate features with overall rating. 



In [39]:
# Data Matrix
dataDF = beerData[["beer_beerid","review_overall", "review_profilename"]]

# drop duplicate reviews for the same beer
dataDF = dataDF.drop_duplicates(["beer_beerid","review_profilename"])

# set indices for determining levels
dataDF = dataDF.set_index("beer_beerid")

# drop the rows with less number of reviews than required samples
# Refer to diff variable from above
dataDF = dataDF.drop(discardBeerIDs)

# Make a numpy 2D array with beerIDs on one axis 
# and profilenames on the other axis
dataDF = dataDF.reset_index()
beerIDList = sorted(dataDF.beer_beerid.unique())
profileList = dataDF.review_profilename.unique()

# Define Data Matrix
dataMatrix = np.zeros(len(beerIDList) * len(profileList)).reshape(len(beerIDList),len(profileList))

# Reindex to use access dataframe using two indices of numpy array
dataDF = dataDF.set_index(["beer_beerid","review_profilename"])

for beerID in range(2):
    for profile in range(len(profileList)):
        try:
            rating = dataDF.loc[(beerIDList[beerID],profileList[profile]),"review_overall"]
            dataMatrix[beerID][profile] = rating
            #print(profile, beerID, rating)
        except KeyError:
            pass
    
            
#print (len([dataMatrix[0][i] for i in range(len(dataMatrix[0])) if dataMatrix[0][i] > 0]))

420


In [59]:
# Feature Matrix
# Follow the first few steps from dataMatrix
featureDF = beerData[["beer_beerid", "review_profilename",'review_appearance','review_aroma', 
                      'review_palate','review_taste']]
featureDF = featureDF.drop_duplicates(["beer_beerid","review_profilename"])
featureDF = featureDF.set_index("beer_beerid")
featureDF = featureDF.drop(discardBeerIDs)
featureDF = featureDF.reset_index()

# Make lists that match data matrix indices
beerIDList = sorted(featureDF.beer_beerid.unique())
profileList = featureDF.review_profilename.unique()

# We are interested in beer features. So we index the data frame using ID and profilename
# and calculate sample means of all features
featureDF = featureDF.set_index(["beer_beerid","review_profilename"])

# features sampleMeans
featuresDict = featureDF.groupby(level=0).mean().to_dict()
appearanceSampleMeans = featuresDict['review_appearance']
aromaSampleMeans = featuresDict['review_aroma']
palateSampleMeans = featuresDict['review_palate']
tasteSampleMeans = featuresDict['review_taste']

# Define featureMatrix
featureMatrix = np.zeros(len(beerIDList*5)).reshape(len(beerIDList),5)
featuresMeansDicts = [appearanceSampleMeans,aromaSampleMeans,
                      palateSampleMeans,tasteSampleMeans]

for beerIndex in range(len(beerIDList)):
    featureMatrix[beerIndex][0] = beerIDList[beerIndex]
    
featureIndex = 1 # feature index in feature Matrix
for featureDict in featuresMeansDicts:
    for beerIndex in range(len(beerIDList)):
        for key in featureDict.keys():
            if key == beerIDList[beerIndex]:
                featureMatrix[beerIndex][featureIndex] = featureDict[key]
    featureIndex += 1

[  7.68160000e+04   4.11000000e+00   3.95000000e+00   3.94000000e+00
   4.01000000e+00]


In [None]:
# check for any NA values in four features
beer_ratings = beer_data.dropna(how='any').loc[:,["beer_beerid","review_overall",
                                                  "review_taste","review_aroma",
                                                 "review_palate","review_appearance",
                                                 "review_profilename"]]
# Check for duplicate entries -  condition: same user, same rating and same beer_id
print (len(beer_ratings)) #- before duplicates removed
beer_ratings = beer_ratings.drop_duplicates()
print (len(beer_ratings)) #- after removing duplicates

In [None]:
beer_ratings = beer_ratings.drop_duplicates(subset=['beer_beerid','review_profilename'])
print (len(beer_ratings)) #- after removing duplicates

Create a data matrix (again a data frame) by rearranging the beer_ratings data frame using two indices and extracting the overall rating column

In [None]:
#beer_ratings.set_index(["beer_beerid","review_profilename"], inplace=True)
# Set indices
#beer_ratings.set_index(['beer_beerid','review_profilename'], inplace=True)
#beer_ratings
# Create a new dataframe which is our 'Data Matrix' that contains
# beer IDs as rows, usernames as columns and overall reviews as datapoints
# Some values are NA because all users didn't rate all beers
#beer_datamatrix = beer_ratings.to_panel().review_overall
#beer_datamatrix
print (len(beer_ratings))
#beer_ratings.to_csv("testdata.csv")

In [None]:
# Test code
df = beer_ratings[:200]
print (len(df.review_profilename))
list_profiles = np.sort(df.review_profilename.tolist())
set_profiles = list(set(list_profiles))
list_beerids = np.sort(df.beer_beerid.tolist())
set_beerids = list(set(list_beerids))
dataM_sub = np.zeros(len(set_beerids) * len(set_profiles))
dataM_sub.reshape(len(set_beerids), len(set_profiles))

df_new = df.set_index(["beer_beerid","review_profilename"])
df_new.index[1]

for i in range(len(set_beerids)):
    for j in range(len(set_profiles)):
        for k in range(len(df_new.index)):
            if (set_beerids[i],set_profiles[j]) == df_new.index[k]:
                dataM_sub[i][j] = df_new.loc[df_new.index[k]].tolist()

print (dataM_sub)
#dfw = beer_ratings.to_panel().review_overall
#dfw = beer_ratings.index.tolist()
#print (len(dfw))
#beer_ratings.index.value_counts()
#print (len(list(set(dfw))))
#for i in list(set(dfw))[:10]:
    