# Beer Reviews: Analysis and Recommendation

###  Recommend beers based on the reviews
#### Customized Recommendation: Collaborative filtering and Content-based recommendation system

#### Data
1.5M beer reviews from beer advocate. (https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz)


In [1]:
%matplotlib inline

### Import packages Numpy, Scikit Learn and Pandas

In [2]:
import numpy as np
import sklearn
import pandas as pd

Read in the "csv" file with reviews into a pandas "data frame". Use Pandas utility to read csv and store the data in a "data frame"

In [3]:
beerData = pd.read_csv('/Users/phani/Downloads/beer_reviews/beer_reviews.csv', delimiter=",", encoding='utf-8')
# download the revievs csv form the link above. Alternatively, use 'urlopen' and rebuild this command 
#to read the csv file after unzipping the response 

### Note: 
We will follow the data cleaning procedure described in the first Notebook (beer_reviews_analysis.ipynb) The code is reused here. For detailed documentation refer to that notebook(beer_reviews_analysis.ipynb)

In [4]:
samplesDF = beerData[["beer_beerid","beer_name","review_overall", "review_profilename"]]
samplesDF = samplesDF.drop_duplicates(["beer_beerid","review_profilename"])
samplesDF = samplesDF.set_index(["beer_beerid","beer_name"])
nSamples = samplesDF.groupby(level=0).count().to_dict()
sampleMeans = samplesDF.groupby(level=0).mean().to_dict()
sampleStdDev = samplesDF.groupby(level=0).std()
mError = 0.1
zScore = 1.96

sampleMeansTemp = {}
for key in nSamples.keys(): 
    if key == "review_overall": # we are only interested in overall_review
        for beerID in nSamples[key].keys(): # get the values - beer_beerid and overall review
            if sampleStdDev[key][beerID] > 0:
                nSamplesRequired = (sampleStdDev[key][beerID] * zScore/mError)**2
            if nSamples[key][beerID] > nSamplesRequired:
                sampleMeansTemp[beerID] =  sampleMeans[key][beerID]

# redefine sampleMeans by sorted overall_reviews 
sampleMeans = sorted(sampleMeansTemp.items(), key=lambda x: x[1] , reverse=True)

reviewBeerIDs = [beerKey[0] for beerKey in sampleMeans]
# drop the duplicate beerIDs 
newBeerDF = beerData.drop_duplicates(["beer_beerid"])
beerIDsAll = newBeerDF.beer_beerid.tolist()

# list the iDs that we need to discard
discardBeerIDs = [beerID for beerID in beerIDsAll if beerID not in reviewBeerIDs]


## Recommender system and collaborative filtering - Customized Recommendations

In a collaborative filtering based recommendation system, we will compute preferences for each user (reviewer == user) and the  predict the rating for each beer (that the particular user has not rated before). Based on the predictions, we will recommend the beers that the user hasn't tried before and he may like based on his proferences.

#### Data Matrix: 

Rows are the beerIDs and the columns correspond to user ratings. Each cell in the matrix correspond to a beer and the overall rating for that beer by the user in that column. 

Since all users have not rated all beers, many of the cells are zero. 

#### Feature Matrix:
Feature matrix rows are beerIDs and columns are features. It contains four features - appearance,aroma, palate, taste. The data in the columns corresponding to a particular beerID is the samplemeans of each feature from all the ratings given for that beerID. We assume that the mean rating is proportional to the attributes of a given beer. (We will not use ABV% here as one of the feature. It has many undefined values that could shrink our feature set.)

#### Data Cleaning:
1. We collect user info based on review_profilename. Hence, some data points with missing "review_profilename" are dropped from the data matrix. 
2. We will use the reduced dataset from the above approach that satisfy minimum number of samples required to predict population means with 95% confidence level given a sample stdDev and margin of error. 
3. We will use sampleMeans for each attribute: aroma, palette, appearance and taste
4. Overall_reviews are the individual reviews given by a user

#### Build the Feature Matrix

In [5]:
# Feature Matrix
featureDF = beerData[["beer_beerid", "review_profilename",'review_appearance','review_aroma', 
                      'review_palate','review_taste','review_overall']]
featureDF = featureDF.drop_duplicates(["beer_beerid","review_profilename"])
featureDF = featureDF.set_index("beer_beerid")

# discard the beers that didn't meet our screening criterion of 95% confidence level
featureDF = featureDF.drop(discardBeerIDs)
featureDF = featureDF.reset_index()


# Make lists that match data matrix indices
beerIDList = sorted(featureDF.beer_beerid.unique())
profileList = featureDF.review_profilename.unique()

# Reindex the dataframe for extracting features.
featureDF = featureDF.set_index(["beer_beerid","review_profilename"])

#Debug Info:
    #print(len(beerIDList),len(profileList))



In [6]:
# features sampleMeans
featuresDict = featureDF.groupby(level=0).mean().to_dict()
appearanceSampleMeans = featuresDict['review_appearance']
aromaSampleMeans = featuresDict['review_aroma']
palateSampleMeans = featuresDict['review_palate']
tasteSampleMeans = featuresDict['review_taste']

# Construct a numpy matrix with features Sample Means
featureMatrix = np.zeros(len(beerIDList*5)).reshape(len(beerIDList),5)
featuresMeansDicts = [appearanceSampleMeans,aromaSampleMeans,
                      palateSampleMeans,tasteSampleMeans]

# Populate the first element of the feature matrix with beerID
for beerIndex in range(len(beerIDList)):
    featureMatrix[beerIndex][0] = beerIDList[beerIndex]
    
featureIndex = 1 # feature index in feature Matrix
for featureDict in featuresMeansDicts:
    for beerIndex in range(len(beerIDList)):
        for key in featureDict.keys():
            if key == beerIDList[beerIndex]:
                featureMatrix[beerIndex][featureIndex] = featureDict[key]
    featureIndex += 1

# Add bias column in the featureMatrix
featureMatrix = np.insert(featureMatrix,1,1, axis=1)

#### Define Data Matrix and compute userPreferences

Constructing a data matrix with all users and all beers takes a lot of time and memory
We will take one user (or a block of users) at a time and evaluate the paramater matrix that correspond to the user preferences. 



In [7]:
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
import timeit

#Create a dictionaryobject that holds users (profileNames) and 
# corresponding preference matrices
userPreferences = {}

# we will use this sampleMeans for mean normalization in Data Matrix
#sampleMeans = featuresDict['review_overall'] 


# For the total number of users in the list ~30000, it takes lot of time (> 5 hrs).
# time consuming constructs here:
# lookup in data frame is approximately 350us
# regression is between 400-700us
# generating X and y numpy arrays around 100us

# Generating user preferences for first 100 users
dataDF = featureDF.drop(['review_appearance','review_aroma', 
                      'review_palate','review_taste'], axis=1)

#Dict object to hold a subset of profileNames with R^2 value greater than a certain threshold
bestScores = {}
for profile in profileList[:1000]:
    dataMatrix = np.zeros(len(beerIDList*2)).reshape(len(beerIDList),2)
    for beerIndex in range(len(beerIDList)):
        dataMatrix[beerIndex][0] = beerIDList[beerIndex]
        try:
            dataMatrix[beerIndex][1] = dataDF.loc[beerIDList[beerIndex],profile].tolist()[0]
        except KeyError:
            dataMatrix[beerIndex][1] = 0.0
            
            
    # subtract sample Means to do mean normalization
    #if dataMatrix[beerIndex][1] > 0:
    #     for key in sampleMeans.keys():
    #        if key == beerIDList[beerIndex]:
    #            dataMatrix[beerIndex][1] -= sampleMeans[key]
                
    # X and y matrices for linear regression
    # Including all the rows resultd in poor R^2 values
    # Hence, only the rows with reviews are included in the fit
    # Bias term not included
    y = np.array([dataMatrix[i][1] for i in range(dataMatrix.shape[0]) if dataMatrix[i][1] > 0])
    X = np.array([featureMatrix[i][1:] for i in range(featureMatrix.shape[0]) if dataMatrix[i][1] > 0])

    
    # linear regression to compute parameter matrix
    regressor = LinearRegression()
    regressor.fit(X,y)
    score = regressor.score(X,y)
    userPreferences[profile] = [regressor.coef_,score]
    
    # we will populate the dict with profile names whose scores are above a certain threshold
    if score > 0.5:
        bestScores[profile] = [profile,score]


    # print('Weight coefficients: ', regressor.coef_)
    # print('y-axis intercept: ', regressor.intercept_)
    # print(regressor.score(X,y))
    
    
# Other possible regression options
    # KNeighbors Regressor
    # for i in range(1,10):
        # kneighbor_regression = KNeighborsRegressor(n_neighbors=i)
        # kneighbor_regression.fit(X, y)
        # print ("No of Neighbors: ", i)
        # print(kneighbor_regression.score(X, y))

    # Ridge Regreession. Includes Regularization (L2 penalty)
    # print ("Ridge Regression (L2 penalty)")
    # ridge_models = {} 
    # for alpha in [100, 10, 1, .01]:
        # ridge = Ridge(alpha=alpha).fit(X, y)
        # print("alpha = :", alpha)
        # print(ridge.score(X, y))
        # ridge_models[alpha] = ridge
        
    # print ("Lasso Regression (L1 penalty)")
    # lasso_models = {}
    #for alpha in [.01,0.001]:
        # lasso = Lasso(alpha=alpha).fit(X, y)
        # print("alpha = :", alpha)
        # print(lasso.score(X, y))
        # lasso_models[alpha] = lasso

#### Predicting the review based on featureMatrix and userPreferences

After populating the userPreferences dictionary with user preferences, 
for a given user, we can predict what his/her rating could be based on the feature matrix and his preferences

For Example, the 10th user in profileList has rated about 40 beers out of 2529 in the list. Based on our prediction, we will recommend other beers that he may like.

In [8]:
# Randomly select a user in the list of profiles and get reommendations

import random

if len(list(bestScores.keys())) > 0:
    profileName  = list(bestScores.keys())[random.randrange(0,len(list(bestScores.keys())))]
    r2score = bestScores[profileName][1]
else:
    profileName = profile
    r2score = score

# List of beers that weren't rated by the user
beersNotRated = np.array([featureMatrix[i][0] for i in range(featureMatrix.shape[0])if dataMatrix[i][1] == 0])

# Feature list of beers
X_notRated = np.array([featureMatrix[i][1:] for i in range(featureMatrix.shape[0]) if dataMatrix[i][1] == 0])

# computed userPreferences from our regression analysis
userPref = userPreferences[profileName][0]
userPref = userPref[:,np.newaxis]

# Compute predicted ratings
y_predRatings = np.dot(X_notRated, userPref)
# We will cap the rating at 5.0 Our regression analysis has predicted values over 5
for i in range(y_predRatings.shape[0]):
    if y_predRatings[i] > 5.0:
        y_predRatings[i] = 5.0

# prepare a new dataFrame to display the top recommendations
predDF = pd.DataFrame()
# Add two columns
predDF['beerID'] = [int(beersNotRated[i]) for i in range(beersNotRated.shape[0])]
predDF['predRating'] = [y_predRatings[i][0] for i in range(beersNotRated.shape[0])]
# Sort in descending order. Main column is the predicted overall rating
predDF.sort_values(by='predRating', ascending=False, inplace=True)

#Extract beerIDs to collect beerName and beerStyle information from original dataFrame 
predBeerIDs = predDF.beerID.tolist() 

# Add other columns to the recommendation dataframe
avgAppearance = [appearanceSampleMeans[i] for i in predBeerIDs]
avgAroma = [aromaSampleMeans[i] for i in predBeerIDs]
avgPalate = [palateSampleMeans[i] for i in predBeerIDs]
avgTaste = [tasteSampleMeans[i] for i in predBeerIDs]
beerNames = [beerData[beerData.beer_beerid == i].beer_name.tolist()[0] for i in predBeerIDs]
breweryNames = [beerData[beerData.beer_beerid == i].brewery_name.tolist()[0] for i in predBeerIDs]
beerStyles = [beerData[beerData.beer_beerid == i].beer_style.tolist()[0] for i in predBeerIDs]
predDF['Aroma'] = avgAroma
predDF['Appearance'] = avgAppearance
predDF['Palate'] = avgPalate
predDF['Taste'] = avgTaste
predDF['BeerName'] = beerNames
predDF['BreweryName'] = breweryNames
predDF['BeerStyle'] = beerStyles

### Our top recommendations for the selected user

In [9]:
print ("ProfileName: ", profileName)
print ("R$^2$ value: ",r2score)
topRecommendations = predDF.head(10).set_index("BeerName")
topRecommendations.drop("beerID", axis=1, inplace=True)
topRecommendations

ProfileName:  drumminbrewer
R$^2$ value:  0.762631741357


Unnamed: 0_level_0,predRating,Aroma,Appearance,Palate,Taste,BreweryName,BeerStyle
BeerName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Rare D.O.S.,4.262129,4.757576,4.469697,4.80303,4.848485,Peg's Cantina & Brewpub / Cycle Brewing,American Double / Imperial Stout
King Henry,4.229985,4.52551,4.091837,4.494898,4.673469,Goose Island Beer Co.,English Barleywine
Kuhnhenn Raspberry Eisbock,4.19748,4.525401,3.958556,4.355615,4.470588,Kuhnhenn Brewing Company,Eisbock
Bourbon Barrel Aged Hi-Fi Rye,4.134359,4.392857,3.991071,4.383929,4.526786,Flossmoor Station Restaurant & Brewery,American Barleywine
Kuhnhenn Bourbon Barrel Fourth Dementia,4.113941,4.55102,3.938776,4.339286,4.637755,Kuhnhenn Brewing Company,Old Ale
Rare Bourbon County Stout,4.05331,4.659919,4.271255,4.593117,4.767206,Goose Island Beer Co.,American Double / Imperial Stout
Mango Mama,3.913836,4.371287,3.757426,4.128713,4.381188,Minneapolis Town Hall Brewery,American IPA
Wooden Hell,3.869799,4.6,4.18,4.46,4.606667,Flossmoor Station Restaurant & Brewery,English Barleywine
Vanilla Bean Aged Dark Lord,3.861984,4.717105,4.450658,4.674342,4.710526,Three Floyds Brewing Co. & Brewpub,Russian Imperial Stout
Cuvee De Castleton,3.848143,4.336022,3.806452,4.153226,4.376344,Captain Lawrence Brewing Co.,American Wild Ale
