# Test Project: Beer Reviews Analysis and Recommendation

###  Recommend beers based on the data

Data: 1.5M beer reviews from beer advocate. (https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz)


In [2]:
%matplotlib inline

### Import packages Numpy, Scikit Learn and Pandas

In [3]:
import numpy as np
import sklearn
import pandas as pd

Read in the "csv" file with reviews into a pandas "data frame". Pandas has this nice reader that can read a bunch of file formats and store the data in a "data frame"

In [4]:
beer_data = pd.read_csv('/Users/phani/Downloads/beer_reviews/beer_reviews.csv', delimiter=",", encoding='utf-8')
# download the revievs csv form the link above. Alternatively, use 'urlopen' and rebuild this command 
#to read the csv file after unzipping the response 

In [39]:
#for i in range(len(beer_data.columns)):
#    print("Column",i, ": ", beer_data.columns[i])
#
#print ("*************************")
#print("Columns with Null Values")
#for column in beer_data.columns:
#    count = len(beer_data.loc[pd.isnull(beer_data[column])])
#    if count > 0 :
#       print ("Column Name: ", column, " Null Values: ", count)

#### Q2. Recommend beers based on the data - Approach 1

A simple ordering of data based on review_overall should give us a list with beers and their corresponding ranking in the list. 

#### Preprocessing
Number of reviews are not same for all beers. So, we will calculate the sample mean from the number of reviews we gathered for a beer to assign one overall_review for each beer. However, some beers have only one review where as others have more than one reviews. Hence, we need to clean up the data to include only those beers where we can calculate the mean within a certain margin of error. 

The statistics way to chose the threshold number of reviews (min number of samples) is to compute the minimum number of required reviews for a beer to predict the mean with 95% confidence interval. 

We use this formula: ($\frac{\sigma^2 * Z^2}{m^2}$), where $\sigma$ is the standard deviation of the sample, Z-score for a confidence interval of 95% is 1.96 and m is the allowed margin of error. 

In [68]:
df = beer_data[["beer_beerid","beer_name","review_overall", "review_profilename"]]

# we will drop duplicate reviews
df = df.drop_duplicates(["beer_beerid","review_profilename"])

# set indices for determining levels
df = df.set_index(["beer_beerid","beer_name"])

# Calculate nSamples, sampleMeans, sampleStd
nSamples = df.groupby(level=0).count().to_dict()
sampleMeans = df.groupby(level=0).mean().to_dict()
sampleStd = df.groupby(level=0).std()

# Define Margin of Error and Z-score for 95% confidence interval
m = 0.1
z = 1.96

# filter out sampleMeans with less number of reviews than minimum required
# to achieve 95% confidence interval, sort sampleMeans and rank beer_ids 
# from the sorted sampleMeans
# reject samples with std dev = 0.0

sampleMeans_new = {}
for i in nSamples.keys():
    if i == "review_overall":
        for j in nSamples[i].keys():
            if sampleStd[i][j] > 0:
                nSamplesRequired = (sampleStd[i][j] * z/m)**2
            if nSamples[i][j] > nSamplesRequired:
                sampleMeans_new[j] =  sampleMeans[i][j]
#sampleMeans = np.array(sampleMeans_new)

#sampleMeans = np.sort(sampleMeans,axis=1)
test = sort(sampleMeans_new.values())
#print (len(sampleMeans_new.keys()))
#for i in sampleMeans_new.keys():
#    print (i, sampleMeans_new[i])
print (test)               
       # nSamples[i][j] = [nSamples[i][j],nSamples[i][j]]
     #   print ( nSamples[i][j])
    
#new_grouped

NameError: name 'sort' is not defined

In [59]:
t = [[  5.00000000e+00,   3.55357143e+00],
       [  6.00000000e+00,   3.70780712e+00],
       [  7.00000000e+00,   3.26946565e+00]]
np.sort()

[5.0, 3.55357143]

#### Q2. Recommend beers based on the data - Build a content-based recommendation system

Data Matrix: Lets assume that ratings are correct and correlate with user preferences. Data matrix contains overall reviews and users data. We collect user info based on review_profilename. Hence, some data points with missing "review_profilename" are dropped from the data matrix. 

Features Matrix: Features matrix contains four features (aroma, palette, appearance, taste) rated by different users. We assume that the rating is proportional to their tastes. We will not use ABV% here as one of the feature. It has many undefined values that could shrink our feature set.

We use linear regression with regularization to calculate the parameters that correlate features with overall rating. 



In [9]:
# check for any NA values in four features
beer_ratings = beer_data.dropna(how='any').loc[:,["beer_beerid","review_overall",
                                                  "review_taste","review_aroma",
                                                 "review_palate","review_appearance",
                                                 "review_profilename"]]
# Check for duplicate entries -  condition: same user, same rating and same beer_id
print (len(beer_ratings)) #- before duplicates removed
beer_ratings = beer_ratings.drop_duplicates()
print (len(beer_ratings)) #- after removing duplicates

1518478
1517728


In [10]:
beer_ratings = beer_ratings.drop_duplicates(subset=['beer_beerid','review_profilename'])
print (len(beer_ratings)) #- after removing duplicates

1504037


Create a data matrix (again a data frame) by rearranging the beer_ratings data frame using two indices and extracting the overall rating column

In [11]:
#beer_ratings.set_index(["beer_beerid","review_profilename"], inplace=True)
# Set indices
#beer_ratings.set_index(['beer_beerid','review_profilename'], inplace=True)
#beer_ratings
# Create a new dataframe which is our 'Data Matrix' that contains
# beer IDs as rows, usernames as columns and overall reviews as datapoints
# Some values are NA because all users didn't rate all beers
#beer_datamatrix = beer_ratings.to_panel().review_overall
#beer_datamatrix
print (len(beer_ratings))
#beer_ratings.to_csv("testdata.csv")

1504037


In [23]:
# Test code
df = beer_ratings[:200]
print (len(df.review_profilename))
list_profiles = np.sort(df.review_profilename.tolist())
set_profiles = list(set(list_profiles))
list_beerids = np.sort(df.beer_beerid.tolist())
set_beerids = list(set(list_beerids))
dataM_sub = np.zeros(len(set_beerids) * len(set_profiles))
dataM_sub.reshape(len(set_beerids), len(set_profiles))

df_new = df.set_index(["beer_beerid","review_profilename"])
df_new.index[1]

for i in range(len(set_beerids)):
    for j in range(len(set_profiles)):
        for k in range(len(df_new.index)):
            if (set_beerids[i],set_profiles[j]) == df_new.index[k]:
                dataM_sub[i][j] = df_new.loc[df_new.index[k]].tolist()

print (dataM_sub)
#dfw = beer_ratings.to_panel().review_overall
#dfw = beer_ratings.index.tolist()
#print (len(dfw))
#beer_ratings.index.value_counts()
#print (len(list(set(dfw))))
#for i in list(set(dfw))[:10]:
    

200


TypeError: 'numpy.float64' object does not support item assignment