### Project Title : Beer Recommendation

### Introduction
##### In the ecommerce industry, customer engagements are greatly impacted by recommendations and ratings. Websites utilize recommendation engines to provide the best recommendations that are most relevant to the user in order to make a profit. 
##### This project focuses on the beer industry/beer online shops. To provide users with personalized product choices based on their past ratings or features of the beers will be the key motivation.

#### Import libraries

In [4]:
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window 
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import math
import os
os.environ["PYSPARK_PYTHON"] = "python3"
import urllib
from pyspark.sql import SparkSession
## Recommendation Engine 
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS as ml_als
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

#### Data ETL

In [6]:
## setup spark session
spark = SparkSession \
    .builder \
    .appName("beer review") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [7]:
## Create Dataframe and SQL Table
## load data into dataframe and create sql tables
beers = spark.read.load("/FileStore/tables/beer_reviews.csv", format='csv', header = True)
beers.createOrReplaceTempView("beer_reviews")

In [8]:
## Display raw data
display(beers.take(5))

In [9]:
## The dimension of the dataframe
print((beers.count(), len(beers.columns)))

In [10]:
## Check Null
print("Is there missing value in the dataframe?")
print('beers: {}'.format(beers.count() == beers.na.drop().count()))

#### In this project, 1.5 millions of beer review data from BeerAdvocate accessed through Kaggle will be analyzed.
#### Reviewer information, beer Information and rating information are provided. 
##### Specifically, 
##### Beer information includes : Beer Name, Beer ABV, Beer ID, Beer Style, Brewery Name, Brewery ID.
##### Rating infomration includes : Overall Ratings, Aroma Ratings, Apprearence Ratings, Palate Ratings, Taste Ratings.
##### Reviewer information includes : Reviewer profilename, Review time.

In [12]:
## Show data types
beers.printSchema()

In [13]:
## Convert from spark dataframe to pandas dataframe
pandasbeer = beers.toPandas()

In [14]:
## Change reviewer profilename to userid
listOfStr = pandasbeer['review_profilename'].tolist()
my_dict = { i: listOfStr[i] for i in range(0, len(listOfStr) )}
flipped_dict = dict(zip(my_dict.values(), my_dict.keys()))
pandasbeer['review_profilename'] = pandasbeer.review_profilename.map(flipped_dict)

In [15]:
## Convert back to spark dataframe
mybeer = spark.createDataFrame(pandasbeer)

### Exploratory data analysis
#### In the first part of this project, I am going to explore what kind of data we get. 
##### I focused on beer style, since it might be an interesting characteristic that affects people's choice of beer. Other attributes, for example ABV, Brewery should also be thoroughly explored if time allowed.
##### In addition, since we also have Overall Ratings, Aroma Ratings, Apprearence Ratings, Palate Ratings, Taste Ratings information, it might be interesting to see the relationships between people's preference for beer and different categories of ratings. It would also be interesting to model the overall ratings using ratings from other categories.

In [17]:
## Size of distinct items
print ("number of distinct users", mybeer.select('review_profilename').distinct().count())
print ("number of distinct beers", mybeer.select('beer_name').distinct().count())
print ("number of distinct beer styles", mybeer.select('beer_style').distinct().count())
print ("number of distinct breweries", mybeer.select('brewery_name').distinct().count())
print ("number of distinct ABV", mybeer.select('beer_abv').distinct().count())

#### Explore beer styles

In [19]:
## Unique Beer Styles
unique_beer_style = mybeer.select('beer_style').distinct()
display(unique_beer_style)

In [20]:
## The number of type of beers in each beer style
df_style= mybeer \
                    .groupBy('beer_style').count() \
                    .orderBy('count', ascending = False)
display(df_style)

In [21]:
##Example of beers of belgian strong dark ale style
display(mybeer.where("beer_style like '%Belgian Strong Dark Ale%'"))

#### Explore the overall ratings of beers

In [23]:
## The number of ratings of each scores
df_rate= mybeer \
                    .groupBy('review_overall').count() \
                    .orderBy('review_overall', ascending = False)
display(df_rate)

#### The average rating of beers from each beer style

In [25]:
pandasbeer['review_overall'] = pandasbeer['review_overall'].apply(pd.to_numeric)
pandasbeer.groupby("beer_style")["review_overall"].mean().sort_values(ascending = False).head(10)

#### The beers that have the most ratings

In [27]:
beerrated = pandasbeer.groupby('beer_name')['review_overall'].count().sort_values(ascending=False).head(5)

#### The number of ratings that each reviewer gave

In [29]:
pandasbeer.groupby('review_profilename')['review_overall'].count().sort_values(ascending=False).head(5) 

##### Although using this interesting dataset from BeerAdvocate, the project can move to many directions.
##### In the rest part of this project, I am going to dive directly into predicting beer ratings only from review ID, and ratings that they gave to other beers.

### Start to predict beer ratings using recommendation algorithms with Spark MLlib APIs

In [32]:
## Create a new dataframe only focusing on the beer, reviewer and the ratings that give.
df_rating_data = mybeer.select("review_profilename","beer_beerid","review_overall")
df_rating_data = df_rating_data \
            .withColumn("review_profilename", df_rating_data.review_profilename.cast(IntegerType())) \
            .withColumn("beer_beerid", df_rating_data.beer_beerid.cast(IntegerType())) \
            .withColumn("review_overall", df_rating_data.review_overall.cast(DoubleType())) 

In [33]:
## Train and test split
data, hold_out = df_rating_data.randomSplit([0.8, 0.2], seed = 7856)
data.cache()
hold_out.cache()

#### Use ALS (Alternating Least Square) and collaborative filtering to predict the ratings for the movies
#### ALS machine learning model referred from 
##### https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html

In [35]:
# Build the recommendation model using ALS on the training data

# Specify model with parameters
als = ml_als(userCol="review_profilename", itemCol="beer_beerid", ratingCol="review_overall", coldStartStrategy="drop")

# Use a ParamGridBuilder to construct a grid of parameters to search over.
paramGrid = ParamGridBuilder()\
    .addGrid(als.maxIter,[5]) \
    .addGrid(als.rank, [4, 6, 8])\
    .addGrid(als.regParam, [0.6, 0.8, 1])\
    .build()

# Evaluate the model by computing the RMSE on the test data
evaluator = RegressionEvaluator(metricName="rmse", labelCol="review_overall",
                                predictionCol="prediction")

tvs = CrossValidator(estimator=als,
                     estimatorParamMaps=paramGrid,
                     evaluator=evaluator,
                     numFolds = 5)

# fit data
myalsmodel = tvs.fit(data)

#### Evaluate the model by computing the RMSE on the test data

In [37]:
# Get the best model from cross validation, evaluate the best model on test data
best_model = myalsmodel.bestModel
predictions = best_model.transform(hold_out)
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

In [38]:
#Print evaluation metrics and model parameters
print ("RMSE = "+str(rmse))
print ("**Best Model**")
print (" Rank:"+str(best_model._java_obj.parent().getRank())), 
print (" MaxIter:"+str(best_model._java_obj.parent().getMaxIter())), 
print (" RegParam:"+str(best_model._java_obj.parent().getRegParam()))

#### Show predicted ratings

In [40]:
## Show predictions
predictions.show()

#### Generate the top 5 beer recommendations for each user
#### Generate the top 5 user recommendations for each beer

In [42]:
# Generate top 5 beer recommendations for user whose id is 666
userRecs = myalsmodel.bestModel.recommendForAllUsers(5).cache()
display(userRecs)

In [43]:
# Generate top 5 user recommendations for each beer
beerRecs = myalsmodel.bestModel.recommendForAllItems(5).cache()
display(beerRecs)

### Discussion
#### Matrix factorization with ALS algorithm was used in the rating prediction task.
#### I used five-fold cross validation to train the model, selected the best model and evaluated on the testing data. 
#### Parameters including max iteration, regularization parameter, and rank were selected using grid search method. 
#### The best model showing the rank number gives us an insight of the number of the latent features in the ALS algorithm. The feature might be related to beer style, abv, brewery, aroma, appearance, palate, tasting and etc. A thorough exploration could be done latter. The findings might help in reinforcing the model using other methods in the future.
#### The predicted beer ratings and recommendations for customers are also shown above, providing recommended beers to the specific customers.
#### More explorations could be done using this dataset in possible future projects.
##### In EDA, I focused on beer style, since it might be an interesting characteristic that affects people's choice of beer. Other attributes, for example ABV, Style, Brewery should also be thoroughly explored if time allowed.
##### In addition, since we also have Overall Ratings, Aroma Ratings, Apprearence Ratings, Palate Ratings, Taste Ratings information, it might be interesting to see the relationships between people's preference for beer and different categories of ratings. It would also be interesting to model the overall ratings using ratings from other categories, at least plot the distribution of theses ratings or run a simple linear regression.

### Appendix for all the columns in the dataset 
##### Reviewer Information : Reviewer profilename, Review time
###### review_profilename 
###### review_time
##### Rating Information : Overall Ratings, Aroma Ratings, Apprearence Ratings, Palate Ratings, Taste Ratings
###### review_overall
###### review_aroma
###### review_appearance
###### review_palate
###### review_taste
##### Beer Information: Beer Name, Beer ABV, Beer ID, Beer Style, Brewery Name, Brewery ID
###### beer_name
###### beer_abv
###### beer_beerid
###### beer_style
###### brewery_name
###### brewery_id

### References 
##### https://www.kaggle.com/rdoume/beerreviews
##### https://hub.packtpub.com/building-recommendation-engine-spark/ 
##### https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/ 
##### https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3