<a href="https://colab.research.google.com/github/CGrannan/building-boardgame-recommendation-systems/blob/master/spark_als_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In our last notebook we started to look at how collaborative-filtering recommendation systems work. We built two systems that provide recommendations based on an item. In this notebook we will be building another collaborative-filtering recommendation system, but one that allows new users to enter ratings and receive personalized recommendations. To accomplish this, we will be using ALS in pyspark. To begin, we will load several libraries and some functions that we have written.


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

import findspark
findspark.init()


from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS 
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder



In [None]:
% run /content/drive/MyDrive/CapstoneProject/als_recommender_systems.py

In [None]:
# import necessary libraries
from pyspark.sql import SparkSession

# instantiate SparkSession object
spark = SparkSession.builder.master('local').getOrCreate()

Great our pyspark session is up and running, now let's upload our data. We have 38 csv files, each with ~100,000 ratings. We will load all of these files now and look at our resulting dataframe.

In [None]:
rating_df = spark.read.option('delimiter', '\t').csv('/content/drive/MyDrive/CapstoneProject/scraped_ratings/*.csv')
rating_df = rating_df.selectExpr("_c0 as game_id", "_c1 as user_name", "_c2 as rating", "_c3 as comment")

rating_df.show(5)
rating_df.count()

+-------+------------+------+--------------------+
|game_id|   user_name|rating|             comment|
+-------+------------+------+--------------------+
|    822|  teambanzai|    10|                null|
|    822|    dumarest|    10|                null|
|    822|SanguineGrrl|    10|This is my favori...|
|    822|   sargeofny|    10|                null|
|    822|MagicWiesner|    10|                null|
+-------+------------+------+--------------------+
only showing top 5 rows



3752111

# Data Cleaning

For this project we will not be using the comments of the ratings, so we will go ahead and drop those now.

In [None]:
rating_df = rating_df.drop('comment')
rating_df.show(5)

+-------+------------+------+
|game_id|   user_name|rating|
+-------+------------+------+
|    822|  teambanzai|    10|
|    822|    dumarest|    10|
|    822|SanguineGrrl|    10|
|    822|   sargeofny|    10|
|    822|MagicWiesner|    10|
+-------+------------+------+
only showing top 5 rows



Now we will change our user_names in to user_ids since they will need to be an integer for our ALS system.

In [None]:
from pyspark.sql.functions import countDistinct

rating_df.select(countDistinct('user_name')).show()

+-------------------------+
|count(DISTINCT user_name)|
+-------------------------+
|                   275860|
+-------------------------+



In [None]:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCol="user_name", outputCol="user_id")
model = stringIndexer.fit(rating_df)
rating_df = model.transform(rating_df)
rating_df.show()

+-------+--------------+------+--------+
|game_id|     user_name|rating| user_id|
+-------+--------------+------+--------+
|    822|    teambanzai|    10| 37019.0|
|    822|      dumarest|    10|123063.0|
|    822|  SanguineGrrl|    10|167961.0|
|    822|     sargeofny|    10|269396.0|
|    822|  MagicWiesner|    10| 27440.0|
|    822|    Ed_the_Red|    10|  1883.0|
|    822|  Paul Slavich|    10| 34561.0|
|    822|       starman|    10|  2312.0|
|    822|     krcubedex|    10| 95351.0|
|    822|Manuel Siebert|    10| 25533.0|
|    822|        Elvite|    10| 54624.0|
|    822|       dgmyers|    10|252158.0|
|    822|   macrovipera|    10|113403.0|
|    822|    karlstroff|    10| 95256.0|
|    822|   FrankWIrsch|    10| 61152.0|
|    822| MBradford1968|    10|190839.0|
|    822|shropshireblue|    10|210666.0|
|    822|       laurana|    10|176374.0|
|    822|         vitas|    10| 10958.0|
|    822|      Anaconda|    10| 24359.0|
+-------+--------------+------+--------+
only showing top

In [None]:
rating_df = rating_df.drop('user_name')
rating_df.show(5)

+-------+------+--------+
|game_id|rating| user_id|
+-------+------+--------+
|    822|    10| 37019.0|
|    822|    10|123063.0|
|    822|    10|167961.0|
|    822|    10|269396.0|
|    822|    10| 27440.0|
+-------+------+--------+
only showing top 5 rows



In [None]:
rating_df.select(countDistinct('user_id')).show()

+-----------------------+
|count(DISTINCT user_id)|
+-----------------------+
|                 275860|
+-----------------------+



Next, we need to change our rating and game_id columns to integers as well.

In [None]:
from pyspark.sql.types import IntegerType
rating_df = rating_df.withColumn("game_id", rating_df["game_id"].cast(IntegerType()))
rating_df = rating_df.withColumn("rating", rating_df["rating"].cast(IntegerType()))

# Modeling

Now we are ready to start modeling. We will separate a train and test set and train a baseline model on our train set. We will evaluate the model on the test set using a regression evaluator for RMSE.

In [None]:
train, test = rating_df.randomSplit([.8, .2])
als = ALS(maxIter=5, userCol='user_id', itemCol='game_id', ratingCol='rating', coldStartStrategy='drop')
model=als.fit(train)

In [None]:
preds = model.transform(test)
evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating', metricName='rmse')
rmse = evaluator.evaluate(preds)
print('RMSE: ', rmse)

RMSE:  1.007362122516593


The model looks pretty good for a baseline, on a ten-point rating system our error has a little over 1 point difference. Let's see if we can improve on this. We will run a cross validator looking for an optimal rank parameter.

In [None]:
# Warning! This cell takes a while to run.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
reg_params = [.01, .005, .001]
for reg_param in reg_params:
  als = ALS(userCol='user_id', regParam = reg_param, ratingCol='rating', itemCol='game_id', coldStartStrategy='drop')

  param_grid = ParamGridBuilder()\
                .addGrid(als.rank, [4, 8, 12, 16])\
                .build()

  cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, parallelism=4)
  model= cv.fit(train)
  predictions = model.transform(test)
  evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                  predictionCol="prediction")
  rmse = evaluator.evaluate(predictions)
  print(reg_param, model.bestModel.rank, rmse)

0.01 4 0.988944988150814
0.005 4 0.9850528869880831
0.001 4 1.0270820812617203


Moving forward we will use 4 for our rank and .005 for our regParam for our ALS model as this combination had the lowest error. Next up we will load our statistics dataframe since we will use the board game name column.

In [None]:
import pandas as pd
games_df = pd.read_pickle('/content/drive/MyDrive/CapstoneProject/games_with_descriptions')
games_df = games_df.reset_index().rename(columns={'index':'game_id'})
games_df.tail()

Unnamed: 0,game_id,name,type,year,designer,artist,publisher,min_players,max_players,play_time,min_age,num_ratings,avg_rating,bayes_avg,weight,categories,mechanics,families,description,min_playtime,max_playtime,bgg_rank,boardgame_rank
119973,324992,Riddle Island: 1974,boardgame,2019.0,uncredited,uncredited,TRANSit,1.0,6.0,60.0,8.0,0,0.0,0.0,0.0,['Puzzle'],,['Category: Escape Room Games'],Riddle Island is a original series of escape r...,60,60,,
119974,324993,Big Boy Throwdown,boardgame,2020.0,uncredited,uncredited,(Web published),2.0,6.0,25.0,8.0,0,0.0,0.0,0.0,"['Card Game', 'Fantasy', 'Fighting', 'Humor']","['Dice Rolling', 'Events', 'Once-Per-Game Abil...",,"Armed with a hand of crazy, colorful character...",5,25,,
119975,324997,Fallen knight,boardgame,0.0,Tulga,Telmen,Gansukh,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,['Horror'],['Cooperative Game'],"['Admin: Better Description Needed!', 'Admin: ...",Fallen Knight is a 1-4 player cooperative dung...,0,0,,
119976,324998,The Treasure Cave of Dragon,boardgame,2019.0,uncredited,uncredited,TRANSit,2.0,6.0,20.0,8.0,0,0.0,0.0,0.0,"['Card Game', 'Dice', 'Number', 'Party Game']","['Dice Rolling', 'Push Your Luck']",,"Holding an ancient treasure map, you keep goin...",20,20,,
119977,325000,Zazz,boardgame,1963.0,Fredda F. S. Sieve,uncredited,"Advertising Attractions, INC.",0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,,['Dice Rolling'],,One of the first games to use a set of differe...,0,0,,


Now we will use a function to enter ratings for a new user (me) and get some recommendations.

In [None]:
user_id = 1000000
user_ratings_1 = [('Cthulhu Wars', 8, user_id),
                  ('Terraforming Mars', 9, user_id),
                  ('Gloomhaven', 9, user_id),
                  ('Twilight Imperium: Fourth Edition', 8, user_id),
                  ('Mage Knight Board Game', 8, user_id)]
new_user_recs(user_id,
             new_ratings= user_ratings_1,
             rating_df= rating_df,
             stats_df= games_df, 
             num_recs= 5,
             spark=spark)

Recommendation 1: Nemesis | Predicted Score = 8.86
Recommendation 2: Go | Predicted Score = 8.6
Recommendation 3: Magic: The Gathering | Predicted Score = 8.6
Recommendation 4: Twilight Struggle | Predicted Score = 8.57
Recommendation 5: Puerto Rico | Predicted Score = 8.57


After entering 5 of my favorite games and their ratings, we got five recommendations. Of those, two are games that I have been hoping to try out soon (Go and Puerto Rico), one of my favorite games (Magic the Gathering) and two games that I am familiar with but haven't played. Overall, some good recommendations! And the best part is that if we keep adding recommendations, the system only gets better. Overall, we have built a solid recommendation system to give user-based recommendations.

You can run the next cell to enter new games and ratings and get personalized recommendations.

In [None]:
create_new_recommendations(rating_df, games_df, 5, spark)

Enter a game for recommendations. Cthulhu Wars
Enter rating. 8
Rate more games? y/n Terraforming Mars
Enter a game for recommendations. Terraforming Mars
Enter rating. 9
Rate more games? y/n y
Enter a game for recommendations. Gloomhaven
Enter rating. 9
Rate more games? y/n y
Enter a game for recommendations. Twilight Imperium: Fourth Edition
Enter rating. 8
Rate more games? y/n y
Enter a game for recommendations. Mage Knight Board Game
Enter rating. 8
Rate more games? y/n n
Recommendation 1: Race for the Galaxy | Predicted Score = 8.7
Recommendation 2: Twilight Struggle | Predicted Score = 8.69
Recommendation 3: Scythe | Predicted Score = 8.69
Recommendation 4: Magic: The Gathering | Predicted Score = 8.68
Recommendation 5: Through the Ages: A Story of Civilization | Predicted Score = 8.66



# Conclusion

We have explored many different kinds of recommendation systems over the course of this project. First, we looked at using natural language processing to identify similar games based on the content of each game. We used two models for this approach a count vectorized model and a tf-idf vectorized model. We determined that the tf-idf model was stronger overall. Next we looked at two collaborative-filtering models. When given a game, these models will return several other games that are rated similarly by users who liked that game. Our two collaborative-filtering models were more difficult to separate in regards to quality. Of the two, I prefer the KNN model, but both had good recommendations. And finally, we took a look at pyspark als to create new users and get recommendations based on user activity, and not on individual items.

While all models were deemed to be pretty effective, they all do slightly different things. I have three recommendations for the implementation of these models. First, I suggest using the tf-idf NLP model as a scrolling banner when selling boardgames. This will allow customers to see games that are similar to the game that they are currently considering purchasing. Second, I would suggest using the nearest neighbor model as a second banner indicating games that other customers liked. This will allow customers to see games that might be a bit different than what they are used to playing, but are thought very highly of by like minded people.  Finally, I suggest implementing the ALS model as a separate component of a business's webpage. By having a page where customers can rate their collections, we can offer them more personalized recommendations. Ultimately all of these models should help customers navigate the overwhelming number of boardgames for purchase and find games that they will enjoy.

# Future Work

There are several way that we can advance this project. The first and most obvious way is to gather more data. By filling in our rather sparse matrix with more reviews we will be able to finely tune our models and provide better recommendations. Second, we can test our knn and svd collaborative-filtering models. By gathering user feedback on the recommendations provided by those models, we can better evaluate our recommendations and tune our parameters. And finally we can further adjust our content-based system by further adjusting the tokens used for NLP. If we more closely monitor which tokens we are using, we might be able to create recommendations that are not mostly expansions and reprints.