In [3]:
%%HTML
<script src="require.js"></script>

In [4]:
from IPython.display import HTML
HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code!"></form>
''')

In [5]:
# Import necessary modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator, RankingEvaluator
from pyspark.sql import functions as F
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import Window


<div style="text-align: center;">
    <img src="https://i.imgur.com/jzRQsuC.png" alt="Alt text" width="1500">
</div>

<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">ABSTRACT</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
This report compares User-Based and Item-Based collaborative filtering for the broad game geek dataset. Models were tuned using `GridSearchCV` and evaluated based on their RMSE. Due to computational constraints, the models were trained on the top 500 highest reviewed games and users with at least 44 reviews, reducing dataset from around 19 million to around 6.2 million ratings for training. The Item-based Neighborhood-based model achieved the best performance with an RMSE of 1.18, outperforming the user-based neighborhood methods, ALS, and SVD. However, neighbor-based models took up a long run time and increasing the dataset further crashes the memory. Therefore, ALS is deemed the next best method because it has the best RMSE for model-based methods while also having the ability to parallelize computations for larger datasets.
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">INTRODUCTION</h3>
</div>

<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Background</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
Recommendation systems are a growing trend in the retail and e-commerce industry. For instance, there is an increasing need for online retail websites to grow consumer confidence in buying products. (Maruti Techlabs, 2021) In other words, in today’s digital age, these systems don’t only enhance consumer experiences but can also drive revenues for businesses. Therefore, mainstreaming this concept to other industries may allow users and consumers to increase their appreciation and engagement towards nicher product lines. One industry that may greatly benefit from this is the board game industry. 
</p>                                                                                                                                                                                      
<p style="line-height: 1.5;">
Recommender systems in the board game industry are a concept yet to be popularized. While it is common to hear about these systems in relation to the retail and film industry, it has yet to be mainstreamed for board games. Integrating this concept into this industry can represent a modern solution to a long-standing problem: helping players navigate an ever-expanding market filled with diverse titles, genres, and mechanics.  A recommender system fills this gap by analyzing player behavior and preferences to generate data-driven game suggestions. Rather than players falling victims to the “halo effect”, where players seek games from the same creators or publishers, users are exposed to lesser-known yet fitting titles. (Zalewski, Ganzha, & Paprzycki, 2019)
</p>  

<p style="line-height: 1.5;">
A research by Zalewski, Ganzha, and Paprzycki (2019) explored the possibilities of this idea. There are three types of recommender systems: Collaborative Filtering, Content-based Filtering, and a hybrid. (Maruti Techlabs, 2021) These authors clustered games according to shared characteristics then used collaborative filtering to recommend games to different users. The proposed algorithm achieved approximately 64% success in predicting preferred games. Therefore, demonstrating its potential to expand individual “game horizons.” In summary, recommender systems have the ability to enhance the social and cultural fabric of the board gaming industry by streamlining discovery and making it easier for every type of player to find games that truly resonate with their play style.
</p>  
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Problem Statement</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The study aims to answer the main problem: <b>How can explicit user ratings be utilized to recommend board games based on similar users?</b>

To address the main question, the study will also uncover answers to the following subquestion:
- Which collaborative filtering methods, neighborhood- or model-based method, result in a better performance?
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Motivation</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
In recent years, the board game industry has experienced a global resurgence, reaching an estimated market value of over USD 15 billion from growing hobbyist communities and rising online review platforms such as BoardGameGeek. However, this rapid expansion has also led to market saturation, where it has become increasingly difficult for players to discover games that match their unique interests and for companies to effectively reach their target audiences (Fortune Business Insights, 2025).
</p>

<p style="line-height: 1.5;">
For players, the overwhelming number of available titles leads to decision fatigue as they struggle to choose from thousands of options despite abundant reviews and ratings. Personalized recommendations can greatly enhance the player experience by personalizing games that align with specific play styles, complexity, and themes of interest. This leads to greater player satisfaction and loyalty to platforms that successfully anticipate user preferences (Ricci et al., 2021).
</p>

<p style="line-height: 1.5;">
Understanding player preferences is also equally valuable for board game publishers and retailers. Companies can gain insight into what mechanics, genres, or themes resonate most with their audiences by analyzing rating data and player behavior. They can then be confident in their marketing, product design, and inventory management. In a highly competitive environment, data-driven personalization can serve as a differentiator that boosts both sales and long-term customer retention (Ricci et al., 2021).

</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">METHODOLOGY</h3>
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/lIceCO4.png" alt="Alt text" width="1300">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Figure 1. Methodology Framework
</p>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
This report's methodology comprises a five-step process. It begins with the collection of publicly available user review data sets from Jojie. The subsequent steps include exploratory data analysis, Preprocessing, Modeling, and Evaluation. The results and discussion section presents the findings of each graph along with a summary of the insights and conclusions drawn from the analysis.
</p>

<p style="line-height: 1.5;">

To begin, the data must first be examined and explored. Data Collection starts with initializing and inspecting the dataset from Jojie’s public library. This includes scanning through the data to identify error points and duplicates, and to identify how to clean the data for modeling. Then an exploratory data analysis is conducted to identify dataset dimensions, such as the no.of unique users and unique and board games, and analyze user ratings. This is shown through various plots and graphs to determine the distribution of ratings, user activity, game popularity, and reception. These initial steps lay the foundation in understanding the datasets structure and behavior to allow for more informed decisions for data preprocessing, modeling, and evaluation.
</p>
<p style="line-height: 1.5;">
The next step is the implementation of the model. First, the dataset undergoes data preprocessing. This includes filtering of relevant information. Then, the narrowed dataset undergoes two different types of collaborative filtering models: neighborhood-based and model-based. Under the neighborhood-based methods, user-based collaborative filtering and item-based collaborative filtering were compared. On the other hand, for the model-based methods, alternating least squares and coordinate gradient descent were used. 
</p>

<p style="line-height: 1.5;">

Finally, to evaluate the best performing model, the Root Mean Squared Error (RMSE) will be the basis of reliability and accuracy, serving as the key metric to evaluate how closely the model’s predicted ratings align with the actual user ratings.
</p>

</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>

<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">DATA PROCESSING</h3>
</div>

<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Data Source</h4>
</div>

In [6]:
pd.set_option('display.float_format', '{:.2f}'.format)

orginal_df = pd.read_csv("/mnt/data/public/bgg/bgg-19m-reviews.csv")

styled = (
    orginal_df.head(4).style
        .set_table_styles([
            {
                'selector': 'thead',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000'),
                    ('font-weight', 'bold'),
                    ('border-bottom', '2px solid #000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(odd)',
                'props': [
                    ('background-color', '#f9f9f9'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(even)',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'td',
                'props': [
                    ('padding', '4px 8px'),
                    ('text-align', 'center'),
                    ('color', '#000000')
                ]
            }
        ])
        .set_table_attributes('style="margin-left:auto; margin-right:auto;"')  # Center align table
)

display(styled)

Unnamed: 0.1,Unnamed: 0,user,rating,comment,ID,name
0,0,Torsten,10.0,,30549,Pandemic
1,1,mitnachtKAUBO-I,10.0,"Hands down my favorite new game of BGG CON 2007. We played it 5 times in a row -- it's just that good. Too bad Pandemic won't be in stores until January of 2008. If you like pure coöp games (Lord of the Rings, Feurio, etc.), this should be right up your alley. Having 5 roles to choose from gives the game some extra variability. Also, once you get good you can ramp up the difficulty by adding more Epidemic cards. 9 -> 10",30549,Pandemic
2,2,avlawn,10.0,"I tend to either love or easily tire of co-op games. Pandemic joins Knizia's LoTR as my favorite true co-op. It edges LoTR out merely in time to set-up and play. LoTR can be an undertaking to explain enough details so that players enjoy their first time through the game, while Pandemic is fast enough that even if the players don't quite get everything that is going on, they can try again immediately.",30549,Pandemic
3,3,Mike Mayer,10.0,,30549,Pandemic


In [7]:
print(f"Shape: {orginal_df.shape}")
print(f"Column Names: {orginal_df.columns}")

Shape: (18964807, 6)
Column Names: Index(['Unnamed: 0', 'user', 'rating', 'comment', 'ID', 'name'], dtype='object')


<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">

The dataset can be obtained from the BoardGameGeek Reviews dataset by Jesse van Elteren from <a href="https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews/data"> Kaggle. </a> 

The raw DataFrame contains 18,964,807 datapoints and 6 columns: 

- <b>'Unnamed: 0'</b> contains the indices for the CSV file
- <b>'user'</b> contains usernames that placed the review
- <b>‘rating'</b> contain the specific score of the user for the game. The score is from continuous decimal scale from 0 to 10.
- <b>'comment'</b> contains optional comment of the user on the game
- <b>'ID'</b> contains the game ID linked to the review
- <b>'name'</b> contains the primary game name linked to the game ID

The dataset only contains games with at least 30 reviews and users with at least 1 review.
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Data Preprocessing</h4>
</div>

In [8]:
print("Datatype per column:")
print(orginal_df.dtypes)
print()
print("Unique values per column:")
print(orginal_df.nunique())

Datatype per column:
Unnamed: 0      int64
user           object
rating        float64
comment        object
ID              int64
name           object
dtype: object

Unique values per column:
Unnamed: 0    18964807
user            412815
rating           10759
comment        3046149
ID               21839
name             21440
dtype: int64


<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The BoardGameGeek dataset is converted into a Pandas DataFrame for Exploratory Data Analysis, preprocessing and modelling of Neighborhood-based and SVD collaborative filtering. The raw dataset contains around 18.9 million unique users, 10 thousand unique numerical ratings, 30 thousand unique comments, 21.8 thousand unique game IDs, and 21.4 thousand unique game names. 
</p>

</div>

<div style="padding-bottom: 7px;">
</div>



In [11]:
duplicates = orginal_df[orginal_df.duplicated(subset=["user", "name"], keep=False)]
duplicates = duplicates.sort_values(by="user")
sample_ids = duplicates["ID"].iloc[-2:].to_list()
sample = pd.read_csv("/mnt/data/public/bgg/games_detailed_info.csv", low_memory=False)
for id in sample_ids:
    print(sample[sample["id"] == id][['id','primary','alternate']])

        id      primary                                          alternate
17  129622  Love Letter  ['Letters to Santa', 'List Miłosny', 'Lista Sk...
          id      primary                                          alternate
1057  277085  Love Letter  ['List miłosny (Edycja Premium)', 'Love Letter...


<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The first step of the data cleaning process is dropping unnecessary rows. Since the study will only tackle explicit collaborative filtering, the models will only use numerical user scores, so the columns 'Unnamed: 0' and 'comment' are dropped. Upon deeper inspection, multiple game IDs also share the same primary game name, but different alternate game names depending on the year or version released. For example, 
game_id 129622 is the regular version of Love Letter while game_id 277085 is the premium version. Since different versions have different features and price points, the ‘name’ column was dropped from the dataset in favor of the ‘id’ column. 
</p>

</div>

<div style="padding-bottom: 7px;">
</div>






In [10]:
orginal_df[orginal_df.duplicated(subset=["user", "ID"], keep=False)]

Unnamed: 0.1,Unnamed: 0,user,rating,comment,ID,name


<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
Each column in the dataset correspond to a unique username and game id pairing, meaning the dataset only contains the latest review of a user on the game.
</p>

</div>

<div style="padding-bottom: 7px;">
</div>






```python
orginal_df = orginal_df[['user', 'rating', 'ID']]

spark = (SparkSession
     .builder
     .master('local[*]') # tells you master is 1 laptop using all 4 executors
     .config("spark.driver.memory", "8g")
     .config("spark.executor.memory", "8g")
     .config("spark.sql.shuffle.partitions", "8")  # reduce for local
     .getOrCreate()) # make new or get latest session

spark.sparkContext.setCheckpointDir("./als_spark_checkpoints")

# Read board game geek file on spark
    schema = """
    _c0 INT,
    user STRING,
    rating FLOAT,
    comment STRING,
    id INT, 
    name STRING
    """
    # Fix quote handling for comments column 
    df_spark = spark.read.csv(
        "/mnt/data/public/bgg/bgg-19m-reviews.csv",
        sep=',', header=True,
        schema=schema,
        multiLine=True,
        quote='"',
        escape='"')


<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Code Snippet 1. Processing Data as a Spark DataFrame
</p>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
Meanwhile, the same raw data was preprocessed as a Spark DataFrame in order to use the Alternating Least Squares function of Spark. For spark, the schema is provided and specific properties of multiLine, quote, and escape, needed to be set to properly handle quotation marks in the comments columns. You can refer to our supplementary notebook “ALS_modeling.ipnyb” for the specific implementation.
</p>

</div>

<div style="padding-bottom: 7px;">
</div>






```python

# -----------------
# 1. DATA PREPROCESSING (FULL)
# -----------------
print("Loading original CSV (FULL DATASET)...")
start_time = time.time()

# --- Use dtype optimization for faster loading and less memory ---
# We know the types, so let's specify them.
# user and item are IDs (integers), rating is a float
dtypes = {
    'rating': 'float32',
    'ID': 'int32'
}

# We only need these three columns, so we'll specify them.
original_df = pd.read_csv(
    "/mnt/data/public/bgg/bgg-19m-reviews.csv",
    usecols=["user", "rating", "ID"],
    dtype=dtypes
)

# 2. Preprocess the data
original_df = original_df.rename(columns={"ID": "item"})
print(f"Data loaded and renamed in {time.time() - start_time:.2f}s.")
print(f"Total original ratings: {len(original_df)}")

# -----------------
# 3. APPLY NEW FILTERS
# -----------------

# --- Get Top 500 Items ---
print("Finding top 500 most-reviewed games...")
item_counts = original_df['item'].value_counts()
top_500_items = item_counts.head(500).index # <--- CHANGED FROM 1000
print(f"Identified {len(top_500_items)} top games.")

# --- First Filter: Keep only ratings for the Top 500 games ---
print("Applying top 500 games filter...") # <--- UPDATED PRINT
df_top_games = original_df[original_df['item'].isin(top_500_items)]
print(f"Ratings remaining (top 500 games): {len(df_top_games)}") # <--- UPDATED PRINT

# --- Get Users with > 20 reviews *within the subset* ---
# This is crucial: we count reviews *after* filtering for games.
print("Finding users with > 20 reviews *within the top-500 subset*...") # <--- UPDATED PRINT
user_counts_subset = df_top_games['user'].value_counts()
user_mask = user_counts_subset > 44
users_to_keep = user_counts_subset[user_mask].index
print(f"Users to keep (with > 20 reviews): {len(users_to_keep)}")

# --- Second Filter: Apply the user filter to the game-filtered data ---
print("Applying user filter...")
df = df_top_games[df_top_games['user'].isin(users_to_keep)]

# --- Final Check ---
print("Filtering complete.")
print(f"Final Users: {df['user'].nunique()}")
print(f"Final Items: {df['item'].nunique()}")


# 4. Prepare data for Surprise
df_for_surprise = df[['user', 'item', 'rating']]
reader = Reader(rating_scale=(0, 10))

print("Loading filtered data into Surprise dataset...")
dataset = Dataset.load_from_df(df_for_surprise, reader)
print(f"Data loaded successfully. Total ratings for training: {len(df_for_surprise)}")


<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Code Snippet 2. Applying Filters to Decrease Sample Size
</p>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
Before training the model, filters were applied to downsample the number of data points for computational feasibility while also ensuring data quality and relevance. First, the dataset was limited to only the top 500 most-reviewed games, allowing the study to focus on popular games with sufficient rating information. This reduced the dataset from 19 million to 9.5 million ratings. Second, the dataset was limited to users with more than 44 reviews within the top 500 games. Too few reviews may distort the model, like finding similar users, so setting that filter removes the noise in the data. It also allows the analysis to focus on users more engaged with the gaming community and have more reliable data. The final database accounts for 63,497 active users and 500 games. 
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">EXPLORATORY DATA ANALYSIS</h3>
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/KIC9sBD.png" alt="Alt text" width="800">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Figure 2. Distribution of Board Game Geek Ratings
</p>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The figure above contains the distribution of board game reviews from the raw data. It can be observed that most reviews use whole digits while a small portion are more specific and use decimal places, indicating that while the majority of reviewers prefer the simplicity of an integer scale, a subset of reviewers want to provide more nuanced feedback. The distribution is left skewed, meaning that majority of reviews is concentrated on the higher end of the scale and very few reviews use the lower end of the range. The most common rating for a game is 7, accounting for 4 million of the reviews. This shows some sort of selection bias, meaning users are more likely to share positive experiences with the game, and users who dislike a game might simply stop playing and not bother submitting a review. This also shows the subjectivity of rating scales as the average experiences correspond to different ratings for different people. Therefore, mean-centering was used for the Neighborhood-based models to unify the scale across all users.
                                                                                                                                                                                                                     </p>
</div>

<div style="padding-bottom: 7px;">
</div>



In [14]:
# Group by operation
groupby_df = orginal_df.groupby("ID").agg({"rating": "mean", "user": "count"}).sort_values("user", ascending=False)

# Styling the describe() output of the groupby DataFrame
styled_groupby = (
    groupby_df.describe().style
        .set_table_styles([
            {
                'selector': 'thead',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000'),
                    ('font-weight', 'bold'),
                    ('border-bottom', '2px solid #000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(odd)',
                'props': [
                    ('background-color', '#f9f9f9'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(even)',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'td',
                'props': [
                    ('padding', '4px 8px'),
                    ('text-align', 'center'),
                    ('color', '#000000')
                ]
            }
        ])
        .set_table_attributes('style="margin-left:auto; margin-right:auto;"')  # Center align table
)

# Display the styled DataFrame
display(styled_groupby)

Unnamed: 0,rating,user
count,21839.0,21839.0
mean,6.415746,868.388708
std,0.929549,3685.057639
min,1.041333,30.0
25%,5.831637,56.0
50%,6.446825,122.0
75%,7.039059,393.0
max,9.568293,108971.0


<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Table 2. Statistics on the Number of Ratings per Game
</p>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The table above shows the average number of reviews per game. The data is highly concentrated at the low end. The median is 122, meaning around half the games have less than 122 ratings while the other half have more than 122 ratings. The mean was 868, showing that the data is right-skewed, with a vast majority of niche games and a few exceptionally popular games with around 100 thousands reviews.                                                                                                                                                                                                                    </p>
</div>

<div style="padding-bottom: 7px;">
</div>



In [15]:
# Group by operation for the new DataFrame
groupby_df2 = orginal_df.groupby("user").agg({"rating": "mean", "ID": "count"}).sort_values("ID", ascending=False)

# Styling the describe() output of the new groupby DataFrame
styled_groupby2 = (
    groupby_df2.describe().style
        .set_table_styles([
            {
                'selector': 'thead',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000'),
                    ('font-weight', 'bold'),
                    ('border-bottom', '2px solid #000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(odd)',
                'props': [
                    ('background-color', '#f9f9f9'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(even)',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'td',
                'props': [
                    ('padding', '4px 8px'),
                    ('text-align', 'center'),
                    ('color', '#000000')
                ]
            }
        ])
        .set_table_attributes('style="margin-left:auto; margin-right:auto;"')  # Center align table
)

# Display the styled DataFrame
display(styled_groupby2)


Unnamed: 0,rating,ID
count,412815.0,412815.0
mean,7.895585,45.940048
std,1.23451,108.486116
min,1.0,1.0
25%,7.133333,2.0
50%,7.785714,12.0
75%,8.666667,44.0
max,10.0,6471.0


<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Table 3. Statistics on the Number of Reviews per User
</p>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
The table above shows the average number of reviews per user. The data is highly concentrated at the low end. The median is 12, meaning around half of users have made less than 12 reviews while the other half made more than 12 reviews. The mean was 45, showing that the data is right-skewed, with only a small portion of reviewing a large number of games.     
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">RESULTS & DISCUSSION</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
This analysis compares three collaborative filtering approaches for board game recommendations using the BoardGameGeek dataset: Neighborhood-based methods (User-Based and Item-Based KNN), Singular Value Decomposition, and Alternating Least Squares (ALS).
</p>
                                                               
<p style="line-height: 1.5;">
The model with the best accuracy is Item-Based Pearson with an  RMSE of 1.1503. This means that item-based pearson achieved the lowest prediction error. This also showed that out of the two neighborhood-based models, item-based methods outperformed user-based for this use case. One possible reason for this is the fact that game characteristics rarely change while user preferences can change frequently over time. change slowly compared to user preferences. In addition, limiting the items to the top 500 compared to the over 63 thousand users made it faster to compute for similarity. However, one major downside of neighborhood based models is that it is computationally expensive as an increase in the number of datapoints require larger training times. Meanwhile, ALS, which had the second best RMSE of 1.1772, was able to perform relatively close to Item-based KNN while also being scalable due to its integration with Apache Spark. It was also able to gain a high accuracy with a small latent factor of only 4. Therefore, ALS was determined to be the best model.
</p>

<p style="line-height: 1.5;">

However, the ALS model's Normalized Discounted Cumulative Gain revealed a significant limitation. NDCG measures how well the model ranks and recommends the top-k ranked items. The NDCG@3 for the test set is 0.0323. While the model predicts ratings reasonably well, even generating a RSME of 1.1772 for the test set, it struggles to identify how to rank the top 3 items. As is, the model’s strong predictive accuracy and scalability make it the most practical for deployment. Future works can enhance this recommendation by conducting additional tuning using ranking metrics.
</p>


</div>

<div style="padding-bottom: 7px;">
</div>

<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Neighborhood-Based Methods (KNN)</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
User-Based collaborative filtering recommends to a user by determining the most similar users and recommending their most liked items. Meanwhile, Item-Based collaborative filtering measures similarity between items and predicts a user’s rating based on how similar games were rated. Their similarities were measured using mean squared difference and pearson baseline. Hyperparameter tuning was conducted to determine the best model. The possible values for k were 20 and 60 to determine which balance of bias and variance creates a better model. A smaller k  like 20 captures more personalized similarities, while larger k like 60 provide more stable predictions. Meanwhile, the minimum support parameter controlled the minimum number of common ratings for the similarity measure. The specific code for preprocessing can be seen in the supplementary notebook ‘Neighborhood-and-SVD.ipynb’
</p>
                                                                                                                                                              <p style="line-height: 1.5;">
All neighborhood-based models have a small RMSE. The best neighborhood-based model turned out to be a KNN item-based method with a RMSE of 1.15. The best model used pearson baseline as the similarity metric and k = 20 and min_support = 10 as the hyperparameters
</p>
</div>

<div style="padding-bottom: 7px;">
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/SrvTNgu.png" alt="Alt text" width="1200">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Figure 3. Process and Results for User-Based Collaborative Filtering
</p>
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/j91fBrt.png" alt="Alt text" width="800">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Figure 4. Process and Results for Item-Based Collaborative Filtering
</p>
</div>

<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Model-Based Methods</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
Model-Based methods like singular value decomposition (SVD) and alternating least squares (ALS), makes recommendations by creating lower-dimensional representations of the dataset through latent factors. For SVD, Each user and item is represented as a vector in this latent space, and predicted ratings are derived from the dot product of these vectors. The hyperparameter tuned was n_factors, which determines the number of latent factors. Setting the values at 50, 100, and 150 allow the model to determine the best results that balance model complexity and overfitting reduction. The specific code for preprocessing can be seen in the supplementary notebook ‘Neighborhood-and-SVD.ipynb’
</p>
                                                               
<p style="line-height: 1.5;">
ALS, implemented through Apache PySpark, is a scalable matrix factorization technique designed for large datasets. It determines latent factors by alternating between user factors and item factors and minimizing squared error through distributed computation. The method is particularly well-suited for sparse datasets and parallel processing. The main hyperparameters tuned were rank (2, 4, 6), which determine the number of latent factors, and regParam (0.01, 0.005, 0.001), which is a parameter used to penalize large weights. The specific code for preprocessing can be seen in the supplementary notebook ‘ALS-modeling.ipynb’
</p>

<p style="line-height: 1.5;">
Out of the two, the ALS model with parameters rank=4, and regParam=0.001 resulted in the best RMSE for model-based methods, which is 1.1772. 
</p>


</div>

<div style="padding-bottom: 7px;">
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/SrvTNgu.png" alt="Alt text" width="1200">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Figure 5. Process and Results for SVD
</p>
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/c3MfuHW.png" alt="Alt text" width="650">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin: 30px; margin-left: 20px; font-style: italic;">
    Figure 6. Process and Results for ALS, with NDCG
</p>
</div>

<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">CONCLUSION & RECOMMENDATIONS</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
In conclusion, this analysis shows that there is no single “best” recommendation model that exists, as the most optimal choice depends on the deployment’s main goal. A clear trade-off emerged between accuracy, scalability, and speed. While the Item-Based Pearson KNN approach provided the highest accuracy with the lowest RMSE (1.1503), its utility in a production environment is limited. Therefore, the ALS model, with its competitive RMSE of 1.1772, is better as the all-around choice for a real-world system, due to it being built to scale efficiently with large datasets on Spark.

</p>

<p style="line-height: 1.5;">
However, further investigation uncovered a significant weakness in the models tested. The very low NDCG score of 0.0323 informs us that while the models are able to give reasonable predictions at a specific rating, they are bad at correctly identifying and ranking a user’s actual top-3 games, showing a gap between prediction accuracy and ranking quality.

</p>

<p style="line-height: 1.5;">
Overall, the preprocessing strategy used was successful, providing a robust foundation of 6.1 million ratings for modelling. The next steps to go beyond simple ratings are now clear thanks to this analysis, and future work must focus on implementing hybrid approaches and ranking-specific optimization to deal with the identified weaknesses and deliver recommendations that are accurate and relevant to a user’s top preferences. 
</p>


<p style="line-height: 1.5;">
Based on the analysis done, the following improvements are recommended:
        
Hybrid Approach 
- Item-Based KNN would be used for new users (content-based fallback) to help solve the “cold start” problem. Even if there are too few ratings for ALS to be effective for some users, this method can still give reasonable recommendations based on item similarity
- ALS will be used for established users that have a rich history as it can generate more personalized predictions
- A weighted ensemble of predictions that can be created through the combination of prediction scores from different models. This can smooth out the errors of individual models and can lead to a more accurate and robust final rating

</p>
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">REFERENCES</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
  <p style="line-height: 1.5;">
    <a href="https://marutitech.medium.com/what-are-the-types-of-recommendation-systems-3487cbafa7c9" target="_blank" style="color: #5a3731; text-decoration: none;">Maruti Techlabs. (2021, August 18). Types of Recommendation Systems & Their Use Cases. Medium.</a>
  </p>

  <p style="line-height: 1.5;">
    <a href="https://www.researchgate.net/profile/Ehsan-Momeni-Bashusqeh/post/What_is_the_newest_topic_in_recommender_systems/attachment/59d6435079197b807799ec94/AS%3A442575841697793%401482529711967/download/introduction-handbook-2015.pdf" target="_blank" style="color: #5a3731; text-decoration: none;">Ricci, F., Rokach, L., & Shapira, B. (2021). Recommender Systems Handbook. 3rd ed.</a>
  </p>

  <p style="line-height: 1.5;">
    <a href="https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews/data" target="_blank" style="color: #5a3731; text-decoration: none;">Van Elteren, J. (2022). <em>BoardGameGeek Reviews.</em></a>
  </p>

  <p style="line-height: 1.5;">
    <a href="https://doi.org/10.1109/ICSTCC.2019.8885455" target="_blank" style="color: #5a3731; text-decoration: none;">Zalewski, J., Ganzha, M., & Paprzycki, M. (2019). Recommender system for board games. In 2019 23rd International Conference on System Theory, Control and Computing (ICSTCC) (pp. 249–254). IEEE.</a>
  </p>
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>
