In [1]:
%%HTML
<script src="require.js"></script>

In [2]:
from IPython.display import HTML
HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code!"></form>
''')

In [3]:
# Import necessary modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator, RankingEvaluator
from pyspark.sql import functions as F
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql import Window


<div style="text-align: center;">
    <img src="https://i.imgur.com/jzRQsuC.png" alt="Alt text" width="1500">
</div>

<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">ABSTRACT</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
Education is a fundamental driver of national progress, yet access to higher education in the Philippines remains highly unequal, particularly between urban and rural regions and across socioeconomic groups. This study examines the relationship between higher education accessibility and key demographic and economic factors, drawing on data from the Philippine Census of 2020 and CHED 2020 statistics. Prior research, including studies by Zamora and Dorado (2015) and Yee (2024), has highlighted the urban-rural divide in educational opportunities, showing that individuals in rural areas consistently lag in educational attainment. The findings of this study reinforce these disparities, demonstrating that urban regions, particularly Metro Manila and CALABARZON, have more higher education institutions (HEIs), higher enrollment rates, and greater faculty availability, leading to better employment opportunities and economic mobility. In contrast, rural areas face institutional shortages, lower participation rates, and higher poverty levels, exacerbating income inequality. While literacy rates are high, they do not necessarily translate to higher education enrollment due to financial constraints and geographic isolation. To address these challenges, policy interventions must focus on expanding HEIs in underserved areas, increasing financial aid programs, and improving faculty distribution. Enhancing access to higher education is critical for reducing socioeconomic disparities and promoting long-term national development.
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
aeljfjiasjeijhoseijohfijosefo
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">INTRODUCTION</h3>
</div>

<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Motivation</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
In recent years, the board game industry has experienced a global resurgence, reaching an estimated market value of over USD 15 billion from growing hobbyist communities and rising online review platforms such as BoardGameGeek. However, this rapid expansion has also led to market saturation, where it has become increasingly difficult for players to discover games that match their unique interests and for companies to effectively reach their target audiences (Fortune Business Insights, 2025).
</p>

<p style="line-height: 1.5;">
For players, the overwhelming number of available titles leads to decision fatigue as they struggle to choose from thousands of options despite abundant reviews and ratings. Personalized recommendations can greatly enhance the player experience by personalizing games that align with specific play styles, complexity, and themes of interest. This leads to greater player satisfaction and loyalty to platforms that successfully anticipate user preferences (Ricci et al., 2021).
</p>

<p style="line-height: 1.5;">
Understanding player preferences is also equally valuable for board game publishers and retailers. Companies can gain insight into what mechanics, genres, or themes resonate most with their audiences by analyzing rating data and player behavior. They can then be confident in their marketing, product design, and inventory management. In a highly competitive environment, data-driven personalization can serve as a differentiator that boosts both sales and long-term customer retention (Ricci et al., 2021).

</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Problem Statement</h4>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
The study aims to answer the main problem: <b>How can explicit user ratings be utilized to recommend board games based on similar users?</b>

To address the main question, the study will also uncover answers to the following subquestion:
- Which collaborative filtering methods, neighborhood- or model-based method, result in a better performance?
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">METHODOLOGY</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
This report's methodology comprises a five-step process. It begins with the collection of publicly available user review data sets from Jojie. The subsequent steps include exploratory data analysis, Preprocessing, Modeling, and Evaluation. The results and discussion section presents the findings of each graph along with a summary of the insights and conclusions drawn from the analysis.
</p>

<p style="line-height: 1.5;">

To begin, the data must first be examined and explored. Data Collection starts with initializing and inspecting the dataset from Jojie’s public library. This includes scanning through the data to identify error points and duplicates, and to identify how to clean the data for modeling. Then an exploratory data analysis is conducted to identify dataset dimensions, such as the no.of unique users and unique and board games, and analyze user ratings. This is shown through various plots and graphs to determine the distribution of ratings, user activity, game popularity, and reception. These initial steps lay the foundation in understanding the datasets structure and behavior to allow for more informed decisions for data preprocessing, modeling, and evaluation.
</p>
<p style="line-height: 1.5;">


The next step is the implementation of the model. First, the dataset undergoes data preprocessing. This includes filtering of relevant information. Then, the narrowed dataset undergoes two different types of collaborative filtering models: neighborhood-based and model-based. Under the neighborhood-based methods, user-based collaborative filtering and item-based collaborative filtering were compared. On the other hand, for the model-based methods, alternating least squares and coordinate gradient descent were used. 
</p>

<p style="line-height: 1.5;">

Finally, to evaluate the best performing model, the Root Mean Squared Error(RMSE) will be the basis of reliability and accuracy, serving as the key metric to evaluate how closely the model’s predicted ratings align with the actual user ratings.
</p>

</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>

<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">DATA PROCESSING</h3>
</div>

<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Data Source</h4>
</div>

In [12]:
pd.set_option('display.float_format', '{:.2f}'.format)

orginal_df = pd.read_csv("/mnt/data/public/bgg/bgg-19m-reviews.csv")

styled = (
    orginal_df.head(4).style
        .set_table_styles([
            {
                'selector': 'thead',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000'),
                    ('font-weight', 'bold'),
                    ('border-bottom', '2px solid #000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(odd)',
                'props': [
                    ('background-color', '#f9f9f9'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'tbody tr:nth-child(even)',
                'props': [
                    ('background-color', '#ffffff'),
                    ('color', '#000000')
                ]
            },
            {
                'selector': 'td',
                'props': [
                    ('padding', '4px 8px'),
                    ('text-align', 'center'),
                    ('color', '#000000')
                ]
            }
        ])
        .set_table_attributes('style="margin-left:auto; margin-right:auto;"')  # Center align table
)

display(styled)

45.94004820561268

In [None]:
print(f"Shape: {orginal_df.shape}")
print(f"Column Names: {orginal_df.columns}")

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">

The dataset can be obtained from the BoardGameGeek Reviews dataset by Jesse van Elteren from <a href="https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews/data"> Kaggle. </a> 

The raw DataFrame contains 18,964,807 datapoints and 6 columns: 

- <b>'Unnamed: 0'</b> contains the indices for the CSV file
- <b>'user'</b> contains usernames that placed the review
- <b>‘rating'</b> contain the specific score of the user for the game. The score is from continuous decimal scale from 0 to 10.
- <b>'comment'</b> contains optional comment of the user on the game
- <b>'ID'</b> contains the game ID linked to the review
- <b>'name'</b> contains the primary game name linked to the game ID

The dataset only contains games with at least 30 reviews and users with at least 1 review.
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>



<div style="background-color: #fff3e4; padding: 5px; border-radius: 3px;">
    <h4 style="color: #5a3731; font-size: 25px; margin: 10px;">Data Preprocessing</h4>
</div>

In [None]:
print("Datatype per column:")
print(orginal_df.dtypes)
print()
print("Unique values per column:")
print(orginal_df.nunique())

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The BoardGameGeek dataset is converted into a Pandas DataFrame for Exploratory Data Analysis, preprocessing and modelling of Neighborhood-based and SVD collaborative filtering. The raw dataset contains around 18.9 million unique users, 10 thousand unique numerical ratings, 30 thousand unique comments, 21.8 thousand unique game IDs, and 21.4 thousand unique game names. 
</p>

</div>

<div style="padding-bottom: 7px;">
</div>



In [None]:
duplicates = orginal_df[orginal_df.duplicated(subset=["user", "name"], keep=False)]
duplicates = duplicates.sort_values(by="user")
sample_ids = duplicates["ID"].iloc[-2:].to_list()
sample = pd.read_csv("/mnt/data/public/bgg/games_detailed_info.csv")
for id in sample_ids:
    print(sample[sample["id"] == id][['id','primary','alternate']])

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
The first step of the data cleaning process is dropping unnecessary rows. Since the study will only tackle explicit collaborative filtering, the models will only use numerical user scores, so the columns 'Unnamed: 0' and 'comment' are dropped. Upon deeper inspection, multiple game IDs also share the same primary game name, but different alternate game names depending on the year or version released. For example, 
game_id 129622 is the regular version of Love Letter while game_id 277085 is the premium version. Since different versions have different features and price points, the ‘name’ column was dropped from the dataset in favor of the ‘id’ column. 
</p>

</div>

<div style="padding-bottom: 7px;">
</div>






In [None]:
orginal_df[orginal_df.duplicated(subset=["user", "ID"], keep=False)]

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
Each column in the dataset correspond to a unique username and game id pairing, meaning the dataset only contains the latest review of a user on the game.
</p>

</div>

<div style="padding-bottom: 7px;">
</div>






```python
orginal_df = orginal_df[['user', 'rating', 'ID']]

spark = (SparkSession
     .builder
     .master('local[*]') # tells you master is 1 laptop using all 4 executors
     .config("spark.driver.memory", "8g")
     .config("spark.executor.memory", "8g")
     .config("spark.sql.shuffle.partitions", "8")  # reduce for local
     .getOrCreate()) # make new or get latest session

spark.sparkContext.setCheckpointDir("./als_spark_checkpoints")

# Read board game geek file on spark
    schema = """
    _c0 INT,
    user STRING,
    rating FLOAT,
    comment STRING,
    id INT, 
    name STRING
    """
    # Fix quote handling for comments column 
    df_spark = spark.read.csv(
        "/mnt/data/public/bgg/bgg-19m-reviews.csv",
        sep=',', header=True,
        schema=schema,
        multiLine=True,
        quote='"',
        escape='"')


<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
Meanwhile, the same raw data was preprocessed as a Spark DataFrame in order to use the Alternating Least Squares function of Spark. For spark, the schema is provided and specific properties of multiLine, quote, and escape, needed to be set to properly handle quotation marks in the comments columns. You can refer to our supplementary notebook “ALS_modeling.ipnyb” for the specific implementation.
</p>                                                                                                                                                                                                                                                                                                            
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>




<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">EXPLORATORY DATA ANALYSIS</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
This report's methodology comprises a five-step process. It begins with the collection of publicly available user review data sets from Jojie. The subsequent steps include exploratory data analysis, Preprocessing, Modeling, and Evaluation. The results and discussion section presents the findings of each graph along with a summary of the insights and conclusions drawn from the analysis.
</p>

<p style="line-height: 1.5;">

To begin, the data must first be examined and explored. Data Collection starts with initializing and inspecting the dataset from Jojie’s public library. This includes scanning through the data to identify error points and duplicates, and to identify how to clean the data for modeling. Then an exploratory data analysis is conducted to identify dataset dimensions, such as the no.of unique users and unique and board games, and analyze user ratings. This is shown through various plots and graphs to determine the distribution of ratings, user activity, game popularity, and reception. These initial steps lay the foundation in understanding the datasets structure and behavior to allow for more informed decisions for data preprocessing, modeling, and evaluation.
</p>
<p style="line-height: 1.5;">


The next step is the implementation of the model. First, the dataset undergoes data preprocessing. This includes filtering of relevant information. Then, the narrowed dataset undergoes two different types of collaborative filtering models: neighborhood-based and model-based. Under the neighborhood-based methods, user-based collaborative filtering and item-based collaborative filtering were compared. On the other hand, for the model-based methods, alternating least squares and coordinate gradient descent were used. 
</p>

<p style="line-height: 1.5;">

Finally, to evaluate the best performing model, the Root Mean Squared Error(RMSE) will be the basis of reliability and accuracy, serving as the key metric to evaluate how closely the model’s predicted ratings align with the actual user ratings.
</p>

</div>

<div style="padding-bottom: 7px;">
</div>



<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">RESULTS & DISCUSSION</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; padding-bottom: 15px">
<p style="line-height: 1.5;">
Education is a fundamental driver of national progress, yet access to higher education in the Philippines remains highly unequal, particularly between urban and rural regions and across socioeconomic groups. This study examines the relationship between higher education accessibility and key demographic and economic factors, drawing on data from the Philippine Census of 2020 and CHED 2020 statistics. Prior research, including studies by Zamora and Dorado (2015) and Yee (2024), has highlighted the urban-rural divide in educational opportunities, showing that individuals in rural areas consistently lag in educational attainment. The findings of this study reinforce these disparities, demonstrating that urban regions, particularly Metro Manila and CALABARZON, have more higher education institutions (HEIs), higher enrollment rates, and greater faculty availability, leading to better employment opportunities and economic mobility. In contrast, rural areas face institutional shortages, lower participation rates, and higher poverty levels, exacerbating income inequality. While literacy rates are high, they do not necessarily translate to higher education enrollment due to financial constraints and geographic isolation. To address these challenges, policy interventions must focus on expanding HEIs in underserved areas, increasing financial aid programs, and improving faculty distribution. Enhancing access to higher education is critical for reducing socioeconomic disparities and promoting long-term national development.
</p>

<p style="line-height: 1.5;">

</p>
</div>

<div style="padding-bottom: 7px;">
</div>


<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">CONCLUSION & RECOMMENDATIONS</h3>
</div>

<div style="font-family: 'Poppins', sans-serif; font-size: 15px; border-bottom: 2px solid #5a3731; padding-bottom: 15px">
<p style="line-height: 1.5;">
Education is a fundamental driver of national progress, yet access to higher education in the Philippines remains highly unequal, particularly between urban and rural regions and across socioeconomic groups. This study examines the relationship between higher education accessibility and key demographic and economic factors, drawing on data from the Philippine Census of 2020 and CHED 2020 statistics. Prior research, including studies by Zamora and Dorado (2015) and Yee (2024), has highlighted the urban-rural divide in educational opportunities, showing that individuals in rural areas consistently lag in educational attainment. The findings of this study reinforce these disparities, demonstrating that urban regions, particularly Metro Manila and CALABARZON, have more higher education institutions (HEIs), higher enrollment rates, and greater faculty availability, leading to better employment opportunities and economic mobility. In contrast, rural areas face institutional shortages, lower participation rates, and higher poverty levels, exacerbating income inequality. While literacy rates are high, they do not necessarily translate to higher education enrollment due to financial constraints and geographic isolation. To address these challenges, policy interventions must focus on expanding HEIs in underserved areas, increasing financial aid programs, and improving faculty distribution. Enhancing access to higher education is critical for reducing socioeconomic disparities and promoting long-term national development.
</p>

<p style="line-height: 1.5;">

</p>
</div>

<div style="border-bottom: 1px solid #ffd3a0; box-shadow: 0px 1px 0px #5a3731; padding-bottom: 7px;">
</div>


<div style="background-color: #ffd3a0; padding: 10px; border-radius: 5px;">
    <h3 style="color: #5a3731; font-size: 30px; font-weight: bold; margin: 10px;">REFERENCES</h3>
</div>

Van Elteren, J. (2022). *BoardGameGeek Reviews.* https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews/data