# Data cleaning and preparation
Before we jump into the EDA (explortory data analysis), we need to get our data cleaned and ready.

# Where is the data from?
This project uses data that has been scraped from Board Game Geek (https://boardgamegeek.com) and made available on Kaggle.

**Source** [Board Game Reviews - Jan 2025] https://www.kaggle.com/datasets/bwandowando/boardgamegeek-board-games-reviews-jan-2025

**Author Credit** [bwandowando] https://www.kaggle.com/bwandowando

# About the raw data
The dataset consists of 4 csv files with some of the fields shown below:
- boardgames (Titles, Descriptions, Release Year, Ratings)
- boardgames_reviews (Pseudo Users, Comments, Post Date, Rating)
- users (Pseudo Users, Location, Badges)
- user_game_status (Pseudo Users, Ownership, Wishlisted)

Before loading the raw data into Jupyter Notebooks, I completed some intial data cleaining in SQL.

---


## boardgames.csv
The raw boardgames csv file initially looks like this:

![Raw_boardgames_file](../images/Raw_boardgames_file.png)

# The changes I made to this file are:
- Renamed column headers (more for personal preference, but it also helps me to remember what fields I have)
- Removed "link" and "thumbnail" as I don't currently need them

I then ran the below SQL code:
![boardgames_file_SQL_code](../images/boardgames_file_SQL_code.png)

Which can be copied below:

SELECT
	a."Rank",
	a."Game ID",
	a."Title",
	a."Description",
	a."Release Year",
	a."Geek Rating",
	a."Avg Rating",
	a."Voters"
FROM
	( 	SELECT 
			ROW_NUMBER () OVER ( PARTITION BY a."Game ID" ORDER BY a."Game ID" ) AS "Rn",
			a."Rank",
			a."Game ID",
			a."Title",
			a."Description",
			a."Year" AS "Release Year",
			a."Geek Rating",
			a."Avg Rating",
			a."Voters"	
		FROM 	
			dbo."BPP - Board Games" a
		WHERE 	
			a."Geek Rating" IS NOT NULL 
		AND 	
			a."Year" IS NOT NULL 		
		AND		
			a."Game ID" IS NOT NULL 				
		AND 	
			a."Rank" IS NOT NULL		
		AND
		 	a."Year" < 2024
	) a
WHERE a."Rn" = 1 
ORDER BY a."Rank"

# What does this do?

- The Raw file consists of 161,404 "board games". I use quotes here as I would argue not every entry is actually a board game.
- Firstly, any game without a "Geek Rating" is removed. More details about the Geek rating are below. This reduce our list of games from 161,404 to 38,059.
- Some games do not have a release year. Perhaps no one knows when these games were released. Anyway, they were removed from the dataset reducing the total by a further 276 to 37,783 games. Games without a release year included things like Go Fish and Poker Dice etc.
- Some games did not feature a "Game ID" or a "Rank". These were often game expansions and second editions. These were all excluded removing a further 10,943 leaving 26,840. That's still a lot of games! I have decided to remove these from the initial analysis as initially, I just want to determine the features of the original game to see if it is a classic. For a game to have a second edition or expansion, it must have garnered some level of success, which we could analyze later on. I also dont want the rating of the original game to be influences by and expansions or second editions, yet.
- Any duplicate "Game ID" were also removed, just in case, but there were no duplicates, but seemed silly to remove the code which was doing no harm.
- Lastly, I removed any games released in 2024. This is because I want to work with full years worth of data. This ensured every game on the list can contain at least one years worth of data. This removed another 1,045 games.

This leaves our starting pot of games at...
# 25,795

---

# Understanding a Geek Rating
The dataset includes two key rating metrics for each board game.
- **Average Rating** - The average of all user submitted scores (out of 10).
- **Geek Rating** - This is a **Bayesian-adjusted score** used by BoardGameGeek (BGG) to provide a fairer ranking for games.

A Bayesian average is used to adjust the games rating based on:
- The number of votes received (v).
- The average rating (r).
- The overall average rating across all games on BGG (c) (the average rating across all games is often around 5.5).
- A constant (m) is also introduced which determines the minimum number of votes required before a games rating is even considered.

# The formula

$$
\text{Geek Rating} = \frac{v}{v + m} \cdot R + \frac{m}{v + m} \cdot C
$$

# Example 1 : A new game with just a few votes
Let's see a brand new game, with only a few votes. 
- **R** = 9.0 - an excellent raw average score.
- **V** = 25 - only 25 people have rated it so far. However, this would now threaten to be the best game ever from the opinion of only 25 people.
- **C** = 5.5 - The average review score of all games on BGG.
- **M** = 1000 - The constant, used to determine the minimum number of votes required. Kind of works like a weighting (1000 used for illustrative purposes, the actual constant used by BGG might be different).

Lets plug all of this into the formula:
$$
\text{Geek Rating} = \frac{25}{25 + 1000} \cdot 9.0 + \frac{1000}{25 + 1000} \cdot 5.5
$$

<br>

$$
= 0.024 \cdot 9.0 + 0.976 \cdot 5.5 = 0.22 + 5.37 = \mathbf{5.59}
$$

This reduces the rating from **9.0** to **5.59**! Our new game will need a much higher volume of positive reviews to climb to the top spot.

# Example 2 : CATAN - A very popular game released in 1995
- **R** = 7.09
- **V** = 132623 
- **C** = 5.5 
- **M** = 1000

Again, lets plug all of this into the formula:
$$
\text{Geek Rating} = \frac{132623}{132623 + 1000} \cdot 7.09 + \frac{1000}{132623 + 1000} \cdot 5.5
$$

<br>

$$
= 0.993 \cdot 7.09 + 0.9756 \cdot 5.5 = 7.04 + 0.04 = \mathbf{7.08}
$$

Due to the high volume of ratings for CATAN, the Geek Rating does not differ much from the non-adjusted average, going from **7.09** to **7.08**. 

# Why use this calculation?
The geek rating is designed to prevent games with just a few (potentially biased) scores from ranking too high or too low.
In this project I will be using the Geek Rating to assess a game's **overall perceived quality**, since it accounts for both rating score and the number of votes.

---