# Welcome to my BoardGameGeek EDA

In this exploration, I aim to assist a future bar owner, Mr. Barman, in making critical decisions about his bar's board games inventory. This includes selecting games to purchase based on his requests and planned events to attract both board game enthusiasts and amateurs. The world of board games extends beyond the commonly known games like Monopoly, card games, chess, etc., and I hope to shed light on this less explored domain.

Through this data-driven analysis, I will investigate past and future tendencies in board games. My goal is to provide Mr. Barman with insights and predictions that may benefit his bar in the long term, ensuring its longevity and success.

This research EDA also serves a personal interest. For me, board games are a medium through which I can connect, engage in social interactions, and strengthen bonds with those around me.

## Key Points

- **Game Acquisition:** When advising Mr. Barman on which games to acquire, I will focus on games released since the 90s, as they are generally easier to obtain.
- **Research Scope:** For my personal research, I will examine the entire DataFrame to gain a comprehensive understanding of board game trends.

Join me as we delve into the fascinating world of board games and discover valuable insights that could help shape the future of Mr. Barman's bar.

In [None]:
#Importing neccessary packages and libraries
import pandas as pd
import numpy as np
import seaborn as sns
import ast
from wordcloud import WordCloud
import matplotlib.pyplot as plt
#Functions to be used
#Label each weight
def map_rating(rating):
    if rating >= 0 and rating <=1:
        return '0-1 Unrated'    
    elif rating > 1 and rating <= 2:
        return '1-2 Easy'
    elif rating > 2 and rating <= 3:
        return '2-3 Intermediate'
    elif rating > 3 and rating <= 4:
        return '3-4 Challenging'
    elif rating > 4 and rating <= 5:
        return '4-5 Hard'
    else:
        return 'Invalid'
    

In [None]:
# Load and preprocess the dataset
bgg_raw = pd.read_csv("/kaggle/input/bgg-eda/games_detailed_info.csv", index_col=0, low_memory=False)
selected_columns = [
    'Board Game Rank', 'primary', 'description', 'yearpublished', 'minplayers',
    'maxplayers', 'minage', 'boardgamedesigner', 'average', 'averageweight',
    'boardgamecategory', 'boardgamemechanic', 'usersrated', 'Family Game Rank'
]
bgg = bgg_raw[selected_columns]

# Convert ranks to numeric and handle missing values
bgg.loc[:, 'Board Game Rank'] = pd.to_numeric(bgg['Board Game Rank'], errors='coerce')
bgg.loc[:, 'Family Game Rank'] = pd.to_numeric(bgg['Family Game Rank'], errors='coerce')

# Sort the dataframe by Board Game Rank
bgg = bgg.sort_values(by='Board Game Rank')

# Map the average weight to rating labels
bgg['Rating_Label'] = bgg['averageweight'].map(map_rating)

# Constants
USERSRATED_TOP = bgg['usersrated'].quantile(0.90)
WEIGHT_MEDIAN = bgg['averageweight'].median()
AVG_MEDIAN = bgg['average'].median()

In [None]:
# Displaying the top 5 games
bgg.head()

### Ensuring Dataframe Integrity for Analysis

Before diving into the analysis, it's crucial to verify the integrity, usefulness, and reliability of the DataFrame. This can be achieved through three methods:

1. **Describe**: Provides a statistical summary including minimum, maximum, percentiles, mean, etc.
2. **Info**: Displays data types and counts of non-NaN cells for each column.
3. **Isna**: Counts and presents the null values per column for easier review.

In [None]:
#subsetting the DataFrame to numerical columns only. 
num_cols=['yearpublished','minplayers', 'maxplayers','minage','average','averageweight','usersrated','Board Game Rank','Family Game Rank']
bgg[num_cols].describe()

I can see that some cells are **abnormal** and might insinuate **incomplete** or **defected DataFrame** that might require further investigation:
1. **Min year**: `-3500` (too old of a game?)
2. **Max players**: `999` (a thousand players game is too much)
3. **Min age**: `25` (why isn't it `18`?)

In [None]:
bgg.info()

The types of each column fit its title.

In [None]:
bgg.isna().sum()

When a mechanic is important, I will drop the NA mechanic; otherwise, the game will be used. The `Family Game Rank` column ranks only the family-friendly games, 2327, while all the rest is NA.

### Checking for Duplicates in the DataFrame

In [None]:
dupe=bgg.duplicated().sum()
print(f"Number of duplicated rows:{dupe}") 

Furthermore, it seems there are **no duplicated games** as well.

### Investigating Abnormalities Found in the Data

First, I will explore games that are listed as being published **before the common era (B.C.E.)**.

In [None]:
bce_games=len(bgg[(bgg['yearpublished']<0)].sort_values(by='yearpublished',ascending=False)) 
print(f"There are {bce_games} games in the DataFrame dated before common era.")  
bgg[bgg["yearpublished"]<0]

It looks reasonable that some games have very low `yearpublished` values since they are before the common era.

### Second Abnormality: Games with Unreasonable Number of Players

In [None]:
bgg[bgg['maxplayers']>10].sort_values('maxplayers',ascending=False)

After performing manual research on the top 3 games: **"Scirmish"**, **"I Don't Know, What Do You Want to Play?"**, and **"Start Player: A Kinda Collectible Card Game"**, I've come to the conclusion that games listed with a max player count of `999` are card games without a number of players restriction and were noted as `999`.

Creating a plot illustrating that the majority of games are designed for 2-4 players, with a noticeable decline in the number of games as the player count exceeds 12. Interestingly, there is a virtual peak at 999 players, indicating games without a fixed player limit.

In [None]:
# Setting up the figure size and style for better visualization
plt.figure(figsize=(10, 6))

# Filtering the dataset for games with less than 20 max players
maxplayers_hist = bgg[bgg['maxplayers'] < 20]

# Plotting the histogram with specific adjustments for clarity and aesthetics
maxplayers_hist['maxplayers'].plot(kind='hist', 
                                   bins=range(1, 21), 
                                   align='left', 
                                   rwidth=0.8, )

# Setting the labels and title for the plot
plt.xlabel('Number of Players', fontsize=12)
plt.ylabel('Number of Games', fontsize=12)
plt.title('Distribution of Maximum Players per Game', fontsize=14)

# Adjusting the x-axis to have a tick for each possible number of players
plt.gca().xaxis.set_major_locator(plt.MultipleLocator(1))

# Adding a grid for better readability
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Displaying the plot
plt.show()

I can see that most games are up to **12 players**, above that there are barely any games with that much max players.

The last abnormality found in the `describe` method was a high minimum age of **25**. So, I would look for games with a minimum age above **18**.

In [None]:
# Filtering the dataset for games with a minimum age requirement above 18
games_above_18 = bgg[bgg['minage'] > 18]

# Displaying the filtered DataFrame
games_above_18

I could see that while some drinking games restrict players from drinking below 21 years old (drinking age of the USA), only one game has a 25 years old restriction. After a manual search, I can see that on the website, the game was updated to 16+, which is reasonable.

I will perform some data cleaning:
- It seems that some descriptions have some HTML characters in them,
- There are some unwanted prerequisites I would like to remove, cleaning the `description` column making it easier to understand the game description.

In [None]:
# Cleaning HTML entities and specific phrases from the 'description' column
html_entities = {
    '&quot;': '',
    '&#10;': '',
    '&nbsp;': '',
    '&amp;': '',
    '&ndash;': '',
}

phrases_to_remove = [
    'Description from BoardgameNews',
    'From the box'
]

# Applying replacements for HTML entities
for entity, replacement in html_entities.items():
    bgg['description'] = bgg['description'].str.replace(entity, replacement, regex=False)

# Removing specific phrases
for phrase in phrases_to_remove:
    bgg['description'] = bgg['description'].str.replace(phrase, '', regex=False)

After verification and manual internet checks I could see that this  DataFrame is intact, useable and reliable.
For which I will continue with performing this EDA for Mr. Barman

# Boardgame Evenings Schedule for Mr. Barman

Mr. Barman is planning a week full of diverse boardgame evenings. Each day is dedicated to a specific theme, ensuring a variety of gaming experiences. Here's the lineup:

1. **Singles Evening** - A night for solo gamers to enjoy deep, engaging single-player games.
2. **Couples Evening** - A romantic evening with games perfect for two.
3. **Children's/Toddler's Games** - Fun and educational games for the little ones.
4. **Miniatures Coloring Evening** - A creative session for painting and customizing miniatures.
5. **Party Games Evening** - High-energy games that are perfect for larger groups.
6. **Classic Boardgames Evening** - A nostalgic journey through timeless boardgame classics.
7. **Family Boardgames Evening** - Games that bring the whole family together.
8. **Cooperative Games** - Team up for a night of cooperative challenges and shared victories.

## Selection Criteria

To curate the perfect game list for each event, we'll consider three key factors:

- **Type of Event:** Tailoring the game selection to match the group size and dynamic, from solo players to family gatherings.
- **Theme/Category/Mechanics:** Selecting games that fit the evening's theme, whether it's strategy, creativity, or cooperative play.
- **BGG Internal Ranking:** Leveraging BoardGameGeek's comprehensive ranking system to choose top-rated games.

For each themed evening, I will recommend **10 top games** to ensure Mr. Barman can provide an unforgettable boardgame experience.


In [None]:
# Calculate the median weight of the games
WEIGHT_MEDIAN = bgg['averageweight'].median()

# Subsetting the DataFrame based on the average weight to categorize games as heavy or light
heavy_games = bgg[bgg['averageweight'] >= WEIGHT_MEDIAN]
light_games = bgg[bgg['averageweight'] < WEIGHT_MEDIAN]

# Dropping rows with missing values in specific columns to clean the data
dropna_mech = bgg.dropna(subset=["boardgamemechanic"])
dropna_cate = bgg.dropna(subset=["boardgamecategory"])
dropna_desc = bgg.dropna(subset=['description'])

### Singles Night: 10 Light Single Games

In [None]:
light_games_single = light_games[light_games['minplayers']==1]
light_games_single.head(10)

### Singles Night: 10 Heavy Single Games

In [None]:
heavy_games_single = heavy_games[heavy_games['minplayers']==1]
heavy_games_single.head(10)

### **Couples Night:** 10 light games

In [None]:
light_games_couples=light_games[light_games['minplayers']==2]
light_games_couples.head(10)

### **Couples Night:** 10 heavy games

In [None]:
heavy_games_couples=heavy_games[heavy_games['minplayers']==2]
heavy_games_couples.head(10)

### **Children's Games Post-2000:** 

In [None]:
#A selection for Mr. Barman to acquire, considering the rapid discontinuation of children's games.
childrens_games=dropna_cate[dropna_cate['boardgamecategory'].str.contains("children's game",case=False) & (dropna_cate['yearpublished'] > 2000)]
childrens_games.head(10)

### **Toddler's Game for Age 4 and Under**

In [None]:
childrens_games[childrens_games['minage'] <=4].head(10)

### **Cards Game Night: Best Card-Included Games for Cards Game Night**

In [None]:
dropna_mech[dropna_mech['boardgamemechanic'].str.contains('card',case=False)].head(10)

### **Party Games Night: Games with More Than 5 Players**

In [None]:
#Subsetting the DataFrame to the 'Party' category and the rating is more than the median.
party_games=dropna_cate[dropna_cate['boardgamecategory'].str.contains("party",case=False) & (dropna_cate['average'] > AVG_MEDIAN)]
party_games.head(10)

### Cooperative Games: 10 Light cooperative games

In [None]:
coop_games_light=dropna_mech[(dropna_mech['boardgamemechanic'].str.contains('Cooperative',case=False)) & (dropna_mech['averageweight'] < WEIGHT_MEDIAN)]
coop_games_light.head(10)

### Cooperative Games: 10 Heavy cooperative games

In [None]:
coop_games_heavy=dropna_mech[(dropna_mech['boardgamemechanic'].str.contains('Cooperative',case=False)) & (dropna_mech['averageweight'] >= WEIGHT_MEDIAN)]
coop_games_heavy.head(10)

### Ancient Games Night: Games That Our Ancestors Used to Play

In [None]:
bgg[bgg['yearpublished']<1900].head(10)

### Miniatures Coloring Night: A Night for Learning How to Paint Miniatures

In [None]:
miniatures_games = dropna_cate[dropna_cate['boardgamecategory'].str.contains('miniatures',case=False)]
miniatures_games.head(10)

### Family Games Night: Games Appropriate for the Whole Family

In [None]:
bgg.sort_values(by='Family Game Rank').head(10)

### Trends and Popularity

After providing Mr. Barman with the appropriate current games that fit his different nights, I will look for trends in board games published in recent years in hopes of future-proofing and knowing which upcoming/Kickstarter games are likely to succeed in the near future. The way I chose to do this is by looking into the past and trying to predict the future.

In [None]:
# Checking the distribution of board games by year to identify when the industry really bloomed
games_dist = bgg[bgg['yearpublished'].between(1950, 2019)].groupby('yearpublished')['primary'].count()
games_dist.plot(kind='line', title='Board Games Published by Year (1950-2019)', xlabel='Year', ylabel='Number of Games')

# Observing the plot, it's evident that the modern era of board games began in the 1990s.
# Therefore, we'll focus our analysis on games published from 1990 to 2019 to avoid skewing the data with older games.

#DataFrame subsetting
bgg_relevant = bgg[(bgg['yearpublished'].between(1990, 2019)) & (bgg['usersrated'] > USERSRATED_TOP)]
sum_games_year = bgg_relevant.groupby(by='yearpublished')['primary'].count()
social_games = bgg_relevant[bgg_relevant['maxplayers'] > 4].groupby('yearpublished')['primary'].count()
norm_social_games = social_games.div(sum_games_year, axis='rows')*100

In [None]:
# Correlation matrix for games published after 1990
bgg_modern_corr = bgg_relevant[num_cols].corr()

# Heatmap of the modern games with stylized settings
plt.figure(figsize=(12, 10))
sns.heatmap(bgg_modern_corr, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5, cbar_kws={'shrink': .5})
plt.title('Heatmap of Board Game Geek (BGG) Modern Times', fontsize=16, fontweight = 'bold')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

Our hottest points according to this heatmap are:

1. **Average-Year Published**: which indicates that users of BGG tend to give higher points to more recently released games.
2. **Board Game Rank-Average Weight**: which seems to indicate that the more complicated the game is, the higher it's ranked.
3. **Average-Average Weight**: similar to point 2, it seems like there's a trend where more complicated games receive higher points.

With all of this in mind, I will continue to investigate.

In [None]:
plt.figure(figsize=(10, 6))
bgg.boxplot(column='average', by='Rating_Label', vert=False)
plt.title('Average Values by Rating Label', fontsize=16)
plt.xlabel('Average')
plt.ylabel('Rating Label')
plt.suptitle('')  # Remove the automatic 'Boxplot grouped by Rating_Label' title
plt.show()

We can see that the harder the game, the better it's rated on average, which suggests some kind of skew of the DataFrame towards more complicated games.

To investigate whether games are receiving better ratings over the years, I can plot a scatter plot or a line plot showing the trend of average ratings over the years. This will help me visualize if there's a positive trend indicating that games are indeed getting better ratings as years progress.

In [None]:
avg_by_year = bgg_relevant[['yearpublished','average']].groupby('yearpublished')['average'].median()
plt.figure(figsize=(10, 6)) 
avg_by_year.plot(kind='line')
plt.title('Average Rating Per Year', fontsize=16, fontweight='bold' )
plt.xlabel('Year')
plt.ylabel('Average Rating of the year')
plt.xticks(range(1990,2022,1), rotation=90)
plt.show()

## Trend Line of Average Rating Over the Years

In [None]:
#???? which one should i keep?
avg_rating_by_year = bgg_relevant[['yearpublished','average']].groupby('yearpublished')['average'].mean().reset_index()
plt.figure(figsize=(10, 6))
#plt.scatter(bgg_relevant['yearpublished'],bgg_relevant['average'])
plt.scatter(avg_rating_by_year['yearpublished'], avg_rating_by_year['average'])
z = np.polyfit(avg_rating_by_year['yearpublished'], avg_rating_by_year['average'], 1)
p = np.poly1d(z)
plt.title('Average Rating Per Year', fontsize=16, fontweight='bold')
plt.xlabel('Year')
plt.ylabel('Average Rating')
plt.xticks(range(1990,2022,1), rotation=90)
plt.plot(avg_rating_by_year['yearpublished'], p(avg_rating_by_year['yearpublished']), 'r--')
plt.show()

According to the heatmap, games are getting more complicated over the years.

In [None]:
heavy_games_year = bgg_relevant[bgg_relevant['averageweight']>= WEIGHT_MEDIAN].groupby('yearpublished')['primary'].count()
norm_heavy_games = heavy_games_year.div(sum_games_year, axis='rows')*100
plt.figure(figsize=(10, 6)) 
norm_heavy_games.plot(kind='line',ylabel='Share of Complexed Games', xlabel='Year Published')
plt.title('Complexed Games Over The Years', fontsize=16, fontweight='bold')
plt.show()

In [None]:
# Stylizing the plot for better visualization
plt.figure(figsize=(10, 6))  # Setting the figure size for better readability
norm_social_games.plot(kind='line')  # Customizing line color, width, and style
plt.title('Percentage of Party Games Published Per Year', fontsize=16, fontweight='bold')  # Setting title with custom font size and weight
plt.ylabel('Percentage', fontsize=14)  # Customizing Y-axis label with font size
plt.xlabel('Year Published', fontsize=14)  # Customizing X-axis label with font size
plt.xticks(range(1990,2020,1), rotation=90)
plt.yticks(fontsize=12)  # Customizing Y-axis ticks with font size
plt.legend(['% of >5 Players'], fontsize=12, frameon=False)  # Customizing legend with font size and removing frame
plt.grid(True, which='both', linestyle='--', linewidth=0.5)  # Adding grid for better readability
plt.tight_layout()  # Adjusting subplot parameters to give specified padding
plt.show()

## In Conclusion:

1. **Improving Averages of Rating Over the Years:** According to the graph and the linear regression analysis, there is a noticeable improvement in the average ratings of games over the years.
2. **Increasing Complexity of Games:** The data and graphs indicate that even though the top 10% of total games are not becoming more complexed over the years, the top 10% games are around 70% complexed games.
3. **Stagnation in Party-Game Popularity:** Despite variations in game types, the trend for games to become more party-like (supporting more than 5 players) does not show significant growth over the years.

### Exploring Game Mechanics

Each game in our dataset features a variety of mechanics, defining how the game is played. I will deconstruct them to provide insights into which mechanics are the most popular. 

In [None]:
mech_temp = dropna_mech['boardgamemechanic']
mech_temp = mech_temp.apply(ast.literal_eval)
mech_list = [item for sublist in mech_temp for item in sublist]
mechanics = pd.Series(mech_list).value_counts()

In [None]:
mechanics.head(10)

### Visualizing Game Mechanics with Word Clouds 

Even though textual data is straightforward to read, a **word cloud** offers a more visually engaging way to identify and understand the variety of mechanics present in board games. Let's dive into the visualization to easily spot the most prominent mechanics.

In [None]:
mech_wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(mechanics)
plt.figure(figsize=(10, 5))
plt.imshow(mech_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

### A Pie Chart Describing the Distribution of the Top 15 Mechanics 

In [None]:
mechanics=mechanics.rename(None)
mechanics.head(15).plot(kind='pie', autopct='%1.1f%%', startangle=90, figsize=(6, 6))
plt.title('Mechanics Distributions')
plt.axis('equal')
plt.show()

I can see that the best mechanics in games are **'Dice Rolling'**, **'Hand Management'**, **'Set Collection'**, **'Variable Player Powers'**, **'Hexagon Grid'**, **'Simulation'**, **'Card Drafting'**, **'Tile Placement'**, **'Modular Board'**, **'Grid Movement'**.

### Exploring Game Categories

Each game in our dataset features a **variety of categories**, defining the *theme* and *flow* of the game. I will **deconstruct** them to provide insights into which categories are the **most popular**.

In [None]:
dropna_cate=bgg.dropna(subset="boardgamecategory")
cate_temp=dropna_cate['boardgamecategory']
cate_temp = cate_temp.apply(ast.literal_eval)
cate_temp = [item for sublist in cate_temp for item in sublist]
categories = pd.Series(cate_temp).value_counts()

In [None]:
categories=categories.rename(None)
categories.head(10)

### Visualizing Game Categories with Word Clouds 

In [None]:
cate_wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(categories)
plt.figure(figsize=(10, 5))
plt.imshow(cate_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
categories.head(15).plot(kind='pie', autopct='%1.1f%%', startangle=90, figsize=(6, 6))
plt.title('Categories Distributions')
plt.axis('equal')
plt.show()

I can now see that these **'Card Game'**, **'Wargame'**, **'Fantasy'**, **'Party Game'**, **'Dice'**, **'Science Fiction'**, **'Fighting'**, **'Children's Game'**, **'Abstract Strategy'**, **'Economic'** are the biggest themes/categories in boardgames according to our DataFrame.

## Conclusion

### Part 1: Data Reliability
In the first part, I ensured the **BGG DataFrame** was reliable to work on. Abnormalities and odd values were carefully addressed, ensuring the data's integrity.

### Part 2: Custom Selection for Mr. Barman
In the second part, I subsetted the DataFrame to find board games according to **Mr. Barman's requests**. This selection is tailored for the upcoming board game night planned at his business, ensuring a diverse and engaging inventory.

### Part 3: Trends and Popularity in Board Games
Lastly, I showcased trends and elements within the board game world. This analysis aims to predict future trends and highlight the most popular elements in the current board game landscape. This insight is invaluable for understanding the evolving preferences of board game enthusiasts.