<img src="https://devra.ai/analyst/notebook/3028/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">  <div style="font-size:150%; color:#FEE100"><b>Board Games Analysis and Prediction</b></div>  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>Board games are more than simple pastimes; they provide a rich dataset of strategies, design intricacies, and human preferences. In this notebook, we explore an extensive board games dataset, examine correlations between various game features, and even build a predictor for average ratings. If you find this analysis useful, please upvote it.

## Table of Contents

- [Data Loading and Exploration](#Data-Loading-and-Exploration)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Board Game Prediction](#Board-Game-Prediction)
- [Future Work and Conclusion](#Future-Work-and-Conclusion)

In [1]:
# Import necessary libraries and suppress warnings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# For those who like their plots export-ready
matplotlib.use('Agg')

# Ensure inline plotting for Jupyter notebooks
%matplotlib inline

# For our prediction tasks
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

print('Libraries imported and matplotlib backend set to Agg.')

ModuleNotFoundError: No module named 'pandas'

In [None]:
# Data Loading
csv_file = '/kaggle/input/board-games-dataset-complete-features/boardgame-geek-dataset_organized.csv'
json_file = '/kaggle/input/board-games-dataset-complete-features/boardgamegeek.json'

# Read the organized CSV dataset
df = pd.read_csv(csv_file, encoding='utf-8')

# Read the JSON dataset if needed
df_json = pd.read_json(json_file)

print('CSV and JSON files loaded successfully.')

## Data Loading and Exploration

Now that the data is loaded, we explore its basic structure. We have two data files; the CSV file is organized for ease of analysis and the JSON provides additional raw information. For this notebook, the CSV serves as our primary dataset.

In [None]:
# Quick look into the data
print('DataFrame shape:', df.shape)
print('Columns:', df.columns.tolist())

# Display first few rows to understand the data layout
df.head()

## Data Cleaning and Preprocessing

In this section, we handle missing values and verify that numeric fields are formatted correctly. Note that dates are not explicitly present here, except for the release year; thus, no additional date parsing is necessary. Quite often, one might encounter errors when dealing with null values in predictor columns, so we ensure to drop or impute missing values as needed.

In [None]:
# Inspect missing values and data types
print('Missing values per column:')
print(df.isnull().sum())

# For predictor building and most analyses, drop rows with missing critical values
cols_to_check = ['min_players', 'max_players', 'min_playtime', 'max_playtime', 
                 'minimum_age', 'avg_rating', 'num_ratings', 'complexity']
df_clean = df.dropna(subset=cols_to_check)

print('Shape after cleaning:', df_clean.shape)

## Exploratory Data Analysis

Let's dive into some visualizations. We will explore correlations between numeric features, distributions for key variables, and counts for categorical-like variables. These plots not only help in understanding the intricate relationships but may also highlight issues or trends worth investigating further. As always, note that visualizations can sometimes reveal more than just numbersâ€”they often expose the beauty (or the mess) within the data.

In [None]:
# Create a subset with only numeric data for correlation analysis
numeric_df = df_clean.select_dtypes(include=[np.number])

if numeric_df.shape[1] >= 4:
    plt.figure(figsize=(12,10))
    corr = numeric_df.corr()
    sns.heatmap(corr, annot=True, cmap="coolwarm")
    plt.title("Correlation Heatmap")
    plt.show()

# Pair Plot for a sample of numeric columns (this can be verbose with many columns)
sample_cols = ['avg_rating', 'num_ratings', 'complexity', 'min_playtime', 'max_playtime']
sns.pairplot(df_clean[sample_cols])
plt.suptitle("Pair Plot of Selected Features", y=1.02)
plt.show()

# Histogram for the complexity distribution
plt.figure(figsize=(8,4))
sns.histplot(df_clean['complexity'].dropna(), kde=True)
plt.title("Distribution of Game Complexity")
plt.show()

# Count plot for minimum players
plt.figure(figsize=(10,4))
sns.countplot(x='min_players', data=df_clean)
plt.title("Distribution of Minimum Players")
plt.show()

# Box plot for average ratings
plt.figure(figsize=(8,4))
sns.boxplot(x=df_clean['avg_rating'])
plt.title("Boxplot of Average Rating")
plt.show()

## Board Game Prediction

In a quest to predict the average rating of a board game, we create a predictor using selected features. Our intuition is that factors such as playtime, number of players, minimum age, and the overall complexity may influence how a game is rated. While our approach is a simple linear regression, it is often surprising how much insight can be gained from an elementary model. There may be more sophisticated techniques in the future, but humility (and dry wit) reminds us that sometimes the simplest approach is the best starting point.

In [None]:
# Select features and target for prediction
features = ['min_players', 'max_players', 'min_playtime', 'max_playtime',
            'minimum_age', 'complexity', 'num_ratings']
target = 'avg_rating'

# Ensure we have complete data for the chosen features
df_model = df_clean[features + [target]].dropna()

X = df_model[features]
y = df_model[target]

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set and calculate R2 score
predictions = model.predict(X_test)
r2 = r2_score(y_test, predictions)

print('Linear Regression R2 Score:', r2)

# Display a simple scatter plot of actual vs predicted ratings
plt.figure(figsize=(8,6))
plt.scatter(y_test, predictions, alpha=0.5)
plt.xlabel('Actual Average Rating')
plt.ylabel('Predicted Average Rating')
plt.title('Actual vs Predicted Average Rating')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.show()

## Future Work and Conclusion

This notebook provided an exploratory analysis of a complex board games dataset. We cleaned the data, visualized relationships between various features using heatmaps, pair plots, histograms, and box plots, and built a simple predictor for average ratings. The use of a linear regression model was a first step; future work could involve more advanced algorithms, feature engineering, and text analytics from game descriptions.

In summary, our exploratory approach proved valuable in uncovering insights and opened avenues for further investigation. As with any data analysis, a balance between sophisticated modeling and interpretability is key. If you enjoyed this analysis, upvote to show your appreciation.