# SI618 Project 3 Classification

# Project Title: World of Board Games: What decide their popularity and ratings?

## Import data

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import ttest_ind

In [2]:
# import processed data from project 1
ready_data = pd.read_pickle('board_games.pkl')

In [3]:
ready_data['rate_cat'] = pd.qcut(ready_data.average, [0, 0.33333, 0.6666666, 1], ['low', 'medium', 'high'])

In [4]:
ready_data.rate_cat

0          high
1          high
2          high
3          high
4          high
          ...  
21626      high
21627    medium
21628       low
21629       low
21630      high
Name: rate_cat, Length: 21631, dtype: category
Categories (3, object): ['low' < 'medium' < 'high']

In [5]:
ready_data.shape

(21631, 41)

In [6]:
ready_data.head().T

Unnamed: 0,0,1,2,3,4
num_x,105,189,428,72,103
id,30549,822,13,68448,36218
name,Pandemic,Carcassonne,Catan,7 Wonders,Dominion
year,2008,2000,1995,2010,2008
rank,106,190,429,73,104
average,7.59,7.42,7.14,7.74,7.61
bayes_average,7.487,7.309,6.97,7.634,7.499
users_rated,108975,108738,108024,89982,81561
url,/boardgame/30549/pandemic,/boardgame/822/carcassonne,/boardgame/13/catan,/boardgame/68448/7-wonders,/boardgame/36218/dominion
thumbnail,https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXL...,https://cf.geekdo-images.com/okM0dq_bEXnbyQTOv...,https://cf.geekdo-images.com/W3Bsga_uLP9kO91gZ...,https://cf.geekdo-images.com/RvFVTEpnbb4NM7k0I...,https://cf.geekdo-images.com/j6iQpZ4XkemZP07HN...


In [7]:
ready_data.isna().sum()

num_x                   0
id                      0
name                    0
year                    0
rank                    0
average                 0
bayes_average           0
users_rated             0
url                     0
thumbnail               0
num_y                   0
primary                 0
description             0
yearpublished           0
minplayers              0
maxplayers              0
playingtime             0
minplaytime             0
maxplaytime             0
minage                  0
owned                   0
trading                 0
wanting                 0
wishing                 0
category_list           0
mechanic_list           0
family_list             0
expansion_list          0
implementation_list     0
artist_list             0
designer_list           0
publisher_list          0
category_count          0
family_count            0
mechanic_count          0
expansion_count         0
implementation_count    0
designer_count          0
publisher_co

- Preprocessing
  - import data
  - Missing value: None
  - outliers: discard or impute
  - delete unnecessary columns (id, url, description)
  - combine similar columns (platers, playtime, own/wish/want/trade/rate)
  - list:
    - use list_count
    - choose top x categories and make dummy variables
  - pipeline: scaling, standardize, dummy variable
  - PCA (optional)
- Classification
  - train_test_split
  - LR, Tree, RF, XGBoost, SVM, try all
  - design model measurement: precision, accuracy score, F1, recall, AUC (be careful we have a 3-class category)
  - Cross Validation get optimal parameters (fine tuning the hyper-parameters)
  - interpretation

## Data Processing

We first discard columns of `num_x` and `num_y`, `id`, `name` and `primary`, `url`, `thumbnail` and `description` columns from the dataset, as they are purely identifiers or text information of the entries and not useful for classification.

For the columns of lists that contain various attributes of the games, we have to transform them into singular values before we could use for classification. Some of them are unique with every game, such as `family`, `expansion` and `implementation`, so we discard them right away. For others like `artist`, `designer` and `publisher`, although there are common values across the games, from part 2 we have shown that there are too many of them and the distribution is very scattered across all games. We would need to have too many dummy variables to encode them, so we do not take them into consideration. For `category` and `mechanic`, the most common types consist a significant portion of the games, so we can use dummy variables to encode them. We keep the counts of all these category lists, which can represent the engagement of making the game, the complexity of the game and the number of derivation from the game.

Also we have explored the relationship of variables that represent game rating, game popularity, player number and playing time in part 2. From the conclusions we have drawn, we decide to use `average` to represent the game rating, which is our target variable, and classify it into 3 categories in the `rate_cat` variable. For game ownership/popularity, given the high correlation between the variables, we choose `owned` to represent ownership, and `wishing` to represent the desire to own. Also given the high correlation of min and max playing time, we use `(max)playingtime` to denote it. Due to the huge outliers present in the games, we can impute the playing time of over 180 minutes to 180. For player numbers, the min and max values are not as strongly correlated, and we can consider them separately, and as they are more discrete values, we can categorize them to ease the process. For `minplayers`, we divide into '1', '2' and '3 and above', while for `maxplayers` we divide into '2 and below', '3-4', '5-7', '8 and above'.

In [8]:
core_data = ready_data.drop(columns=['num_x', 'num_y', 'id', 'name', 'primary', 'url', 'thumbnail', 'description'])
core_data.drop(columns=['family_list', 'expansion_list', 'implementation_list', 'artist_list', 'designer_list', 'publisher_list'], inplace=True)
core_data.drop(columns=['year', 'rank', 'average', 'bayes_average', 'users_rated', 'minplaytime', 'maxplaytime', 'trading', 'wanting'], inplace=True)

In [9]:
core_data['is_cardgame'] = core_data['category_list'].apply(lambda x: 'Card Game' in x)
core_data['is_dicerolling'] = core_data['mechanic_list'].apply(lambda x: 'Dice Rolling' in x)
core_data.drop(columns=['category_list', 'mechanic_list'], inplace=True)

In [10]:
def minplayers_cat(x):
    if x < 2:
        return '1'
    elif x == 2:
        return '2'
    else:
        return '3 and above'
 
core_data['minplayers_cat'] = core_data['minplayers'].apply(minplayers_cat)
core_data.drop(columns=['minplayers'], inplace=True)

In [11]:
def maxplayers_cat(x):
    if x <= 2:
        return '2 and below'
    elif x <= 4:
        return '3-4'
    elif x <= 7:
        return '5-7'
    else:
        return '8 and above'
 
core_data['maxplayers_cat'] = core_data['maxplayers'].apply(maxplayers_cat)
core_data.drop(columns=['maxplayers'], inplace=True)

In [12]:
core_data['playingtime'] = core_data['playingtime'].apply(lambda x: x if x < 180 else 180)

In [13]:
core_data.head().T

Unnamed: 0,0,1,2,3,4
yearpublished,2008,2000,1995,2010,2008
playingtime,45,45,120,30,30
minage,8,7,10,10,13
owned,168364,161299,167733,120466,106956
wishing,9344,7383,5890,12105,8621
category_count,1,3,2,5,2
family_count,9,11,5,7,3
mechanic_count,7,3,9,5,5
expansion_count,7,155,87,16,46
implementation_count,11,17,29,3,2


Now that we have our preprocessed data, we can apply the train-test split and use scaling/encoding for the numerical/categorical variables.

In [14]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(core_data, core_data['rate_cat']):
    train_set = core_data.iloc[core_data.index.intersection(train_index)]
    test_set = core_data.iloc[core_data.index.intersection(test_index)]

In [15]:
train_X = train_set.drop('rate_cat',axis=1)
train_y = train_set['rate_cat'].copy()
test_X = test_set.drop('rate_cat',axis=1)
test_y = test_set[['rate_cat']].copy()

In [16]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
num_attribs = list(train_X.select_dtypes(include=[np.number]))
cat_attribs = list(train_X.select_dtypes(exclude=[np.number]))

full_pipeline = ColumnTransformer([
        ("num", StandardScaler(), num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

train_X_prepared = full_pipeline.fit_transform(train_X)
test_X_prepared = full_pipeline.transform(test_X)

## Classification task

## Reference