*Practical Data Science 19/20*
# Programming Assignment 2 - Predicting Video Game Sales

In this programming assignment you need to apply your new (or refreshed) machine learning knowledge. You will need to create a modeling pipeline training and evaluating a machine learning model build on several numeric as well as categorical features

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas: 

In [3]:
import pandas as pd

In [7]:
data_path = 'https://raw.githubusercontent.com/pds2021/course/main/assignments/Data/02/video_game_sales.csv'
game_sales_data = pd.read_csv(data_path)
game_sales_data.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,Wii Sports,Wii,2006.0,Sports,82.53,76.0,51.0,8.0,322.0,E
1,Super Mario Bros.,NES,1985.0,Platform,40.24,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,35.52,82.0,73.0,8.3,709.0,E
3,Wii Sports Resort,Wii,2009.0,Sports,32.77,80.0,73.0,8.0,192.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,31.37,,,,,


In [20]:
game_sales_data['Name'].value_counts()

Need for Speed: Most Wanted                 12
Madden NFL 07                                9
LEGO Marvel Super Heroes                     9
FIFA 14                                      9
Ratatouille                                  9
                                            ..
Tales of the World: Radiant Mythology 2      1
To Heart                                     1
Wu-Tang: Shaolin Style                       1
Sunrise Eiyuutan R                           1
Teenage Mutant Ninja Turtles Double Pack     1
Name: Name, Length: 11556, dtype: int64

## Splitting the Dataset

Before you can get started training a machine learning model you will have to split the dataframe into features and the target variable (try to use as many features as possible):

In [21]:
# Write your code here
y_data = game_sales_data['Global_Sales']
x_data = game_sales_data[['Platform', 'Year_of_Release', 'Genre', 'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 'Rating', 'Name']]

Next, you will have to create a train-test split in order to be able to evaluate your models. Use 80\% of the data for training and 20\% for evaluation (take a look at the sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to identify the relevant parameters):

In [22]:
# Write your code here
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data ,test_size = 0.2, shuffle=False)

## Removing missing values
If you inspect your training data you will find that some of the variables have missing values. Use the ``SimpleImputer`` to replace missing values in numerical columns with the column mean and missing values in categorical columns with the most frequent value (take a look at the SimpleImputer [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to identify the relevant parameters). You can decide if you want to use the simple or the advanced imputation strategy (or just try both).

In [23]:
# split training data into categorical and numeric columns (https://stackoverflow.com/questions/55124655/imputing-only-the-numerical-values-using-sci-kit-learn)

categorical_columns = []
numeric_columns = []
for c in x_train.columns:
    if x_data[c].map(type).eq(str).any(): #check if there are any strings in column
        categorical_columns.append(c)
    else:
        numeric_columns.append(c)

#create two DataFrames, one for each data type
data_numeric = x_train[numeric_columns]
data_categorical = pd.DataFrame(x_train[categorical_columns])

In [34]:
import numpy as np
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp2 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
data_numeric = pd.DataFrame(imp.fit_transform(data_numeric), columns = data_numeric.columns) #only apply imputer to numeric columns
data_categorical = pd.DataFrame(imp2.fit_transform(data_categorical), columns = data_categorical.columns)

In [32]:
data_numeric.head(3)

Unnamed: 0,Year_of_Release,Critic_Score,Critic_Count,User_Score,User_Count
0,2006.0,76.0,51.0,8.0,322.0
1,1985.0,69.624125,27.660789,7.151185,170.87802
2,2008.0,82.0,73.0,8.3,709.0


In [36]:
data_categorical.head(3)

Unnamed: 0,Platform,Genre,Rating,Name
0,Wii,Sports,E,Wii Sports
1,NES,Platform,E,Super Mario Bros.
2,Wii,Racing,E,Mario Kart Wii


In [38]:
data_numeric.isnull().sum().sum()

0

In [39]:
data_categorical.isnull().sum().sum()

0

In [40]:
# Join imputed data 
x_train = pd.concat([data_numeric, data_categorical], axis = 1)

## Encoding categorical variables

Prior to training your model you will have to encode the categorical variables. Inspect all categorical variables and use the ``LabelEncoder`` or the ``OneHotEncoder`` where appropriate. Remember that you have to combine the numerical as well as the label encoded and the one hot encoded dataframes at the end.

In [41]:
data_categorical

Unnamed: 0,Platform,Genre,Rating,Name
0,Wii,Sports,E,Wii Sports
1,NES,Platform,E,Super Mario Bros.
2,Wii,Racing,E,Mario Kart Wii
3,Wii,Sports,E,Wii Sports Resort
4,GB,Role-Playing,E,Pokemon Red/Pokemon Blue
...,...,...,...,...
13363,X360,Racing,E,Test Drive: Ferrari Legends
13364,Wii,Simulation,E,Sushi Go-Round
13365,XB,Action,T,NightCaster II: Equinox
13366,PSV,Adventure,E,Shin Hayarigami


In [48]:
# Number of unique categorical features 

# The feature Platform is ordinal  and number of categories is quite large -> Label Encoding
data_categorical['Platform'].nunique()


29

In [45]:
# The number of Genre features is less -> One-Hot Encoding 
data_categorical['Genre'].nunique()

12

In [46]:
# The number of Ratig features is less -> One-Hot Encoding 
data_categorical['Rating'].nunique()

7

In [49]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

LabelEncoder

In [54]:
from sklearn import preprocessing
# creating instance of labelencoder
le = preprocessing.LabelEncoder()
le.fit(data_categorical['Platform'])

LabelEncoder()

In [56]:
le.classes_

array(['2600', '3DO', '3DS', 'DC', 'DS', 'GB', 'GBA', 'GC', 'GEN', 'N64',
       'NES', 'NG', 'PC', 'PS', 'PS2', 'PS3', 'PS4', 'PSP', 'PSV', 'SAT',
       'SCD', 'SNES', 'TG16', 'WS', 'Wii', 'WiiU', 'X360', 'XB', 'XOne'],
      dtype=object)

In [58]:
le.transform(data_categorical['Platform'])

array([24, 10, 24, ..., 27, 18,  4])

In [60]:
# Assigning numerical values 
data_categorical['Platform'] = le.fit_transform(data_categorical['Platform'])

One-Hot-Encoder

In [82]:
# creating instance of one-hot-encoder
enc = OneHotEncoder(handle_unknown='ignore')

# Fitting
enc.fit(data_categorical[['Genre']])

onehotlabels = enc.transform(data_categorical[['Genre']]).toarray()

In [83]:
onehotlabels

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Train the Model

Now our dataset should be ready and we can train a predictive model. Train a Decision Tree as well as a Random Forest and compare the in-sample as well as the out-of-sample performance of both models usinge the mean absolute error.

In [None]:
# Write your code here