<a href="https://colab.research.google.com/github/MariusRemmlinger/myrepo/blob/main/Kopie_von_02_PA_Machine_Learning_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Practical Data Science 21/22*
# Programming Assignment 2 - Predicting Video Game Sales

In this programming assignment you need to apply your new (or refreshed) machine learning knowledge. You will need to create a modeling pipeline training and evaluating a machine learning model build on several numeric as well as categorical features

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas: 

In [1]:
import pandas as pd

In [2]:
data_path = 'https://raw.githubusercontent.com/NikoStein/pds_data/main/data/video_game_sales.csv'
game_sales_data = pd.read_csv(data_path)
game_sales_data.head() 

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,Wii Sports,Wii,2006.0,Sports,82.53,76.0,51.0,8.0,322.0,E
1,Super Mario Bros.,NES,1985.0,Platform,40.24,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,35.52,82.0,73.0,8.3,709.0,E
3,Wii Sports Resort,Wii,2009.0,Sports,32.77,80.0,73.0,8.0,192.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,31.37,,,,,


In [3]:
game_sales_data.describe()

Unnamed: 0,Year_of_Release,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count
count,16442.0,16711.0,8130.0,8130.0,10007.0,7585.0
mean,2006.486437,0.533713,68.976015,26.358549,7.126238,162.277521
std,5.87973,1.548282,13.935162,18.978236,1.30619,561.459579
min,1980.0,0.01,13.0,3.0,0.0,4.0
25%,2003.0,0.06,60.0,12.0,6.8,10.0
50%,2007.0,0.17,71.0,21.0,7.13,24.0
75%,2010.0,0.47,79.0,36.0,8.0,81.0
max,2020.0,82.53,98.0,113.0,9.7,10665.0


In [4]:
game_sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16711 entries, 0 to 16710
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16709 non-null  object 
 1   Platform         16711 non-null  object 
 2   Year_of_Release  16442 non-null  float64
 3   Genre            16709 non-null  object 
 4   Global_Sales     16711 non-null  float64
 5   Critic_Score     8130 non-null   float64
 6   Critic_Count     8130 non-null   float64
 7   User_Score       10007 non-null  float64
 8   User_Count       7585 non-null   float64
 9   Rating           9942 non-null   object 
dtypes: float64(6), object(4)
memory usage: 1.3+ MB


## Splitting the Dataset

Before you can get started training a machine learning model you will have to split the dataframe into features and the target variable (try to use as many features as possible):

In [5]:
y = game_sales_data['Global_Sales']

In [6]:
games_features = ['Year_of_Release','Critic_Score','User_Score','Rating','Genre','Platform']

In [7]:
X = game_sales_data[games_features]
X

Unnamed: 0,Year_of_Release,Critic_Score,User_Score,Rating,Genre,Platform
0,2006.0,76.0,8.0,E,Sports,Wii
1,1985.0,,,,Platform,NES
2,2008.0,82.0,8.3,E,Racing,Wii
3,2009.0,80.0,8.0,E,Sports,Wii
4,1996.0,,,,Role-Playing,GB
...,...,...,...,...,...,...
16706,2016.0,,,,Action,PS3
16707,2006.0,,,,Sports,X360
16708,2016.0,,,,Adventure,PSV
16709,2003.0,,,,Platform,GBA


Next, you will have to create a train-test split in order to be able to evaluate your models. Use 80\% of the data for training and 20\% for evaluation (take a look at the sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to identify the relevant parameters):

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

In [10]:
from sklearn.metrics import mean_absolute_error

## Removing missing values
If you inspect your training data you will find that some of the variables have missing values. Use the ``SimpleImputer`` to replace missing values in numerical columns with the column mean and missing values in categorical columns with the most frequent value (take a look at the SimpleImputer [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to identify the relevant parameters). You can decide if you want to use the simple or the advanced imputation strategy (or just try both).

In [12]:
from sklearn.impute import SimpleImputer

In [13]:
numCols = [cname for cname in X_train.columns if X_train[cname].dtype != "object"]
numCols

['Year_of_Release', 'Critic_Score', 'User_Score']

In [14]:
objCols = [cname for cname in X_train.columns if X_train[cname].dtype != "float64"]
objCols

['Rating', 'Genre', 'Platform']

In [15]:
simple_imputer = SimpleImputer()

X_train_num = pd.DataFrame(simple_imputer.fit_transform(X_train[numCols]), columns=numCols, index=X_train.index)
X_valid_num = pd.DataFrame(simple_imputer.transform(X_valid[numCols]), columns=numCols, index=X_valid.index)



In [16]:
simple_imputer = SimpleImputer(strategy='most_frequent')
X_train_cat = pd.DataFrame(simple_imputer.fit_transform(X_train[objCols]), columns=objCols, index=X_train.index)
X_valid_cat = pd.DataFrame(simple_imputer.transform(X_valid[objCols]), columns=objCols, index=X_valid.index)

## Encoding categorical variables

Prior to training your model you will have to encode the categorical variables. Inspect all categorical variables and use the ``LabelEncoder`` or the ``OneHotEncoder`` where appropriate. Remember that you have to combine the numerical as well as the label encoded and the one hot encoded dataframes at the end.

In [17]:
from sklearn.preprocessing import LabelEncoder

In [18]:
data = pd.read_csv(data_path)

# Drop NA
data.dropna(axis=0, inplace=True)

# Separate target from predictors
y = data['Global_Sales']
X = data.drop(['Global_Sales'], axis=1)

# Train-test split
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,random_state=0)

In [19]:
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 15 and 
                        X_train_full[cname].dtype == "object"]

print(low_cardinality_cols)

['Genre', 'Rating']


In [20]:
cols_to_keep = low_cardinality_cols + numCols
X_train = X_train[cols_to_keep].copy()
X_valid = X_valid[cols_to_keep].copy()

X_train

Unnamed: 0,Genre,Rating,Year_of_Release,Critic_Score,User_Score
8134,Sports,,1998.0,,
14329,Fighting,T,2007.0,38.0,7.13
8513,Simulation,T,2010.0,70.0,6.40
9328,Sports,E,2006.0,58.0,6.90
2658,Puzzle,E,2009.0,,7.13
...,...,...,...,...,...
9225,Simulation,T,2007.0,72.0,7.80
13123,Action,,2015.0,,
9845,Platform,E10+,2008.0,,6.60
10799,Action,T,2015.0,71.0,6.50


In [21]:
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep only selected columns
cols_to_keep = low_cardinality_cols + numerical_cols
X_train = X_train_full[cols_to_keep].copy()
X_valid = X_valid_full[cols_to_keep].copy()

X_train.head()

Unnamed: 0,Genre,Rating,Year_of_Release,Critic_Score,Critic_Count,User_Score,User_Count
2401,Shooter,T,2002.0,65.0,14.0,8.1,17.0
10675,Sports,E10+,2005.0,75.0,37.0,6.1,24.0
10819,Action,T,2002.0,54.0,18.0,5.0,4.0
15597,Puzzle,E,2007.0,70.0,21.0,7.5,11.0
13977,Action,E,2013.0,49.0,7.0,6.3,4.0


In [22]:
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply label encoder
label_encoder = LabelEncoder()
for col in low_cardinality_cols:
    label_X_train[col] = label_encoder.fit_transform(X_train[col])
    label_X_valid[col] = label_encoder.transform(X_valid[col])

ValueError: ignored

## Train the Model

Now our dataset should be ready and we can train a predictive model. Train a Decision Tree as well as a Random Forest and compare the in-sample as well as the out-of-sample performance of both models usinge the mean absolute error.

In [26]:
from sklearn.tree import DecisionTreeRegressor



In [30]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [32]:
from sklearn.ensemble import RandomForestRegressor

In [36]:
# Define
forest_model = RandomForestRegressor(random_state=1, n_estimators=100)

# Fit
forest_model.fit(X_train, y_train)

# Evaluate
games_preds = forest_model.predict(X_valid) 
print("The MAE of our model is: {}".format(mean_absolute_error(y_valid, games_preds)))

ValueError: ignored

In [37]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [40]:
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

# Bundle both preprocessors
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, low_cardinality_cols)
])

In [42]:
model = RandomForestRegressor(n_estimators=100, random_state=1)

In [43]:
complete_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Preprocess the raw training data and fit the model
complete_pipeline.fit(X_train, y_train)

# Preprocess the raw validation data and make predictions
preds = complete_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print("MAE using the complete pipeline: {}".format(score))

ValueError: ignored

## Improve the Model

Having successfully trained a model, your next task is to improve its performance. Try different advanced feature engineering techniques and see if they are able to improve your model.  

In [None]:
# Write your code here