<a href="https://colab.research.google.com/github/BigZ23/course/blob/assignment_2/02_PA_Machine%20Learning%20Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Topics in Data Science 2 23/24*
# Programming Assignment 2 - Predicting Video Game Sales

In this programming assignment you need to apply your new (or refreshed) machine learning knowledge. You will need to create a modeling pipeline training and evaluating a machine learning model build on several numeric as well as categorical features

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas:

In [None]:
import pandas as pd

In [None]:
data_path = 'https://raw.githubusercontent.com/GuntherGust/tds2_data/main/data/video_game_sales.csv'
game_sales_data = pd.read_csv(data_path)
game_sales_data.head()

## Splitting the Dataset

Before you can get started training a machine learning model you will have to split the dataframe into features and the target variable (try to use as many features as possible):

In [30]:
X = game_sales_data.drop(columns=['Global_Sales'])
y = game_sales_data['Global_Sales']

Next, you will have to create a train-test split in order to be able to evaluate your models. Use 80\% of the data for training and 20\% for evaluation (take a look at the sklearn [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to identify the relevant parameters):

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (13368, 9)
Shape of X_test: (3343, 9)
Shape of y_train: (13368,)
Shape of y_test: (3343,)


## Removing missing values
If you inspect your training data you will find that some of the variables have missing values. Use the ``SimpleImputer`` to replace missing values in numerical columns with the column mean and missing values in categorical columns with the most frequent value (take a look at the SimpleImputer [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to identify the relevant parameters). You can decide if you want to use the simple or the advanced imputation strategy (or just try both).

In [33]:
from sklearn.impute import SimpleImputer

#help(SimpleImputer)

# Imputer für numerische Werte
numerical_imputer = SimpleImputer(strategy='mean')
X_train[numerical_cols] = numerical_imputer.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = numerical_imputer.transform(X_test[numerical_cols])

# Imputer für kategoriale Werte
categorical_imputer = SimpleImputer(strategy='most_frequent')
X_train[categorical_cols] = categorical_imputer.fit_transform(X_train[categorical_cols])
X_test[categorical_cols] = categorical_imputer.transform(X_test[categorical_cols])

## Encoding categorical variables

Prior to training your model you will have to encode the categorical variables. Inspect all categorical variables and use the ``OrdinalEncoder``, the ``OneHotEncoder`` or the ``TargetEncoder`` where appropriate. Remember that you have to combine the numerical as well as the ordinal encoded and the one hot encoded dataframes at the end.

In [35]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Liste der kategorialen Spalten
categorical_cols = ['Platform', 'Genre', 'Rating']

# Pipeline für die numerischen Spalten
numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns
numerical_transformer = SimpleImputer(strategy='mean')

# Pipeline für die kategorialen Spalten
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# ColumnTransformer, Kombination der Transformationen
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Transformierung der Daten
X_train_encoded = preprocessor.fit_transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

## Train the Model

Now our dataset should be ready and we can train a predictive model. Train a Decision Tree as well as a Random Forest and compare the in-sample as well as the out-of-sample performance of both models usinge the mean absolute error.

In [36]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Modell Entscheidungsbaum
decision_tree_model = DecisionTreeRegressor(random_state=42)
decision_tree_model.fit(X_train_encoded, y_train)

# Vorhersagen Entscheidungsbaum auf Trainingsdaten
y_train_pred_dt = decision_tree_model.predict(X_train_encoded)

# Vorhersagen Entscheidungsbaum auf Testdaten
y_test_pred_dt = decision_tree_model.predict(X_test_encoded)

# Modell Random Forest
random_forest_model = RandomForestRegressor(random_state=42)
random_forest_model.fit(X_train_encoded, y_train)

# Vorhersagen Random Forest auf Trainingsdaten
y_train_pred_rf = random_forest_model.predict(X_train_encoded)

# Vorhersagen Random Forest auf Testdaten
y_test_pred_rf = random_forest_model.predict(X_test_encoded)

# Auswertung der Leistung mit dem Mean Absolute Error
mae_train_dt = mean_absolute_error(y_train, y_train_pred_dt)
mae_test_dt = mean_absolute_error(y_test, y_test_pred_dt)

mae_train_rf = mean_absolute_error(y_train, y_train_pred_rf)
mae_test_rf = mean_absolute_error(y_test, y_test_pred_rf)

# Ergebnisse
print("Decision Tree - Mean Absolute Error (Train):", mae_train_dt)
print("Decision Tree - Mean Absolute Error (Test):", mae_test_dt)

print("Random Forest - Mean Absolute Error (Train):", mae_train_rf)
print("Random Forest - Mean Absolute Error (Test):", mae_test_rf)

#Decision Tree - Mean Absolute Error (Train): 0.1327
#Random Forest - Mean Absolute Error (Train): 0.2270

#Decision Tree - Mean Absolute Error (Test):  0.5882
#Random Forest - Mean Absolute Error (Test):  0.4764

Decision Tree - Mean Absolute Error (Train): 0.1326754458397188
Decision Tree - Mean Absolute Error (Test): 0.5882457302437696
Random Forest - Mean Absolute Error (Train): 0.2269834888122577
Random Forest - Mean Absolute Error (Test): 0.4764243116885032


## Improve the Model

Having successfully trained a model, your next task is to improve its performance. Try different advanced feature engineering techniques and see if they are able to improve your model.  

In [37]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Polynomiale Features (Grad 2)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_encoded)
X_test_poly = poly.transform(X_test_encoded)

# Modell Entscheidungsbaum mit polynomiellen Daten
decision_tree_model_poly = DecisionTreeRegressor(random_state=42)
decision_tree_model_poly.fit(X_train_poly, y_train)

# Vorhersagen Entscheidungsbaum auf polynomiellen Testdaten
y_test_pred_dt_poly = decision_tree_model_poly.predict(X_test_poly)

# Modell Random Forest mit polynomiellen Daten
random_forest_model_poly = RandomForestRegressor(random_state=42)
random_forest_model_poly.fit(X_train_poly, y_train)

# Vorhersagen Random Forest auf polynomiellen Testdaten
y_test_pred_rf_poly = random_forest_model_poly.predict(X_test_poly)

# Auswertung der Leistung mit dem Mean Absolute Error für polynomielle Daten
mae_test_dt_poly = mean_absolute_error(y_test, y_test_pred_dt_poly)
mae_test_rf_poly = mean_absolute_error(y_test, y_test_pred_rf_poly)

print("Decision Tree - Mean Absolute Error (Test, polynomial):", mae_test_dt_poly)
print("Random Forest - Mean Absolute Error (Test, polynomial):", mae_test_rf_poly)

#Decision Tree - Mean Absolute Error (Test, polynomial): 0.5553   -vs. 0.5882-
#Random Forest - Mean Absolute Error (Test, polynomial): 0.4769   -vs. 0.4764-

Decision Tree - Mean Absolute Error (Test, polynomial): 0.5553428704285776
Random Forest - Mean Absolute Error (Test, polynomial): 0.4769390046801551


In [38]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_test_scaled = scaler.transform(X_test_encoded)
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Modell Entscheidungsbaum mit skalierten Daten
decision_tree_model_scaled = DecisionTreeRegressor(random_state=42)
decision_tree_model_scaled.fit(X_train_scaled, y_train)

# Vorhersagen Entscheidungsbaum auf Testdaten mit skalierten Daten
y_test_pred_dt_scaled = decision_tree_model_scaled.predict(X_test_scaled)

# Modell Random Forest mit skalierten Daten
random_forest_model_scaled = RandomForestRegressor(random_state=42)
random_forest_model_scaled.fit(X_train_scaled, y_train)

# Vorhersagen Random Forest auf Testdaten mit skalierten Daten
y_test_pred_rf_scaled = random_forest_model_scaled.predict(X_test_scaled)

# Auswertung der Leistung mit dem MAE für skalierte Daten
mae_test_dt_scaled = mean_absolute_error(y_test, y_test_pred_dt_scaled)
mae_test_rf_scaled = mean_absolute_error(y_test, y_test_pred_rf_scaled)

# Ausgabe der Ergebnisse
print("Decision Tree - Mean Absolute Error (Test, scaled):", mae_test_dt_scaled)
print("Random Forest - Mean Absolute Error (Test, scaled):", mae_test_rf_scaled)

# Decision Tree - Mean Absolute Error (Test, scaled): 0.5861    -vs. 0.5882-
# Random Forest - Mean Absolute Error (Test, scaled): 0.4733    -vs. 0.4764-

Decision Tree - Mean Absolute Error (Test, scaled): 0.5860613150478378
Random Forest - Mean Absolute Error (Test, scaled): 0.47328451115578096
