<div class='bar_title'></div>

*Introduction to Data Science*

# Assignment 6 - Machine Learning Solutions

Gunther Gust / Vanessa Haustein<br>
Chair of Enterprise AI

Winter Semester 24/25

<img src='https://raw.githubusercontent.com/vhaus63/ids_data/main/d3.png?raw=true' style='width:20%; float:left;' />

<img src="https://raw.githubusercontent.com/vhaus63/ids_data/main/CAIDASlogo.png" style="width:20%; float:left;" />

# Exercise 1
In this exercise, we will use decision trees to classify people on board the titanic. We want to train a model to predict whether a person with given features survived the catastrophe.

In [1]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/vhaus63/ids_data/refs/heads/main/DecisionTrees_titanic.csv')
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,1,0,38.0,1,0,71.2833,0
1,1,1,0,35.0,1,0,53.1,2
2,0,1,1,54.0,0,0,51.8625,2
3,1,3,0,4.0,1,1,16.7,2
4,1,1,0,58.0,0,0,26.55,2


Split the model into feature (X) and target dataset (y). Use as many features as possible with the given dataset. Then, create a train-test split where the testing data makes up 30% of the entire dataset.

In [3]:
y = df.loc[:,"Survived"]
X = df.drop("Survived", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Now, create a simple decision tree model, fit it on the training data and then create predictions for the test set.

In [4]:
decisiontree = DecisionTreeClassifier()
decisiontree.fit(X_train, y_train)
decisiontree_pred = decisiontree.predict(X_test)

Use the `classification_report` method to check the performance of the model:

In [5]:
print(classification_report(y_test, decisiontree_pred))

              precision    recall  f1-score   support

           0       0.73      0.61      0.67        18
           1       0.82      0.89      0.86        37

    accuracy                           0.80        55
   macro avg       0.78      0.75      0.76        55
weighted avg       0.79      0.80      0.79        55



We want to compare the performance of our model to a Random Forest. You can read more about random forests [here](https://medium.com/@mrmaster907/introduction-random-forest-classification-by-example-6983d95c7b91). Just like with the decision tree, create a Random Forest model, fit it  and predict on the test data. Print the classification report for this model and compare it to the one from the decision tree.

In [6]:
randomforest = RandomForestClassifier()
randomforest.fit(X_train, y_train)
randomforest_pred = randomforest.predict(X_test)

In [7]:
print(classification_report(y_test, randomforest_pred))

              precision    recall  f1-score   support

           0       0.80      0.44      0.57        18
           1       0.78      0.95      0.85        37

    accuracy                           0.78        55
   macro avg       0.79      0.70      0.71        55
weighted avg       0.79      0.78      0.76        55



The identical reports from both models are not surprising given the relatively small dataset of just 183 instances, which is small compared to the complexity of decision trees. Since the decision tree already captures the data's complexity well, the Random Forest does not provide any additional performance improvement.

# Exercise 2

In this programming assignment you need to apply your new machine learning knowledge. You will need to create a modeling pipeline for training and evaluating a machine learning model built on several numeric as well as categorical features.

## Introduction and Dataset

You are provided with a dataset containing a list of video games with sales greater than 100.000 copies. Your task is to build a model predicting the yearly global sales (column ``Global_Sales``) of a video game leveraging the available features.

To help you get started, the following blocks of code import the dataset using pandas: 

In [8]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
from lets_plot import *
LetsPlot.setup_html()

In [9]:
data_path = 'https://raw.githubusercontent.com/vhaus63/ids_data/refs/heads/main/video_game_sales.csv'
game_sales_data = pd.read_csv(data_path)
game_sales_data = game_sales_data[game_sales_data.Name.isna() == False]
game_sales_data.head()

Unnamed: 0.1,Unnamed: 0,Name,Platform,Year_of_Release,Genre,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating
0,0,Wii Sports,Wii,2006.0,Sports,82.53,76.0,51.0,8.0,322.0,E
1,1,Super Mario Bros.,NES,1985.0,Platform,40.24,,,,,
2,2,Mario Kart Wii,Wii,2008.0,Racing,35.52,82.0,73.0,8.3,709.0,E
3,3,Wii Sports Resort,Wii,2009.0,Sports,32.77,80.0,73.0,8.0,192.0,E
4,4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,31.37,,,,,


## Splitting the Dataset

Before you can get started training a machine learning model you will have to split the dataframe into features and the target variable (try to use as many features as possible):

In [10]:
game_sales_data.set_index('Name', inplace=True)
game_sales_data.columns

Index(['Unnamed: 0', 'Platform', 'Year_of_Release', 'Genre', 'Global_Sales',
       'Critic_Score', 'Critic_Count', 'User_Score', 'User_Count', 'Rating'],
      dtype='object')

In [11]:
y = game_sales_data['Global_Sales']
X = game_sales_data.drop('Global_Sales', axis=1)
print(y.head())
print(X.head())

Name
Wii Sports                  82.53
Super Mario Bros.           40.24
Mario Kart Wii              35.52
Wii Sports Resort           32.77
Pokemon Red/Pokemon Blue    31.37
Name: Global_Sales, dtype: float64
                          Unnamed: 0 Platform  Year_of_Release         Genre  \
Name                                                                           
Wii Sports                         0      Wii           2006.0        Sports   
Super Mario Bros.                  1      NES           1985.0      Platform   
Mario Kart Wii                     2      Wii           2008.0        Racing   
Wii Sports Resort                  3      Wii           2009.0        Sports   
Pokemon Red/Pokemon Blue           4       GB           1996.0  Role-Playing   

                          Critic_Score  Critic_Count  User_Score  User_Count  \
Name                                                                           
Wii Sports                        76.0          51.0         8.0     

Next, you will have to create a train-test split in order to be able to evaluate your models. Use 80\% of the data for training and 20\% for evaluation. Additionally, make sure that your results are reproducible:

In [12]:
train_X, val_X, train_y, val_y = train_test_split(X, y, 
                                                  train_size=0.8, 
                                                  random_state = 0)

## Removing missing values
If you inspect your training data you will find that some of the variables have missing values. Use the ``SimpleImputer`` to replace missing values in numerical columns with the column mean and missing values in categorical columns with the most frequent value (take a look at the SimpleImputer [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) to identify the relevant parameters).

First, you need to separate the column names that have numerical and categorical values. First, look at the dtypes of the training data:

In [13]:
train_X.dtypes

Unnamed: 0           int64
Platform            object
Year_of_Release    float64
Genre               object
Critic_Score       float64
Critic_Count       float64
User_Score         float64
User_Count         float64
Rating              object
dtype: object

In [14]:
num_cols = [col for col in train_X.columns if train_X[col].dtype == 'float64']
cat_cols = [col for col in train_X.columns if train_X[col].dtype == 'object']

Now you can use the `SimpleImputer` to fill the missing data.

In [15]:
num_imputer = SimpleImputer(strategy='mean')

train_X_num_imputed = pd.DataFrame(num_imputer.fit_transform(train_X[num_cols]), 
                                   columns=num_cols, index=train_X.index)
val_X_num_imputed = pd.DataFrame(num_imputer.transform(val_X[num_cols]), 
                                   columns=num_cols, index=val_X.index)

cat_imputer = SimpleImputer(strategy='most_frequent')

train_X_cat_imputed = pd.DataFrame(cat_imputer.fit_transform(train_X[cat_cols]), 
                                   columns=cat_cols, index=train_X.index)
val_X_cat_imputed = pd.DataFrame(cat_imputer.transform(val_X[cat_cols]), 
                                   columns=cat_cols, index=val_X.index)

## Encoding categorical variables

Prior to training your model you will have to encode the categorical variables. We inspect all categorical variables and use the ``OrdinalEncoder`` or the ``OneHotEncoder`` where appropriate.

In [16]:
for cat in cat_cols:
    print("{}: {}".format(cat, game_sales_data[cat].nunique()))

Platform: 31
Genre: 12
Rating: 8


Use the `OrdinalEncoder` for the Rating:

In [17]:
#see e.g. https://en.wikipedia.org/wiki/Entertainment_Software_Rating_Board for correct order
#see e.g. https://stackoverflow.com/questions/72170947/how-to-use-ordinalencoder-to-set-custom-order for more explanation
ordinal_encoder = OrdinalEncoder(categories=[['EC','E','K-A','E10+','T', 'M', 'AO', 'RP']])

train_X_cat_label = pd.DataFrame(ordinal_encoder.fit_transform(train_X_cat_imputed[["Rating"]]),
                                 columns=["Rating"], 
                                  index=train_X_cat_imputed.index)
val_X_cat_label = pd.DataFrame(ordinal_encoder.transform(val_X_cat_imputed[["Rating"]]),
                                 columns=["Rating"], 
                                 index=val_X_cat_imputed.index)

Now, use the `OneHotEncoder` to encode the rest of the categorical values:

In [18]:
ohe_encoder = OneHotEncoder(sparse_output=False,)
train_X_cat_ohe = pd.DataFrame(ohe_encoder.fit_transform(train_X_cat_imputed[["Platform", 'Genre']]),
                                 index=train_X_cat_imputed.index, columns=ohe_encoder.get_feature_names_out())
val_X_cat_ohe = pd.DataFrame(ohe_encoder.transform(val_X_cat_imputed[["Platform", 'Genre']]),
                                 index=val_X_cat_imputed.index, columns=ohe_encoder.get_feature_names_out())

We still have to combine the numeric imputed and all the categrorical encoded columns in order to get a full training and validation set:

In [19]:
train_X = pd.concat([train_X_num_imputed, train_X_cat_label, train_X_cat_ohe], axis=1)
val_X = pd.concat([val_X_num_imputed, val_X_cat_label, val_X_cat_ohe], axis=1)

## Training the Model

Now our dataset should be ready and we can train a predictive model. Train a Decision Tree as well as a Random Forest and compare the in-sample as well as the out-of-sample performance of both models usinge the mean absolute error. For this, you can create a function called `score_dataset` that takes all four necessary datasets (X_train, X_valid, y_train and y_valid), fits the two models and then predicts based on X_valid. Then, depending on whether you want the in-sample or the out-of-sample score, you can give the function the appropriate dataset to predict on. It should return the MAE of both models. Print all four results.

In [20]:
def score_dataset(X_train, X_valid, y_train, y_valid):
    model_rf = RandomForestRegressor(n_estimators=100, random_state=1)
    model_rf.fit(X_train, y_train)
    preds_rf = model_rf.predict(X_valid)
    model_dt = DecisionTreeRegressor(random_state=1)
    model_dt.fit(X_train, y_train)
    preds_dt = model_dt.predict(X_valid)
    return mean_absolute_error(y_valid, preds_rf), mean_absolute_error(y_valid, preds_dt), model_rf, model_dt

In [21]:
oos_rf, oos_dt, model_rf, model_dt= score_dataset(train_X, val_X, train_y, val_y)
is_rf, is_dt, model_rf, model_dt = score_dataset(train_X, train_X, train_y, train_y)

In [22]:
print('Out-of-sample\nRandom Forest: {}\nDecicion Tree" {}'.format(oos_rf, oos_dt))
print('------------------------------')
print('In-sample\nRandom Forest: {}\nDecicion Tree" {}'.format(is_rf, is_dt))

Out-of-sample
Random Forest: 0.4334695197218959
Decicion Tree" 0.5181498358720199
------------------------------
In-sample
Random Forest: 0.22931049440241147
Decicion Tree" 0.12641105283767942


### Looks like Overfitting
Since the in-sample score is pretty good but the out-of-sample score not, we can deduce that we overfitted our model on the training data. Since we would rather have a model that in generar performs good on broad data than performing very good on the already known data, we need to look at the tree depth:

In [23]:
#tree depth
model_dt.get_depth()

42

Change the `score_dataset` function from above so that you can control the depth of the tree with a parameter.

In [24]:
def score_dataset(X_train, X_valid, y_train, y_valid,depth= None):
    model_rf = RandomForestRegressor(n_estimators=100, random_state=1,max_depth=depth)
    model_rf.fit(X_train, y_train)
    preds_rf = model_rf.predict(X_valid)
    model_dt = DecisionTreeRegressor(random_state=1,max_depth=depth)
    model_dt.fit(X_train, y_train)
    preds_dt = model_dt.predict(X_valid)
    return mean_absolute_error(y_valid, preds_rf), mean_absolute_error(y_valid, preds_dt), model_rf, model_dt

Now you can use this function in order to get both in-sample and out-of-sample scores for both models depending on different tree depths from 1 to 42. Store the results for both models and both in-sample and out-of-sample each in a separate list:

In [25]:
dt_scores_oos = []
dt_scores_is = []
rf_scores_oos = []
rf_scores_is = []
for i in range(1,42):
    oos_rf, oos_dt, model_rf, model_dt= score_dataset(train_X, val_X, train_y, val_y,depth=i)
    is_rf, is_dt, model_rf, model_dt = score_dataset(train_X, train_X, train_y, train_y,depth=i)
    dt_scores_is.append(is_dt)
    dt_scores_oos.append(oos_dt)
    rf_scores_is.append(is_rf)
    rf_scores_oos.append(oos_rf)

Plot the results using `lets-plot`.

In [26]:
x_values = list(range(1, 42))
data = pd.DataFrame({
    'x': x_values * 4,
    'y': dt_scores_oos + dt_scores_is + rf_scores_oos + rf_scores_is,
    'series': ['DT Out of Sample'] * 41 + ['DT In Sample'] * 41 + ['RF Out of Sample'] * 41 + ['RF In Sample'] * 41
})

(
    ggplot(data, aes(x='x', y='y', color='series'))
    + geom_point()
    + labs(color='Series', title='Scores Comparison')  # Legend title for clarity
)

Decide, which tree depth is the best and apply it to get the exact scores again.

In [27]:
oos_rf, oos_dt, model_rf, model_dt= score_dataset(train_X, val_X, train_y, val_y,depth=10)
is_rf, is_dt, model_rf, model_dt = score_dataset(train_X, train_X, train_y, train_y,depth=10)

In [28]:
print('Out-of-sample\nRandom Forest: {}\nDecicion Tree" {}'.format(oos_rf, oos_dt))
print('------------------------------')
print('In-sample\nRandom Forest: {}\nDecicion Tree" {}'.format(is_rf, is_dt))

Out-of-sample
Random Forest: 0.44150895074300234
Decicion Tree" 0.48015531209599804
------------------------------
In-sample
Random Forest: 0.35564943924730624
Decicion Tree" 0.3381196860663547


### Use a Pipeline

We will use the sklearn Pipeline to wrap up the different steps of building a model.

In [29]:
train_X, val_X, train_y, val_y = train_test_split(X, y, 
                                                  train_size=0.8, 
                                                  random_state = 0)

Define the same preprocessing steps that we had before but now for the pipeline:

In [30]:
# Preprocessing numerical columns
numerical_transformer = SimpleImputer(strategy='mean')

## Preprocessing categorical columns

# Ordinal Encoder
categorical_transformer_ordinal = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=[['EC','E','K-A','E10+','T', 'M', 'AO', 'RP']])) 
])

# One hot encoder
categorical_transformer_ohe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')) 
])

# Bundle the preprocessors
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, num_cols),
    ('cat_ordinal', categorical_transformer_ordinal, ['Rating']),
    ('cat_ohe', categorical_transformer_ohe, ["Platform", "Genre"])
])

Now, create the model and put together the complete pipeline. Evaluate the model again with the MAE.

In [31]:
# Create model
model = RandomForestRegressor(n_estimators=100, random_state=1,max_depth=10)


# Bundle preprocessing and modeling code in a pipeline
complete_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Preprocess the raw training data and fit the model
complete_pipeline.fit(train_X, train_y)

# Preprocess the raw validation data and make predictions
preds = complete_pipeline.predict(val_X)

# Evaluate the model
score = mean_absolute_error(val_y, preds)
print("MAE using the complete pipeline: {}".format(score))

MAE using the complete pipeline: 0.4404769639699466


## Improve the Model: Are blockbuster Title relevant?

Having successfully trained a model, your next task is to improve its performance. Evaluate whether an additional column that indicates whether a game is a blockbuster is able to improve your model.  

In [32]:
game_sales_data[game_sales_data["Global_Sales"]>10].index

Index(['Wii Sports', 'Super Mario Bros.', 'Mario Kart Wii',
       'Wii Sports Resort', 'Pokemon Red/Pokemon Blue', 'Tetris',
       'New Super Mario Bros.', 'Wii Play', 'New Super Mario Bros. Wii',
       'Duck Hunt', 'Nintendogs', 'Mario Kart DS',
       'Pokemon Gold/Pokemon Silver', 'Wii Fit', 'Kinect Adventures!',
       'Wii Fit Plus', 'Grand Theft Auto V', 'Grand Theft Auto: San Andreas',
       'Super Mario World', 'Brain Age: Train Your Brain in Minutes a Day',
       'Pokemon Diamond/Pokemon Pearl', 'Super Mario Land',
       'Super Mario Bros. 3', 'Grand Theft Auto V',
       'Grand Theft Auto: Vice City', 'Pokemon Ruby/Pokemon Sapphire',
       'Brain Age 2: More Training in Minutes a Day',
       'Pokemon Black/Pokemon White', 'Gran Turismo 3: A-Spec',
       'Call of Duty: Modern Warfare 3',
       'Pokémon Yellow: Special Pikachu Edition', 'Call of Duty: Black Ops 3',
       'Call of Duty: Black Ops', 'Pokemon X/Pokemon Y',
       'Call of Duty: Black Ops II', 'Call of D

In [33]:
block_buster = ["Mario","Pokemon","Grand Theft Auto","Call of Duty"]

Create a new column in the game_sales_data DataFrame called `blockbuster`. This column should contain `True` if the title contains any of the blockbuster game titles and `False` if the title does not contain any of the blockbuster game titles.

In [34]:
game_sales_data["blockbuster"] = [
        any(block_buster_title in title for block_buster_title in block_buster)
        for title in game_sales_data.index
    ]

In [35]:
game_sales_data

Unnamed: 0_level_0,Unnamed: 0,Platform,Year_of_Release,Genre,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Rating,blockbuster
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Wii Sports,0,Wii,2006.0,Sports,82.53,76.0,51.0,8.0,322.0,E,False
Super Mario Bros.,1,NES,1985.0,Platform,40.24,,,,,,True
Mario Kart Wii,2,Wii,2008.0,Racing,35.52,82.0,73.0,8.3,709.0,E,True
Wii Sports Resort,3,Wii,2009.0,Sports,32.77,80.0,73.0,8.0,192.0,E,False
Pokemon Red/Pokemon Blue,4,GB,1996.0,Role-Playing,31.37,,,,,,True
...,...,...,...,...,...,...,...,...,...,...,...
Samurai Warriors: Sanada Maru,16706,PS3,2016.0,Action,0.01,,,,,,False
LMA Manager 2007,16707,X360,2006.0,Sports,0.01,,,,,,False
Haitaka no Psychedelica,16708,PSV,2016.0,Adventure,0.01,,,,,,False
Spirits & Spells,16709,GBA,2003.0,Platform,0.01,,,,,,False


Split the `game_sales_data` again into feature and target variables, create the same train-test split, perform the same preprocessing using the pipeline and create the same models. Is the model performing better now?

In [36]:
Y = game_sales_data["Global_Sales"]
X = game_sales_data.drop("Global_Sales", axis=1)

In [47]:
train_X, val_X, train_y, val_y = train_test_split(X, y, 
                                                  train_size=0.8, 
                                                  random_state = 0)

In [49]:
# Preprocessing numerical columns
numerical_transformer = SimpleImputer(strategy='mean')

## Preprocessing categorical columns

# Ordinal Encoder
categorical_transformer_ordinal = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder(categories=[['EC','E','K-A','E10+','T', 'M', 'AO', 'RP']])) 
])

# One hot encoder
categorical_transformer_ohe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')) 
])

# Bundle the preprocessors
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, num_cols),
    ('cat_ordinal', categorical_transformer_ordinal, ['Rating']),
    ('cat_ohe', categorical_transformer_ohe, ["Platform", "Genre", "blockbuster"])
])

In [50]:
# Create model
model = RandomForestRegressor(n_estimators=100, random_state=1,max_depth=10)


# Bundle preprocessing and modeling code in a pipeline
complete_pipeline2 = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

# Preprocess the raw training data and fit the model
complete_pipeline2.fit(train_X, train_y)

# Preprocess the raw validation data and make predictions
preds = complete_pipeline2.predict(val_X)

# Evaluate the model
score = mean_absolute_error(val_y, preds)
print("MAE using the complete pipeline: {}".format(score))

MAE using the complete pipeline: 0.42701099280772986
