In [None]:
# all libraries needed
library(tidyverse)
library(tidymodels)
library(forcats)
library(repr)
library(dplyr)

### Predicting the age of the user

Introduction

Online games allow for an extensive amount of data to be collected on users for understanding user behavior. A research group in Computer Science at the University of British Columbia (UBC), led by Frank Wood, is collecting data from a custom Minecraft server to study how people play video games. They aim to optimize their project by targeting recruitment efforts and ensuring server resources are sufficient for the player base. To do this, they need to better understand the characteristics and behaviors of their players. This raises the question: can a player's engagement metrics, specifically the total number of sessions played and total hours played, be used to predict whether a player is an adult (18+) or underaged (under 18) in the UBC server dataset?

The analysis utilizes two datasets provided by the research group: players.csv, containing unique player attributes, and sessions.csv, containing records of individual play sessions.To prepare the data for analysis, the two datasets were combined based on the common hashedEmail variable. The goal was to isolate the total number of sessions for each player, their total play hours, and their age. An initial challenge arose during this process, as sessions.csv originally contained 1535 rows (representing individual sessions) while players.csv only contained 196 rows (representing unique players). This required sessions.csv to be joined with players.csv using a left_join(). Afterward, the resulting file was grouped by player and summarized into a concise dataset containing age, total play time, and number of sessions for each unique player.

Methods and Results

As mentioned above, the goal of this was to combine the two files in such a way that makes it easy to extrapolate useful data from. However, because I wanted to create a classification model, I did mutate the ages into age groups, making it a class. My thinking behind this was to create two distinct groups based on the amount of time they usually have for freetime (under 18 usually has more as they don't have work, university, and whatnot).

In [None]:
# wrangling
player_data <- read_csv('data/players.csv')
session_data <- read_csv('data/sessions.csv')

merged_player_data <- left_join(player_data, session_data, by = 'hashedEmail')

write.csv(merged_player_data, "merged_file.csv", row.names = FALSE)

final_data_players <- merged_player_data |>
    group_by(hashedEmail) |>
    summarize(num_sessions = n(), Age = first(Age), played_hours = first(played_hours)) |>
    mutate(age_group = if_else(Age < 18, "Underaged", "Adult")) |>
    mutate(age_group = as_factor(age_group)) |>
    select(-Age, - hashedEmail)

head(final_data_players, n = 6)

To begin the exploratory data analysis, a scatter plot was created to investigate the relationship between the two primary predictor variables: the number of sessions a player has logged (num_sessions) and their total hours played (played_hours). The points on the plot are colored by the age_group response variable ("Adult" or "Underaged") to determine if there are any obvious visual patterns or clusters that distinguish the two groups. This helps to assess the potential difficulty of the classification task ahead.

As seen in Figure 1 positive correlation between the number of sessions and the hours played. This is an expected relationship, as players who log in more frequently will naturally accumulate more total playtime.

Crucially, in the context of our research question, there is no obvious visual separation between the "Adult" and "Underaged" data points. The two groups appear to be heavily intermingled across the entire range of engagement levels shown. This lack of a clear boundary is key as it suggests that a simple linear model wouldn't be able to distinguish between the two classes based on this information alone.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)
#first visualization for showing a correlation

session_to_hours_plot <- final_data_players |>
    ggplot(aes(x = num_sessions, y = played_hours, color = age_group)) +
    geom_point() +
    labs(x = 'number of sessions played',
         y = 'total hours played',
         color = 'Age group') +
    ggtitle('Figure 1') +
    xlim(0, 50) +
    ylim(0, 50)

session_to_hours_plot

The final player dataset was split into a training set (75% of the data) and a testing set (25% of the data). A stratified split was performed on the age_group variable to ensure that the proportion of "Adult" and "Underaged" players was similar in both the training and testing sets. The testing set was held out and not used for any training or tuning procedures.

In [None]:
# split training and testing data
set.seed(3456) 

player_split <- initial_split(final_data_players, prop = .75, strata = age_group)  
player_train <- training(player_split)   
player_test <- testing(player_split)

In [None]:
# model test
set.seed(3456) 

player_recipe <- recipe(age_group ~ num_sessions + played_hours , data = player_train) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = 'rectangular', neighbors = 5) |>
      set_engine('kknn') |>
      set_mode('classification')

player_vfold <- vfold_cv(player_train, v = 10, repeats = 3, strata = age_group)

player_resample_fit <- workflow() |>
      add_recipe(player_recipe) |>
      add_model(knn_spec) |>
      fit_resamples(resamples = player_vfold) 

player_metrics <- collect_metrics(player_resample_fit)
player_metrics

In [None]:
#tune model

set.seed(3456) 

player_vfold <- vfold_cv(player_train, v = 10, strata = age_group)

k_vals <- tibble(neighbors = c(1:20))

knn_tune_spec <- nearest_neighbor(neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_workflow <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_tune_spec)

better_tuning_results <- knn_workflow |>
    tune_grid(resamples = player_vfold, grid = k_vals)

# Or see the best performing values of k
results <- show_best(better_tuning_results, metric = "accuracy")
results

In [None]:
# select the best model and update the workflow

best_k <- results |>
  filter(.metric == "accuracy") |>
  filter(mean == max(mean)) |>
  arrange(neighbors) |>
  slice(1) |>
  pull(neighbors)

knn_tune_spec_final <- nearest_neighbor(neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

final_knn_workflow <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(knn_tune_spec_final)

final_model_fit <- fit(
  final_knn_workflow,
  data = player_train)

final_model_fit

### Predicting the age of the user

Introduction

Online games allow for an extensive amount of data to be collected on users for understanding user behavior. A research group in Computer Science at the University of British Columbia (UBC), led by Frank Wood, is collecting data from a custom Minecraft server to study how people play video games. They aim to optimize their project by targeting recruitment efforts and ensuring server resources are sufficient for the player base. To do this, they need to better understand the characteristics and behaviors of their players. This raises the question: can a player's engagement metrics, specifically the total number of sessions played and total hours played, be used to predict whether a player is an adult (18+) or underaged (under 18) in the UBC server dataset?

The analysis utilizes two datasets provided by the research group: players.csv, containing unique player attributes, and sessions.csv, containing records of individual play sessions.To prepare the data for analysis, the two datasets were combined based on the common hashedEmail variable. The goal was to isolate the total number of sessions for each player, their total play hours, and their age. An initial challenge arose during this process, as sessions.csv originally contained 1535 rows (representing individual sessions) while players.csv only contained 196 rows (representing unique players). This required sessions.csv to be joined with players.csv using a left_join(). Afterward, the resulting file was grouped by player and summarized into a concise dataset containing age, total play time, and number of sessions for each unique player.