### DSCI100 Group 006-35 W2 Final Project

Younghoon Kim 11371317 <br>
Kentaro Barnes 49524861 <br>
Sean Liou 86967916 <br>
Matthew Pabustan 48718266 <br>

## How does player experience level and age influence the total number of hours played?

## Introduction

Videogames are a valuable source of data for studying human behavior and online interaction. At the University of British Columbia, a research group led by Professor Frank Wood has created a Minecraft-based research server designed to collect such data. Here, players's actions are continuosly logged as they explore the game. That said, managig the server presents serverall logististical challenges. Namely, the research team must ensure that they have enough hardware and software resources to handle peak player loads, and must also prioritize recruitment of participants who are likely to contribute meaningful amounts of data.

In this project, we investigate the relationship between player experience level, age, and the total number of hours played. Specifically, we aim to answer the question:
**How does player experience level and age influence the total number of hours played?** We believe that understanding this relationship can help the research team better predict which players are likely to engage deeply with the server over time, enabling them to recruit more effectively and plan resources accordingly.

To explore this question, we use the players.csv dataset which contains information about information about a variety of player demographics as well as each player's experience level, engagement with the server, and subscription status. We have also included the sessions.csv dataset, which contains informataion about individual players' sessions on the server, to support further analysis.

## Datasets
### players.csv overview
A list of 196 unique participants who played on the minecraft server hosted for a scientific study with the following information for each player:
##### experience
- Originally a character `<chr>` variable, converted to factor `<fct>` containing one of the following:
    - Beginner (35)
    - Regular (36)
    - Amateur (63)  
    - Veteran (48)
    - Pro (14)
- Represents a player's self-reported experience in the game.
##### subscribe
- a logical `<lgl>` variable where TRUE indicates that the player is subscribed to the game-related newsletter, and FALSE indicates that the player has not
    - 144 players reported TRUE, 52 reported FALSE.
##### hashedEmail
- a string of characters `<chr>` that acts as a unique id to identify players in sessions.csv
##### played_hours
- A double `<dbl>` variable, indicating the number of hours a participant has spent playing on the server.
    - On average, each player spends a total of 5.85 hours.
##### name
- A character `<chr`> variable containing the player's real (first) name.
##### gender
- Originally a character `<chr>` variable, converted to factor `<fct>` containing one of the following:
    - Male (124)
    - Female (37)
    - Non-binary (15)
    - Prefer not to say (11)
    - Agender (2)
    - Two-Spirited (6)
    - Other (1)
- Represents a player's gender
##### Age
- A double `<dbl>` variable, indicating the age of the player in years.
    - On average, the players are 20.52 years old.
    - Two rows/players contain missing Age data, and thus is removed from the list.

### sessions.csv overview
A catalogue of all 1535 instances where a player logs into the server with the following information for each instance.
##### hashedEmail
- a string of characters `<chr>` indicating which unique player that logged on, allowing us to track session information along with personal information in players.csv.
##### start_time
- Originally a string of characters `<chr>` indicating the time (Day/Month/Year Hour:Minute) when the player logs ON
    - Converted to a date-time `<ddtm>` variable for ease of use.
##### end_time
- Originally a string of characters `<chr>` indicating the time (Day/Month/Year Hour:Minute) when the player logs OFF
    - Converted to a date-time `<ddtm>` variable for ease of use.
##### original_start_time
- A double `<dbl>` variable that represents the time a player logs ON, in number of milliseconds since 1970.
    - Not reported to high enough precision for comparison, and thus is removed.
##### original_end_time
- A double `<dbl>` variable that represents the time a player logs OFF, in number of milliseconds since 1970.
    - Not reported to high enough precision for comparison, and thus is removed.



## Methods

#### Data Acquisition and Data Collection
- Load necessary R libraries
- Load the *players.csv* and *sessions.csv* stored on our GitHub

#### Data Cleaning and Wrangling
- Tidy the data to contain only the information we need
- The players data set will be the main data set we use for our report
- The three main varaibles we will work with are **experience**, **played_hours**, and **Age**
- Select those three variables and mutate experience and create a new variable called **experience_numeric**, which is an ordinal version of the experience variable
- Filter for played_hours values of less than 60 to get rid of any outliers
Our **experience_numeric** variable maps: <br>
| Experience   | experience_numeric |
|--------|-----|
| Beginner  | 1  |
| Amateur    | 2  |
| Regular  | 3  |
| Veteran  | 4  |
| Pro    | 5  |

In [None]:
# Load necessary libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

set.seed(69420)

In [None]:
# Reading the data

# URL to the csv files stored on github
players_url <- "https://raw.githubusercontent.com/KentoBaguetti/DSCI100-GroupProjcect/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/KentoBaguetti/DSCI100-GroupProjcect/refs/heads/main/sessions.csv"

# read the csv data into dataframes
players <- read_csv(players_url)
sessions <- read_csv(sessions_url)

players
sessions

In [None]:
# Tidy the data
tidy_players <- players |>
  mutate(
    experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"), ordered = TRUE), # Make the experience variable an ordinal
    experience_numeric = as.numeric(experience)
  ) |>
    select(played_hours, experience_numeric, Age) |>
    filter(played_hours < 60) |>
    mutate(experience_numeric = as.integer(experience_numeric))

tidy_players

In [None]:
# Histogram for player age and experience distribution
age_experience_dist <- tidy_players |>
    ggplot(aes(x = Age, fill = experience_numeric)) +
    geom_histogram

In [None]:
# Age vs Play time
prepredict_plot <- tidy_players |>
    ggplot(aes(x=Age, y=played_hours, color = as.factor(experience_numeric))) +
    geom_point() +
    labs(x = "Age (Years)",
         y = "Total Hours Played (Hours)",
         color = "Experience Level") +
    ggtitle("Distribution of Playtime Across Experience Level and Age (Figure 1)") +
    theme(text=element_text(size=15))

prepredict_plot

### Figure 1 ^^^
A scatter plot of our tidy data, Total Hours vs Age, and Experience Level as the color before our predictions

In [None]:
# use knn regression as a linear relationship is not present within the data

players_split <- initial_split(tidy_players, prop = 0.75, strata = played_hours)
players_train <- training(players_split)
players_test <- testing(players_split)

knn_recipe <- recipe(played_hours ~ Age, data = tidy_players) |>
    step_impute_mean(all_numeric_predictors()) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

players_vfold <- vfold_cv(players_train, v = 5, strata = played_hours)

players_wf <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec) 

players_wf


In [None]:
gridvals <- tibble(neighbors = seq(from = 1, to = 30, by = 1))

players_results <- players_wf |>
    tune_grid(resamples = players_vfold, grid = gridvals) |>
    collect_metrics() |>
    filter(.metric == "rmse")

players_results

rmse_plot <- players_results |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_line() +
    geom_point() +
    ggtitle("Mean vs Neighbors (Figure 2)") + 
    xlab("K Neighbors") +
    ylab("Mean") +
    theme(text=element_text(size=15))

rmse_plot

### Figure 2 ^^^
Mean vs Neighbors plot to help us find the best number of neighbors to use for our prediction analysis.

In [None]:
min_neighbor <- players_results |>
    filter(mean == min(mean))

min_neighbor

# lowest rmse is when k = 27

k_min <- min_neighbor |>
    pull(neighbors)

k_min

In [None]:
players_final_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = k_min) |>
    set_engine("kknn") |>
    set_mode("regression")

players_final_fit <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(players_final_spec) |>
    fit(data = players_train)

players_summary <- players_final_fit |>
    predict(players_test) |>
    bind_cols(players_test) |>
    metrics(truth = played_hours, estimate = .pred) |>
    filter(.metric == "rmse")

players_summary

In [None]:
players_prediction_grid <- expand_grid(
  Age = seq(
    from = min(tidy_players$Age, na.rm = TRUE),
    to   = max(tidy_players$Age, na.rm = TRUE),
    by = 1
  ),
  experience_numeric = 1:5 
)

players_pred <- players_final_fit |>
    predict(players_prediction_grid)|>
    bind_cols(players_prediction_grid)

players_plot_final <- tidy_players |>
    ggplot(aes(x = Age, y = played_hours, color = as.factor(experience_numeric))) +
    geom_point(alpha = 0.4) +
    geom_line(data = players_pred, mapping = aes(x = Age, y = .pred),
            color = "steelblue",
            linewidth = 1) +
    xlab("Age of Users") +
    ylab("Number of Played Hours") +
    ggtitle("Predicted Player Hours vs Age of Users (Figure 3)") +
    labs(color = "Experience (Factored)") +
    theme(text=element_text(size=15))

players_plot_final

### Figure 3 ^^^
Scatter plot of our data with our predicted k nearest neighbors regression line

In [None]:
tidy_players <- players |>
  mutate(
    experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))
  ) |>
  select(played_hours, experience, Age) |>
  filter(played_hours < 60, !is.na(Age))


age_grid <- tibble(Age = seq(min(tidy_players$Age), max(tidy_players$Age), by = 1))


predict_per_experience <- function(experience_level) {
  
  data_exp <- tidy_players |> filter(experience == experience_level)
  
  set.seed(123)
  data_split <- initial_split(data_exp, prop = 0.75)
  data_train <- training(data_split)
  
  
  rec <- recipe(played_hours ~ Age, data = data_train) |>
    step_impute_mean(all_predictors()) |>
    step_normalize(all_predictors())
  
  spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>  # Adjust k as needed
    set_engine("kknn") |>
    set_mode("regression")
  
  wf <- workflow() |>
    add_recipe(rec) |>
    add_model(spec)
  
  
  fit <- wf |> fit(data = data_train)
  
  
  preds <- fit |> predict(age_grid) |>
    bind_cols(age_grid) |>
    mutate(experience = experience_level)
  
  return(preds)
}


experience_levels <- levels(tidy_players$experience)


all_preds <- map_dfr(experience_levels, predict_per_experience)


tidy_players |> 
  ggplot(aes(x = Age, y = played_hours, color = experience)) +
  geom_point(alpha = 0.4) +
  geom_line(data = all_preds, aes(x = Age, y = .pred, color = experience), size = 1.2) +
  labs(title = "Predicted Played Hours vs Age by Experience Level (Figure 4)",
       x = "Age", y = "Played Hours") +
    theme(text=element_text(size=15))

### Figure 4 ^^^
Scatter plot with 5 different k nearest neighbor regression lines, one for each experience level

## Discussion

#### Our results
placeholder-----------

### Expected findings vs results
placeholder-----------

### Significance
placeholder-----------

### What future questions could this lead to?
placeholder-----------