#  Can Played Hours and Age Predict Newsletter Subscription?
## A Data-Science Analysis of the UBC Minecraft Research-Server Logs
*May Wei· DSCI _100 · UBC, 2025-06-17*
## Link to github repository

## 1. Introduction
### 1.1 Background
A research group at the University of British Columbia has launched a Minecraft server that records how players behave in virtual environments.  The server collects rich in-game activity data, which can be used to study user engagement and support research in human-computer interaction and AI.

To maintain engagement and allocate server resources effectively, the research team uses a game-related newsletter.  Predicting which players are likely to subscribe can help with targeted recruitment and infrastructure planning.

In the commercial gaming industry, predictive marketing is widely used to retain players by sending customized offers to those at risk of leaving (Ghantasala, 2024).  Similarly, understanding which players are more inclined to subscribe to game newsletters can improve outreach and user management.

This project investigates whether a player’s demographic information (e.g., age, gender, experience) and gameplay patterns (e.g., session frequency, average session length) can predict newsletter subscription status.

### Research Question 
 Can played hours, age, and gender predict newsletter subscription in players?

The response variable is the binary flag **`subscribed`**, and the explanatory variables are  
1. **`hours_played`** – cumulative play-time (h),  
2. **`age`** – self-reported age (years),  
3. **`gender`** – self-reported gender identity.

## 1.2 Data Description
experience: Categorical (Beginner, Amateur, Regular, Veteran, Pro)

subscribe: Boolean (TRUE/FALSE) - Newsletter subscription status (response variable)

hashedEmail: Unique identifier for each player

played_hours: Numeric - Total hours played

name: Player name

gender: Categorical (Male, Female, Non-binary, Prefer not to say, Two-Spirited, Agender, Other)

Age: Numeric - Player age

# 2. Methods & Results

## 2.1 Load data

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
player <- read_csv("players.csv")
head(player)

## 2.2 Clean the data to the format necessary for the planned analysis. 
### Since our variables are played hours, age, and gender, so we should remove those we don't need.

In [None]:
clean_player <- mutate(player,subscribe =as.factor(subscribe))|>select( -hashedEmail, -name, -experience, -gender)
head(clean_player)

In [None]:
ggplot(player, aes(x = Age, y = played_hours)) +
  geom_point(aes(color = experience)) +
  labs(
    x = "Age(yr)",
    y = "Played Time(hr)",
    color = "Experience"
  )

In [None]:
set.seed(1234)

players_split <- initial_split(clean_player, prop = 0.7, strata = subscribe)
players_training <- training(players_split)
players_testing <- testing(players_split)

players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_training) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
players_recipe

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

In [None]:
players_vfold <- vfold_cv(players_training, v = 5, strata = subscribe)

In [None]:
set.seed(1234)
k_val <- tibble(neighbors = seq(from = 1, to = 30, by = 1))
knn_results <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = players_vfold, grid = k_val) |>
  collect_metrics()
player_accuracy <- filter(knn_results, .metric == "accuracy")
player_accuracy

In [None]:
accuracy_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy")

accuracy_plot
best_k <- accuracies |>
  arrange(desc(mean)) |>
  head(1) |>
  pull(neighbors)

best_k

In [None]:
best_k <- accuracies |>
  arrange(desc(mean)) |>
  head(1) |>
  pull(neighbors)

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit(data = players_training)

my_prediction <- predict(knn_fit, players_testing) |>
  bind_cols(players_testing)

my_prediction

In [None]:
accuracy <- my_prediction |>
  accuracy(truth = subscribe, estimate = .pred_class, event_level = "first")

accuracy