# PREDICTING MINECRAFT SERVER NEWSLETTER SUBSCRIPTION USING PLAYER DEMOGRAPHICS AND BEHAVIOR

**Name:** Zhaoxuan Wu  
**GitHub:** [https://github.com/Shad2zz/Zhaoxuanwu-dsci-100](https://github.com/Shad2zz/Zhaoxuanwu-dsci-100)

## Background
- Video‐game research platforms (e.g., Minecraft servers) enable computer science researchers to collect real‐world player behavior data.  
- The UBC research group led by Frank Wood aims to leverage these data to optimize player recruitment and allocate server resources effectively.  
- Subscribing to the game newsletter serves as an indicator of player engagement and future interaction intent.

## Question
> “Can player demographics (age, gender, experience) and behavioral features (total play time, number of sessions, average session duration, night/weekend play proportion) predict whether a player will subscribe to the game newsletter?”

## Data Description
- **players.csv**  
  - **Observations:** 196  
  - **Variables (6):  
    - `hashedEmail` (string): unique player identifier  
    - `experience` (numeric): cumulative experience points  
    - `played_hours` (numeric): total play time (hours)  
    - `subscribed` (factor): subscription status (“Yes”/“No”)  
    - `gender` (factor): gender (“Male”/“Female”/“Other”)  
    - `age` (numeric): age in years  
  - **Data Quality:** some missing age values; subscription rate approx. 60% Yes, 40% No

- **sessions.csv**  
  - **Observations:** 1,535  
  - **Variables (3):**  
    - `hashedEmail` (string): unique player identifier  
    - `start_time` (string datetime): session start time (UTC)  
    - `end_time` (string datetime): session end time (UTC)  
  - **Data Quality:** some sessions span midnight, requiring careful handling in feature engineering

> **Potential Issues:**  
> - Time zone alignment and timestamp consistency  
> - Players with no sessions or extremely long/short sessions  
> - Unobserved external factors (e.g., network outages, server maintenance) may influence behavior  










In [None]:
library(tidyverse)   
library(lubridate)   
library(tidymodels)   
library(cowplot)      
library(dplyr)




players  <- read_csv("https://raw.githubusercontent.com/Shad2zz/Zhaoxuanwu-dsci-100/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Shad2zz/Zhaoxuanwu-dsci-100/refs/heads/main/sessions.csv")


head(players)
tail(players)
head(sessions)
tail(sessions)


In [None]:
sessions <- sessions %>%
  mutate(start_time = dmy_hm(start_time),
         end_time = dmy_hm(end_time),
         session_duration = as.numeric(difftime(end_time, start_time, units = "mins")),
         is_night = hour(start_time) >= 21 | hour(start_time) < 6,
         is_weekend = wday(start_time) %in% c(1, 7))
behavior_summary <- sessions %>%
  group_by(hashedEmail) %>%
  summarise(
    num_sessions = n(),
    avg_session_duration = mean(session_duration, na.rm = TRUE),
    night_play_ratio = mean(is_night),
    weekend_play_ratio = mean(is_weekend)
  )

final_df <- players %>%
  left_join(behavior_summary, by = "hashedEmail") %>%
  mutate(
    across(c(num_sessions, avg_session_duration, night_play_ratio, weekend_play_ratio),
           ~ replace_na(., 0)),
    subscribe = as.factor(subscribe),          
    gender = as.factor(gender),                
    experience = as.factor(experience)        
  )

Left‐join sessions_features to players on hashedEmail.

Replace NA in new features with 0.

Convert subscribe and gender to factors.

In [None]:
final_df %>%
  pivot_longer(cols = c(played_hours, num_sessions, avg_session_duration, night_play_ratio, weekend_play_ratio),
               names_to = "metric", values_to = "value") %>%
  ggplot(aes(x = subscribe, y = value, fill = subscribe)) +
  geom_boxplot() +
  facet_wrap(~ metric, scales = "free") +
  labs(title = "Comparison of Features by Subscription Status",
       x = "Subscribed", y = "Value") +
  theme_minimal()

Compute mean and SD of each predictor by subscribe status.

Visualize:

Figure 1: Boxplot of played_hours by subscribe.

Figure 2: Stacked bar chart of gender vs. subscribe.

In [None]:
set.seed(42)

# 数据划分
data_split <- initial_split(final_df, prop = 0.8, strata = subscribe)
train_data <- training(data_split)
test_data <- testing(data_split)

# 数据预处理：标准化数值变量、编码类别变量
knn_recipe <- recipe(subscribe ~ Age + gender + experience + played_hours + 
                       num_sessions + avg_session_duration + 
                       night_play_ratio + weekend_play_ratio,
                     data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

Split data into train (70%) and test (30%) stratified by subscribe.

Define logistic regression workflow with dummy encoding and normalization.

Perform 5-fold CV on training set to evaluate AUC and accuracy.

Fit final model and assess on test set.



In [None]:
knn_spec <- nearest_neighbor(neighbors = tune(), weight_func = "rectangular") %>%
  set_engine("kknn") %>%
  set_mode("classification")

# 交叉验证折数
cv_folds <- vfold_cv(train_data, v = 5, strata = subscribe)

# workflow
knn_wf <- workflow() %>%
  add_recipe(knn_recipe) %>%
  add_model(knn_spec)

# 网格搜索调参
knn_grid <- tibble(neighbors = seq(1, 15, 2))

# 调参训练
knn_results <- tune_grid(knn_wf,
                         resamples = cv_folds,
                         grid = knn_grid,
                         metrics = metric_set(accuracy, roc_auc))

# 查看结果
knn_results %>% collect_metrics()
knn_results

In [None]:
best_k <- knn_results %>%
  select_best("accuracy")

# 最佳模型
final_knn <- finalize_workflow(knn_wf, best_k)

# 在测试集上评估
knn_fit <- fit(final_knn, data = train_data)

knn_predictions <- predict(knn_fit, test_data, type = "prob") %>%
  bind_cols(predict(knn_fit, test_data), test_data)

# 查看模型准确率
metrics(knn_predictions, truth = subscribe, estimate = .pred_class)

# 绘制ROC
knn_predictions %>%
  roc_curve(truth = subscribe, .pred_True) %>%
  autoplot() +
  labs(title = "ROC Curve for KNN Classifier")
knn_predictions