# PREDICTING MINECRAFT SERVER NEWSLETTER SUBSCRIPTION USING PLAYER DEMOGRAPHICS AND BEHAVIOR

**Name:** Zhaoxuan Wu  
**GitHub:** [https://github.com/Shad2zz/Zhaoxuanwu-dsci-100](https://github.com/Shad2zz/Zhaoxuanwu-dsci-100)

## Background
- Video‐game research platforms (e.g., Minecraft servers) enable computer science researchers to collect real‐world player behavior data.  
- The UBC research group led by Frank Wood aims to leverage these data to optimize player recruitment and allocate server resources effectively.  
- Subscribing to the game newsletter serves as an indicator of player engagement and future interaction intent.

## Question
> “Can player demographics (age, gender, experience) and behavioral features (total play time, number of sessions, average session duration, night/weekend play proportion) predict whether a player will subscribe to the game newsletter?”

## Data Description
- **players.csv**  
  - **Observations:** 196  
  - **Variables (6):  
    - `hashedEmail` (string): unique player identifier  
    - `experience` (numeric): cumulative experience points  
    - `played_hours` (numeric): total play time (hours)  
    - `subscribed` (factor): subscription status (“Yes”/“No”)  
    - `gender` (factor): gender (“Male”/“Female”/“Other”)  
    - `age` (numeric): age in years  
  - **Data Quality:** some missing age values; subscription rate approx. 60% Yes, 40% No

- **sessions.csv**  
  - **Observations:** 1,535  
  - **Variables (3):**  
    - `hashedEmail` (string): unique player identifier  
    - `start_time` (string datetime): session start time (UTC)  
    - `end_time` (string datetime): session end time (UTC)  
  - **Data Quality:** some sessions span midnight, requiring careful handling in feature engineering

> **Potential Issues:**  
> - Time zone alignment and timestamp consistency  
> - Players with no sessions or extremely long/short sessions  
> - Unobserved external factors (e.g., network outages, server maintenance) may influence behavior  










In [None]:
library(tidyverse)   
library(lubridate)   
library(tidymodels)   
library(cowplot)      



players  <- read_csv("https://raw.githubusercontent.com/Shad2zz/Zhaoxuanwu-dsci-100/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Shad2zz/Zhaoxuanwu-dsci-100/refs/heads/main/sessions.csv")


head(players)
tail(players)
head(sessions)
tail(sessions)


Parse start_time/end_time as POSIX datetimes.

Compute duration_mins, extract hour and weekday (wday).

Flag sessions in night hours (20:00–06:00) and on weekends (Sat/Sun).

Aggregate per player:

n_sessions, avg_duration, prop_night, prop_weekend.

In [None]:
sessions_features <- sessions %>%
  mutate(
    start         = ymd_hms(start_time),
    end           = ymd_hms(end_time),
    duration_mins = as.numeric(difftime(end, start, units = "mins")),
    hour          = hour(start),
    wday          = wday(start, label = TRUE),
    night         = hour >= 20 | hour < 6,
    weekend       = wday %in% c("Sat", "Sun")
  ) %>%
  group_by(hashedEmail) %>%
  summarise(
    n_sessions   = n(),
    avg_duration = mean(duration_mins, na.rm = TRUE),
    prop_night   = mean(night, na.rm = TRUE),
    prop_weekend = mean(weekend, na.rm = TRUE)
  )
sessions_features