#  Can Played Hours, Age, and Gender Predict Newsletter Subscription?
## A Data-Science Analysis of the UBC Minecraft Research-Server Logs
*May Wei· DSCI _100 · UBC, 2025-06-17*
## Link to github repository

## 1. Introduction
### 1.1 Background
A research group at the University of British Columbia has launched a Minecraft server that records how players behave in virtual environments.  The server collects rich in-game activity data, which can be used to study user engagement and support research in human-computer interaction and AI.

To maintain engagement and allocate server resources effectively, the research team uses a game-related newsletter.  Predicting which players are likely to subscribe can help with targeted recruitment and infrastructure planning.

In the commercial gaming industry, predictive marketing is widely used to retain players by sending customized offers to those at risk of leaving (Ghantasala, 2024).  Similarly, understanding which players are more inclined to subscribe to game newsletters can improve outreach and user management.

This project investigates whether a player’s demographic information (e.g., age, gender, experience) and gameplay patterns (e.g., session frequency, average session length) can predict newsletter subscription status.

### Research Question 
 Can played hours, age, and gender predict newsletter subscription in players?

The response variable is the binary flag **`subscribed`**, and the explanatory variables are  
1. **`hours_played`** – cumulative play-time (h),  
2. **`age`** – self-reported age (years),  
3. **`gender`** – self-reported gender identity.

## 1.2 Data Description

In [11]:
library(tidyverse)

In [20]:
player <- read_csv("players.csv")
head(player)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


Since our variables are played hours, age, and gender, so we should remove those we don't need.

In [21]:
clean_player <- select(player, -hashedEmail, -name)
head(clean_player)

experience,subscribe,played_hours,gender,Age
<chr>,<lgl>,<dbl>,<chr>,<dbl>
Pro,True,30.3,Male,9
Veteran,True,3.8,Male,17
Veteran,False,0.0,Male,17
Amateur,True,0.7,Female,21
Regular,True,0.1,Male,21
Amateur,True,0.0,Female,17


In [22]:
gender <- clean_player |>
  group_by(gender) |>
  summarize(count = n())

gender

gender,count
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


We have a total of 196 rows; there are multiple genders in the data. We combine a few types of gender into "gender others".

In [27]:
install.packages("mltools")
library(mltools)
library(data.table)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [28]:
clean_player$gender <- as.factor(clean_player$gender)
player_1h <- one_hot(as.data.table(clean_player))
head(player_1h)

experience,subscribe,played_hours,gender_Agender,gender_Female,gender_Male,gender_Non-binary,gender_Other,gender_Prefer not to say,gender_Two-Spirited,Age
<chr>,<lgl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>
Pro,True,30.3,0,0,1,0,0,0,0,9
Veteran,True,3.8,0,0,1,0,0,0,0,17
Veteran,False,0.0,0,0,1,0,0,0,0,17
Amateur,True,0.7,0,1,0,0,0,0,0,21
Regular,True,0.1,0,0,1,0,0,0,0,21
Amateur,True,0.0,0,1,0,0,0,0,0,17


In [29]:
player_1h |>
  mutate(gender_others = gender_Agender + gender_Other + `gender_Two-Spirited`) |>
  select(-gender_Agender, -gender_Other, -`gender_Two-Spirited`) |>
  head()

experience,subscribe,played_hours,gender_Female,gender_Male,gender_Non-binary,gender_Prefer not to say,Age,gender_others
<chr>,<lgl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<int>
Pro,True,30.3,0,1,0,0,9,0
Veteran,True,3.8,0,1,0,0,17,0
Veteran,False,0.0,0,1,0,0,17,0
Amateur,True,0.7,1,0,0,0,21,0
Regular,True,0.1,0,1,0,0,21,0
Amateur,True,0.0,1,0,0,0,17,0


In [30]:
clean_player$experience <- as.factor(clean_player$experience)
player_exp <- clean_player |> select(-gender)
player_1h_exp <- one_hot(as.data.table(player_exp))
head(player_1h_exp)

experience_Amateur,experience_Beginner,experience_Pro,experience_Regular,experience_Veteran,subscribe,played_hours,Age
<int>,<int>,<int>,<int>,<int>,<lgl>,<dbl>,<dbl>
0,0,1,0,0,True,30.3,9
0,0,0,0,1,True,3.8,17
0,0,0,0,1,False,0.0,17
1,0,0,0,0,True,0.7,21
0,0,0,1,0,True,0.1,21
1,0,0,0,0,True,0.0,17


In [33]:
player_exp$experience <- player_exp$experience |>
  fct_recode("1" = "Beginner", 
             "2" = "Amateur", 
             "3" = "Regular", 
             "4" = "Veteran", 
             "5" = "Pro")
head(player_exp)

ERROR: [1m[33mError[39m in `fct_recode()`:[22m
[1m[22m[33m![39m `.f` must be a factor or character vector, not a double vector.


In [32]:
player_exp$experience <- as.numeric(player_exp$experience)
head(player_exp)

experience,subscribe,played_hours,Age
<dbl>,<lgl>,<dbl>,<dbl>
3,True,30.3,9
5,True,3.8,17
5,False,0.0,17
1,True,0.7,21
4,True,0.1,21
1,True,0.0,17
