In [None]:

library(tidyverse)

players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

sessions <- sessions %>%
  mutate(start = lubridate::dmy_hm(start_time),
         end = lubridate::dmy_hm(end_time),
         duration = as.numeric(difftime(end, start, units = "mins")))

usage <- sessions %>%
  group_by(hashedEmail) %>%
  summarise(total_minutes = sum(duration, na.rm = TRUE), .groups = "drop")

df <- players %>%
  left_join(usage, by = "hashedEmail") %>%
  mutate(total_minutes = replace_na(total_minutes, 0),
         subscribe = as.numeric(subscribe)) %>%
  filter(!is.na(Age))


In [None]:

ggplot(df, aes(x = experience, y = total_minutes, fill = experience)) +
  geom_boxplot() +
  labs(title = "Figure 1: Experience vs Total Play Time (mins)") +
  theme_minimal()


In [None]:

ggplot(df, aes(x = Age, y = total_minutes)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Figure 2: Age vs Total Play Time (mins)") +
  theme_minimal()


In [None]:

model <- lm(total_minutes ~ Age + subscribe + played_hours, data = df)
summary(model)


# Predicting Player Engagement on a Minecraft Research Server

## Introduction

### Background  
A research group in the Department of Computer Science at the University of British Columbia has deployed a Minecraft server to collect data on user behavior. As players explore and interact with the game world, their sessions are logged and stored. These data serve not only academic purposes but also operational ones. Running a multiplayer server involves planning infrastructure and targeting the right participants for recruitment. Since server load and research value scale with player activity, identifying what types of players tend to engage more is critical.

### Question  
**Can player characteristics such as experience level, age, and subscription status predict total time spent playing on the Minecraft research server?**

## Data Description

This project uses two datasets:

- **`players.csv`**: Contains information on 196 players, including demographic features, experience level, and subscription status.
- **`sessions.csv`**: Records 1535 individual gameplay sessions, with start and end times for each session.

The key variables we used from these files are summarized below:

| Variable Name            | Type                 | Description                                     |
| ------------------------ | -------------------- | ----------------------------------------------- |
| `experience`             | Categorical          | Self-reported skill level (e.g., Pro, Amateur)  |
| `subscribe`              | Logical              | Whether the player subscribed to the newsletter |
| `played_hours`           | Numeric              | Self-reported total hours played                |
| `gender`, `Age`          | Categorical, Numeric | Demographic features                            |
| `start_time`, `end_time` | Timestamp            | Time range for each play session                |


We joined both datasets using `hashedEmail` and computed total gameplay time per player by summing the duration of all their sessions.

## Methods and Results

We used `R` and the `tidyverse` library for data wrangling, visualization, and modeling. First, we parsed timestamps from `sessions.csv` and calculated the duration (in minutes) of each session. Then we aggregated these values per player and joined them with player-level data from `players.csv`.

Next, we conducted exploratory data analysis. A boxplot (Figure 1) showed that players with higher experience levels tended to play longer. A scatterplot (Figure 2) showed a weak positive relationship between age and playtime. We observed that subscription status also correlated with total minutes played.

We fit a simple linear regression model using three predictors: age, subscription status (as binary), and `played_hours`. The outcome variable was `total_minutes`, calculated from the sessions data. The model summary showed that both `played_hours` and `subscribe` had positive and statistically significant coefficients, indicating that more self-reported experience and being subscribed were associated with longer actual playtime. Age had a smaller, positive effect.

### Justification of Method  
Linear regression is suitable for this problem because the outcome variable is continuous and the relationships between predictors and outcome are approximately linear. It is also interpretable and easy to communicate to stakeholders.

### Assumptions  
We assume linearity, homoscedasticity, and normally distributed residuals. These assumptions were not formally tested but appeared reasonable based on the visual diagnostics.

### Limitations  
The model does not capture non-linear effects or interaction terms. Also, we did not split the dataset into training and test sets due to its limited size. Cross-validation could be used in future work to improve robustness.

## Discussion

This analysis shows that simple player features like experience, subscription, and self-reported playtime can provide useful insights into predicting total engagement. The findings were in line with expectations—players who subscribe and report more hours are indeed more active in the actual server data.

From a practical perspective, these results can help guide recruitment: targeting experienced and subscribed users is likely to yield more engaged participants, maximizing research value and minimizing idle resource use.



Github:https://github.com/HongyangGong/DSCI_V-100-Project
