In [None]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)
source('cleanup.R')

## Loading Data

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

In [None]:
players
sessions

## (1) Data Description:
### Dataset Overview

| Dataset | Description | # Observations | # Variables | Key Variable |
|----------|--------------|----------------|--------------|---------------|
| players.csv | Player demographic and skill information | 196 | 7 | hashedEmail |
| sessions.csv | Each recorded game session per player | 1535 | 5 | hashedEmail |


### players.csv — Variable Summary

| Variable | Type | Meaning |
|-----------|------|---------|
| experience | character | Player's self-reported gaming experience level (e.g., Amateur, Pro, Veteran) |
| subscribe | logical | Whether the player subscribed to the game newsletter |
| hashedEmail | character | Anonymized unique player identifier |
| played_hours | numeric | Total number of hours the player has played on the server |
| name | character | Player’s in-game name or alias |
| gender | character | Player’s gender (Male, Female, Other, or Prefer not to say) |
| Age | numeric | Player’s age in years (some missing values) |

---

### sessions.csv — Variable Summary

| Variable | Type | Meaning |
|-----------|------|---------|
| hashedEmail | character | Unique player ID (foreign key linking to players.csv) |
| start_time | character | Start time of a game session |
| end_time | character | End time of a game session |
| original_start_time | numeric | Original start time as a UNIX timestamp |
| original_end_time | numeric| Original end time as a UNIX timestamp |

---

### Potential Data Issues and Observations

| Category | Description | Possible Impact |
|-----------|--------------|----------------|
| Missing values | `Age` contains missing data; `gender` includes “Prefer not to say” | Could reduce sample size or introduce bias |
| Outliers | Some players have 0 or unusually high `played_hours` | May skew averages or affect regression results |
| Data type inconsistencies | `start_time` and `end_time` are stored as character, not datetime | Need conversion with `lubridate` for time-based calculations |
| Duplicates | Players may appear multiple times in `sessions.csv` | Must aggregate sessions per player |
| Sampling bias | Data comes from a voluntary Minecraft research server | May not represent the general player population |
| Ethical considerations | All identifiers are anonymized (`hashedEmail`) | Satisfies data privacy and ethics requirements |

---


### How the Data Were Collected

The data were collected from a Minecraft research server operated by the UBC Computer Science department.  
Player information (`players.csv`) was obtained through voluntary registration forms, including demographics, experience, and newsletter subscription.  
Session data (`sessions.csv`) were automatically logged by the server, recording start and end times for each play session.  
All players were anonymized using hashed identifiers (`hashedEmail`) to ensure privacy and comply with research ethics.

## Summary Statistics


In [None]:
library(dplyr)
library(tidyr)

summary_stats_players <- players |>
  summarise(
    across(
      where(is.numeric),
      list(
        Min    = ~min(.x, na.rm = TRUE),
        Mean   = ~mean(.x, na.rm = TRUE),
        Median = ~median(.x, na.rm = TRUE),
        Max    = ~max(.x, na.rm = TRUE),
        SD     = ~sd(.x, na.rm = TRUE)
      ),
      .names = "{.col}__{.fn}"   
    )
  ) |>
  pivot_longer(
    everything(),
    names_to = c("Variable", "Statistic"),
    names_sep = "__",           
    values_to = "value"
  ) |>
  pivot_wider(
    names_from = Statistic,
    values_from = value
  ) |>
  mutate(across(where(is.numeric), ~round(.x, 2))) |>
  arrange(Variable)

summary_stats_players



### (2) Questions:
- broad question:
    - Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### (3) Exploratory Data Analysis and Visualization:

### (4) Methods and Plan

### (5) GitHub Repository