In [None]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)

## Loading Data

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

In [None]:
players
sessions

## (1) Data Description:
### Dataset Overview

| Dataset | Description | # Observations | # Variables | Key Variable |
|----------|--------------|----------------|--------------|---------------|
| players.csv | Player demographic and skill information | 196 | 7 | hashedEmail |
| sessions.csv | Each recorded game session per player | 1535 | 5 | hashedEmail |


### players.csv — Variable Summary

| Variable | Type | Meaning |
|-----------|------|---------|
| experience | character | Player's self-reported gaming experience level (e.g., Amateur, Pro, Veteran) |
| subscribe | logical | Whether the player subscribed to the game newsletter |
| hashedEmail | character | Anonymized unique player identifier |
| played_hours | numeric | Total number of hours the player has played on the server |
| name | character | Player’s in-game name or alias |
| gender | character | Player’s gender (Male, Female, Other, or Prefer not to say) |
| Age | numeric | Player’s age in years (some missing values) |

---

### sessions.csv — Variable Summary

| Variable | Type | Meaning |
|-----------|------|---------|
| hashedEmail | character | Unique player ID (foreign key linking to players.csv) |
| start_time | character | Start time of a game session |
| end_time | character | End time of a game session |
| original_start_time | numeric | Original start time as a UNIX timestamp |
| original_end_time | numeric| Original end time as a UNIX timestamp |

---

### Potential Data Issues and Observations

| Category | Description | Possible Impact |
|-----------|--------------|----------------|
| Missing values | `Age` contains missing data; `gender` includes “Prefer not to say” | Could reduce sample size or introduce bias |
| Outliers | Some players have 0 or unusually high `played_hours` | May skew averages or affect regression results |
| Data type inconsistencies | `start_time` and `end_time` are stored as character, not datetime | Need conversion with `lubridate` for time-based calculations |
| Duplicates | Players may appear multiple times in `sessions.csv` | Must aggregate sessions per player |
| Sampling bias | Data comes from a voluntary Minecraft research server | May not represent the general player population |
| Ethical considerations | All identifiers are anonymized (`hashedEmail`) | Satisfies data privacy and ethics requirements |

---


### How the Data Were Collected

The data were collected from a Minecraft research server operated by the UBC Computer Science department.  
Player information (`players.csv`) was obtained through voluntary registration forms, including demographics, experience, and newsletter subscription.  
Session data (`sessions.csv`) were automatically logged by the server, recording start and end times for each play session.  
All players were anonymized using hashed identifiers (`hashedEmail`) to ensure privacy and comply with research ethics.


### Summary Statistics for Numeric Variables in `players.csv` （See Part 3)

| Variable     | Min | Mean  | Median | Max   | SD   |
|---------------|------|--------|---------|--------|--------|
| Age           | 9.00 | 21.14 | 19.00 | 58.00 | 7.39 |
| played_hours  | 0.00 | 5.85  | 0.10  | 223.10 | 28.36 |


## (2) Questions:
### broad question:
Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### Specific Predictive Question
Which combination of player features — such as age, gender, experience level, and total playtime — best predicts whether a player will subscribe to the newsletter, and how much does each factor contribute to improving the prediction accuracy?

### How the Data Help
The `players.csv` dataset includes each player's demographic information (`Age`, `gender`), self-reported experience level (`experience`), total playtime (`played_hours`), and subscription status (`subscribe`).  
These data allow us to test how different features, individually and in combination, relate to subscription behaviour and which combinations provide the most accurate predictions.

### Data Wrangling Plan
- Use only `players.csv` as the dataset.  
- Keep `subscribe` as the variable we aim to predict.  
- Use `Age`, `gender`, `experience`, and `played_hours` as predictors, and test different combinations of these variables.  
- Handle missing values (e.g., missing `Age`) and make sure all categorical variables are properly encoded.  
- Convert `experience` to a factor variable for classification.  
- Split the data into training and testing sets, and compare models to find the combination that predicts subscription most accurately.  



### (3) Exploratory Data Analysis and Visualization:

In [None]:
## Minimal Data Wrangling to Create Tidy Datasets

players_tidy <- players |>
  mutate(
    experience    = as.factor(experience),
    gender        = as.factor(gender))

head(players_tidy)

sessions_tidy <- sessions |>
  mutate(
    start_time  = dmy_hm(start_time, quiet = TRUE),
    end_time    = dmy_hm(end_time, quiet = TRUE))

head(sessions_tidy)


### (4) Methods and Plan

### (5) GitHub Repository