## Data Description

### Overview

- The dataset was collected by a UBC Computer Science research group studying how people play on a Minecraft server.  
- Two datasets were provided:
  - `players.csv` — player-level information such as demographics, skill, and newsletter subscription.
  - `sessions.csv` — session-level information, where each row represents a single play session (with timestamps, duration, etc.).  
- The data were collected automatically through server logs and voluntary player sign-ups.  


In [6]:
library(tidyverse)

# Loading the datasets
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

# Observe the first 3 rows
head(players, 3)
head(sessions, 3)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0


The datasets were successfully loaded into R.
The preview above shows the first few rows from both files, confirming that each dataset has multiple variables related to player information and session activity.
Each row in players.csv represents one unique player, while each row in sessions.csv represents one play session recorded on the Minecraft server.

In [5]:
data_shapes <- tibble(
  dataset = c("players.csv", "sessions.csv"),
  rows    = c(nrow(players), nrow(sessions)),
  columns = c(ncol(players), ncol(sessions))
)

data_shapes

dataset,rows,columns
<chr>,<int>,<int>
players.csv,196,7
sessions.csv,1535,5


The table above summarizes the size of each dataset.  
- **Rows** = number of observations (records).  
- **Columns** = number of variables (features).  
`players.csv` contains player-level data, and `sessions.csv` contains session-level data.


In [8]:
# Check structure of each dataset
str(players)
str(sessions)

spc_tbl_ [196 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ experience  : chr [1:196] "Pro" "Veteran" "Veteran" "Amateur" ...
 $ subscribe   : logi [1:196] TRUE TRUE FALSE TRUE TRUE TRUE ...
 $ hashedEmail : chr [1:196] "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d" "f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9" "b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28" "23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5" ...
 $ played_hours: num [1:196] 30.3 3.8 0 0.7 0.1 0 0 0 0.1 0 ...
 $ name        : chr [1:196] "Morgan" "Christian" "Blake" "Flora" ...
 $ gender      : chr [1:196] "Male" "Male" "Male" "Female" ...
 $ Age         : num [1:196] 9 17 17 21 21 17 19 21 47 22 ...
 - attr(*, "spec")=
  .. cols(
  ..   experience = [31mcol_character()[39m,
  ..   subscribe = [33mcol_logical()[39m,
  ..   hashedEmail = [31mcol_character()[39m,
  ..   played_hours = [32mcol_double()[39m,
  ..   name = [31mcol_characte

### Variable Summary (players.csv)

| Variable | Type | Description |
|-----------|------|-------------|
| experience | character | Player’s reported experience level (e.g., Pro, Veteran, Amateur). |
| subscribe | logical (TRUE/FALSE) | Whether the player subscribed to the newsletter. |
| hashedEmail | character | Hashed email ID used to match between files. |
| played_hours | numeric | Total hours the player has played. |
| name | character | Player name (not used for modeling). |
| gender | character | Gender identity of the player. |
| Age | numeric | Age of the player in years. |

---

### Variable Summary (sessions.csv)

| Variable | Type | Description |
|-----------|------|-------------|
| hashedEmail | character | Key linking each session to a player. |
| start_time | character | Start time of the game session. |
| end_time | character | End time of the game session. |
| original_start_time | numeric | Timestamp version of start_time. |
| original_end_time | numeric | Timestamp version of end_time. |

---

**Notes**
- `subscribe` is likely the best outcome (response) variable for prediction.  
- `played_hours` and `experience` might be useful explanatory variables.  
- Missing values appear minimal.  
- Player-level data (`players.csv`) and session-level data (`sessions.csv`) can be joined via `hashedEmail` if needed.
