# Predicting High-Usage Players on the Minecraft Server


In [1]:
library(tidyverse)
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


## 1. Data Description

In [9]:
summary(players)
summary(sessions)
head(players, 5)
head(sessions, 5)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0


In [10]:
# Count how many unique players appear in sessions.csv
n_sessions_players <- sessions |> distinct(hashedEmail) |> nrow()

# Count total players in players.csv
n_total_players <- nrow(players)

# Print both results
n_sessions_players
n_total_players

The project uses two datasets collected from the UBC Minecraft research server:  
**`players.csv`** and **`sessions.csv`**.

#### players.csv
- **Observations:** 196 unique players  
- **Variables:** 7  
- **Description of variables:**

| Variable | Type | Description | Example | Issues / Notes |
|-----------|------|--------------|----------|----------------|
| `experience` | Categorical (character) | Player’s self-reported Minecraft experience level | "Veteran", "Pro", "Amateur" | Uneven group sizes possible |
| `subscribe` | Logical (TRUE/FALSE) | Whether the player subscribed to the research newsletter | TRUE | Could serve as a target variable for Question 1 |
| `hashedEmail` | Character | Unique player identifier (used to link to sessions.csv) | long hash string | Acts as join key |
| `played_hours` | Numeric | Total reported playtime (hours) | Mean = 5.85, Max = 223.1 | Strong right skew — some players play far more than others |
| `name` | Character | Player’s name | “Morgan” | Not useful for prediction |
| `gender` | Categorical | Player’s gender | “Male”, “Female” | May contain small or unbalanced categories |
| `Age` | Numeric | Player’s age (years) | Mean = 21.14, Range = 9–58 | 2 missing values |

---
#### sessions.csv
- **Observations:** 1,535 individual play sessions  
- **Variables:** 5  
- **Description of variables:**

| Variable | Type | Description | Example | Issues / Notes |
|-----------|------|--------------|----------|----------------|
| `hashedEmail` | Character | Player identifier linking to players.csv | same hash as above | Used for join |
| `start_time` | Character (datetime) | Start time of session (`dd/mm/yyyy hh:mm`) | 30/06/2024 18:12 | Needs parsing to datetime |
| `end_time` | Character (datetime) | End time of session (`dd/mm/yyyy hh:mm`) | 30/06/2024 18:24 | Needs parsing to datetime |
| `original_start_time` | Numeric (timestamp) | Unix-like start time (ms) | 1.719e+12 | Duplicate info, potential conversion issue |
| `original_end_time` | Numeric (timestamp) | Unix-like end time (ms) | 1.719e+12 | Often equal to start, possible recording bug |

---

#### Potential Issues
- 71 players have no sessions (appear only in players.csv).  
- `played_hours` and session duration both measure playtime which may overlap.  
- The time format (`dd/mm/yyyy`) must be converted for time calculations.  
- `original_*` timestamps seem identical, might be just recording the date.  
- Possible data imbalance: more players with low playtime than high playtime.  
- Sampling bias may exist; data collected from voluntary participants.

---

#### Summary Statistics (rounded to 2 decimals)
- Mean player age: **21.14 years**
- Mean reported playtime: **5.85 hours**
- Median playtime: **0.10 hours**
- Max playtime: **223.10 hours**

These statistics suggest that most players spend little time on the server, but a few players contribute a lot more data which may bias predictions to those of high-usage players.


## 2. Questions

## 3. Exploratory Data Analysis & Visualization

## 4. Methods and Plan

## 5. GitHub Repository