In [None]:
library(tidyverse)

# DSCI Project Individual Portion
A real data science project from a UBC research group is researching how people play video games through recording players' actions when playing on a MineCraft server. However, due to limited resources, the study requires a narrower recruitment team of players that would provide them a greater amount of data. Hence, this page investigate in the narrower scope; more specifically, investigate traits players obtain that are most likely to contribute a large amount of data.

## Data Description

### 1. "players.csv" is dataset with 7 variables, each describing each players (one row per unique player) with higher-level attributes.
|Variable Name |Type|Meaning|
|:------------:|:--:|:-----:|
|`experience`  |chr |Experience with MineCraft (beginner, regular, amateur, pro, veteran ordered in increasing experience)|
|`subscribe`   |lgl |Subscribe to the Newspaper or not (True/False)|
|`hashedEmail` |chr |Registration email|
|`played_hours`|dbl |Hours played on server (hours)|
|`name`        |chr |Player's name|
|`gender`      |chr |Player's gender|
|`age`         |dbl |Player's age|

#### Overview of "player_data.csv" Table

In [None]:
player_data <- read_csv("data/players.csv")
head(player_data)

#### Summary Table for Quantitative Variables: `played_hours` and `age`

In [None]:
summarized_played_hours <- player_data |>
    summarize(variable = "played_hours",
              mean = round(mean(played_hours, na.rm = TRUE), 2),
              sd = round(sd(played_hours, na.rm = TRUE), 2),
              max = round(max(played_hours, na.rm = TRUE), 2),
              min = round(min(played_hours, na.rm = TRUE), 2),
              num_missing = sum(is.na(played_hours)))
summarized_age <- player_data |>
    summarize(variable = "Age",
              mean = round(mean(Age, na.rm = TRUE), 2),
              sd = round(sd(Age, na.rm = TRUE), 2),
              max = round(max(Age, na.rm = TRUE), 2),
              min = round(min(Age, na.rm = TRUE), 2),
              num_missing = sum(is.na(Age)))
summarized_played_hours
summarized_age

#### Overall Summary Table Basing on `experience`
`experience` provides a clear structure of the dataset with enough samples for each category.

In [None]:
summarized_player_data <- player_data |>
    group_by(experience) |>
    summarize(num_players = n(),
              num_subscribed = sum(subscribe, na.rm = TRUE),
              avg_played_hours = mean(played_hours, na.rm = TRUE), 
              max_played_hours = round(max(played_hours, na.rm = TRUE), 2), 
              min_played_hours = round(min(played_hours, na.rm = TRUE), 2),
              sd_played_hours = round(sd(played_hours, na.rm = TRUE), 2),
              avg_age = round(mean(Age, na.rm = TRUE), 2), 
              max_age = max(Age, na.rm = TRUE), 
              min_age = min(Age, na.rm = TRUE),
              sd_age = round(sd(Age, na.rm = TRUE), 2))
summarized_player_data

##### Potential Issues with Data
1. Data may contain missing values such as **NA** in `Age` and **"prefer not to say"** in `gender`.
2. The `avg_played_hours` may have no representative of the player if the **time span** for data recordings are too short. For example, if the data is collected during summer, children may have more freetime playing games, while adults are less likely.
3. Due to a **smaller sample**, **outliers** may increase the average by too much, although this may not be the case across most other players in the same category. `Regular` category may be an example with **high standard deviation** in `sd_played_hours`
4. **Bias** in `experience` when self-reporting from players.

### 2. "sessions.csv" is dataset with 5 variables that describes player's states in the session
|Variable Name |Type|Meaning|
|:------------:|:--:|:-----:|
|`hashedEmail`|chr|Regristration email|
|`start_time`|chr|Connect time (in datetime)|
|`end_time`|chr|Disconnect time (in datetime)|
|`original_start_time`|dbl|Connect time (in Unix timestamp)|
|`original_end_time`|dbl|Disconnect time (in Unix timestamp)|

The **Unix timestamp** make calculation of total time spend on the server convient to calculate, with the formula (`original_end_time` - `original_start_time`) milliseconds.

In [None]:
session_data <- read_csv("data/sessions.csv")
head(session_data)

#### Summary Table of `original_start_time` and `original_end_time`

In [None]:
summary_original_start_time <- session_data |>
    select(original_start_time) |>
    summarize(variable = "original_start_time",
              num_session = n(),
              avg_start_time = mean(original_start_time, na.rm = TRUE),
              avg_start_date = as.POSIXct(mean(original_start_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              latest_start_time = max(original_start_time, na.rm = TRUE),
              latest_start_date = as.POSIXct(max(original_start_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              earliest_start_time = min(original_start_time, na.rm = TRUE),
              earliest_start_date = as.POSIXct(min(original_start_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"))
summary_original_start_time

In [None]:
summary_original_end_time <- session_data |>
    select(original_end_time) |>
    summarize(variable = "original_end_time",
              num_session = n(),
              avg_end_time = mean(original_end_time, na.rm = TRUE),
              avg_end_date = as.POSIXct(mean(original_end_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              latest_end_time = max(original_end_time, na.rm = TRUE),
              latest_end_date = as.POSIXct(max(original_end_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              earliest_end_time = min(original_end_time, na.rm = TRUE),
              earliest_end_date = as.POSIXct(min(original_end_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"))
summary_original_end_time