# DSCI Project Individual Portion
A real data science project from a UBC research group is researching how people play video games through recording players' actions when playing on a MineCraft server. However, due to limited resources, the study requires a narrower recruitment team of players that would provide them a greater amount of data. Hence, this page investigate in the narrower scope; more specifically, investigate traits players obtain that are most likely to contribute a large amount of data.

The data "players.csv" and "sessions.csv" that this project is working on is stored in data file. Individually, they have the following variables:
1. "players.csv" is dataset with 7 variables, each describing each players (one row per unique player) with higher-level attributes.
- Experience with MineCraft <chr> : `experience`
<br>(beginner, regular, amateur, pro, veteran) ordered in increasing experience
- Subscribe to the Newspaper or not <lgl>: `subscribe`
- Registration email <chr>: `hashedEmail`
- Hours played on server (hrs) <dbl>: `played_hours`
- Player's name <chr>: `name`
- Player's gender <chr>: `gender`
- Player's age <dbl>: `age`


The data is shown below

In [None]:
library(tidyverse)

In [None]:
player_data <- read_csv("data/players.csv")
head(player_data)

In [None]:
summarized_player_data <- player_data |>
    group_by(experience) |>
    summarize(players = n(),
              num_subscribed = sum(subscribe, na.rm = TRUE),
              avg_played_hours = mean(played_hours, na.rm = TRUE), 
              max_played_hours = max(played_hours, na.rm = TRUE), 
              min_played_hours = min(played_hours, na.rm = TRUE), 
              avg_age = mean(Age, na.rm = TRUE), 
              max_age = max(Age, na.rm = TRUE), 
              min_age = min(Age, na.rm = TRUE))
summarized_player_data

**Potential Issues with Data**
1. Data may contain missing values such as NA in `Age` and "prefer not to say" in `gender`.
2. There is no remark of the time span of the data collection. Is playing

2. "sessions.csv" is dataset with 5 variables that describes player's states in the session
- Regristration email <chr>: `hashedEmail`
- Connect time <chr>: `start_time`
- Disconnect time <chr>: `end_time`
- Login time <dbl>: `original_start_time`
- Exit time <dbl>: `original_end_time`

The difference between Connect time and Login time is that Connect time records each reconnection, while Login time records the total duration time disregarding reconnections.

In [None]:
session_data <- read_csv("data/sessions.csv")
head(session_data)

**Unix Timestamp**


The `original_start_time` and `original_end_time` are time represented in Unix timestamps in milliseconds. Unix timestamps falls into default (0ms) at 1980-01-01, UTC time. Hence, an example of 1.71977e+12 in Unix timestamps represent 2024-06-30 17:53:20 UTC. Therefore, the difference of original_start_time and original_end_time is the total time the player spend on the server.