In [None]:
library(tidyverse)

# DSCI Project Individual Portion
A real data science project from a UBC research group is researching how people play video games through recording players' actions when playing on a MineCraft server. However, due to limited resources, the study requires a narrower recruitment team of players that would provide them a greater amount of data. Hence, this page investigate in the narrower scope; more specifically, investigate traits players obtain that are most likely to contribute a large amount of data.

## Data Description

### 1. "players.csv" is dataset with 7 variables, each describing each players (one row per unique player) with higher-level attributes.
|Variable Name |Type|Meaning|
|:------------:|:--:|:-----:|
|`experience`  |chr |Experience with MineCraft (beginner, regular, amateur, pro, veteran ordered in increasing experience)|
|`subscribe`   |lgl |Subscribe to the Newspaper or not (True/False)|
|`hashedEmail` |chr |Registration email|
|`played_hours`|dbl |Hours played on server (hours)|
|`name`        |chr |Player's name|
|`gender`      |chr |Player's gender|
|`age`         |dbl |Player's age|

#### Overview of "player_data.csv" Table

In [None]:
player_data <- read_csv("data/players.csv")
head(player_data)

#### Summary Table for Quantitative Variables: `played_hours` and `age`

In [None]:
summarized_played_hours <- player_data |>
    summarize(variable = "played_hours",
              mean = round(mean(played_hours, na.rm = TRUE), 2),
              sd = round(sd(played_hours, na.rm = TRUE), 2),
              max = round(max(played_hours, na.rm = TRUE), 2),
              min = round(min(played_hours, na.rm = TRUE), 2),
              num_missing = sum(is.na(played_hours)))
summarized_age <- player_data |>
    summarize(variable = "Age",
              mean = round(mean(Age, na.rm = TRUE), 2),
              sd = round(sd(Age, na.rm = TRUE), 2),
              max = round(max(Age, na.rm = TRUE), 2),
              min = round(min(Age, na.rm = TRUE), 2),
              num_missing = sum(is.na(Age)))
summarized_played_hours
summarized_age

#### Overall Summary Table Basing on `experience`
`experience` provides a clear structure of the dataset with enough samples for each category.

In [None]:
summarized_player_data <- player_data |>
    group_by(experience) |>
    summarize(num_players = n(),
              num_subscribed = sum(subscribe, na.rm = TRUE),
              avg_played_hours = mean(played_hours, na.rm = TRUE), 
              max_played_hours = round(max(played_hours, na.rm = TRUE), 2), 
              min_played_hours = round(min(played_hours, na.rm = TRUE), 2),
              sd_played_hours = round(sd(played_hours, na.rm = TRUE), 2),
              avg_age = round(mean(Age, na.rm = TRUE), 2), 
              max_age = max(Age, na.rm = TRUE), 
              min_age = min(Age, na.rm = TRUE),
              sd_age = round(sd(Age, na.rm = TRUE), 2))
summarized_player_data

##### Potential Issues with Data
1. Data may contain missing values such as **NA** in `Age` and **"prefer not to say"** in `gender`.
2. The `avg_played_hours` may have no representative of the player if the **time span** for data recordings are too short. For example, if the data is collected during summer, children may have more freetime playing games, while adults are less likely.
3. Due to a **smaller sample**, **outliers** may increase the average by too much, although this may not be the case across most other players in the same category. `Regular` category may be an example with **high standard deviation** in `sd_played_hours`
4. **Bias** in `experience` when self-reporting from players.

### 2. "sessions.csv" is dataset with 5 variables that describes player's states in the session
|Variable Name |Type|Meaning|
|:------------:|:--:|:-----:|
|`hashedEmail`|chr|Regristration email|
|`start_time`|chr|Connect time (in datetime)|
|`end_time`|chr|Disconnect time (in datetime)|
|`original_start_time`|dbl|Connect time (in Unix timestamp)|
|`original_end_time`|dbl|Disconnect time (in Unix timestamp)|

The **Unix timestamp** make calculation of total time spend on the server convient to calculate, with the formula (`original_end_time` - `original_start_time`) milliseconds.

In [None]:
session_data <- read_csv("data/sessions.csv")
head(session_data)

#### Summary Table of `original_start_time` and `original_end_time`

In [None]:
summary_original_start_time <- session_data |>
    select(original_start_time) |>
    summarize(variable = "original_start_time",
              num_session = n(),
              avg_start_time = mean(original_start_time, na.rm = TRUE),
              avg_start_date = as.POSIXct(mean(original_start_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              latest_start_time = max(original_start_time, na.rm = TRUE),
              latest_start_date = as.POSIXct(max(original_start_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              earliest_start_time = min(original_start_time, na.rm = TRUE),
              earliest_start_date = as.POSIXct(min(original_start_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"))
summary_original_start_time

In [None]:
summary_original_end_time <- session_data |>
    select(original_end_time) |>
    summarize(variable = "original_end_time",
              num_session = n(),
              avg_end_time = mean(original_end_time, na.rm = TRUE),
              avg_end_date = as.POSIXct(mean(original_end_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              latest_end_time = max(original_end_time, na.rm = TRUE),
              latest_end_date = as.POSIXct(max(original_end_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"),

              earliest_end_time = min(original_end_time, na.rm = TRUE),
              earliest_end_date = as.POSIXct(min(original_end_time, na.rm = TRUE)/1000, origin = "1970-01-01", tz = "UTC"))
summary_original_end_time

##### Potential Issues with Data
1. Data may contain missing values such as **NA** in `original_end_time`.
2. There are **more session** than unique players, hence the data contains the multiple playing sessions of the same person in which the data fails to present.
3. Data are mostly collected during **summer**. It may not be helpful to predict for another time of the year.
4. Data neglects potential issues with **reconnection** or **network issues**.
5. Mismatch in played time in **session.csv and player.csv**. (shown below)

In [None]:
session_played_time <- session_data |>
    filter(hashedEmail == "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf") |>
    select(original_start_time, original_end_time) |>
    mutate(diff = original_end_time - original_start_time) |>
    summarize(session_played_time = sum(diff) * 2.7778e-7)
player_played_time <- player_data |>
    filter(hashedEmail == "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf") |>
    select(played_hours)
session_played_time
player_played_time

## Research Question
The **broad question** I will address is: what "kinds" of players are likely to contribute a large amount of data. <br>
A derived narrower **research question** is: How is `experience`, `subscribe`, and `age ` effect the `played_hour` of players present in player_data.csv. <br>
Each variables are essential traits for players that may wrangle large amount of data:
- `experiment` shows what player is more likely to play at which stage.
- `subscribe` may indicate attention payed to the game. More interest may play more game.
- `Age` can show difference in welcomeness of the game across age groups. <br>

`gender` may not be an appropriate indicator, since each group can have a large difference in sample number.

### Wrangling Plan
The aim of the wrangling is to put all variables into knn-regression model in predicting played hours.
1. Remove distracting columns like `hashedEmail` or `name`.
2. Remove rows with **NA**
3. Use pivot_longer to turn `experience`, `subscribe` into categories where the value is 1 if it is in the correct category, 0 otherwise. <br>
- Wrangling the data in 1 and 0 separated by category makes each of the original quanlitative values into quantitative values, allowing regression to take place.
- Wrangling the data without `hashedEmail` and `name` allows minimization of distraction since this project neglect data from session.csv.

## Data Analysis and Visualization
### Simple wrangling of player_data.csv
Since visualization can utiliza categorical values, the conversion of categorical to quantitative data can be neglected until predictive analysis.

In [None]:
tidy_player_data <- player_data |>
    select(experience, subscribe, Age, played_hours)

### Relationship between `experience` and `avg_played_hours`

In [None]:
options(repr.plot.height = 7, repr.plot.width = 10)

experience_vs_played_hours <- tidy_player_data |>
    mutate(experience = factor(experience, levels = c("Beginner", "Regular", "Amateur", "Pro", "Veteran"))) |>
    filter(experience != "NA") |>
    group_by(experience) |>
    summarize(avg_played_hours = mean(played_hours, na.rm = TRUE))|>
    ggplot(aes(x = experience, y = avg_played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Prior Experience with MineCraft (More Experienced to The Right)",
         y = "Average Hours Played on the Server (hours)",
         title = "Relationship between Prior Experience and Average Playing Duration")+
    theme(text = element_text(size = 18))
experience_vs_played_hours

#### Observation:
Prior experience is directing influencing the playing time. Counter intuitively, most average played time (y-axis) occurs at category Regular and Amateur (x-axis), rather than more experienced group. This creates a non-linear relationship that suggest more/less experience may lead to more play time.

### Relationship between `subscribe` and `avg_played_hours`

In [None]:
options(repr.plot.height = 7, repr.plot.width = 10)

subscribe_vs_played_hours <- tidy_player_data |>
    group_by(subscribe) |>
    summarize(avg_played_hours = mean(played_hours, na.rm = TRUE))|>
    ggplot(aes(x = subscribe, y = avg_played_hours)) +
    geom_bar(stat = "identity") +
    labs(x = "Subscribe to MineCraft Newsletter (TRUE or FALSE)",
         y = "Average Hours Played on the Server (hours)",
         title = "Relationship between Subcribe to Game Newsletter and Average Playing Duration") +
    theme(text = element_text(size = 15.5))
subscribe_vs_played_hours

#### Observation:
More average playing hour (y-axis) if people are subscribed to Newsletter (x-axis)

### Relationship between `Age` and `played_hours`

In [None]:
options(repr.plot.height = 7, repr.plot.width = 9)

age_vs_played_hours <- tidy_player_data |>
    ggplot(aes(x = Age, y = played_hours)) +
    geom_point() +
    labs(x = "Age (integer numbers)",
         y = "Hours Played on the Server (hours)",
         title = "Relationship between Age and Average Playing Duration")+
    theme(text = element_text(size = 20))
age_vs_played_hours

#### Observation:
The highest playing durations happens in the range of 15 to 20 years old which are mostly teenagers. However, most data points have 0.0 hours of playing time. This means that one trait will not be enough to determine the "kind" of people who played the most, hence involving other variables.

## Methods and Plan
The method I would choose to use is knn-regression model. 
1. Knn-regression predicts on a continuous variable, `played_hours` in this case. Hence knn-classification is in appropriate. Linear regression is in appropriate as some variables are clearly non-linear such as `Age` which peaks in the middle.
2. Assumptions include:
   - Categorical values are independent (beginner and pro are two independent variables)
   - Scaling is reasonable (Age has much larger scale than other quanlitative variables)
   - Balanced data distribution
   - The closer the points, the more related
3. Limitations include:
   - Fail to predict out of range values
   - Outliers may dominate the predicting result
   - Categorical values like types in `experience` may be correlated
4. I select the model based on the results of cross validation. More specifically, lower RMSPE on validation data, better the model.
5. Quanlitative variables are turned into 0s and 1s, where as Age is centered and scaled within standard deviation of 1. Moreover, with a random seed, I will split the data with prop = 0.8, where the greater part goes to training and the other testing. I will find the best k value by selecting k from 5, 7, ..., 21. I will also do 5-fold cross validation to find the best trained model.