# Welcome to the Minecraft Group Project!

## <ins>Introduction</ins>
Due to COVID-19 and the strict lockdown policies imposed in many countries, a huge surge of gamers was observed in 2020, with the numbers racing up to 2.7 billion. Given this, many studies have been conducted on how gamers play video games and how video games can affect an individual’s well-being, cognitive performance and brain activity ([Johannes et al., 2021](https://doi.org/10.1098/rsos.202049); [Jordan & Dhamala, 2022](https://doi.org/10.1016/j.ynirp.2022.100112)).

Despite the global increase in the number of gamers and the potential advantages associated with video gaming, numerous studies have been hindered by limited sample sizes ([Alonso-Fernández et al., 2019](https://doi.org/10.1016/j.compedu.2019.103612); [Petri & Gresse, 2017](https://doi.org/10.1016/j.compedu.2017.01.004)) and inaccuracies stemming from reliance on self-reported engagement metrics ([Johannes et al., 2021](https://doi.org/10.1098/rsos.202049)). Addressing these critical issues, a team of computer scientists from the University of British Columbia, known as PLAICraft, has developed a study that automates data collection during players' gaming sessions in Minecraft. This approach alleviates concerns related to self-evaluation. Furthermore, PLAICraft aims to identify specific player types that are likely to generate a greater volume of data based on previous datasets, which will serve as the central focus of this paper. Specifically, this paper will investigate what gender is likely to contribute more data based on the K-NN classification model based on the Age and Total Playtime of the players.

The dataset being used in this study, derived from `players.csv` (courtesy of PLAICraft team; [here](https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz)), contains a total of 196 observations and nine variables. These variables are listed and described in the table below:

<!DOCTYPE html>
<html>
<body>
    <table style="border-collapse: collapse; width: 40%; margin: auto auto; table-layout: auto; border: 0px solid black;">
        <caption style="font-size: 1.1em; font-weight: bold; margin-bottom: 5px; text-align: center;">
            Table 1: The Name of the Variables, Its Data Type and Meaning
        </caption>
        <tr>
            <th style="border: 0px solid black; text-align: center; padding: 10px;">Name</th>
            <th style="border: 0px solid black; text-align: center; padding: 10px;">Data Type</th>
            <th style="border: 0px solid black; text-align: center; padding: 10px;">Meaning</th>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Experience</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">chr</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Self-evaluated experience with Minecraft</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Subscribe</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">lgl</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Declarations to receive email updates</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Hashed Email</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">chr</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Encrypted email via Hash</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Played Hours</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">dbl</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Total played hours</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Name</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">chr</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">A fake name used in-game</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Gender</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">chr</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Gender of players</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Age</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">dbl</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Age of players</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Individual ID</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">chr</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">The player's ID in-game</td>
        </tr>
        <tr style="border: 0px solid black;">
            <td style="border: 0px solid black; text-align: left; padding: 8px;">Organization Name</td>
            <td style="border: 0px solid black; text-align: center; padding: 8px;">chr</td>
            <td style="border: 0px solid black; text-align: left; padding: 8px;">The name of the players’ school/organization</td>
        </tr>
    </table>
</body>
</html>


It is important to note while the data can be used to provide meaningful insights into relevant topics, there are potential issues that exist in this data. This includes (1) missing values for `Individual ID` and `Organization Name` (e.g. reported as NA), (2) potential inaccuracy in the self-reported age (e.g. age 91 and 99) and (3) playtime might not correlate to contribution levels to the study (e.g microphone can also be used)

## <ins>Methods & Results</ins>
*   **Methods & Results**:
    *   describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
    *   your report should include code which:
        *   loads data  
        *   wrangles and cleans the data to the format necessary for the planned analysis
        *   performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
        *   creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
        *   performs the data analysis
        *   creates a visualization of the analysis 
        *   _note: all figures should have a figure number and a legend_

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(cowplot)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

### (1) Load the Data
After examining that the data is of .csv files with the delimiter as a common (i.e. ","), we read in the data using the `read_csv()` as provided by `tidyverse` package using the URL given by the PLAICraft team.

In [2]:
# Constants
url_players <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"

# Reading in the data via URL
mc_players <- read_csv(url_players)
head(mc_players, 10)


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<lgl>,<lgl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,,
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,,
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,,
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21,,
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21,,
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17,,
Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d392c18a0da9a722807,0.0,Luna,Female,19,,
Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,0.0,Emerson,Male,21,,
Amateur,True,8b71f4d66a38389b7528bb38ba6eb71157733df7d1740371852a797ae97d82d1,0.1,Natalie,Male,17,,
Veteran,True,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f9094afd11d8f217039ec,0.0,Nyla,Female,22,,


### (3) Data Wrangling
Upon the initial examination, it is clear that the columns/variables `individualID` and `organizationName` provide no information as all values are `NA` values. Similarly, the `experience`, `subscribe`, `hashedEmail` and `name` were also removed using the `select()` function as these variables have no values in the question of this paper.

In [3]:
# Remove unnecessary columns
mc_cleaned <- mc_players |>
    select(played_hours, age, gender) |>
    arrange(age)
head(mc_cleaned, 5) # shorten for easier preview

played_hours,age,gender
<dbl>,<dbl>,<chr>
0.3,8,Male
30.3,9,Male
3.6,10,Male
2.9,11,Male
0.5,12,Male


Here we implemented a KNN regression model to predict total `playtime` based on the player's `age`.

**Preprocessing**:
  - Split the dataset into training (75%) and testing (25%) subsets, stratified by `played_hours`.
  - Scaled and centered `age`.

**Tuning**:
  - Performed 10-fold cross-validation to tune the number of `neighbors` (1–100).
  - Selected the model with the lowest RMSE.

**Model Training and Prediction**:
  - Trained the final KNN model with the optimal number of neighbors.

**Visualization**:
  - Created a scatter plot of actual vs. predicted `played_hours` against `age` with a fitted line.


In [None]:
options(repr.plot.width = 12, repr.plot.height = 12)
set.seed(1)

mc_split <- initial_split(mc_cleaned, prop = 0.75, strata = played_hours)
mc_train <- training(mc_split)
mc_test <- testing(mc_split)

mc_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression") 

mc_recipe <- recipe(played_hours ~ age, data = mc_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

mc_vfold <- vfold_cv(mc_train, v = 10, strata = played_hours)

gridvals <- tibble(neighbors = seq(from = 1, to = 100, by = 1)) 

mc_wkflw <- workflow() |>
    add_model(mc_spec) |>
    add_recipe(mc_recipe) 

mc_results <- mc_wkflw |>
  tune_grid(resamples = mc_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse")

# show only the row of minimum RMSPE
mc_min <- mc_results |>
    slice_min(mean, n = 1) |>
    pull(neighbors)
mc_min

mc_spec_2 <- nearest_neighbor(weight_func = "rectangular", neighbors = mc_min) |>
    set_engine("kknn") |>
    set_mode("regression") 

mc_fit <- workflow() |>
    add_recipe(mc_recipe) |>
    add_model(mc_spec_2) |>
    fit(data = mc_train) 

mc_preds <- mc_fit |>
  predict(mc_train) |>
  bind_cols(mc_train)


mc_plot_final <- ggplot(mc_preds, aes(x = age, y = played_hours)) +
    geom_point(alpha = 0.4) +
    geom_line(data = mc_preds,
              mapping = aes(x = age, y = .pred),
              color = "steelblue",
              linewidth = 1) +
    labs(x = "Age of the Player",
         y = "Hours Played (hr)",
        title = "Figure 1: Hours Played (hr) vs. Age of the Player") +
    theme(text = element_text(size = 14))
mc_plot_final

In [None]:
RMSPE <- mc_fit |>
    predict(mc_test) |>
    bind_cols(mc_test) |>
    metrics(truth = played_hours, estimate = .pred) |>
    filter(.metric == "rmse") |>
    select(.estimate) |>
    pull()
RMSPE





### (4) Some Calculations (and Visualization)

# To-Do

**Victor**
- [ ] Re-do the introduction with the newer questions + describe the second data set
- [ ] Clean up the comments + make graph looks better + add in comment for codes readability 

**Andy**
- [ ] Methods & Results: (the rest bullet of points)

**Jack**
- [ ] Methods & Results: Descriptions (first bullet point)

**Danny**
- [ ] Discussion

## References
- Alonso-Fernández, C., Calvo-Morata, A., Freire, M., Iván Martínez-Ortiz, & Baltasar Fernández-Manjón. (2019). Applications of data science to game learning analytics data: A systematic literature review. Computers & Education, 141, 103612. <https://doi.org/10.1016/j.compedu.2019.103612> 

- Johannes, N., Vuorre, M., & Przybylski, A. K. (2021). Video game play is positively correlated with well-being. Royal Society Open Science, 8(2), 202049. https://doi.org/10.1098/rsos.202049

- Jordan, T., & Dhamala, M. (2022). Video game players have improved decision-making abilities and enhanced brain activities. Neuroimage: Reports, 2(3), 100112. <https://doi.org/10.1016/j.ynirp.2022.100112> 

- Petri, G., & Gresse, C. (2017). How games for computing education are evaluated? A systematic literature review. Computers & Education, 107, 68--90. <https://doi.org/10.1016/j.compedu.2017.01.004>