# Final Project Report: Is age of a player a good predictor for total played hours?
Group 007-2 Sohan Sadeque, Io Santiago, Maggie Tu, Fangfei Zhu

### Introduction:
PLAICraft is a project run by the PLAI (Pacific Laboratory for Artificial Intelligence) group at UBC’s Computer Science Department. Through this project, participants are able to play *Minecraft* for free on their browser on a shared server world where data such as audio, key presses, mouse inputs, and video footage is collected from players to train an Artificial Intelligence model. 
Our group first chose a broad question that we wanted to answer using datasets from this project. We ended up choosing: <br>

**Which ‘kinds’ of players are most likely to contribute a large amount of data?** <br>

To answer this broad question, we had to narrow it down and make it more specific. We decided that we can use ‘total played hours’ as a metric for the amount of data contributed and that age would be an interesting variable to correlate it with. Therefore, this report will focus on answering the following specific question: <br>

**Can age predict the total played hours for participants?** <br>

To answer this question, we were provided with two datasets: ‘players.csv’ and ‘sessions.csv’. <br>
The players.csv dataset includes data on the PLAICraft players themselves, totalling 196 observations and 7 variables including: <br>
<ol>
    <li>
        experience: The player’s prior experience with *Minecraft*
    </li>
    <li>
        subscribe: Whether the player is subscribed to the PLAICraft mailing list
    </li>
    <li>
        hashedEmail: The player’s encrypted email
    </li>
    <li>
        played_hours: The player’s total hours on the PLAICraft server
    </li>
    <li>
        name: The player’s first name
    </li>
    <li>
        gender: The player’s gender
    </li>
    <li>
        Age: The player’s age
    </li>
</ol>
The second dataset, sessions.csv, includes data on PLAICraft sessions. It totals 1,535 observations, each representing an individual player’s play session on the server. The dataset includes 5 variables:
<ol>
    <li>
        hashedEmail: The player’s encrypted email
    </li>
    <li>
        start_time: The play session’s start time including date, month, year, and time in 24-hour clock format
    </li>
    <li>
        end_time: The play session’s end time including date, month, year, and time in 24-hour clock format
    </li>
    <li>
        original_start_time: The play session’s start time in a different format.
    </li>
    <li>
        original_end_time: The play session’s end time in a different format.
    </li>
</ol>
Our methods will focus on using the <b>players.csv</b> dataset to answer our question.

### Methods & Results:

#### loads data 

In [None]:
library(tidyverse)
library(repr)
library(ggplot2)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/maggiettu/dsci100-group-project/refs/heads/main/players.csv")
players
summary(players)

In [None]:
sessions <- read_csv("https://raw.githubusercontent.com/maggiettu/dsci100-group-project/refs/heads/main/sessions.csv")
sessions
summary(sessions)

### Wrangling and Cleaning the Dataset

In [None]:
players_select <- players |>
        select(played_hours,Age)
players_select

In [None]:
players_clean <- players_select |>
rename(age = Age)

In [None]:
players_mean <- players_clean |>
        summarize(
            mean_played_hours = mean(played_hours),
            mean_age = mean (age, na.rm = TRUE))

players_mean

### Summary of the data

### Visualization of the dataset

In [None]:
players_plot <- players_clean |>
    ggplot(aes(x = age ,y = played_hours)) +
    geom_point()+
    labs(x = "Player's age",y = "Total played time (hours)") +
    ggtitle("Player's age and Total played time in hours")+
    theme(text = element_text(size = 15))
players_plot

### Data Analysis

In [None]:
players_split <- initial_split(players_clean, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors =  tune()) |> 
      set_engine("kknn") |>
      set_mode("regression") 

players_recipe <- recipe(played_hours ~ age, data = players_training) |>
      step_scale(all_predictors()) |>
      step_center(all_predictors())

In [None]:
set.seed(1234)

players_vfold <- vfold_cv(players_training, v = 5, strata = played_hours)

players_workflow <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(players_spec)

In [None]:
set.seed(2019)

gridvals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))


players_results <- players_workflow |>
  tune_grid(resamples = players_vfold, grid = gridvals) |>
  collect_metrics()

players_results

In [None]:
players_preds <- players_best_fit |>
  predict(players_training) |>
  bind_cols(players_training)

players_plot <- ggplot(players_preds, aes(x = age, y = played_hours)) +
  geom_point() +
  geom_line(data = players_preds,
            mapping = aes(x = age, y = .pred),
            color = "blue") +
  xlab("player age (years)") +
  ylab("total played time (hours)") +
  ggtitle(paste0("K = ", k_min))

players_plot

### Discussion:

### References