# [Title]
#### by Group 19 (Theo Marill, Finn Piney, Cecilia Xu, Hayoung Cho)

## Introduction
### Background  

This project aims to explore player behavior on a dedicated Minecraft server, managed by the Pacific Laboratory for Artificial Intelligence (PLAI) at the University of British Columbia (UBC). Minecraft offers an open-world environment where players have significant freedom to explore, build, and interact. By analyzing player sessions and characteristics, this study seeks to uncover patterns that can inform server management, recruitment strategies, and resource allocation. The primary goal of this analysis is to determine if player experience levels are predictive of session length, which could help tailor recruitment efforts to specific player profiles.

### Question  
Is the experience level of players predictive of length and time of sessions?

This question is critical for understanding whether recruitment should target players with specific experience levels (e.g., "Pro" or "Veteran") or if a broader recruitment strategy is equally effective. By examining the relationship between self-reported experience and session duration, this project intends to provide actionable insights for the PLAI group.

### Data
The data of interest for this project is a player and session information of gamers using a free Minecraft server hosted and monitored by the Pacific Laboratory for Artificial Intelligence at UBC studying player behaviours to develop embodied AI. We have two `.csv` files called `players` and `sessions` with 196 observations and 7 variables, and 1535 observations and 5 variables, respectively. 

In [None]:
# run this cell
library(tidyverse)
library(repr)
library(tidymodels)
library(RColorBrewer)
options(repr.matrix.max.rows = 6)

In [None]:
# run this cell
raw_players <- read_csv("https://raw.githubusercontent.com/Booch58/individual_contribution/refs/heads/main/players.csv", show_col_types = FALSE)
raw_sessions <- read_csv("https://raw.githubusercontent.com/Booch58/individual_contribution/refs/heads/main/sessions.csv", show_col_types = FALSE)
raw_players
raw_sessions

In [None]:
glimpse (raw_players)

##### `players.csv` Variables:  

- `experience` (character): The player's self-reported Minecraft experience level, categorized as Beginner, Amateur, Regular, Pro, or Veteran.

- `subscribe` (logical): Response Variable (Y). Whether the player subscribes to the newsletter.

- `hashedEmail` (character): A unique, hashed identifier for each player.

- `played_hours` (double): The total hours a player has spent on the server (ranging from 0 to 223.1 hours).

- `name` (character): The player's in-game username.

- `gender` (character): The player's gender.

- `Age` (double): The player's age in years (ranging from 9 to 58, contains 2 NAs).

In [None]:
glimpse (raw_sessions)

##### `sessions.csv` Variables:  

- `hashedEmail` (character): The unique identifier linking sessions to players in the `players.csv` file.

- `start_time` (character): The session start time, recorded in a human-readable date and time format. 

- `end_time` (character): The session end time, also in a human-readable format.

- `original_start_time` (double): The start time as a numerical Unix timestamp.

- `original_end_time` (double): The end time as a numerical Unix timestamp.

To answer the research question, these datasets will be joined using the `hashedEmail` key. The duration of each session will be calculated from the start and end times, allowing for an analysis of session length distributions across different experience levels.   
Issues such as missing values in `Age` and timestamps will be addressed during the data cleaning process.

## Methods and Results
Loading data again, packages have already been loaded in Introduction.

In [None]:
raw_players <- read_csv("https://raw.githubusercontent.com/Booch58/individual_contribution/refs/heads/main/players.csv", show_col_types = FALSE)
raw_sessions <- read_csv("https://raw.githubusercontent.com/Booch58/individual_contribution/refs/heads/main/sessions.csv", show_col_types = FALSE)
raw_players
raw_sessions

#### Tidying

Let's standardize the column naming as well as separate the start/end *date* and the *time* of `start_time`/`end_time`.

In [None]:
players <- raw_players |>
    rename(hashed_email = hashedEmail, age = Age)|>
    mutate(experience = as_factor(experience), gender = as_factor(gender), hashed_email = fct_reorder(hashed_email, played_hours, .fun = sum))
sessions <- raw_sessions |>
    rename(hashed_email = hashedEmail) |>
    separate(col = start_time, into = c('start_date', 'start_time'), sep = " ")|>
    separate(col = start_time, into = c('start_hr', 'start_m'), sep = ":") |>
    mutate(start_time = as.double(start_hr) * 60 + as.double(start_m)) |>
    separate(col = end_time, into = c('end_date', 'end_time'), sep = " ")|>
    separate(col = end_time, into = c('end_hr', 'end_m'), sep = ":") |>
    mutate(end_time = as.double(end_hr) * 60 + as.double(end_m)) |>
    select(-start_hr, -start_m, -end_hr, -end_m)
players
sessions

Here, to account for the fact that some sessions will span the midnight mark, we adjust the endtime to be **minutes past the midnight most recently before the start_time.**

In [None]:
sessions_mid <- sessions |>
    filter(end_date != start_date) |>
    mutate(end_time = end_time + 1440)

sessions_day <- sessions |>
    filter(end_date == start_date)

sessions_adj <- sessions_mid |>
    bind_rows(sessions_day)

sessions_adj

### Summary of Relevant Data
Here we have: 

1. Calculated the mean session length for each experience level.
2. Given counts of all players according to experience level and who have contributed to the sessions data.
3. Made histograms identifying count of start times for each experience level (Notice varying scales across experience levels! This is done to make less populated experience levels more visible).

In [None]:
named_sessions <- sessions_adj |>
    merge(players)

lengths <- named_sessions |>
    mutate(length = end_time - start_time) |>
    select(length, experience) |>
    group_by(experience) |>
    summarize(mean_session_length = mean(length))

populations <- named_sessions |>
    group_by(experience) |>
    summarize(count = n()) |>
    select(experience, count)


start_grid <- named_sessions |>
    ggplot(aes(x=start_time, fill = factor(experience, levels = c("Beginner","Amateur","Regular","Pro","Veteran")))) +
        geom_histogram(binwidth = 120) +
        facet_grid(rows = vars(experience), scale = "free_y") +
        labs(x="Session Start Time \n(mins past 00:00)",
             y="Count", 
             fill="Player Experience Level") +
        theme(text = element_text(size = 20))


In [None]:
named_sessions

**fig. 1**

In [None]:
lengths

**fig. 2**

In [None]:
populations

**fig. 3**

In [None]:
start_grid

### Visualization for Analysis

In [None]:
# run this cell
options(repr.plot.width = 9, repr.plot.height = 8)

start_vs_end <- ggplot(named_sessions, aes(x=start_time, y=end_time, colour = factor(experience, levels = c("Beginner","Amateur","Regular","Pro","Veteran")))) +
    geom_point(alpha = 0.4) +
    labs(title="Start and End Times of Individual Sessions \nat Different Experience Levels",
         x="Start Time (mins past 00:00)",
         y="End Time (mins past 00:00)",
         colour="Experience") +
    xlim(0,1500) +
    theme(text = element_text(size = 20)) +
    scale_color_brewer(palette = "Dark2")

____
Here we have visualized each player's sessions as a point on this graph where its start end time is on plotted on the x and y axes. Each point is also coloured according the player's experience. Interpreting the straight line, we make out that all points on the line of y = x are sessions that lasted 0 minutes and every point progressively higher above the line is a progressively longer session.  

**fig. 4**

In [None]:
start_vs_end

____
### Data Analysis

In [None]:
set.seed(1122)
named_split <- initial_split(named_sessions, 0.60)
named_training <- training(named_split)
named_testing <- testing(named_split)

k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))
named_vfold <- vfold_cv(named_training, v = 10, strata = experience)

named_recipe <- recipe(experience ~ start_time + end_time, data = named_training) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors())
named_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
                    set_engine("kknn") |>
                    set_mode("classification")
named_fit <- workflow() |>
    add_recipe(named_recipe) |>
    add_model(named_spec) |>
    tune_grid(resamples = named_vfold, grid = k_vals) |>
    collect_metrics()

In [None]:
k_plot <- named_fit |>
    filter(.metric == "accuracy") |>
    ggplot(aes(x=neighbors,y=mean)) +
        geom_point() +
        geom_line() +
        labs(x="k-Neighbours",
             y="Accuracy (%)")

____
Here we found that the best K value of neighbours was around 21:

**fig. 5**

In [None]:
k_plot

____

In [None]:
named_tuned <- nearest_neighbor(weight_fun = "rectangular", neighbors = 21) |>
                    set_engine("kknn") |>
                    set_mode("classification")
named_fit2 <- workflow() |>
    add_recipe(named_recipe) |>
    add_model(named_tuned) |>
    fit(data = named_training)

named_evaluation <- named_fit2 |>
    predict(named_testing) |>
    bind_cols(named_testing) 

accuracy <- named_evaluation |>
    metrics(truth = experience, estimate = .pred_class) |>
    filter(.metric == "accuracy") |>
    select(-.estimator)

conf <- named_evaluation |>
    conf_mat(truth = experience, estimate = .pred_class)

After creating our final analysis workflow, our accuracy on the testing data we set aside. was:

In [None]:
pull(accuracy)

____

Here is also the confusion matrix indicating the number of predictions compared to the true observations:

**fig. 6**

In [None]:
conf

In [None]:
populations

____

## Discussion

### 1. Summary of what we found
<p>Through our methods, we sought to explore whether the players' self-reported experience level is predictive of (1) the length of time they play and (2) the time of their sessions. The results show that ____ </p>

### 2. Discussion of if this was expected
<p>These results are _____ to what we expected. Before conducting our analysis, we predicted that higher reported experience levels should be predictive of longer session lengths in this dataset. Intuitively, we felt that this would make sense as more experienced players have a higher level of dedication to Minecraft, and would therefore play more hours. </p>

### 3. Discussion of impact of findings
<p>Our methods tried to see if the start time and end time of gaming sessions could be predictive of the user's experience level. Our exploratory figures regarding start times, particularly figure 3, indicated that Pro users had peak start times of 1300 minutes past 00:00 (21:40), veterans, amateurs, and beginners at 250 minutes past 00:00 (04:10), and regular users at 200 minutes past 00:00 (03:20). Taking this together with our results from figure 1 which displayed mean durations of sessions by experience level, we could initally expect that our results would be suggestive of allocating resources to recruit regular and pro users at 03:20 and 21:40, respectively. However, our analysis results suggested elsewise. Unfortunately, our classification model lacked accuracy with a value of 0.64. This resulted in 0 correct predictions regarding the experience levels pro, veteran, and beginner out of the 39, 51, and 106 users, respectively (figure 2, figure 6). However, the . recruitment effortsThis suggested from thknowing the experience levels of players could be predictive of longer sessions to investigate which 'kind' of player would likely contribute larger amounts of data. Based on our findings, we believe that players of ____ level tend to play longer amounts, resulting in a larger amount of data collected by the research team. This helps the research team and can impact their recruiting methods as they will be able to allocate their resources in targeting ____ players, such as by _____ or ______. </p>

### 4. Discussion of future questions findings could lead to