# Project Planning Stage (Group)

## **Introduction**

Recruiting and retaining active users is critical for running online gaming experiments. The UBC Minecraft research server, led by Prof. Frank Wood, logs all player activity in detail. However, only a small fraction of players subscribe to the game-related newsletter. Because the newsletter is used to announce experiments, updates, and funding opportunities, understanding what influences subscription behavior can help the team target their messages in a more effective way.

For this project, our goal is to explore the question:
*"Can a player's experience level, playing time, and age predict whether a player subscribes to the game-related newsletter?"*

We will be using the players.csv dataset that contains one row per user and other information like experience level, total hours played, age, and subscription status. The data is collected between 2024-08-01 and 2024-11-30.


For this data analyzation, we used age, playing time, and experience level to predict player subscription status to a game related newsletter. We used these variables because we hypothesize that they play time, age, and experience level should be directly related to a player's interest in Minecraft. We hypothesize that high playing time, younger ages, and higher experience levels in a game should be linked to higher involvement in a game's community, all of which takes dedication.

**Descriptive summary of dataset:**
- The players dataset contains 196 observations and 7 variables.

  Below is a summary of the variables used in our analysis:

| Variable Name | Data Type   | Description / Meaning                                                                                       | Notes / Potential Issues                                                                                |
| ------------- | ----------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Experience    | Categorical | Player’s self-reported experience level (Beginner, Amateur, Regular, Veteran, Pro)                          | Reflects player skill/familiarity                                                                       |
| Subscribe     | Logical     | Whether players are subscribed to the newsletter (TRUE/FALSE)                                               | May be influenced by external sources other than gameplay experience (marketing, interest)              |
| hashedEmail   | Categorical | Unique hashed player identifier (anonymized)                                                                | Used to identify players; no direct analytical value                                                    |
| played_hours  | Numeric     | Total hours played on server                                                                                | Includes 0 hours (inactive or new players); possible outliers with high values                          |
| name          | Categorical | Player’s first name                                                                                         | Nominal data with potential duplicates                                                                  |
| gender        | Categorical | Player’s self-identified gender (Male, Female, Non-binary, Two-Spirited, Agender, Other, Prefer not to say) | Multiple categories with social diversity; minority groups may require special attention/representation |
| age           | Integer     | Player’s age in years                                                                                       | Large range of ages; two missing data points                                                            |

- Summary statistics table for numeric/integer variables:
  | Variable     | Min  | Mean  | Median | Max   | Std Dev |
| ------------ | ---- | ----- | ------ | ----- | ------- |
| played_hours | 0.00 | 5.85  | 0.1   | 223.10 | 28.36   |
| age          | 9.00 | 21.14 | 19.00  | 58.00 | 7.39    |

## **Summary of Dataset and Tidying Data**

Below, we created a cleaned version of the dataset by selecting some of the previously mentioned variables and removing rows with missing values.

In [None]:
install.packages("tidyverse")

library(tidyverse)
library(knitr)
library(GGally)
library(ggplot2)
library(dplyr)
library(tidymodels)
library(tidyclust)
library(themis)
library(janitor)
library(rsample)
library(repr)
options(repr.matrix.max.rows = 6)
     

In [None]:
players <- read_csv("players.csv")

head(players)
glimpse(players)

In [None]:
#Summary Statistics

players_summary <- players |>
  summarise(
    across(where(is.numeric),
           list(min = ~round(min(., na.rm = TRUE), 2),
                mean = ~round(mean(., na.rm = TRUE), 2),
                median = ~round(median(., na.rm = TRUE), 2),
                max = ~round(max(., na.rm = TRUE), 2),
                sd = ~round(sd(., na.rm = TRUE), 2)))
  )
players_summary


In [None]:
#selecting necessary variables

players_select <- players |>
    select(subscribe, experience, played_hours, Age)

#converting categorical variables to factors

players_select <- players_select |>
    mutate(subscribe = as.factor(subscribe), experience = as.factor(experience))

#removing rows with missing vals

players_clean <- players_select |>
    filter(!is.na(played_hours), !is.na(Age))

#mean values for numeric variables

mean_summary <- players_clean |>
    summarise(mean_played_hours = mean(played_hours, na.rm = TRUE), mean_age = mean(Age, na.rm = TRUE)) |>
    mutate(across(everything(), ~round(.x, 2)))

head(players_clean)
mean_summary

## **Data Visualization**

In [None]:
#Plot 1: Categorical Plot - Experience vs. Subscription Status
options(repr.plot.width = 12, repr.plot.height = 8)

cat_plot <- players_clean |>
    group_by(experience, subscribe) |>
    summarise(count = n(), .groups = "drop") |>
    group_by(experience) |>
    mutate(prop = count / sum(count)) |>
    filter(subscribe == "TRUE") |>
    ggplot(aes(x = experience, y = prop, fill = experience)) +
    scale_y_continuous(labels = scales::percent_format()) +
    geom_col(show.legend = FALSE) + 
    labs(title = "Newsletter Subscription Rate by Experience level", x = "Experience Level", y = "Subscription Rate (%)") + 
    theme(text = element_text(size = 20)) 
cat_plot

In [None]:

#Plot 2: Numeric Plot 1 - Age vs. Subscription Status
players_clean <- players_clean |>
    mutate(
    subscribe = as.factor(subscribe),
    experience = as.factor(experience),
    age_group = cut(
      Age,
      breaks = c(0, 15, 20, 25, 30, 40, 50, 60),
      labels = c("0–15","16–20","21–25","26–30","31–40","41–50","51–60")))

options(repr.plot.width = 12, repr.plot.height = 8)

num_plot1 <- players_clean |> 
    ggplot(aes(x = age_group, fill = subscribe)) +
    geom_bar(position = position_dodge(width = 0.8)) +
    geom_text(
    aes(label = after_stat(count)),
    stat = "count",
    position = position_dodge(width = 0.8),
    vjust = -0.5) +
    labs(title = "Subscription Status by Age Group", x = "Age Group", y = "Count", fill = "Subscribed") +
    theme(text = element_text(size = 20))

num_plot1

In [None]:

#Plot 3: Numeric Plot 2 - Played Hours vs. Subscription Status

players_clean <- players_clean |>
    mutate(hours_group = cut(
     played_hours,
     breaks = c(0,1,5,20,50,200,300),
     labels = c("0–1","1–5","5–20","20–50","50–200","200+"))) |>
    filter(!is.na(hours_group))  
 
num_plot2 <- players_clean |>
    ggplot(aes(hours_group, fill = subscribe)) +
    geom_bar(position = position_dodge(width = 0.8)) +
    geom_text(aes(label = after_stat(count)), stat = "count", position = position_dodge(width = 0.8), vjust = -0.5) +
    labs(title = "Subscription Status by Played Hours", x = "Played Hours (binned)", y = "Count", fill = "Subscribed") +
    theme(text = element_text(size = 20))

num_plot2

     

## **Visualization Insights:**

**Plot 1** 
- Veteran players subscribed the least (68%), wheareas regular players subscribed the most (81%)
- All other experience levels subscribed near similar levels (70-77%)
  
**Plot 2**
- The median age of players subscribed were below the age of 20, whereas the median age of players that did not subscribe were above the age of 20
- The spread of players that did not subscribe to te newletter is larger than the spread of players that subscribed

**Plot 3**
- The logged median played hours of subscribed players is higher than the median played hours of players who were not subscribed
- The spread of played hours for subscribed players is larger than the spread of played hours for players who were not subscribed

## **Modelling**

First, I split the data into a training and testing set. After I had done that, I used vfold to find the best number of neighbours for the KNN-model.

In [None]:
set.seed(123)

players <- players_clean |>
                    drop_na() |>
                    select(experience, subscribe, Age, played_hours) |>
                    mutate(experience = as_factor(experience) |>
                                           fct_relevel("Beginner", "Amateur", "Regular", "Pro", "Veteran"),
                          subscribe = as_factor(subscribe),
                          Age = as.integer(Age))

players_resampled <- players |>
                        mutate(experience = as.integer(experience)) |>
                        rep_sample_n(size = 194, replace = TRUE, reps = 10)

subscribers_split <- players_resampled |>
                        initial_split(prop = 0.75)

subscribers_training <- training(subscribers_split)
subscribers_testing <- testing(subscribers_split)

subscribers_vfold <- vfold_cv(subscribers_training, v = 5, repeats = 5, strata = subscribe)

subscribers_k_values <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

knn_classification_spec <- nearest_neighbor(weight = "rectangular", neighbors = tune()) |>
                            set_engine("kknn") |>
                            set_mode("classification")

subscribers_recipe <- recipe(subscribe ~ experience + played_hours + Age, data = subscribers_training) |>
                                step_smote(subscribe) |>
                                step_normalize(all_predictors())

subscribers_results <- workflow() |>
                        add_recipe(subscribers_recipe) |>
                        add_model(knn_classification_spec) |>
                        tune_grid(resamples = subscribers_vfold, grid = subscribers_k_values) |>
                        collect_metrics() |>
                        filter(.metric == "accuracy")

subscribers_results_plot <- subscribers_results |>
                                ggplot(aes(x = neighbors, y = mean)) +
                                    geom_point() +
                                    geom_line() +
                                    ggtitle("Accuracy vs Neighbors") +
                                    xlab("Neighbors") +
                                    ylab("Accuracy")
subscribers_results_plot

Here I found the best k-value (best number of nearest neighbours) before building the workflow (which constructs the model):

In [None]:
subscribers_best_k <- subscribers_results |>
    arrange(desc(mean)) |>
    slice(1) |>
    pull(neighbors)

print("Best k-value: ")
subscribers_best_k

best_knn_spec <- nearest_neighbor(weight = "rectangular", neighbors = subscribers_best_k) |>
    set_engine("kknn") |>
    set_mode("classification")

subscribers_fit <- workflow() |>
    add_recipe(subscribers_recipe) |>
    add_model(best_knn_spec) |>
    fit(data = subscribers_training)

After I built the model, I tested its accuracy against the testing dataset and showed its metrics in decimal form as well as through a confusion matrix.

In [None]:
subscribers_predictions <- predict(subscribers_fit, subscribers_testing) |>
    bind_cols(subscribers_testing)

subscribers_acc <- subscribers_predictions |>
    metrics(truth = subscribe, estimate = .pred_class) |>
    filter(.metric == "accuracy") |>
    pull(.estimate)

subscribers_prec <- subscribers_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level = "second") |>
    pull(.estimate)

subscribers_rec <- subscribers_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level = "second") |>
    pull(.estimate)

subscribers_conf_mat <- subscribers_predictions |>
    conf_mat(truth = subscribe, estimate = .pred_class)

print("Accuracy: ")
subscribers_acc
print("Precision: ")
subscribers_prec
print("Recall: ")
subscribers_rec
subscribers_conf_mat


I used the KNN-model for this part of the project because I am working with three varibles. The KNN-model uses the labels of a number of specified nearest neighbours to classify a point. The KNN-model makes no assumptions on data distribution, which helps in this case because there are some extreme outliers in the dataset as well as possible skewedness when looking at the distribution of the three predictor variables. The KNN-model assumes that nearby neighbours should have similar, or the same, label. Considerng the goal of this data analysis is to see whether or not age, experience level, and played_hours are good predctors of subscription status, testing the accuracy of a KNN model that assumes there is a correlation between each of these variables and subscription status will accomplish that goal.

First, I split the data into a training and testing set. After I had done that, I used vfold to find the best number of neighbours for the KNN-model. After I built the model, I tested its accuracy against the testing dataset and showed its metrics in decimal form as well as through a confusion matrix.

## **Discussion**

A total of 196 players were recorded in this study on the UBC Minecraft research server led by Prof. Frank Wood in an open-access environment where player interactions are automatically logged. This study was conducted for a bit more than 2 months, from 2024-08-01 to 2024-11-30. Of the players who did subscribe to a game related newsletter, we found that subscription rates differed by experience, age, and played hours. In terms of experience level, players who were veterans at the game were found to have the lowest subscription rate (68%) and regular players were found to have the highest subscription rate (81%), all other experience levels subscribed near similar levels (70-77%). The median age of subscribed players was below 20, whereas the median age of unsubscribed players was above 20. The logged median played hours of subscribed players was higher than the logged median played hours for unsubscribed players, the latter also has a smaller spread of datapoints.

When using the KNN-model to classify data under these 3 variables, the accurage, precision, and recall were all near 90% showing that these 3 factors are good predictors for subsription status. 

The results of the study were expected. Younger ages, regular experience levels, and higher played hours were linked to higher subscription rates. This could be because younger players have a tendence to spend more time in the game. Regular experience levels indicate that a player has played the game more than a beginner, but lacks the experience and knowledge of a pro or veteran that has already spent a lot of time on the game, so subsribing to a game-related newsletter would seem more attractive to players seeking to learn more or progress in the game. Younger players could be more interested in subscribing to a game related newsletter because they have more time than older players. Players with higher played hours could be more interested in subscribing to a game-related newsletter because they spend more time and are more attached to the game. To accomplish the goal of figuring out the influences that impact subscription rate to the game-related newsletter to target players to subscribe, more studies must be conducted to understand why player's age, experience level, and played hours impacts their subscription rate. 

Some limitations in this dataset are that most players in the study spent less than hour on Minecraft over the course of the study. There were also many outliers in the dataset. Having players with a wider range of playtime, or having a larger dataset, would provide better understanding to player subscription behaviour.
