# **Group 19 Project**  **GIVE BETTER TITLE**

# **Introduction**

Video games have evolved from simple pastimes into complex environments that offer rich data about user behavior and interaction. This report is grounded in a real-world data science project conducted by a research group in Computer Science at UBC, led by Frank Wood. The team has established a Minecraft server where every action taken by players is recorded. By capturing this data, the researchers aim to unlock insights into how individuals navigate and interact within virtual worlds.

The project has multiple objectives, and we will focus on understanding the characteristics and behaviors that most predict a player's likelihood to subscribe to a game-related newsletter. This targeted approach helps ensure that sufficient resources—such as software licenses and server hardware—are available to support the anticipated influx of players. By investigating player behavior through detailed analytics, the study aims to inform future strategies for engagement, recruitment, and resource allocation in online gaming communities. This report will detail the specific methodologies used to analyze the player data, the key findings related to newsletter subscription behavior, and the implications of these findings.

# **Question**

Are age and the number of hours played predictive of subscription status to a game-related newsletter in the player.csv data set?

## Data Set Description

There a two datasets containing information on players on the MineCraft server; "players.csv" and "sessions.csv". 

The "players.csv" dataset contains observations collected for multiple different variables from people who played on the MineCraft server. The data frame contains 7 variables and 196 rows of data, producing 1372 observations in total. The variables are ordered in the table left to right are:

- `Experience`
    - This variable describes the level at which each player is at in terms of playing the game.
    - This variable is represented by a string value that can be either Amateur, Beginner, Regular, Pro, or Veteran
- `Subscribe`
    - This variable describes whether or not the player is subscribed to a game-related newsletter.
    - This variable is represented by a boolean value (either True or False) 
- `Hashed Email`
    - This variable describes lists each players email in a hashed format. 
    - This variable is represented by a string  
- `Hours Played`
    - This variable describeshow many hours each player spent playing the game (in hours). 
    - This variable is represented by a float value (number with a decimal value)
- `Name`
    - This variable states the players first name
    - This variable is represented by a string  
- `Gender`
    - This variable describes the gender of each player. 
    - This variable is represented by a string value that can be either Agender, Female, Male, Non-binary, other, Prefer not to say, or Two-Spirited
- `Age`
    - This variable describes the age of the players (in years) 
    - This variable represented by an integer value (whole number) 

**This is the data set that will be used in the analysis.**

The "sessions.csv" data contains observations collected for multiple different variables from people who played on the MineCraft server. The data frame contains 5 variables and 1535 rows of data, producing 7675 observations in total. The variables are ordered in the table from left to right are:

- `hashedEmail`
    - This variable gives a string of letters and numbers that represent the players email address. 
    - This variable is represented by a string  
- `start_time`
    - This variable gives the exact date (DD/MM/YR) and time (24 hour clock) that the player started their session.
    -  This variable is represented by a string
- `end_time`
    - This variable the exact date (DD/MM/YR) and time (24 hour clock) that the player ended their session. 
    - This variable is represented by a string  
- `original_start_time`
    - This variable describes the original start time of players 
    - This variable is represented by a float value (number with a decimal value)
- `original_end_time`
    - This variable describes the original end time of players
    - This variable is represented by a float value (number with a decimal value)


This data set will not be used in the analysis. 

In [None]:
library(tidyverse)
library(rvest)
library(dplyr)
library(tidymodels)

In [None]:
# url_sessions <- "https://raw.githubusercontent.com/IFQXK/DSCI-100-project-group-19/refs/heads/main/sessions.csv"
# sessions_data <- read.csv(url_sessions)
# head(sessions_data)

url_players <- "https://raw.githubusercontent.com/IFQXK/DSCI-100-project-group-19/refs/heads/main/players.csv"
players_data <- read.csv(url_players)
head(players_data)

This is what the first 6 lines of code looks like without wrangling

In [None]:
players_data_raw <- players_data |>
    mutate(subscribe = as.factor(subscribe)) |>
    select(subscribe, played_hours, Age)

players_data_fixed <- filter(players_data_raw, is.na = TRUE)

head(players_data_fixed)

Once specific columns from the original data set required for analysis have been selected, the data gets split into a training and testing set so that the model created later generalizes well and provides reliable predictions on new data.

In [None]:
set.seed(1)
players_split <- initial_split(players_data_fixed, prop = 0.75, strata = subscribe)
players_training <- training(players_split)
players_testing <- testing(players_split)

## Statistics Summary of Variables used in Specific Question

In [None]:
mean_table <- players_training |>
summarize(
Average_Age = mean(Age, na.rm = TRUE), Hours_Played = mean(played_hours, na.rm = TRUE))

mean_table



`Age` :

Mean: 20.9

Max: 50

Min: 8


`Subscribe`

True: 108

False: 39

`Played_hours` 

Mean: 4.6

Max: 218.1

Min: 0.0

Above is the summary for each statistic used in the analysis. If an integer value, mean, max, and min were calculated. If the stat was a character value, the count was summarized per category. 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 4)
gender_bar <- players_training |>
            ggplot(aes(x = played_hours, fill = subscribe)) +
            geom_histogram(position = "dodge", binwidth = 30) +
            labs(x = "Hours Played", title = "Graph 1. Relationship between hours played and subscription status", fill =  "Subscription Status") + 
            theme(text = element_text(size = 14))       
gender_bar

A histogram was chosen to represent the data above to see the distribution of hours played across the data set in relation to subscription status. As seen above, most players have low hours and are subscribed. Furthermore, all players who have many hours played are all subscribed. The non-subscribed players appear to all have very low hours played. 

In [None]:
age_histogram <- players_training |>
            ggplot(aes(x = Age, fill = subscribe)) +
            geom_histogram(position = "dodge", binwidth = 5) +
            labs(x = "Age (in years)", title = "Graph 2. Relationship between age and subscription status", fill = "Subscription Status") + 
            theme(text = element_text(size = 14))       
age_histogram

A bar plot was used to compare age and subscription to visually compare the two subscription statuses based on age. This allows use to determine and see which ages are more likely to be subscribed. As seen in the graph above, players between the ages of 15 and 28 are more likely to be subscribed. As the ages get older, the ratio of subscribed to unsubscribed is approximately the same.

# **Data Analysis**

In [None]:
scatter_plot <- players_training |>
            ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
            geom_point() +
            labs(y = "Hours Played", x = "Age (in years)", title = "Graph 3. Relationship between hours played, age and subcription status") + 
            theme(text = element_text(size = 14))
    
scatter_plot

In [None]:
players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
    set_engine("kknn") |>
    set_mode("classification")

players_workflow <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    fit(data = players_training)

players_workflow

In [None]:
players_test_prediction <- predict(players_workflow, players_testing) |>
    bind_cols(players_testing)

players_test_accuracy <- players_test_prediction |>
    metrics(truth = subscribe, estimate = .pred_class)
players_test_accuracy

In [None]:
players_mat <- players_test_prediction |> 
      conf_mat(truth = subscribe, estimate = .pred_class)

players_mat

In [None]:
players_vfold <- vfold_cv(players_testing, v = 5, strata = subscribe)

players_resample_fit <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(players_spec) |>
      fit_resamples(resamples = players_vfold) |>
    collect_metrics()

players_resample_fit

The code above performs the following:

Prediction: It makes predictions on the dataset using KNN classification when K = 3.
Evaluation: It evaluates the model's performance using accuracy metrics and a confusion matrix.
Cross-Validation: It performs cross-validation to assess the model's performance on different subsets of the data.

**Because the accuracy of the prediction is so low, we must tune the model to find the bext K value possible**

In [None]:
gridvals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

players_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

players_results <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(players_tune) |>
      tune_grid(resamples = players_vfold, grid = gridvals) |>
      collect_metrics()

players_results

In [None]:
accuracies <- players_results |> 
      filter(.metric == "accuracy")

accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(0, 14, by = 1)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.4, 1.0)) # adjusting the y-axis

accuracy_versus_k


From the plot above, we can see that $K = 1$ provides the highest accuracy. Larger $K$ values result in a reduced accuracy estimate.

In [None]:
best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  fit(data = players_testing)

players_test_predictions <- predict(knn_fit, players_testing) |>
bind_cols(players_testing)

players_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

The code above uses the most optimal $K$ value for our dataset and fits our model to the testing set to determine its accuracy. The accuracy reported was 93%. 

In [None]:
new_player <- tibble(played_hours = 8, Age = 15)
new_player_prediction <- predict(knn_fit, new_player) 
new_player_prediction

Then, we create a new observation, 'new_player', with randomly chosen played hours and age, and use our model to predict it's subscription status. We predicted that our new observation, who is 15 years old with 8 played hours, is subcribed to the newsletter. There is reason to believe that this prediction is true as the models accuracy is reported to be 93%, which is pretty high. 

Say in discussion: When compared with our exploratory datatset, our prediction makes sense as most players around 15 years old and with around 8 hours of playing time were subcribed, so it is natural for our fictional player to be subscribed. 

# **Discussion**

Discuss what impact could such findings could have?

There are many impacts the findings of this model can have. For example, game studios could use it to automatically show newsletter sign-up prompts only to players who are unlikely to subscribe, reducing annoyance for already-subscribed users. Marketing teams could send email campaigns to older players who play less often, encouraging them to re-engage with the game through exclusive content or beginner-friendly updates.
