# Crafting Subscriptions: Can Player Demographics in Minecraft Predict Game-related Newsletter Subscriptions?


![](https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExN2lxcTc5OTdpcmV2bWllaDRtMzhpOGpqMzhuemY5eWkwdXFqN3luNSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/AHpC7mG5fOaA3cgYw1/giphy.gif)

# Introduction
While computer science students are often thought of being swamped with course work and personal projects, they actually love spending free time on the world's most popular video game: Minecraft! 

In particular, a research group of computer science students at UBC loved playing Minecraft so much, they decided to collect data on how people play video games and if we can use the data derived from sessions to make certain predictions about player behaviour. Minecraft servers are expensive to setup and maintain. Thus, to faciliate the research effectively, they need to be able to target Minecraft enthusiasts while simultaneously making sure that they have enough server capacity on hand. 

Being a kind data scientist (and commerce student) myself, I thought to help the group identify which players have the potential to become engaged and interested in the broader project by uncovering the characteristics that are correlated to being a newsletter subscriber. 

# The Question 
The formal question this analysis attempts to answer is: Can player demographics, played hours and age, predict whether a new player is going to be a newsletter subscriber? Answering this question and determining the factors that affect if a new player will be subscriber can provide insightful information to the research group. 

Knowing which players are more inclined to subscribe helps the research group determine who to target to recruit to the Minecraft server. A key assumption made is that players who subscribe are also players who are more engaged in the broader project, which can lead to richer data collection (more sessions played, more responsiveness to surveys, etc). Another useful insight that the research group can obtain is that video games often have certain subscription services (battle pass in Fortnite, Nintendo game passes, etc) and perhaps knowing the factors that determine the subsciption rates of their Minecraft-related newsletter can help them gain intuition in subscribing tendencies of other games. 

Now that we know the question we are trying to solve, as well as the implications and benefits for the amazing group of UBC computer science students. Let us start the analysis (and code)!

# Analysis (and Code)!

Before we start our analysis, it is important to load in the necessary packages needed to load, wrangle, and visualize our data. 

In [None]:
# run before continuing 
library(tidyverse)
library(dplyr)
library(repr)
library(tidymodels)
library(yardstick)

Before we start, let's also set our seed for the rest of the analysis to make sure our results are reproducible.

In [None]:
# set our seed
set.seed(2025) 

Now that the seed is set, let's load in our `players.csv` file under the `data` folder in our current directory. This is going to be main dataset we will be working with to answer our classification problem. 




In [None]:
# loading in our dataset 
players <- read_csv("data/players.csv")

players

Looking at our dataset, we have 7 variables and 196 observations. However, it is clear that there are many steps to take in order to prepare our dataset for analysis. For a first step, lets `select` the `subscribe`, `played_hours`, and `Age` columns as those are the relevant variables for our analysis.

In [None]:
# removing hashed email and name columns from our dataset, changing Age column to lower case
players_clean <- players |>
    select(subscribe, played_hours, Age) |>
    rename(age = Age)

players_clean

Next, we need to convert our `subscribe` data type into a factor as this is the varible we will be classifying. We also notice that there are possible NA values in our dataset. Let's convert our `subscribe` data type and remove any rows with NA values now.

In [None]:
# convert subscribe into factor data type and remove NA values
players_clean_factored <- players_clean |>
    mutate(subscribe = as.factor(subscribe)) |>
    drop_na()

players_clean_factored

Before we build our model, it is nice to visualize the relationship between the two variables and how the points are classifed in our dataset. Let's build a scatterplot now with `Age` on the x-axis, `played_hours` on the y-axis, and colour the points using the `subscribe` column.

In [None]:
# create our scatterplot
players_plot <- players_clean_factored |>
    ggplot(aes(x = age, y = played_hours, colour = subscribe)) +
    geom_point() +
    xlab("Age (in years)") +
    ylab("Number of Played Hours") +
    ggtitle("Scatterplot of Age and Number of Played Hours of Minecraft Players")

players_plot

# Wait, there seems to be no obvious correlation ...

![](https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExNTV3emplajljMnd5YXR4dzEyOWQzcWZseGxuYzEzdnJ2eTNsMTd3eiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/GntpLR18f4b8Uu8GzO/giphy.gif)

We can see that on the surface level, there is no strong evidence that age and number of hours played can be useful in predicting possible newsletter subscriptions. However, let's not panic, we can still see that newsletter subscribers are generally younger in age and play more hours than non-subscribers. 

To gain a better more insight, let's build our classification model using k-nearest neighbors to see if we can accurately predict if someone is going to be a newsletter subscriber based on age and number of hours played on the Minecraft server. 

Remembering to link this back to our initial question and purpose, knowing the variables that help us accurately predict possible new player subscription can help the research group determine which demographic they should target for the study. Furthermore, being able to accurately classify subscribers based on demographics, in our case age and hours played, can help the research group determine who to approach when asking for more in-depth data or research in the future

# K-Nearest Neighbours Classification Model 

First, let's split the data into `training` and `testing` data. Let's specify 75% of our data as training data and 25% of our data as testing data.

In [None]:
# split our data into training and testing data
players_split <- initial_split(players_clean_factored, prop = 0.75, strata = subscribe)

# assign our data into training and testing data
players_training <- training(players_split)
players_testing <- testing(players_split)

glimpse(players_training)

Now, let's create our `recipe` and `knn` specifications for our model. Then, we need to conduct 5-fold validation and for our `recipe`, we will make sure that our variables are centered and scaled to make sure no variables dominate others. For our specification, let's tune our neighbors value in order to find the optimal value for our model.

In [None]:
# create our recipe and standardize our data 
players_recipe <- recipe(subscribe ~ played_hours + age, data = players_training) |>
    step_center(all_predictors()) |>
    step_scale(all_predictors()) 

# create specification for our model 
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

Let's also create our 5-fold cross validation model and our tibble of values we want to test for `K`. For this model, let's try values of `K` from 1 to 10.

In [None]:
# create our 5-fold cross validation model
vfold_players <- vfold_cv(players_training, v = 5, strata = subscribe)

# create our tibble for the K values we want to tune to
grid_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

Now, we can train our model by putting it all together in a workflow! Then, we collect our metrics to see how each value of `K` did.

In [None]:
# train our model by putting it together in a workflow!
players_results <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(players_spec) |>
      tune_grid(resamples = vfold_players, grid = grid_vals) |>
      collect_metrics()

players_results


It is difficult to see the best K value in the table that we created. Let's create a visual representation of our results with our `K` values on the x-axis and `accuracies` on the y-axis

In [None]:
# filter for only accuracy metric
players_accuracies <- players_results |> 
      filter(.metric == "accuracy")

# create our lineplot with points!
accuracy_versus_k <- ggplot(players_accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(0, 14, by = 1)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.4, 1.0)) + # adjusting the y-axis 
      ggtitle("Accuracies Versus K Values")
accuracy_versus_k

We can see that the `K` values of 5 and 6 provides the most accuracy for our classification model. Surprisingly, they actually produce the exact same accuracy! One possible reason is that the changing the number of nearest neighbors the classifer uses does not change the predicted subscriber status (either TRUE or FALSE). Let's choose one of the values, 5 in our analysis, to create our model with.

In [None]:
# create our new specification with optimal k value
players_spec_optimal <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
    set_engine("kknn") |>
    set_mode("classification")

# create our model using workflow
players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec_optimal) |>
    fit(players_training) 

players_fit

Note that we have been using the `players_training` set so far to train our model to classify the subsciption status so far. Now let's put our model to the test to make predictions of the subsciption status of players the model has never seen before based on `age` and `played_hours`. We need to also show the accuracy of our model using `metrics` as well as create a confusion matrix to gain a deeper insight into our model's accuracy.

In [None]:
# make our predictions and bind columns to compare 
players_predictions <- predict(players_fit, players_testing) |>
    bind_cols(players_testing)

players_predictions

# store metrics in created object
players_metrics <- players_predictions |>
    metrics(truth = subscribe, estimate = .pred_class)

players_metrics

# create confusion matrix
players_conf_mat <- players_predictions |>
    conf_mat(truth = subscribe, estimate = .pred_class)

players_conf_mat

Hmm, we can see that our `accuracy` as an percentage while our recall and precision is a bit more difficult to determine. Let's show our recall and precision as a percentage as well.

In [None]:
precision(players_predictions, truth = subscribe, estimate = .pred_class, event_level = "second")
recall(players_predictions, truth = subscribe, estimate = .pred_class, event_level = "second")

# Interpretation of Results and What This Means for the Research Group

Looking at the `accuracy`, `precision`, and `recall`, we can see that while our `accuracy` is slightly lower at 65% our `recall` (69%) and `precision` (81%) are both higher in percentage. In layman terms, our model is makes the correct prediction 65% of time either predicting if a player is indeed a subsciber or not a subscriber to the game-related newsletter. when it does predict someone to be a subscriber, it is right 81% of time. 