# Crafting Subscriptions: Can Player Demographics in Minecraft Predict Game-related Newsletter Subscriptions?


![](https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExN2lxcTc5OTdpcmV2bWllaDRtMzhpOGpqMzhuemY5eWkwdXFqN3luNSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/AHpC7mG5fOaA3cgYw1/giphy.gif)

# Introduction
While computer science students are often thought of being swamped with course work and personal projects, they actually love spending free time on the world's most popular video game: Minecraft! 

In particular, a research group of computer science students at UBC loved playing Minecraft so much, they decided to collect data on how people play video games and if we can use the data derived from sessions to make certain predictions about player behaviour. Minecraft servers are expensive to setup and maintain. Thus, to faciliate the research effectively, they need to be able to target Minecraft enthusiasts while simultaneously making sure that they have enough server capacity on hand. 

Being a kind data scientist (and commerce student) myself, I thought to help the group identify which players have the potential to become engaged and interested in the broader project by uncovering the characteristics that are correlated to being a newsletter subscriber. 

# The Question 
The formal question this analysis attempts to answer is: Can player demographics, played hours and age, predict whether a new player is going to be a newsletter subscriber? Answering this question and determining the factors that affect if a new player will be subscriber can provide insightful information to the research group. 

Knowing which players are more inclined to subscribe helps the research group determine who to target to recruit to the Minecraft server. A key assumption made is that players who subscribe are also players who are more engaged in the broader project, which can lead to richer data collection (more sessions played, more responsiveness to surveys, etc). Another useful insight that the research group can obtain is that video games often have certain subscription services (battle pass in Fortnite, Nintendo game passes, etc) and perhaps knowing the factors that determine the subsciption rates of their Minecraft-related newsletter can help them gain intuition in subscribing tendencies of other games. 

Now that we know the question we are trying to solve, as well as the implications and benefits for the amazing group of UBC computer science students. Let us start the analysis (and code)!

# Analysis (and Code)!

Before we start our analysis, it is important to load in the necessary packages needed to load, wrangle, and visualize our data. 

In [None]:
# run before continuing 
library(tidyverse)
library(dplyr)
library(repr)
library(tidymodels)

Before we start, let's also set our seed for the rest of the analysis to make sure our results are reproducible.

In [None]:
# set our seed
set.seed(2025) 

Now that the seed is set, let's load in our `players.csv` file under the `data` folder in our current directory. This is going to be main dataset we will be working with to answer our classification problem. 




In [None]:
# loading in our dataset 
players <- read_csv("data/players.csv")

players

Looking at our dataset, it is clear that there is many steps to take to prepare our dataset for analysis. For a first step, lets `select` the `subscribe`, `played_hours`, and `Age` columns as those are the relevant variables for our analysis.

In [None]:
# removing hashed email and name columns from our dataset
players_clean <- players |>
    select(subscribe, played_hours, Age)

players_clean

Next, we need to convert our `subscribe` data type into a factor as this is the varible we will be classifying. Let's do this now.

In [None]:
# convert subscribe into factor data type
players_clean_factored <- players_clean |>
    mutate(subscribe = as.factor(subscribe))

players_clean_factored

Before we build our model, it is nice to visualize the relationship between the two variables and how the points are classifed in our dataset. Let's build a scatterplot now with `Age` on the x-axis, `played_hours` on the y-axis, and colour the points using the `subscribe` column.

In [None]:
# create our scatterplot
players_plot <- players_clean_factored |>
    ggplot(aes(x = Age, y = played_hours, colour = subscribe)) +
    geom_point() +
    xlab("Age (in years)") +
    ylab("Number of Played Hours") +
    ggtitle("Scatterplot of Age and Number of Played Hours of Minecraft Players")

players_plot

# Wait, there seems to be no obvious correlation ...
We can see that on the surface level, there is no strong evidence that age and number of hours played can be useful in predicting possible newsletter subscriptions. However, let's not panic, we can still see that newsletter subscribers are generally younger in age and play more hours than non-subscribers. 

To gain a better more insight, let's build our classification model using k-nearest neighbors to see if we can accurately predict if someone is going to be a newsletter subscriber based on age and number of hours played on the Minecraft server. 

Remembering to link this back to our initial question and purpose, knowing the variables that help us accurately predict possible new player subscription can help the research group determine which demographic they should target for the study. Furthermore, being able to accurately classify subscribers based on demographics, in our case age and hours played, can help the research group determine who to approach when asking for more in-depth data or research in the future