Individual Project Planning

In this project, our group will conduct a complete data science workflow to explore and predict player behavior on a video game research server. The dataset comes from a real-world study conducted by a research group in Computer Science at UBC, led by Frank Wood, which investigates how people play and interact within a Minecraft environment. 

In this planning report, I will conduct a preliminary analysis, organization and visualization of the dataset. The report includes the following parts: I will first state one broad question and one specific question that I will address; Then I will analyze and visualize the data; In the end, I will discuss what methods I will use in future project.

1.Broad Question

The broad question I intend to address is "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?" My specific question is "Is played_hours a good predictor of the subscription status of a game-related newsletter? Is it tend to be longer among younger people?"

2.Data Description, Wrangling, and Analysis

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(lubridate)
library(patchwork)

We have two datasets: players and sessions. 

In [None]:
player <- read_csv("data/players.csv")
head(player, 5)

The players dataset contains 7 variables in total. Each variable contains 196 observations.

In [None]:
variable_summary_1 <- tibble(
  variable = c("experience", "subscribe", "hashedEmail", "played_hours", "name", "gender", "Age"),
  type = c("character", "logical", "character", "numeric", "character", "character", "numeric"),
  description = c("Gaming experience",
                  "Whether the player subscribes to the newsletter",
                  "Anonymized email identifier",
                  "Hours the player has played",
                  "Player name",
                  "Player gender",
                  "Player age"))

variable_summary_1

Since this dataset is already tidy, now we can calculate the summary statistics of played_hours and Age. The Age column contains one NA values, so I first remove the NA values and name this new dataset "players". All our future calculations and visualizations will use players.

In [None]:
players <- player |>
   filter(!is.na(played_hours), !is.na(Age))

played_hours_stat <- players |>
   summarise(mean = mean(played_hours),
    median = median(played_hours),
    sd = sd(played_hours),
    min = min(played_hours),
    max = max(played_hours)) |>
    round(2)
Age_stat <- players |>
   summarise(mean = mean(Age),
    median = median(Age),
    sd = sd(Age),
    min = min(Age),
    max = max(Age)) |>
    round(2)

played_hours_stat
Age_stat

In [None]:
players_stat <- tibble(mean_played_hours = 5.9, mean_Age = 21.14)
players_stat

Now we explore the "sessions" dataset.

In [None]:
session <- read_csv("data/sessions.csv")
head(session, 5)

The sessions dataset contains 5 variables in total. Each variable contains 1535 observations.

In [None]:
variable_summary_2 <- tibble(
  variable = c("hashedEmail", "start_time", "end_time", "original_start_time", "original_end_time"),
  type = c("character", "character", "character", "numeric", "numeric"),
  description = c("Anonymized email identifier",
                  "session start time",
                  "session end time",
                  "session start time encoded by computer",
                  "session end time encoded by computer"))

variable_summary_2

The dataset is not tidy enough, since the start_time and end_time contain more than one value in a cell (date and time). We can create a new double variable called session_time using end_time minus start_time. However, there is one potential problem: the start_time and end_time are currently character strings. If we want to do time calculations, we need to convert them into date-time first. The original_start_time and original_end_time are not human readable, and we already have readable start_time and end_time, so they will be removed. 

In [None]:
sessions <- session |> 
    mutate(start_time = dmy_hm(start_time), end_time = dmy_hm(end_time)) |>
    mutate(session_time_min = as.numeric(difftime(end_time, start_time, units = "mins")),
           session_time_min = pmax(session_time_min, 0)) |> 
    select(hashedEmail, session_time_min) |>
    filter(!is.na(session_time_min), !is.na(hashedEmail)) 
head(sessions, 5)

Below provides the summary statistics of the session_time variable.

In [None]:
session_time_stat <- sessions |>
   summarise(mean = mean(session_time_min),
    median = median(session_time_min),
    sd = sd(session_time_min),
    min = min(session_time_min),
    max = max(session_time_min)) |>
    round(2)
session_time_stat

3. Visualization

We will be mainly using observations in the subscribe, played_time, and Age columns. In the planning stage, I made a few exploratory visualizations of these three variables.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)
Age_plot <- ggplot(data = players, aes(x = Age)) +
  geom_histogram(bins = 20, fill = "skyblue") +
  labs(title = "Distribution of Ages (years)", 
       x = "Age", y = "Number of Players")

hours_plot <- ggplot(data = players, aes(x = played_hours)) +
  geom_histogram(bins = 20, fill = "skyblue") +
  labs(title = "Distribution of hours played", 
       x = "hours", y = "Frequency")

Age_plot / hours_plot

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7)
subscribe_plot <- ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
   geom_point() + 
   labs(x = "age of players (years)", y = "time played (hours)", color = "subscription status", title = "age vs play time")
subscribe_plot

The majority of players are between 15 and 30 years old. There are fewer older players, and their playing times also tend to be relatively low. A few subscribed players, particularly under age 25, play dramatically more hours (above 100), indicating highly engaged users. Both subscribed and non-subscribed players appear across all age groups. However, the high-hour outliers mostly belong to the subscribed group, suggesting that subscription may be associated with greater playtime. In general, this visualization supports that played_hours is positively associated with subscription, particularly among younger players, and gives initial, descriptive evidence for my question.

4. Methods and Plan

To examine whether played_hours can predict newsletter subscription, I plan to use a kknn classification model. This method is appropriate because the outcome variable, subscribe, is binary, and kknn is a flexible classifier that does not assume linear relationships or specific distributions. It classifies users based on the behaviour of nearby players, which fits well with the idea that players with similar gameplay patterns may have similar subscription tendencies. Before modelling, I will clean the data, remove NA values, and standardize numeric predictors. I will split the data into 70% training and 30% testing, and then use 5-fold cross-validation on the training set to choose the best value of k. Model performance will be assessed using accuracy and confusion matrices, and the final selected model will be evaluated on the test set. Potential limitations include the sensitivity of kknn model to outliers, class imbalance, and the choice of k. 

To further address whether younger players tend to have longer played_hours, I will use ggplot visualizations rather than a predictive model. This question is descriptive in nature, and visualization is well-suited for exploring trends in game time across ages. I will create plots such as boxplots comparing played_hours across age groups or histograms showing the general trend of played_hours across age. These visualizations will help reveal general patterns, though they cannot establish causation and may be affected by skewed data or uneven group sizes. By combining kknn classification for the predictive question and ggplot visual exploration for age-related patterns, this plan provides a structured and appropriate approach to addressing both parts of the question. 