# Project Planning Stage (Group)

## **Introduction**

Background Recruiting and retaining active users is critical for running online gaming experiments. The UBC Minecraft research server, led by Prof. Frank Wood, offers an open-access environment in which every movement and click are automatically logged. While the server generates rich behavioural data, only a small fraction of players eventually subscribe to the project’s game-related newsletter—a low-cost channel for announcing new experiments, updates, and funding opportunities. Being able to predict subscription likelihood from readily available player attributes would allow the team to focus recruitment messages on users most likely to engage, thereby increasing subscription rates without expanding marketing effort.

Primary Question Can a player’s experience level, playing time, and age predict whether a player subscribes to the game-related newsletter? Datasets One CSV file is provided and is linkable via the primary key player_id.

players.csv – one row per unique participant. After importing, the merged file contains 1,842 players. The data is collected between 2024-08-01 and 2024-11-30.

*Edited intro if you want to change it*

Recruiting and retaining active users is critical for running online gaming experiments. The UBC Minecraft research server logs all player activity in detail. However, only a small fraction of players subscribe to the game-related newsletter. Because the newsletter is used to announce experiments, updates, and funding opportunities, understanding what influences subscription behavior can help the team target their messages in a more effective way.

For this project, our goal is to explore the question:
*"Can a player's experience level, playing time, and age predict whether a player subscribes to the game-related newsletter?"*

We will be using the players.csv dataset that contains one row per user and other information like experience level, total hours played, age, and subscription status. The data is collected between 2024-08-01 and 2024-11-30.


**Describe why we used said predictors and why we didn't use other predictors**

**Descriptive summary of dataset:**
- The players dataset contains 196 observations and 7 variables.

  Below is a summary of the variables used in our analysis:

| Variable Name | Data Type   | Description / Meaning                                                                                       | Notes / Potential Issues                                                                                |
| ------------- | ----------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Experience    | Categorical | Player’s self-reported experience level (Beginner, Amateur, Regular, Veteran, Pro)                          | Reflects player skill/familiarity                                                                       |
| Subscribe     | Logical     | Whether players are subscribed to the newsletter (TRUE/FALSE)                                               | May be influenced by external sources other than gameplay experience (marketing, interest)              |
| hashedEmail   | Categorical | Unique hashed player identifier (anonymized)                                                                | Used to identify players; no direct analytical value                                                    |
| played_hours  | Numeric     | Total hours played on server                                                                                | Includes 0 hours (inactive or new players); possible outliers with high values                          |
| name          | Categorical | Player’s first name                                                                                         | Nominal data with potential duplicates                                                                  |
| gender        | Categorical | Player’s self-identified gender (Male, Female, Non-binary, Two-Spirited, Agender, Other, Prefer not to say) | Multiple categories with social diversity; minority groups may require special attention/representation |
| age           | Integer     | Player’s age in years                                                                                       | Large range of ages; two missing data points                                                            |

- Summary statistics table for numeric/integer variables:
  | Variable     | Min  | Mean  | Median | Max   | Std Dev |
| ------------ | ---- | ----- | ------ | ----- | ------- |
| played_hours | 0.00 | 5.85  | 0.1   | 223.10 | 28.36   |
| age          | 9.00 | 21.14 | 19.00  | 58.00 | 7.39    |

## **Methods and Results**

Below, we created a cleaned version of the dataset by selecting some of the previously mentioned variables and removing rows with missing values.

In [None]:
install.packages("tidyverse")

library(tidyverse)
library(knitr)
library(GGally)
library(ggplot2)
library(dplyr)

In [None]:
players <- read_csv("players.csv")

head(players)
glimpse(players)

In [None]:
#Summary Statistics

players_summary <- players |>
  summarise(
    across(where(is.numeric),
           list(min = ~round(min(., na.rm = TRUE), 2),
                mean = ~round(mean(., na.rm = TRUE), 2),
                median = ~round(median(., na.rm = TRUE), 2),
                max = ~round(max(., na.rm = TRUE), 2),
                sd = ~round(sd(., na.rm = TRUE), 2)))
  )
players_summary


In [None]:
#selecting necessary variables

players_select <- players |>
    select(subscribe, experience, played_hours, Age)

#converting categorical variables to factors

players_select <- players_select |>
    mutate(subscribe = as.factor(subscribe), experience = as.factor(experience))

#removing rows with missing vals

players_clean <- players_select |>
    filter(!is.na(played_hours), !is.na(Age))

#mean values for numeric variables

mean_summary <- players_clean |>
    summarise(mean_played_hours = mean(played_hours, na.rm = TRUE), mean_age = mean(Age, na.rm = TRUE)) |>
    mutate(across(everything(), ~round(.x, 2)))

head(players_clean)
mean_summary

## **Data Visualization**

In [None]:
#Plot 1: Categorical Plot - Experience vs. Subscription Status
options(repr.plot.width = 12, repr.plot.height = 8)

cat_plot <- players_clean |>
    group_by(experience, subscribe) |>
    summarise(count = n(), .groups = "drop") |>
    group_by(experience) |>
    mutate(prop = count / sum(count)) |>
    filter(subscribe == "TRUE") |>
    ggplot(aes(x = experience, y = prop, fill = experience)) +
    scale_y_continuous(labels = scales::percent_format()) +
    geom_col(show.legend = FALSE) + 
    labs(title = "Newsletter Subscription Rate by Experience level", x = "Experience Level", y = "Subscription Rate (%)") + 
    theme(text = element_text(size = 20)) 
cat_plot

In [None]:
#Plot 2: Numeric Plot 1 - Age vs. Subscription Status

options(repr.plot.width = 12, repr.plot.height = 8)

num_plot1 <- players_clean |>
    ggplot(aes(x = subscribe, y = Age, color = subscribe)) +
    geom_boxplot(outlier.shape = NA, alpha = 0.4) +
    geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
    labs(x = "Subscription Status", y = "Age", title = "Age vs. Subscription Status") +
    theme(text = element_text(size = 20)) + 
    theme(legend.position = "none")

num_plot1

In [None]:
#Plot 3: Numeric Plot 2 - Played Hours vs. Subscription Status

num_plot2 <- players_clean |>
    ggplot(aes(x = subscribe, y = played_hours, color = subscribe)) +
    geom_boxplot(outlier.shape = NA, alpha = 0.4) +
    geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
    scale_y_continuous(trans = "log1p") +
    labs(x = "Subscription Status", y = "Played Hours", title = "Played Hours (log scale) vs. Subscription Status") +
    theme(text = element_text(size = 20))  +
    theme(legend.position = "none")

num_plot2

## **Visualization Insights:**

**Plot 1** 
- Veteran players subscribed the least (68%), wheareas regular players subscribed the most (81%)
- All other experience levels subscribed near similar levels (70-77%)
  
**Plot 2**
- A


**Plot 3**

## **Discussion**

**References**