# Predicting Newsletter Subscription from Demographics and Play Behavior

## Introduction

**Background**: The UBC Minecraft server project collects player data to help manage server resources and improve player engagement strategies. One key goal is to understand what kinds of players are likely to subscribe to the server newsletter, which serves as a way to share updates and strengthen the community.

**Question**: This project investigates the following question:

*Can we predict whether a player will subscribe to the newsletter based on their age and total number of hours played?*

To answer this question, we will use data from the following source:

- `players.csv`, which contains each player's demographic information (including age), total number of hours played (`played_hours`), and subscription status (`subscribe`).

This project focuses only on the `players.csv` dataset. By using demographic information (`Age`) and behavioral information (`played_hours`), we aim to determine whether these variables are useful predictors of newsletter subscription.

We chose this approach because:
- Age may relate to interest in community updates.
- Players who spend more time in-game may be more engaged and thus more likely to subscribe.

**Data Description**

This project uses the dataset `players.csv`, which contains demographic, behavioral, and subscription information for 196 unique players.

We use two variables—`Age` and `played_hours`—to investigate whether they are useful predictors of newsletter subscription status.

**Dataset Overview**

#### players.csv

| Variable Name | Type    | Description                                  |
|---------------|---------|----------------------------------------------|
| hashedEmail   | String  | Unique identifier for each player            |
| age           | Numeric | Player's reported age                        |
| gender        | String  | Player's reported gender                     |
| subscribe     | Boolean | Whether the player subscribed (TRUE/FALSE)  |
| experience    | String  | Self-reported experience level               |
| played_hours  | Numeric | Total number of hours the player has played |
| name          | String  | Player's chosen username                     |

**Summary Statistics**

- Number of players: 196  
- Average player age: 20.5 years  
- Average played hours: 8.7 hours  
- Newsletter subscription rate: 73.5% subscribed  

These variables (`Age` and `played_hours`) will be used to build a predictive model to determine whether a player is likely to subscribe to the server newsletter.

## Methods & Results

To explore whether a player's age and total played time can help predict their newsletter subscription status, we followed several steps:

### Data Processing

- We used the `players.csv` dataset, which contains each player's age, total played hours, and subscription status.
- We kept only the necessary columns: `Age`, `played_hours`, and `subscribe`.
- We removed any rows with missing values to ensure a clean dataset for modeling.

### Splitting the Data

- We split the dataset into a **training set** (75%) and a **testing set** (25%) using random sampling.
- The training set was used to teach the model how age and played hours might relate to newsletter subscription.
- The testing set was used to evaluate how well the model performs on new, unseen data.

### Model

- We trained a **K-nearest neighbors (KNN)** classification model using `Age` and `played_hours` as the predictor variables, and `subscribe` as the response.
- The model identifies patterns in training data to classify new players as likely subscribers or not.

### Evaluation

- We evaluated the model using **accuracy**, which measures the proportion of correct predictions on the test set.
- A higher accuracy means that `Age` and `played_hours` are useful predictors of subscription status.

This analysis helps us understand whether player demographics (age) and behavior (play time) are useful indicators of newsletter interest.

In [None]:
library(tidyverse)
library(tidymodels)

# Data Processing
players <- read_csv("players.csv") |>
    select(Age, played_hours, subscribe) |>
    drop_na() |>
    filter(played_hours <= 20, Age <= 35) |>
    mutate(subscribe = as.factor(subscribe))

# Splitting the Data
set.seed(123)
split <- initial_split(players, prop = 0.75)
train_data <- training(split)
test_data <- testing(split)

# Model
knn_recipe <- recipe(subscribe ~ Age + played_hours, data = train_data) |>
    step_scale(all_numeric_predictors()) |>
    step_center(all_numeric_predictors())

knn_spec <- nearest_neighbor(mode = "classification", neighbors = 5) |>
    set_engine("kknn")

knn_workflow <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec)

knn_fit <- fit(knn_workflow, data = train_data)

# Evaluation
predicted_classes <- predict(knn_fit, new_data = test_data)

test_predictions <- test_data |>
    mutate(prediction = pull(predicted_classes, .pred_class))

accuracy(test_predictions, truth = subscribe, estimate = prediction)

In [None]:
ggplot(players, aes(x = played_hours)) +
  geom_histogram(bins = 30)

ggplot(players, aes(x = Age)) +
  geom_histogram(bins = 20)

## Discussion

After training a K-nearest neighbors (KNN) classification model using `Age` and `played_hours` as predictor variables, the model achieved an accuracy of approximately **68.2%** on the testing set. This marks a substantial improvement compared to the initial accuracy of 53%, which was obtained using the full dataset without filtering.

We identified and removed extreme values in both variables. Most players had less than 25 total played hours and were between 15 and 30 years old. These observations guided us to filter the data to focus on the core player group. This step significantly reduced noise in the data and allowed the model to learn more consistent patterns.

The improved accuracy suggests that **player demographics and behavioral time investment can moderately predict newsletter subscription**. While not highly accurate, the model performs better than random guessing and indicates some signal in the selected variables.

However, the moderate performance also implies that other unobserved factors may play a more important role in predicting subscription. For example, social interactions, in-game achievements, or motivations for joining the community could be more relevant but were not available in this dataset.

To further improve the model, future work could:
- Include additional variables (e.g., gender, experience level, event participation)
- Try alternative models (e.g., logistic regression, decision trees)
- Address potential class imbalance through resampling or weighted loss functions

Overall, this analysis demonstrates how basic demographic and behavioral features can offer some predictive value, while also highlighting the limits of such information when used in isolation.