# Predicting Newsletter Subscription from Demographics and Play Behavior

## Introduction

**Background**: The UBC Minecraft server project collects player data to help manage server resources and improve player engagement strategies. One key goal is to understand what kinds of players are likely to subscribe to the server newsletter, which serves as a way to share updates and strengthen the community.

**Question**: This project investigates the following question:

*Can we predict whether a player will subscribe to the newsletter based on their age and total number of hours played?*

To answer this question, we will use data from the following source:

- `players.csv`, which contains each player's demographic information (including age), total number of hours played (`played_hours`), and subscription status (`subscribe`).

This project focuses only on the `players.csv` dataset. By using demographic information (`Age`) and behavioral information (`played_hours`), we aim to determine whether these variables are useful predictors of newsletter subscription.

We chose this approach because:
- Age may relate to interest in community updates.
- Players who spend more time in-game may be more engaged and thus more likely to subscribe.

**Data Description**

This project uses the dataset `players.csv`, which contains demographic, behavioral, and subscription information for 196 unique players.

We use two variables—`Age` and `played_hours`—to investigate whether they are useful predictors of newsletter subscription status.

**Dataset Overview**

#### players.csv

| Variable Name | Type    | Description                                  |
|---------------|---------|----------------------------------------------|
| hashedEmail   | String  | Unique identifier for each player            |
| age           | Numeric | Player's reported age                        |
| gender        | String  | Player's reported gender                     |
| subscribe     | Logical | Whether the player subscribed (TRUE/FALSE)  |
| experience    | String  | Self-reported experience level               |
| played_hours  | Numeric | Total number of hours the player has played |
| name          | String  | Player's chosen username                     |

**Summary Statistics**

- **Number of players**: 196  
- **Average player age**: 20.5 years  
- **Average played hours**: 8.7 hours  
- **Newsletter subscription rate**: 73.5% subscribed  

These variables (`Age` and `played_hours`) will be used to build a predictive model to determine whether a player is likely to subscribe to the server newsletter.


In [None]:
library(tidyverse)
library(tidymodels)
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

## Methods & Results

To explore whether a player's age and number of sessions can help us predict their newsletter subscription status, we followed several steps:

### Data Processing

- We first counted how many play sessions each player had using the `sessions.csv` dataset.
- Then, we combined this count with the main `players.csv` dataset using the unique player ID (`hashedEmail`).
- We kept three columns: `age`, `subscribe`, and `total_sessions`.

### Splitting the Data

- We split the dataset into a **training set** (75%) and a **testing set** (25%).
- The training set was used to teach the computer how age and session count might relate to subscribing.
- The testing set was used to evaluate how well the model performs on new, unseen data.

### Model

- We used a classification method that looks at the values of age and session count to decide whether a player is likely to subscribe.
- The model was trained on the training data and then used to make predictions on the testing data.

### Evaluation

- We measured how many of the predictions were correct using **accuracy**.
- The accuracy tells us what percentage of players the model could correctly predict as subscribed or not.

This analysis helps us understand whether age and play frequency are useful indicators of newsletter interest.


In [None]:
players_clean <- players |>
  select(Age, subscribe) |>
  drop_na()

session_counts <- sessions |>
  group_by(hashedEmail) |>
  summarize(total_sessions = n())

players_filtered <- players |>
  filter(hashedEmail %in% session_counts$hashedEmail)

players_summary <- players_filtered |>
  select(hashedEmail, Age, subscribe)

sessions_summary <- session_counts |>
  select(hashedEmail, total_sessions)

shared_ids <- intersect(players_summary$hashedEmail, sessions_summary$hashedEmail)

analysis_players <- players_summary |>
  filter(hashedEmail %in% shared_ids)

analysis_sessions <- sessions_summary |>
  filter(hashedEmail %in% shared_ids)