# Project Final Report
## Predicting Newsletter Subscription from Player Characteristics 
  
#### **Name:** 
1. Carlos Saliba
2. Simon San
3. Ni Made Chandra Sriwijaya Putri
4. Maxwell Wong

#### **Course:** DSCI 100_004_Group 5  

This project explores whether player characteristics and behaviors can predict whether a player subscribes to the Minecraft research newsletter.

## Introduction

### Background Information

Understanding player behaviour is essential for designing effective communication strategies in online gaming communities. In this project, we analyze data collected from a Minecraft research server developed by a UBC Computer Science research group led by Frank Wood. The server records players’ actions and demographic information, creating opportunities to study how different types of players engage with the game environment.

One challenge faced by the research team is efficiently recruiting and maintaining participants for their studies. One possible indicator of engagement is whether players subscribe to the project’s game-related newsletter. Identifying which player characteristics predict newsletter subscription can help the research team better allocate resources such as server capacity and outreach efforts.

### Questions to be Answered 

_What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ across various player types?_

More specifically, we examine whether attributes such as hours of gameplay and age (from players.csv) help predict whether a player subscribes to the newsletter.

### Data Description

#### players.csv
- **Number of observations: 196**
- **Number of variables: 7**
- **Description:** Contains player demographic and behavioral information.

| Category | Type | Description|
|----------|------|------------|
|`experience`| Character | Player's experience categorized as Amateur, Beginner, Regular, Veteran, or Pro|
|`subscribe`| Logical | Whether the player is subscribed to a game-related newspaper or not. TRUE = subscribed, FALSE = not subscribed |
|`hashedEmail` | Character | Hashed email of players to identify players between data like `sessions.csv` while protecting privacy |
|`played_hours`| Numeric | Amount of server play time in hours |
|`name`| Character | Name of player |
|`gender`| Character | Gender of player (Male/Female)|
|`Age`| Numeric | Age of player in years |


#### Summary Statistics for Data Description

In [27]:
library(tidyverse)

#preview data
players_url <- "https://raw.githubusercontent.com/NiMadeChandra/Final-Project-DSCI-100-004_Group-5/refs/heads/main/players.csv"
players <- read_csv(players_url)
head(players)

#total observations/players
total_obs <- nrow(players)
total_var <- ncol(players)
paste("Total Observation of Players Data: ", total_obs)
paste("Total Variabel of Players Data: ", total_var)

#experience summary stats
experience_sum <- players |>
  summarize(
    `Subscribe` = sum(subscribe == TRUE, na.rm = TRUE),
    `Not Subscribe` = sum(subscribe == FALSE, na.rm = TRUE),
    `Total Observation` = n(),
    `Percentage of Subscribers` = `Subscribe` / `Total Observation`* 100,
    `Percentage of Non Subscribers` = `Not Subscribe` / `Total Observation` * 100) |>
  round(2)

#played hours summary stats
played_hours_sum <- players |>
  summarize(
    `Minimum` = min(played_hours, na.rm = TRUE),
    `Q1` = quantile(played_hours, 0.25, na.rm = TRUE),
    `Median` = median(played_hours, na.rm = TRUE),
    `Mean` = mean(played_hours, na.rm = TRUE),
    `Q3` = quantile(played_hours, 0.75, na.rm = TRUE),
    `Maximum` = max(played_hours, na.rm = TRUE)
  ) |>
  round(3)

#age summary stats
age_sum <- players |>
  summarize(
    `Minimum` = min(Age, na.rm = TRUE),
    `Q1` = quantile(Age, 0.25, na.rm = TRUE),
    `Median` = median(Age, na.rm = TRUE),
    `Mean` = mean(Age, na.rm = TRUE),
    `Q3` = quantile(Age, 0.75, na.rm = TRUE),
    `Maximum` = max(Age, na.rm = TRUE)
  ) |>
  round(2)

#gender summary stats
gender_sum <- players |>
  mutate(gender = as.factor(gender)) |>
  summarize(
    Female = sum(gender == "Female", na.rm = TRUE),
    Male = sum(gender == "Male", na.rm = TRUE),
    Other = sum(gender == "Other", na.rm = TRUE),
    Total = n()
  ) |>
  mutate(
    `Percentage Female` = Female / Total * 100,
    `Percentage Male` = Male / Total * 100,
    `Percentage Other` = Other / Total * 100
  ) |>
  round(2)

cat("\n \nSummary of Player Subscription\n")
list(experience_sum)

cat("\n Summary of Played Hours\n")
list(played_hours_sum)

cat("\n Summary of Player Age\n")
list(age_sum)

cat("\n Summary of Player Gender\n")
list(gender_sum)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17



 
Summary of Player Subscription


Subscribe,Not Subscribe,Total Observation,Percentage of Subscribers,Percentage of Non Subscribers
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
144,52,196,73.47,26.53



 Summary of Played Hours


Minimum,Q1,Median,Mean,Q3,Maximum
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,0,0.1,5.846,0.6,223.1



 Summary of Player Age


Minimum,Q1,Median,Mean,Q3,Maximum
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
9,17,19,21.14,22.75,58



 Summary of Player Gender


Female,Male,Other,Total,Percentage Female,Percentage Male,Percentage Other
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
37,124,1,196,18.88,63.27,0.51


#### Potential Issues
- `played_hours` has a median of 0.10 and a maximum of 223.10, indicating that the distribution of data maybe skewed and affect our classification.
- There are 2 players who are missing information for `Age` (NA).
- There maybe bias in participation with an imbalance in `gender` of players, and a high participation by players who are subscribed.

## Methods 

To investigate our question, we performed a full data analysis workflow in a Jupyter Notebook. Our methods, supported by the accompanying code, follow the steps described below.

### Data Import and Cleaning
- We began by loading the dataset (players.csv) directly into our notebook.
- After examining its structure, we removed missing or inconsistent values and standardised any variables that required formatting.
- We also confirmed that key variables such as age, hours played, and newsletter subscription status were correctly encoded.

### Exploratory Data Analysis
Next, we summarised and visualised the dataset to understand patterns and distributions. This included:
- Plotting histograms of age and hours played
- Creating bar charts for subscription rates
- Comparing characteristics among different player types.
These visualisations helped us identify relationships worth testing further.

### Data Splitting
To build a predictive model, we split the dataset into:
- A training set used to fit our model
- A test set used to evaluate model performance on unseen data
This ensured a fair assessment of how well the model generalises.

### Building the Predictive Model
We used a **classification model** to predict whether a player subscribes to the newsletter.
The model used player characteristics as predictors, including:
- hours played
- age
- player type
- Any additional relevant behavioural features available in the dataset.
We trained the model using cross-validation to select optimal hyperparameters and prevent overfitting.

### Model Evaluation
We evaluated model performance using metrics such as:
- Accuracy
- Confusion matrices
- Classification error.
We then used the test set to estimate how accurately the model predicts subscription status for new, unseen players.

### Key Findings
- From the results of our model and exploratory analyses, we identified which player characteristics contribute most strongly to predicting newsletter subscription.
- We also compared these characteristics across different player types to understand behavioural differences in the game environment.