# Project Final Report
## Predicting Newsletter Subscription from Player Characteristics 
  
#### **Name:** 
1. Carlos Saliba
2. Simon San
3. Ni Made Chandra Sriwijaya Putri
4. Maxwell Wong

#### **Course:** DSCI 100_004_Group 5  

This project explores whether player characteristics and behaviors can predict whether a player subscribes to the Minecraft research newsletter.

## Introduction

Understanding player behaviour is essential for designing effective communication strategies in online gaming communities. In this project, we analyze data collected from a Minecraft research server developed by a UBC Computer Science research group led by Frank Wood. The server records players’ actions and demographic information, creating opportunities to study how different types of players engage with the game environment.

One challenge faced by the research team is efficiently recruiting and maintaining participants for their studies. One possible indicator of engagement is whether players subscribe to the project’s game-related newsletter. Identifying which player characteristics predict newsletter subscription can help the research team better allocate resources such as server capacity and outreach efforts.

##### In this project, we aim to answer the following question:

_What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ across various player types?_

More specifically, we examine whether attributes such as hours of gameplay and age (from players.csv) help predict whether a player subscribes to the newsletter.

### Data Description

#### players.csv
This dataset contains information of the players, consisting of 7 variables:
| Category | Type | Description|
|----------|------|------------|
|`experience`| Character | Player's experience categorized as Amateur, Beginner, Regular, Veteran, or Pro|
|`subscribe`| Logical | Whether the player is subscribed to a game-related newspaper or not |
|`hashedEmail` | Character | Hashed email of players to identify players between data like `sessions.csv` while protecting privacy |
|`played_hours`| Numeric | Amount of server play time in hours |
|`name`| Character | Name of player |
|`gender`| Character | Gender of player |
|`Age`| Numeric | Age of player in years |

#### Summary Statistics

In [1]:
library(tidyverse)

#preview data
players_url <- "https://raw.githubusercontent.com/NiMadeChandra/Final-Project-DSCI-100-004_Group-5/refs/heads/main/players.csv"
players <- read_csv(players_url)
head(players)

#total observations/players
total_obs <- nrow(players)
paste("This dataset contains", total_obs, "total observations or players.")

#experience summary stats
experience_sum <- players |>
    summarize(sub = sum(subscribe == "TRUE", na.rm = "TRUE"),
              notsub = sum(subscribe == "FALSE", na.rm = "TRUE"),
              pct_sub = sub/total_obs * 100,
              pct_notsub = notsub/total_obs * 100) |>
    round(2)
paste("The percentage of players subcribed were", experience_sum$pct_sub, "%, while the percentage of players not subscribed were", experience_sum$pct_notsub, "%.")

#played hours summary stats
played_hours_sum <- summary(players$played_hours, na.rm = "TRUE") |>
    format(round(3))
paste("Summary statistics of played_hours:")
played_hours_sum

#age summary stats
age_sum <- summary(players$Age, na.rm = "TRUE") |>
    format(round(2))
paste("Summary statistics of Age:")
age_sum

#gender summary stats
gender_sum <- players |>
    mutate(gender = as.factor(gender)) |>
    group_by(gender) |>
    summarize(percentage_gender = n()/total_obs*100) |>
    mutate(percentage_gender = round(percentage_gender, 2)) |>
    arrange(desc(percentage_gender))
paste("Percentage of different gender identities of players:")
gender_sum

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


gender,percentage_gender
<fct>,<dbl>
Male,63.27
Female,18.88
Non-binary,7.65
Prefer not to say,5.61
Two-Spirited,3.06
Agender,1.02
Other,0.51


#### Potential Issues
- `played_hours` has a median of 0.10 and a maximum of 223.10, indicating that the distribution of data maybe skewed and affect our classification.
- There are 2 players who are missing information for `Age` (NA).
- There maybe bias in participation with an imbalance in `gender` of players, and a high participation by players who are subscribed.

## Methods 

To investigate our question, we performed a full data analysis workflow in a Jupyter Notebook. Our methods, supported by the accompanying code, follow the steps described below.

### Data Import and Cleaning
- We began by loading the dataset (players.csv) directly into our notebook.
- After examining its structure, we removed missing or inconsistent values and standardised any variables that required formatting.
- We also confirmed that key variables such as age, hours played, and newsletter subscription status were correctly encoded.

### Exploratory Data Analysis
Next, we summarised and visualised the dataset to understand patterns and distributions. This included:
- Plotting histograms of age and hours played
- Creating bar charts for subscription rates
- Comparing characteristics among different player types.
These visualisations helped us identify relationships worth testing further.

### Data Splitting
To build a predictive model, we split the dataset into:
- A training set used to fit our model
- A test set used to evaluate model performance on unseen data
This ensured a fair assessment of how well the model generalises.

### Building the Predictive Model
We used a **classification model** to predict whether a player subscribes to the newsletter.
The model used player characteristics as predictors, including:
- hours played
- age
- player type
- Any additional relevant behavioural features available in the dataset.
We trained the model using cross-validation to select optimal hyperparameters and prevent overfitting.

### Model Evaluation
We evaluated model performance using metrics such as:
- Accuracy
- Confusion matrices
- Classification error.
We then used the test set to estimate how accurately the model predicts subscription status for new, unseen players.

### Key Findings
- From the results of our model and exploratory analyses, we identified which player characteristics contribute most strongly to predicting newsletter subscription.
- We also compared these characteristics across different player types to understand behavioural differences in the game environment.