# Data Science Project Final Report
**By: Ricky Shi, Evan Chen, Mari Yasui, Shon Hoang**

---
# Introduction
### Data Description


This project explores the player behavior using a dataset collected by a UBC Computer Science research group that ran a Minecraft server. As players joined and interacted with the world, the server recorded their individual information. This includes their age, name, gender, how many hours they played in total, their experience level, and whether they were subscribed to a game-related newsletter. 


This project uses two datasets, `players.csv` and `sessions.csv`, collected from a Minecraft research server operated by UBC’s Computer Science department. The data record player demographics and in-game activity. My analysis will primarily use `players.csv`, as it includes both predictors and the response variable.

| Dataset | Rows | Columns | Description |
|----------|------|----------|--------------|
| `players.csv` | 196 | 7 | Contains hashed emails (acts as player IDs) and variables such as age, gender, total playtime (hours), and newsletter subscription status |
| `sessions.csv` | 1535 | 5 | Contains session-level data, including hashed emails, start and end times, and timestamp equivalents |

**Potential Issues**
- Missing age for two players reduces usable data
- Extreme playtime values (e.g., idle sessions) may report inaccurate playtimes
- Subscription classes are imbalanced (many more “Yes” than “No”)
- Predictors use different scales and must be standardized

In [3]:
library(tidyverse)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## Data Summary

In [4]:
players <- read_csv("https://github.com/FunkyMonkey245/dsci_project_planning_individual/raw/refs/heads/main/data/players.csv", show_col_types = FALSE)

sessions <- read_csv("https://github.com/FunkyMonkey245/dsci_project_planning_individual/raw/refs/heads/main/data/sessions.csv", show_col_types = FALSE)

players_summary <- players |>
    summarize(
        total_players = n(),
        subscribed_players = sum(subscribe == TRUE, na.rm = TRUE),
        unsubscribed_players = sum(subscribe == FALSE, na.rm = TRUE),
        subscribed_percent = round(100 * subscribed_players / total_players, 2),
        age_mean = round(mean(Age, na.rm = TRUE), 2),
        age_min  = min(Age, na.rm = TRUE),
        age_max  = max(Age, na.rm = TRUE),
        hours_mean = round(mean(played_hours, na.rm = TRUE), 2),
        hours_min  = min(played_hours, na.rm = TRUE),
        hours_max  = max(played_hours, na.rm = TRUE)
      )

sessions_summary <- sessions |>
    count(hashedEmail, name = "session_count") |>
    summarize(
        total_players = n(),
        total_sessions = sum(session_count),
        avg_sessions_per_player = mean(session_count),
        lowest_session_count = min(session_count),
        most_session_count = max(session_count)
      )
glimpse(players)
glimpse(sessions)
players_summary
sessions_summary

Rows: 196
Columns: 7
$ experience   [3m[90m<chr>[39m[23m "Pro", "Veteran", "Veteran", "Amateur", "Regular", "Amate…
$ subscribe    [3m[90m<lgl>[39m[23m TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, T…
$ hashedEmail  [3m[90m<chr>[39m[23m "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8…
$ played_hours [3m[90m<dbl>[39m[23m 30.3, 3.8, 0.0, 0.7, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 1.6, 0…
$ name         [3m[90m<chr>[39m[23m "Morgan", "Christian", "Blake", "Flora", "Kylie", "Adrian…
$ gender       [3m[90m<chr>[39m[23m "Male", "Male", "Male", "Female", "Male", "Female", "Fema…
$ Age          [3m[90m<dbl>[39m[23m 9, 17, 17, 21, 21, 17, 19, 21, 47, 22, 23, 17, 25, 22, 17…
Rows: 1,535
Columns: 5
$ hashedEmail         [3m[90m<chr>[39m[23m "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8a…
$ start_time          [3m[90m<chr>[39m[23m "30/06/2024 18:12", "17/06/2024 23:33", "25/07/202…
$ end_time            [3m[90m<chr>[39m[23m "30/06/2024 18:24"

total_players,subscribed_players,unsubscribed_players,subscribed_percent,age_mean,age_min,age_max,hours_mean,hours_min,hours_max
<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
196,144,52,73.47,21.14,9,58,5.85,0,223.1


total_players,total_sessions,avg_sessions_per_player,lowest_session_count,most_session_count
<int>,<int>,<dbl>,<int>,<int>
125,1535,12.28,1,310


**For *players* (players.csv), this data contains a list of 196 players (observations) which includes their 7 self-reported descriptions (variables):**

| Variable | Type | Description |
| --- | --- | --- |
| experience | factor | self-reported experience level (Amateur, Veteran, Regular, Behinner, Pro) |
| subscribe | logical | whether the player is subscribed to the newsletter |
| hashedEmail | character | a unique anonymized ID for each player |
| played_hours | double | total hours played by the player |
| name | character | the player's first name |
| gender | factor | the player's gender |
| Age | integer | the player's age |

The variables *experience and gender* have been changed to categorical variables to better aid in further data-visualization, while *Age* has been changed to an integer variable.


**Summary Report: `players.csv`**

- Total players: 196
- Subscribed players: 144
- Unsubscribed players: 52
- Percent Subscribed: 73.47%
- Average player age: 21.14 years
- Youngest player: 9 years
- Oldest player: 58 years
- Average total playtime: 5.85 hours
- Lowest playtime: 0 Hours
- Highest playtime: 223.10 Hours

## Establishing the Question

**Broad Question:**  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do they differ between various player types?

**Specific Question:**  
The specific question we aim to answer using this dataset is: Can a player’s age and total playtime predict whether they subscribe to the game’s newsletter? 
The goal of this question is to see whether a player’s age and total number of played hours are useful indicators of whether they choose to subscribe to the game’s newsletter. 

To answer this question, we use the `players.csv` dataset, which contains a row per player and several columns that describe their characteristics. The variables/columns we focus on are:
Subscribe (logical), which indicates whether the player is subscribed to the newsletter
Age (numeric), which indicates the player’s age in years
Played_hours, which indicates the total number of hours the player played on the server

## Exploratory Data Analysis and Visualization

- For this section, I will use the `players.csv` dataset to summarize the mean of the predictor variables, and visualize their relationships with newsletter subscription in multiple plots.
- Not much wrangling is needed as the dataset is already tidy for my analysis

In [3]:
# Mean summary of predictor variables
mean_summary <- players |>
    summarize(
        mean_age = round(mean(Age, na.rm = TRUE), 2),
        mean_playtime = round(mean(played_hours, na.rm = TRUE), 2),
      )

mean_summary

mean_age,mean_playtime
<dbl>,<dbl>
21.14,5.85


**Plots Included**
- Scatterplot of age vs. playtime colored by subscription
- Bar plot comparing subscription counts across experience levels
- Histograms showing playtime distribution for subscribers vs non-subscribers
- Log10 scaling was applied to playtime variables to compress extreme values and clearly show clustered data

**Observations**
- Most players recorded under 10 hours of total playtime
- There is a weak negative relationship between age and playtime, where younger players tend to play more
- Subscribed players generally show more total playtime
- Amateurs and veterans have the most subscribers

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5)

#Data is already quite tidy, so changing the variable types is sufficient for my analysis

players <- players |>
    mutate(subscribe = as_factor(subscribe), subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"), experience = as_factor(experience))

# Plot visualizations showing the relationships between multiple variables in the players dataset

players_plot <- ggplot(players, aes(x = Age, y = played_hours + 1, color = subscribe)) +
    geom_point(alpha = 0.7, size = 3) +
    scale_y_log10() +
    labs(x = "Age of Player (years)", y = "Total Playtime (hours)", color = "Subscribed") +
    ggtitle("Relationship Between Age, Playtime, and Newsletter Subscription") +
    theme(text = element_text(size = 14))

players_barplot <- ggplot(players, aes(x = experience, fill = subscribe)) +
    geom_bar(position = "dodge") +
    labs(title = "Newsletter Subscription by Experience Level", x = "Experience Level", y = "Proportion of Players", fill = "Subscribed") +
    theme(text = element_text(size = 14))

players_histogram <- ggplot(players, aes(x = played_hours, fill = subscribe)) +
    geom_histogram(bins = 20, alpha = 0.7, position = "identity") +
    facet_grid(rows = vars(subscribe)) +
    scale_x_log10() +
    labs(title = "Distribution of Total Playtime by Subscription Status", x = "Total Playtime (hours)", y = "Number of Players", fill = "Subscribed") +
    theme(text = element_text(size = 14))

players_barplot
players_histogram
players_plot 

“[1m[22m[32mlog-10[39m transformation introduced infinite values.”
“[1m[22mRemoved 85 rows containing non-finite outside the scale range (`stat_bin()`).”


## Methods and Plan

I will use a K-NN classification model to predict subscription status using age and playtime.

**Why is this method appropriate?**

The research question is a binary classification problem (predicting subscription status [TRUE/FALSE]). The model predicts the outcome for a new player by finding the "k" closest players in the training data and using a majority vote on their subscription status. Unlike other methods, k-NN make sno assumptions about the underlying distribution of data, which is better when relationships might be complex.

**What assumptions are required?**

Some variables with larger ranges like played hours could dominate the distance calculation during k-NN classification, therefore requiring data normalization. The performance of k-NN decreases as the number of variables increase, however since we only have a few variables it would be managable, however if we decide to add more from the sessions.csv dataset the method may weaken. k-NN Classification also assumes that the distance represents the similarity between players (selected variables) are relevant to subscription status. 

**What are the potential weaknesses or limitations of this model?**

If more data is gathered, k-NN could be computationally expensive and slow as it must calculate the distance for every data point for a new prediction. If irrelevant variables are included, it could introduce unwanted noise to the method, reducing accuracy. 

**How are you going to compare and select the model?**

We can compare and select the model based on tuning for the best value of "k", implementing hyperparameter tuning, where 5-fold cross-validation will be used on the training dataset to maximize performance.

**How are you going to process the data to apply this model?**

The data will be first processed by splitting the players.csv dataset into a training set (75%) and a testing set (25%). Highly skewed data in played_hours and other variables will be standardized, while the missing values in the dataset (from the Age variable) could either be imputed using the median age (calculated by the training set) or by dropping the observations. Finally, the k-NN model will be trainied and the optimal "k" will be selected using k-fold cross-validation on the training set. The final model using the optimal "k" will be run on the testing set to report its final performance metrics.  

**Why this model?**
- It is suitable for categorical prediction using numerical predictors
- It does not require any particular relationship between the predictors and the outcome variables, but only uses nearby datapoints

**Assumptions & Preparation**
- Predictors will be standardized in a recipe using scaling and centering
- Missing ages will be filled using the mean age

**Limitations**
- Class imbalance (more subscribers) can bias results  
- Extreme points could overly influence distance-based calculations

**Model Selection & Evaluation**
- I will tune K (1–10) via 5-fold cross-validation using accuracy as the main metric
- Data is split into 70% training, 30% testing 
- The K with highest validation accuracy will be used to fit the final model and the accuracy on the test data will be used to answer the question

## Finding K

In [1]:
set.seed(2025)


players_split <- initial_split(players, prop = 0.70, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

players_recipe <- recipe(subscribe ~ Age, played_hours, data = players_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

players_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

players_wkflw <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(players_spec)

players_wkflw

k_vals <- tibble(neighbors = seq(from = 1, to = 25, by = 1))

knn_results <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(players_spec) |>
  tune_grid(resamples = players_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies
Accuracy_graph <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate", title = "Number of Neighbors and their Accuracies") +
  theme(text = element_text(size = 12))

Accuracy_graph


ERROR: Error in initial_split(players, prop = 0.7, strata = subscribe): could not find function "initial_split"
