**<ins>Introduction</ins>**
==
In the gaming industry, understanding player behavior is crucial for designing engagement strategies and improving retention. One common strategy used by game developers is promoting in-game newsletters, which keep players informed about updates, events, and promotions. These newsletters are typically opt-in, meaning players choose whether or not to subscribe. Understanding what kinds of players are more likely to subscribe can help developers better target their communication efforts.

In this project, we aim to explore the question:
**Can a player's playtime and age predict whether they subscribe to the game-related newsletter in <ins>players.csv</ins>?**

By identifying potential patterns in subscription behavior based on quantitative characteristics, this analysis can offer insight into which types of players are more engaged or committed to the game.

To explore this question, we used a dataset titled <ins>players.csv</ins>, which contains information on 196 individuals who play the game. This dataset includes 7 variables:

<ins>players.csv</ins> Dataset
---

| Variable               | Type                | Description  |
|------------------------|---------------------|--------------|
| **experience**         |<chr\>| Player's level of in-game experience.|
| **subscribe**          |<lgl\>| Indicates if player is subscribed to in-game newsletters.|
| **hashedEmail**        |<chr\>| Player's anonymous email.|
| **played_hours**       |<dbl\>| Time playing session started (relative).|
| **name**               |<chr\>| Player's in-game name.|
| **gender**             |<chr\>| Player's gender.|
| **Age**                |<dbl\>| Player's age.|


While most variables are not directly relevant to this project, we focus on the numeric variables ```played_hours``` and ```Age``` as predictors, and ```subscribe``` as the response variable. It's worth noting that the dataset contains some outliers in ```played_hours```, but we chose to keep them because they may represent highly engaged players whose behavior is important for predicting newsletter subscription patterns.

**<ins>Methods & Results</ins>**
==

In [None]:
### Run this cell for initialization of packages
library(tidyverse)
library(tidymodels)
library(cowplot)
library(repr)
library(RColorBrewer)
library(gridExtra)
options(repr.matrix.max.rows = 6)
set.seed(4)

In [None]:
set.seed(5)

# Set url's from github repository
players_url <- "https://raw.githubusercontent.com/Alonso181006/Individual-Project/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/Alonso181006/Individual-Project/refs/heads/main/sessions.csv"

# Read the data from the url's
players_data <- read_csv(players_url)
sessions_data <- read_csv(sessions_url)

# Count number of sessions for each user
sessions_data_count <- sessions_data |>
    group_by(hashedEmail) |>
    summarize(number_sessions = n())

# Renamed column
session_data_tidy <- sessions_data_count |>
    rename(hashed_email = hashedEmail)

# Combine Two Data Frames
players_data_combine <- players_data |>
    left_join(session_data_tidy, by = "hashed_email")

# Update Data Frame to replace NA values, and set subscribe column as a factor 
players_data_update <- players_data_combine |>
    mutate(number_sessions = replace_na(number_sessions, 0), 
          subscribe = as.factor(subscribe)) 

# Add a column that sets the experience level as a rank from 1-5 (inclusive)
players_data <- players_data_update|>
    mutate(rank = as.integer(factor(experience, 
                                    levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))))
# Display Data Set
players_data

### Computing summary statistics all quantitative variables (```played_hours```, ```Age```) in <ins>players.csv</ins>

In [None]:
# storing the selected columns in a variable
players_selected <- select(players_data, played_hours, age)

# calculating the mean
players_mean <- players_selected |>
    map_df(mean, na.rm = TRUE) |>
    rename(mean_played_hours = played_hours, mean_age = age)

players_mean

# calculating the median
players_median <- players_selected |>
    map_df(median, na.rm = TRUE) |>
    rename(median_played_hours = played_hours, median_age = age)

# calculating the mode
players_mode <- players_selected |>
    map_df(mode) |>
    rename(mode_played_hours = played_hours, mode_age = age)

# calculating the Standard Deviation (SD)
players_sd <- players_selected |>
    map_df(sd, na.rm = TRUE) |>
    rename(sd_played_hours = played_hours, sd_age = age)

# calculating the min 
players_min <- players_selected |>
    map_df(min, na.rm = TRUE) |>
    rename(min_played_hours = played_hours, min_age = age)

# calculating the max
players_max <- players_selected |>
    map_df(max, na.rm = TRUE) |>
    rename(max_played_hours = played_hours, max_age = age)

# all summary statistics of played_hours and Age
summary_combined <- bind_cols(players_mean, players_median, players_mode, players_sd, players_max, players_min)
summary_combined

In [None]:
# Plot to Visualize Relationship between Subscribe & Played Hours, Experience, & Age

# Bar plot of subscription status counts
subsribe_bar <- players_data |>
    ggplot(aes(x = subscribe, fill = subscribe)) + 
    geom_bar() + 
    labs(x = "Subscription", 
         y = "Subscription Count", 
         fill = "Subscription Status",
         title = "Subscription Count",
         caption = "Figure 1") + 
    scale_fill_brewer(palette = "Paired") +
    theme(text = element_text(size = 15)) + 
    theme(plot.caption = element_text(hjust = 0.5)) 


# Histogram showing the distribution of age, colored by subscription status.
age_by_sub <- players_data |>
    ggplot(aes(x = age, fill = subscribe)) +
    geom_histogram(binwidth = 2) +
    labs(x = "Age (years)", y = "Number of Subscribers", fill = "Subscription Status", caption = "Figure 2") +
    ggtitle("Plot of Age by Subscription Status") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Set2") + 
    theme(plot.caption = element_text(hjust = 0.5)) 

# Histogram showing the distribution of played hours, colored by subscription status.
subscription_plot <- players_data |>
    ggplot(aes(x = played_hours, fill = subscribe)) +
    geom_histogram(bins = 30, alpha = 0.7) +
    labs(title = "Plot of Played Hours by Subscription Status", 
         x = "Hours Played", 
         y = "Count", 
         fill = "Subscription Status", 
         caption = "Figure 3") +
    theme(text = element_text(size = 15)) + 
    theme(plot.caption = element_text(hjust = 0.5)) 


# Finding average played hours based on rank and subscription
players_data_avg_hours <- players_data |>
  group_by(rank, subscribe) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE)) |>
  ungroup()

# Visualize average hours against experience level
players_plot_outliers <- players_data_avg_hours |>
    ggplot(aes(x = rank, y = mean_hours, color = subscribe)) +
    geom_line() +
    geom_point(size = 3) +
    labs(x = "Experience Level", 
         y = "Mean Hours Played", 
         color = "Subscription Status", 
         title = "Avg Hrs Played by Rank and Subscription", 
         caption = "Figure 4") +
    theme(text = element_text(size = 15)) + 
    theme(plot.caption = element_text(hjust = 0.5)) 

# Combine plots into 2x2 grid layout .
options(repr.plot.width = 14, repr.plot.height = 10)
plot_grid(subsribe_bar, age_by_sub, subscription_plot, players_plot_outliers, nrow = 2, ncol = 2)

In [None]:
# Set seed to prevent different splits of data
set.seed(5)

# Split data into training and testing sets at 30/70 split
players_split <- initial_split(players_data, prop = 0.7, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

# Create players recipe for Classification
players_recipe <- recipe(subscribe ~ age + played_hours, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Create player knn-model, tuning the neighbor value
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbor = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Perform cross-validation using 5-fold
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

# Create a tibble of possible 'neighbors' values 1 to 10
gridvals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

# Setup workflow to test k values with cross-validation
players_wkflw <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    tune_grid(resamples = players_vfold, grid = gridvals)

# Collect and Filter the metrics for evaluation.
players_results <- players_wkflw |>
    collect_metrics() |>
    filter(.metric == "accuracy")

# Pull Best K value
best_k = players_results |>
    arrange(desc(mean)) |>
    head(1) |>
    pull(neighbors)

# Plot the Accuracy vs Neighbors estimate
players_plot <- players_results |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate", title = "KNN Accuracy vs Number of Neighbors", 
        caption = "Figure 5: Plot to visualize best neighbors") +
    theme(text = element_text(size = 20)) + 
    theme(plot.caption = element_text(hjust = 0.5)) 

#Display Plot
options(repr.plot.width = 10, repr.plot.height = 8)
players_plot

### Finding Optimal K Value ###
Our analysis consists of classification utilizing the k-nearest neighbour model. To proceed with the modelling, we have to find the most optimal k-value to train and fit our data to test the model's reliability eventually. To achieve this, we initially split the data into the training and testing sets, created the player recipe, and set the model's neighbour's parameter to tune() to test it for multiple neighbour values. We added the recipe and model to the workflow and then performed cross-validation using 5-folds for neighbours from 1-20 to prevent underfitting. From this evaluation of the training data, we found that 15 neighbours would be best, which is visualized in the plot above. 

In [None]:
# Set seed 
set.seed(5)

# Re-create player model using the optimal k value
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 15) |>
    set_engine("kknn") |>
    set_mode("classification") 

# Fit the workflow to the player traing data
players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    fit(data = players_train)

# Predict the subcription of players in test set
players_summary <- players_fit |>
    predict(players_test) |>
    bind_cols(players_test) 

# Compute & Display accuracy of the Knn Classfication on test dataset
players_accuracy <- players_summary |>
    metrics( truth = subscribe, estimate = .pred_class) |>
    filter(.metric == "accuracy")
players_accuracy

# Compute & Display precision of the Knn Classfication on test dataset
players_precision <- players_summary |>
    precision(truth = subscribe, estimate = .pred_class, event_level = "second")
players_precision

# Compute & Display recall of the Knn Classfication on test dataset
players_recall <- players_summary |>
    recall(truth = subscribe, estimate = .pred_class, event_level = "second")
players_recall

# Compute & Display confusion matrix of the Knn Classfication on test dataset
players_matrix <- players_summary |>
    conf_mat(truth = subscribe, estimate = .pred_class)
players_matrix

In [None]:
# create the grid of area/smoothness vals, and arrange in a data frame
are_grid <- seq(8,
                50,
                length.out = 100)
smo_grid <- seq(min(players_data$played_hours),
                max(players_data$played_hours),
                length.out = 100)
asgrid <- as_tibble(expand.grid(age = are_grid,
                                played_hours = smo_grid))

# use the fit workflow to make predictions at the grid points
knnPredGrid <- predict(players_fit, asgrid)

# bind the predictions as a new column with the grid points
prediction_table <- bind_cols(knnPredGrid, asgrid) |>
  rename(subscribe = .pred_class)

# plot:
# 1. the colored scatter of the original data
# 2. the faded colored scatter for the grid points
wkflw_plot <-
    ggplot() +
    geom_point(data = players_data,
                mapping = aes(x = age,
                            y = played_hours,
                            color = subscribe),
                alpha = 0.75) +
    geom_point(data = prediction_table,
                mapping = aes(x = age,
                            y = played_hours,
                            color = subscribe),
                alpha = 0.02,
                size = 5) +
    labs(color = "Subscription Status",
        x = "Age",
        y = "Number of Hours Played", 
        title = "Player Subscription Classification with Knn",
        caption = "Figure 6: Background color indicates the decision of the classifier.") +
  scale_color_manual(values = c("darkorange", "steelblue")) +
  theme(text = element_text(size = 15)) + 
  theme(plot.caption = element_text(hjust = 0.5)) 

wkflw_plot

**<ins>Discussion</ins>**
==

The results of our analysis suggest that a player's playtime and age can predict newsletter subscription behaviour. Using a k-Nearest Neighbors model, the classification achieved an accuracy of 76.7%, a precision of 75.8% and a recall of 100%. All actual subscribers were correctly identified. However, 14 non-subscribers were misclassified, demonstrating that the model may sometimes tend to overpredict.

The findings are somewhat expected, as higher engagement often correlates with interest in in-game updates. This insight can help make marketing efforts more efficient by targeting regular gamers. Additionally, at a lower age, the importance surrounding what clutters your email inbox is less of a priority, and thus, students are more likely to provide their email for the newsletter that they might never read.

While the model is imperfect, it offers information that can be used to explore other unknowns. Knowing that age and playtime contribute to whether a person subscribes to a game-related newsletter, researchers can investigate whether age is a stronger predictor than playtime. In addition, the specific age range that is most likely to subscribe can be investigated. Future work could explore other predictors, test other models, or analyze behavioural patterns over time to better understand subscription behaviour. In terms of applications for our model, it can improve efficiency on targetting users due to a recall of 100% mitigating any false negatives, which would help reduce resources towards acquiring users that are guaranteed to never subscribe. 

**<ins>References</ins>**
==

*Data Science, A First Introduction*
https://datasciencebook.ca/

*Data wrangling, exploration, and analysis with R*
https://stat545.com/join-cheatsheet.html 