**<ins>Introduction</ins>**
==

In the context of online gaming platforms, identifying which players are most engaged can provide valuable insight for recruitment and retention strategies. Engagement is often measured by how much time players spend on the platform, and being able to predict this behavior based on other known characteristics could help platforms target specific player groups for marketing or community building.

In this project, we explore the question:
**Can we predict which players are most likely to contribute a large amount of data (i.e., total hours played) based on their experience level and whether they are subscribed to the game-related newsletter?**

This is a predictive modeling problem where the goal is to estimate a continuous numerical variable (hours played) using two categorical predictors (experience and subscription status). Since both predictors are non-numeric, we aim to apply a regression method that can handle categorical inputs effectively without requiring strong assumptions about the data.

<ins>players.csv</ins> Dataset
---

Observations: 196

Variables (7):

| Variable               | Type                | Description  |
|------------------------|---------------------|--------------|
| **experience**         |<chr\>| Player's level of in-game experience.|
| **subscribe**          |<lgl\>| Indicates if player is subscribed to in-game newsletters.|
| **hashedEmail**        |<chr\>| Player's anonymous email.|
| **played_hours**       |<dbl\>| Time playing session started (relative).|
| **name**               |<chr\>| Player's in-game name.|
| **gender**             |<chr\>| Player's gender.|
| **Age**                |<dbl\>| Player's age.|


**<ins>Methods & Results</ins>**
==

In [None]:
### Run this cell for initialization of packages
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(gridExtra)
options(repr.matrix.max.rows = 6)
source("cleanup.R")
set.seed(4)

In [None]:
set.seed(5)

# Set url's from github repository
players_url <- "https://raw.githubusercontent.com/Alonso181006/Individual-Project/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/Alonso181006/Individual-Project/refs/heads/main/sessions.csv"

# Read the data from the url's
players_data <- read_csv(players_url)
sessions_data <- read_csv(sessions_url)

# Count number of sessions for each user
sessions_data_count <- sessions_data |>
    group_by(hashedEmail) |>
    summarize(number_sessions = n())

# Renamed column
session_data_tidy <- sessions_data_count |>
    rename(hashed_email = hashedEmail)

# Combine Two Data Frames
players_data_combine <- players_data |>
    left_join(sessions_data, by = "hashed_email")

# Update Data Frame to replace NA values, and set subscribe column as a factor 
players_data_update <- players_data_combine |>
    mutate(number_sessions = replace_na(number_sessions, 0), 
          subscribe = as.factor(subscribe)) 

# Add a column that sets the experience level as a rank from 1-5 (inclusive)
players_data <- players_data_update|>
    mutate(rank = as.integer(factor(experience, 
                                    levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))))


In [None]:
# Plot to Visualize Relationship between Subscribe & Played Hours, Experience, & Age
subscription_plot <- ggplot(players_data, aes(x = played_hours, fill = subscribe)) +
    geom_histogram(bins = 30, alpha = 0.7) +
    labs(title = "Subscription Rate vs. Played Hours", x = "Hours Played", y = "Count") +
    theme(text = element_text(size = 15))
subscription_plot

In [None]:
# Set seed to prevent different splits of data
set.seed(5)

# Split data into training and testing sets at 30/70 split
players_split <- initial_split(players_data, prop = 0.7, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

# Create players recipe for Classification
players_recipe <- recipe(subscribe ~ age + played_hours + rank, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# Create player knn-model, tuning the neighbor value
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbor = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# Perform cross-validation using 5-fold
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

# Create a tibble of possible 'neighbors' values 1 to 10
gridvals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

players_wkflw <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    tune_grid(resamples = players_vfold, grid = gridvals) |>
    collect_metrics()

players_results <- players_wkflw |>
    filter(.metric == "accuracy")

head(players_results)

best_k = players_results |>
    arrange(desc(mean)) |>
    head(1) |>
    pull(neighbors)

best_k

players_plot <- players_results |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line()

players_plot

In [None]:
set.seed(4)

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
    set_engine("kknn") |>
    set_mode("classification") 

players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    fit(data = players_train)


players_summary <- players_fit |>
    predict(players_test) |>
    bind_cols(players_test) 


players_accuracy <- players_summary |>
    metrics( truth = subscribe, estimate = .pred_class) |>
    filter(.metric == "accuracy")

players_accuracy

players_precision <- players_summary |>
    precision(truth = subscribe, estimate = .pred_class, event_level = "second")

players_precision

players_recall <- players_summary |>
    recall(truth = subscribe, estimate = .pred_class, event_level = "second")
players_recall

players_matrix <- players_summary |>
    conf_mat(truth = subscribe, estimate = .pred_class)
players_matrix

**<ins>Discussion</ins>**
==

**<ins>References</ins>**
==