**<ins>Introduction</ins>**
==

In the context of online gaming platforms, identifying which players are most engaged can provide valuable insight for recruitment and retention strategies. Engagement is often measured by how much time players spend on the platform, and being able to predict this behavior based on other known characteristics could help platforms target specific player groups for marketing or community building.

In this project, we explore the question:
**Can we predict which players are most likely to contribute a large amount of data (i.e., total hours played) based on their experience level and whether they are subscribed to the game-related newsletter?**

This is a predictive modeling problem where the goal is to estimate a continuous numerical variable (hours played) using two categorical predictors (experience and subscription status). Since both predictors are non-numeric, we aim to apply a regression method that can handle categorical inputs effectively without requiring strong assumptions about the data.

<ins>players.csv</ins> Dataset
---

Observations: 196

Variables (7):

| Variable               | Type                | Description  |
|------------------------|---------------------|--------------|
| **experience**         |<chr\>| Player's level of in-game experience.|
| **subscribe**          |<lgl\>| Indicates if player is subscribed to in-game newsletters.|
| **hashedEmail**        |<chr\>| Player's anonymous email.|
| **played_hours**       |<dbl\>| Time playing session started (relative).|
| **name**               |<chr\>| Player's in-game name.|
| **gender**             |<chr\>| Player's gender.|
| **Age**                |<dbl\>| Player's age.|


**<ins>Methods & Results</ins>**
==

In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
library(gridExtra)
options(repr.matrix.max.rows = 6)
source("cleanup.R")
set.seed(3)


In [None]:
set.seed(3)
players_url <- "https://raw.githubusercontent.com/Alonso181006/Individual-Project/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/Alonso181006/Individual-Project/refs/heads/main/sessions.csv"
players_data <- read_csv(players_url)
sessions_data <- read_csv(sessions_url)
players_data <- players_data

sessions_data <- sessions_data |>
    group_by(hashedEmail) |>
    summarize(number_sessions = n()) |>
    rename(hashed_email = hashedEmail)

sessions_data

players_data <- players_data |>
    left_join(sessions_data, by = "hashed_email") 
players_data

In [None]:
set.seed(3)
# Wrangling Data for the Plot
players_data_ranks <- players_data |>
    mutate(rank = as.integer(factor(experience, 
                                      levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))), 
          subs = as.integer(factor(subscribe, levels = c("FALSE", "TRUE"))))
# Finding average played hours based on rank and subscription
players_data_avg_hours <- players_data_ranks |>
  group_by(rank, subscribe) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE)) |>
  ungroup()

# Visualize average hours against experience level
players_plot_outliers <- players_data_avg_hours |>
    ggplot(aes(x = rank, y = mean_hours, color = subscribe)) +
    geom_line() +
    geom_point(size = 3) +
    labs(x = "Experience Level", y = "Mean Hours Played", color = "Subscription", title = "Avg Hrs Played by Rank and Subscription (Outliers)") +
    theme(text = element_text(size = 14))
players_plot_outliers

# Remove Outliers
players_data_low_hours <- players_data_ranks |>
    filter(played_hours < 5)

# Finding average played hours based on rank and subscription without outliers
players_data_avg_low_hours <- players_data_low_hours |>
  group_by(rank, subscribe) |>
  summarise(mean_hours = mean(played_hours, na.rm = TRUE)) |>
  ungroup()

# Visualize average hours against experience level
players_plot_no_outliers <- players_data_avg_low_hours |>
    ggplot(aes(x = rank, y = mean_hours, color = subscribe)) +
    geom_line() +
    geom_point(size = 3) +
    labs(x = "Experience Level", y = "Mean Hours Played", color = "Subscription", title = "Avg Hrs Played by Rank and Subscription (No Outliers)") +
    theme(text = element_text(size = 14))
players_plot_no_outliers

In [None]:
set.seed(2)
players_data_ranks <- players_data_low_hours
players_split <- initial_split(players_data_ranks, prop = 0.75, strata = played_hours)
players_train <- training(players_split)
players_test <- testing(players_split)

players_recipe <- recipe(played_hours ~ rank, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_spec <- linear_reg() |>
    set_engine("lm") |>
    set_mode("regression")

# players_vfold <- vfold_cv(players_train, v = 5, strata = played_hours)

# gridvals <- tibble(neighbors = seq(from = 1, to = 100, by = 1))

# players_wkflw <- workflow() |>
#     add_recipe(players_recipe) |>
#     add_model(players_spec)

# players_results <- players_wkflw |>
#     fit(players_train) |>
#     collect_metrics() |>
#     filter(.metric == "rmse") |>
#     arrange(desc(mean))

# tail(players_results)

# players_plot <- players_results |>
#     ggplot(aes(x = neighbors, y = mean)) +
#     geom_point() +
#     geom_line()

# players_plot

In [None]:
# set.seed(2)
# kmin <- players_results |> 
#     filter(mean == min(mean)) |>
#     pull(neighbors)

# players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) |>
#     set_engine("kknn") |>
#     set_mode("regression") 

players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    fit(data = players_train)

players_fit

players_summary <- players_fit |>
    predict(players_test) |>
    bind_cols(players_test) 
    # metrics(truth = played_hours, estimate = .pred) |>
    # filter(.metric == "rmse")

players_summary


lm_plot_final <- ggplot(players_data_ranks, aes(x = played_hours, y = rank)) +
    geom_point() +
    geom_line(data = players_summary, 
             mapping = aes(x = played_hours, y = .pred), 
             color = "steelblue", 
             linewidth = 1)

lm_plot_final

**<ins>Discussion</ins>**
==

**<ins>References</ins>**
==