# Final Report

## Methods & Results

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
players_data <- read_csv("data/players.csv")
players_data

Our analysis starts by importing the dataset directly from the data folder.

In [None]:
players_tidy <- players_data |>
  select (experience, subscribe, played_hours, Age) |>
  filter (Age != 17) |>
  mutate (experience = as_factor(experience),
        subscribe = as_factor(subscribe))
head(players_tidy)
players_tidy <- players_tidy %>%
  mutate(subscribe = fct_relevel(subscribe, "TRUE")) #This simply makes TRUE in subscribe column the positive class

After loading, we clean the dataset to include only the relevant variables: experience, subscribe, played_hours, and Age. These variables are chosen because they relate closely to user behavior and potential factors influencing subscription decisions. Players aged exactly 17 are removed, because they form an unrepresentative group that could introduce noise. Converting experience and subscribe to categorical types ensures that these variables are treated correctly during modeling, especially since subscribe is the outcome we aim to predict.

In [None]:
p1 <- players_tidy |>
  ggplot(aes(x = played_hours, fill = subscribe)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
  labs(title = "Figure 1: Distribution of Played Hours by Subscription",
       x = "Played Hours", y = "Count") +
  theme_minimal()
p1

In [None]:
p2 <- players_tidy |>
  ggplot(aes(x = experience, y = played_hours, fill = experience, alpha = 0.1)) +
  geom_boxplot() +
  labs(title = "Figure 2: Played Hours by Experience Level",
       x = "Experience Level", y = "Played Hours") +
  theme_minimal()
p2

The next step involves visual exploration of the data. A histogram is created to compare how playtime is distributed across subscription statuses. This visualization provides an intuitive way to assess whether more engaged users tend to subscribe and whether there is a pattern that could inform predictions. The bin width is chosen to balance granularity and clarity. A second plot—a boxplot—illustrates how playtime differs across experience levels. This helps us explore whether experience is related to the amount of time spent playing, which might be another indicator of subscription behavior. Both visualizations serve to uncover possible relationships among the variables and the target, guiding feature selection and model expectations.

In [None]:
set.seed(1234)
data_split <- initial_split(players_tidy, prop = 0.8, strata = subscribe)
train_data <- training(data_split)
test_data <- testing(data_split)

The dataset is then split into training and testing sets with an 80/20 ratio. This ensures enough data is available for training while reserving a portion for final model evaluation. Stratifying by the outcome variable ensures that both subsets have similar class distributions, which is critical for classification tasks to avoid biased or misleading results.

In [None]:
knn_recipe <- recipe(subscribe ~ played_hours + Age + experience, data = train_data) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

We then create a preprocessing recipe . Normalizing numeric variables is essential because the model we’re using is sensitive to scale—larger-scale features could dominate smaller ones and distort the distance calculations. Dummy encoding is applied to the experience variable, allowing the categorical levels to be represented in a way the model can interpret and process effectively. This preprocessing ensures the model receives clean, consistent input without hidden biases from scaling or encoding mismatches.

In [None]:
knn_spec <- nearest_neighbor(neighbors = tune(), weight_func = "rectangular") |>
  set_engine("kknn") |>
  set_mode("classification")

Next, we specify the k-nearest neighbors classification model, allowing the number of neighbors (k) to be tuned. This setup is chosen because k-NN is intuitive and well-suited for problems where similarity in feature space corresponds to similar outcomes. Setting the number of neighbors as a parameter to be tuned because model performance is highly dependent on this value—too small, and the model overfits; too large, and it may underfit or smooth out important distinctions. Therefore we want to test out different k to determine which works the best.

In [None]:
knn_workflow <- workflow() |>
  add_model(knn_spec) |>
  add_recipe(knn_recipe)

We combine the preprocessing and model definition into a single workflow. 

In [None]:
set.seed(123)
folds <- vfold_cv(train_data, v = 5, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

To find the optimal number of neighbors, we conduct 10-fold cross-validation. This method offers a robust way to evaluate the model’s performance across various partitions of the training data. It mitigates the effects of random variation from any single train-test split and produces more stable estimates of model performance. We evaluate a range of values for k, from 1 to 20, to identify the configuration that results in the highest classification accuracy.

In [None]:
knn_tune_results <- knn_workflow |>
    tune_grid(resamples = folds, grid = k_vals) |>
    collect_metrics() |>
    filter(.metric == "accuracy")
knn_tune_results

In [None]:
best_k <- knn_tune_results |>
  arrange(-mean) |>
    slice(1) |>
    pull(neighbors)
best_k

In [None]:
knn_plot <- knn_tune_results |>
  ggplot(aes(x = neighbors, y = mean)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  geom_vline(xintercept = best_k, linetype = "dashed", color = "darkgreen") +
  labs(title = "Figure 3: Accuracy vs Number of Neighbors (k)",
       x = "Number of Neighbors (k)",
       y = "Cross-Validated Accuracy") +
  theme_minimal()
knn_plot

The model is tuned across the grid of k values, and the one yielding the best accuracy across all folds is selected, which turns out to be 15. This process helps identify a balance between model flexibility and stability. Visualizing the tuning results as a line plot makes it easy to interpret how accuracy changes with the number of neighbors. The plot reveals whether performance improves or plateaus and helps justify the selection of the best k. As shown in the figure, the best k is 15 because it has the highest accuracy.

In [None]:
final_knn_spec <- nearest_neighbor(neighbors = best_k, weight_func = "rectangular") |>
  set_engine("kknn") |>
  set_mode("classification")

final_knn_workflow <- workflow() |>
  add_model(final_knn_spec) |>
  add_recipe(knn_recipe)

final_fit <- fit(final_knn_workflow, data = train_data)

test_accuracy <- predict(final_fit, test_data) |>
  bind_cols(test_data) |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")

test_precision <- predict(final_fit, test_data) |>
  bind_cols(test_data) |>
  precision(truth = subscribe, estimate = .pred_class)

test_recall <- predict(final_fit, test_data) |>
  bind_cols(test_data) |>
  recall(truth = subscribe, estimate = .pred_class)

test_accuracy

test_precision

test_recall

confusion_matrix <- predict(final_fit, test_data) |>
  bind_cols(test_data) |>
  conf_mat(truth = subscribe, estimate = .pred_class)

confusion_matrix

With the optimal value chosen, the workflow is finalized and retrained on the full training dataset. This ensures that the model benefits from all available training data before making final predictions. The model is then applied to the test set, providing an unbiased evaluation of its real-world performance. The results include  overall accuracy, precision, recall and a confusion matrix, offering insight into how well the model predicts each class and where it might make mistakes. This final step validates the effectiveness of the model and helps us understand its strengths and limitations in a practical context.