# Title: Accuracy of Predicting Newsletter Subscription by Looking at Player's Age and Hours Played #

## Introduction ##
##### Understanding user engagement in online platforms such as video games is critical for developers and researchers. In this project, I analyze data from a Minecraft research server to determine if using players' **age** and **hours played** can be a reliable way to predict whether they subscribe to a game-related newsletter.The aim of this project is to help a research group at UBC target their recruitment efforts #####

#### Question:  ####
##### How accurately can hours played and age of a player predict if they are going to subscribe to the newsletter or not? #####




##### For this project I used the data players.csv, which contains the following variables: #####
| Variable       | Description                                      | Type        |
|----------------|--------------------------------------------------|-------------|
| experience     | Experience metric (not used in this analysis)    | Numeric     |
| subscribe      | Whether the player subscribed to the newsletter  | Logical |
| hashedEmail    | Unique identifier for player (not used)          | Text        |
| played_hours   | Total hours the player played                    | Numeric     |
| name           | Player name (not used)                           | Text        |
| gender         | Player gender (not used)                         | Text        |
| Age            | Player age                                       | Numeric     |


##### This dataset contains 196 observations. #####

## Methods And Results ##

In [None]:
#Load Libraries
library(tidyverse)
library(repr)
library(tidymodels)

1. The first thing I will do is to explore the data so we can prepare it for our analysis. 

In [None]:
# Explore the Data 
players<- read_csv("players.csv")

We can see that this data set has 196 rows and 7 variables. This data set is also organized using the "," delimeter.

2. Now that I have explored the data, I could see what each variable is and what type of variable they are. Since I am trying to see if Age and Hours Played are good predictors of the player subscribing to the newsletter, I will `select` only those 3 variables and also `mutate` the *subscribe* variable into a factor instead of logical and drop the na in the dataset. 

In [None]:
#Load and clean the data
players<- read_csv("players.csv") |> 
filter(!is.na(Age)) |>
select(subscribe, played_hours, Age)|> 
mutate(subscribe= as.factor(subscribe))  

players

3. Now that I have loaded, explored and cleaned the data, I will see what is the average age and hours played of the players so we can have an idea of some of the data's parametrics. 

In [None]:

players_summary <- players |>
summarize(mean_age = mean(Age, na.rm = TRUE),
mean_hours = mean(played_hours, na.rm = TRUE))
players_summary

4. I will also check the range of numbers of played hours and Age so can see how imporant it is to standardize this data. 

### Scatter Plot Visualization

In [None]:
players |> 
  ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.4) +
  labs(title = "Scatter Plot of Age vs Played Hours by Subscription",
       x = "Age",
       y = "Played Hours",
       color = "Subscribed") +
  theme_minimal()

### *Step 1*: Data Splitting
For us to find the best number of K as well as test if our classifier is a good model we will split the data into a training and test subset.

In [None]:
set.seed(123)
players_split <- initial_split(players, prop = 0.7, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

### *Step 2*: Preprocessing Recipe
I set my recipe and standardized the predictors. 

In [None]:
players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_train) |>
step_center(all_predictors()) |>
step_scale(all_predictors())


### *Step 3*: Specify KNN Model 

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")

### *Step 4*: 5-Fold Cross Validation

In [None]:
set.seed(123)
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

### *Step 5*: Create Grid of K values

In [None]:
k_vals <- tibble(neighbors = seq(1, 15, by = 2))

### *Step 6: Define Workflow and Tune Model*

In [None]:
players_workflow <- workflow() |>
add_recipe(players_recipe) |>
add_model(knn_spec)

In [None]:
set.seed(123)
knn_results <- players_workflow |>
tune_grid(resamples = players_vfold, grid = k_vals) |>
collect_metrics()

knn_results 

### *Step 7: Select Best K by Accuracy*

In [None]:
accuracies <- knn_results |> filter(.metric == "accuracy")
accuracies

In [None]:
best_k <- accuracies |> 
  slice_max(mean) |> 
  pull(neighbors)

best_k

### *Step 8*: Finalize Model with Best K

In [None]:
final_knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

final_workflow <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(final_knn_spec)

final_fit <- final_workflow |> fit(data = players_train)

### *Step 9*: Predict on Test Set & Evaluate

In [None]:
test_predictions <- predict(final_fit, players_test) |> 
bind_cols(players_test)

test_metrics <- test_predictions |> 
metrics(truth = subscribe, estimate = .pred_class)

test_metrics

### *Step 10: Confusion Matrix*

In [None]:
conf_mat(test_predictions, truth = subscribe, estimate = .pred_class)

In [None]:
conf_matrix_data <- tribble(
  ~Prediction, ~Truth, ~Count,
  "FALSE",     "FALSE",    3,
  "FALSE",     "TRUE",     3,
  "TRUE",      "FALSE",   13,
  "TRUE",      "TRUE",    40
)

conf_matrix_data

In [None]:
ggplot(conf_matrix_data, aes(x = Truth, y = Count, fill = Prediction)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Confusion Matrix (Grouped Bar Chart)",
    x = "Actual (Truth)",
    y = "Count",
    fill = "Predicted"
  ) +
  theme_minimal()