# Title: Accuracy of Predicting Newsletter Subscription by Looking at Player's Age and Hours Played 

Link to git hub repository: https://github.com/25494998/NataliaProject.git

## Introduction 


##### Understanding user engagement in online platforms such as video games is critical for developers and researchers. In this project, I analyzed data from a Minecraft research server to determine if using players' **age** and **hours played** can be a reliable way to predict whether they subscribe to a game-related newsletter.The aim of this project is to help a research group at UBC target their recruitment efforts 

#### Question:  
##### How accurately can hours played and age of a player predict if they are going to subscribe to the newsletter or not? #####


#### Datase:  
##### For this project I used the data _players.csv_, which contains the following variables: 
| Variable       | Description                                      | Type        |
|----------------|--------------------------------------------------|-------------|
| experience     | Experience metric (not used in this analysis)    | Numeric     |
| subscribe      | Whether the player subscribed to the newsletter  | Logical |
| hashedEmail    | Unique identifier for player (not used)          | Text        |
| played_hours   | Total hours the player played                    | Numeric     |
| name           | Player name (not used)                           | Text        |
| gender         | Player gender (not used)                         | Text        |
| Age            | Player age                                       | Numeric     |


##### This dataset contains 196 observations. #####

## Methods And Results ##

### Exploring the data ###

In [None]:
#Load Libraries
library(tidyverse)
library(repr)
library(tidymodels)

1. The first thing I will do is to explore the data so we can prepare it for our analysis. 

In [None]:
# Explore the Data 
players<- read_csv("players.csv")

We can see that this data set has 196 rows and 7 variables. This data set is also organized using the "," delimeter, which is why I have used the `read_csv` function.

2. After exploring the dataset, I identified the key variables relevant to my analysis: Age, played_hours, and subscribe. Since the goal is to evaluate how accurately Age and Hours Played can predict newsletter subscription, I will shorten the dataset to just these three variables.

To prepare the data for modeling, I will:

- Use `drop_na()` to remove rows with missing values
  
- Use `select()` to keep only the relevant columns/variables: Age, played_hours, and subscribe

- Use `mutate()` to convert the subscribe variable from a logical to a factor type, since I am working with a classification model

I have also added the `slice` function, because I don't want to vizualize the whole data, since I already have.

In [None]:
#Load and clean the data
players<- read_csv("players.csv") |> 
drop_na()|>
select(subscribe, played_hours, Age)|> 
mutate(subscribe= as.factor(subscribe))  

slice(players, 0:20)

3. Now that I have loaded, explored, and cleaned the dataset, I want to gain a better understanding of the overall characteristics of the player population. Specifically, I am interested in identifying metrics related to age and hours played. To do this, I use the `summarize()` function to calculate two key summary statistics:

In [None]:
players_summary <- players |>
summarize(mean_age = mean(Age, na.rm = TRUE),
mean_hours = mean(played_hours, na.rm = TRUE))
players_summary

This code computes the average age (mean_age) and average number of hours played (mean_hours) across all players. The `na.rm = TRUE` argument ensures that any missing values are excluded from the calculations, which helps avoid getting NA results. These metrics provide a baseline understanding of the dataset values. 

4. To assess how important standardization is before modeling, I will calculate the range of values for both played_hours and Age. This helps identify differences in scale, which can significantly affect the distance-based model of KNN. 

In [None]:
played_hours_range <- players |>
summarise(hours_range = max(played_hours, na.rm = TRUE) - min(played_hours, na.rm = TRUE))

Age_range <- players |>
summarise(age_range = max(Age, na.rm = TRUE) - min(Age, na.rm = TRUE))

Age_range
played_hours_range

We can observe that _played_hours_ has a significantly larger range of values compared to _Age_. This difference in scale means that, without standardization, the played_hours variable would disproportionately influence the classifier’s decisions. To ensure both predictors contribute equally to the model, I will need to standardize the variables before training.

### Scatter Plot Visualization ###

To explore the relationship between the predictors and the class predicted, I created a scatter plot of Age vs. Played Hours, with points colored based on subscription status.

A scatter plot is an appropriate choice here because:

- It allows us to visualize the relationship between two continuous variables — in this case, Age and Hours Played.

- By adding color to indicate whether or not a player subscribed, we can visually examine the target variable (subscription) relation (if any) to the two predictors.

- This helps identify clusters, trends, or patterns in the data — for example, whether subscribers tend to have higher playtime or fall within a specific age group.

It also gives a sense of overlap between classes, which is important when evaluating how well these variables might separate subscribed from non-subscribed players.

In [None]:
players |> 
ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
geom_point(alpha = 0.4) +
labs(title = "Scatter Plot of Age vs Played Hours by Subscription Factor",
x = "Age",
y = "Played Hours",
color = "Subscribed") +
ylim(0, 10)

**Figure 1**
Scatter plot that shows each observation in the data set and is colored by wheter they player has subscribed or not. Age being in the x-axis and plyed hours in the y-axis.

From the scatter plot, I observed that the data is fairly spread out in terms of Age, while Played Hours is more concentrated. In particular, most players have fewer than 7.5 hours of gameplay. To better visualize this dense region, I limited the y-axis (representing hours played) to range from 0 to 10, instead of 0 to 200. This adjustment helps focus on where the majority of the data points lie and makes patterns in the lower-hour range easier to interpret, especially since many points were clustered below 5 hours.

### *Step 1*: Data Splitting
For us to find the best number of K as well as test if our classifier is a good model we will split the data into a training and test subset. I chose to pick a proportion of 70%, because that is the proportion I have seen in most of the tutorials and worksheets.

Data splitting is important in the case of this model for us to evaluate our classifier. 

- Train Subset: Train the model so it can learn patterns.

- Test Subset: Test the model on unseen data to evaluate how well it generalizes to new data sets.

In [None]:
set.seed(123)
players_split <- initial_split(players, prop = 0.7, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

### *Step 2*: Preprocessing Recipe


To ensure the classifier treats all predictors proportionally, I created a recipe that standardizes the features. This process puts played_hours and Age on the same scale, preventing any variable with a larger numeric range, in this case the _played_hours_ variable, from dominating the classifier. `step_center` makes the variable have a mean zero and `step_scale` divides the variable by SE.

In [None]:
players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_train) |>
step_center(all_predictors()) |>
step_scale(all_predictors())


### *Step 3*: Specify KNN Model 

In this step, I define the K-Nearest Neighbors (KNN) model specification. I set the number of neighbors to tune(), because this allows me to experiment with different values of k and select the one that provides the best classification model.

The model uses the "kknn" engine and is set to operate in classification mode, as my goal is to predict a categorical outcome ( if the player is subscribed or not).

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")

### *Step 4*: 5-Fold Cross Validation

I chose to perform 5-fold cross-validation to ensure that every data point is used at least once for validation and at least once for training. This approach helps provide a more reliable estimate of the model’s performance by reducing the impact of random chance of a particular data split.

By repeatedly training and validating the model across different subsets of the data, 5-fold cross-validation also helps prevent overfitting.

In [None]:
set.seed(123)
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

### *Step 5*: Create Grid of K values

I have chosen to test a number of neigbors from 1 to 20, since I have tested different ranges, but the best k seems to fall inside this range.

In [None]:
k_vals <- tibble(neighbors = seq(1, 20, by = 2))

### *Step 6: Define Workflow and Tune Model*

In this step, I create a workflow that combines the preprocessing recipe and the KNN model specification.

In [None]:
players_workflow <- workflow() |>
add_recipe(players_recipe) |>
add_model(knn_spec)

Next, I use this workflow to find the number of neighbors (k) using the grid of values defined in k_vals. And, again, the tuning is done with 5-fold cross-validation.

To ensure reproducibility, I set a random seed before tuning:

In [None]:
set.seed(123)
knn_results <- players_workflow |>
tune_grid(resamples = players_vfold, grid = k_vals) |>
collect_metrics()

knn_results 

The knn_results table summarizes how the K-Nearest Neighbors (KNN) classifier performed across different numbers of neighbors (k). For each k, it reports:

- neighbors: The number of neighbors (k) used in the KNN model

- .metric: The performance metric being reported (in this case either accuracy or ROC AUC)

- .estimator: the class we are trying to predict is a binary variable, which is why in teh entire knn_results ouput we have "binary" in the .estimator column

- mean: The average value of accuracy across the cross-validation folds

- n: The number of resamples used in cross-validation, in the case of my analysis, 5

- std_err: The standard error of the metric’s mean

- .config: #####


### *Step 7: Select Best K by Accuracy*

Having identified k = 13 as the optimal number of neighbors, I will use this value to define the final model specification. The following code filters the tuning results to show only the accuracy metrics for each value of neighbors (k). By isolating accuracy, I can focus on how well the model accurately classifies players at different k values, which helps us answer the question.


In [None]:
accuracies <- knn_results |> filter(.metric == "accuracy")
accuracies

In [None]:
best_k <- accuracies |> 
slice_max(mean) |> 
pull(neighbors)

best_k

In this step I was able to find that the best number of neighbours would be 13. The output of _accuracies_ has the same variables from the output of knn results. The only difference is that I have now filtered to only see where `.metric="accuracy"`.

### Step 8: Finalize Model with Best K

After identifying the optimal number of neighbors (best_k), I finalized the KNN model specification using this value. I then created a workflow that combines the preprocessing steps with the finalized model. Finally, I trained the complete model on the entire training dataset.

In [None]:
final_knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("classification")

final_workflow <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(final_knn_spec)

final_fit <- final_workflow |> fit(data = players_train)

### *Step 9*: Predict on Test Set & Evaluate

After training the final model, I used it to generate predictions on the test dataset. The predicted subscription variable were combined with the actual test data for comparison by using the `bind_cols` funtion. Then, I calculated performance metrics by comparing the predicted classes against the true subscription labels. 

In [None]:
test_predictions <- predict(final_fit, players_test) |> 
bind_cols(players_test)

test_metrics <- test_predictions |> 
metrics(truth = subscribe, estimate = .pred_class)

test_metrics

I woud also like to highlight that accuracy is the total number of correct predictions over the total number of predictions.

### *Step 10: Confusion Matrix*

I created a confusion matrix to better vizualize the performance of the classifier. A confusion matrix is a table that compares the predicted classes against the actual classes.

In [None]:
conf_matrix_data <- conf_mat(test_predictions, truth = subscribe, estimate = .pred_class)
conf_matrix_data

The following code formats the above confusion matrix as a data frame so I can later plor a bar chart.

In [None]:
conf_matrix_data <- data.frame(
Prediction = c("FALSE", "FALSE", "TRUE", "TRUE"),
Truth = c("FALSE", "TRUE", "FALSE", "TRUE"),
Count = c(3, 3, 13, 40))
conf_matrix_data

I chose to visualize the confusion matrix as a stacked bar chart showing proportions to better understand the relative distribution of predictions within each actual class. This plot highlights the percentage of correct and incorrect predictions for subscribers and non-subscribers separately. By focusing on proportions, it becomes easier to compare how well the model performs across classes regardless of class size differences, providing a clearer picture of classification accuracy and errors.

In [None]:
ggplot(conf_matrix_data, aes(x = Truth, y = Count, fill = Prediction)) +
geom_bar(stat = "identity", position = "fill") +  
labs(title = "Confusion Matrix (In Proportions)", x = "Actual (Truth)",y = "Proportion", fill = "Predicted")


**Figure 2**
Bar Chart of the proportions of how many correct and wrong predictions the classifier got it. 

It is noticeable to see that the classifier predicted most of it as true. So let's explore if the orginal data set has a bigger percante of subscribed players. 

In [None]:
total <- nrow(players)
players_percentage <- players |>
summarize(
subscribers = sum(subscribe == TRUE, na.rm = TRUE),
percent_subscribed = (subscribers / total) * 100)

players_percentage

We can see from this output that indeed the majority of players are subscribed, which could justify why our classifier predicted a lot more of subscribers rather than non-subscribers.

### Notes on the Method Chosen to Perform this Data Analysis ###
For this project I used a classification model because the goal is to predict whether a player subscribes based on 2 variables: age and played hours, which are both numerical. This method is appropriate since I am trying to predict a factor/category. I split the data into training and test sets (70/10) and used 5-fold cross-validation on the training set to tune and compare models. 

The main assumption that I have used when implementing the KNN classifier is that similar observations have similar caratheristics. However, one limitation of this model is that it can be affected by noise and randomness (even tough I used cross validation to avoid it). 

## Discussion and Conclusion ##

Through this analysis, we found that the classifier I developed achieves an accuracy of approximately 73%, meaning it correctly predicts the subscription status about three-quarters of the time.

I was somewhat surprised by this level of accuracy, given that the model relies on only two variables — Age and Played Hours. Intuitively, I expected additional factors to play a significant role in predicting subscription behavior. This result suggests that Age and Played Hours alone hold considerable predictive power, though likely not the full story.

However, this finding also indicates that relying solely on these two variables may be insufficient for fully understanding or accurately predicting subscription behavior. Important factors beyond Age and Played Hours may influence subscription decisions, and excluding them could lead to incomplete or misleading conclusions.

A key question arising from this analysis is: To what extent could targeted marketing strategies informed by these predictions improve actual subscription conversion rates? Exploring this could help translate predictive insights into practical actions that enhance user engagement.