# Predicting Subscription to a Video Game Newsletter Based on Age and Time Played on the Game

### By: Lila, Lauren, and Khush

## Introduction

Researchers have been collecting data on how gamers play video games. They have done this by collecting data through a server created by the data scientists on the game 'Minecraft'. This data can be used to explore a number of different questions.

The question we aim to answer through this report is *'Can age and time played of a player predict subscription to a newsletter in players.csv?'*

The dataset titled "players.csv" will be used to answer this question. It contains Information about each player observed. The information included is described in the table below. 

The dataset titled "players.csv" will be used to answer this question. It contains Information about each player observed. The information included is described in the table below. 

| Column Name           | Data Type        | Description                                                       |
|-----------------------|------------------|-------------------------------------------------------------------|
| experience | Categorical (string) | The players experience (Beginner, Amateur, Veteran, Pro) |
| subscribe | Boolean | Whether or not the player has suscribed to a game-related newsletter (true or false) |
| hashedEmail | String | The hashed email of the player |
| played_hours | Numerical | The hours they have spent playing |
| name | String | The name of the player |
| gender | Categorical | The gender of the player |
| Age | Numerical | The age of the player |

This dataset has 7 columns and 196 rows. 

For this report we will be focusing on just the age of the player, the time that they have spent playing, and whether or not they have suscribed to a game-related newsletter.

## Methods & Results

This section is for loading, wrangling, performing a summary and creating visualizations of the data

In this analysis, we investigated whether a player's age and the number of hours they play the game can predict their subscription to a game-related newsletter using K-Nearest Neighbours (KNN) classification.

First, we cleaned the dataset by dropping irrelevant columns: "name", "hashedEmail", "gender" and experience" as they are not conceptually related to the question being investigated. We also dropped rows with "NA" to make the dataset ready for K-NN classification because a dataset with rows that contain "NA" would cause errors in the classification.

We then standardized the subscription labels to use "Yes/No" instead of "TRUE/FALSE" as it is more more easily understood and reduces ambiguity. We also scaled both predictor variables to make sure the distance metric for K-NN uses them with equal weighting.

We then made explanatory summaries and visualizations to understand data patterns before the modeling and classification. This helped us gain a rough idea of what the expected outcome could be.

Next, we split the data into training and testing data into 75% and 25% respectively. Using the training dataset, we determined the most effective K value. We did this by using a range of K values and gathering information on their accuracies and generating a plot to visualize this information. We chose the optimal K value for this classification problem based on highest accuracy and then performed the classification in the testing set using this K value to generate predictions for the dataset.

Finally, we analysed performance by generating a confusion matrix from which we evaluated accuracy and classification. Using the obtained information, we made a heatmap to summarize how accurate the model was at differentiating subscribers and non-subscribers. These results were then used to answer the original question.

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
#loading and viewing the dataset to take a look at the variables. 
players_original <- read_csv("players.csv")
head(players_original)

In [None]:
#Dropping Columns with NA
#Setting subscribe as factor and renaming true/false to yes/no
#Scaling predictor variables 
#Selecting relevant columns only

players <- players_original |>
drop_na(played_hours, Age, subscribe) |>
mutate(subscribe = as.factor(subscribe),
       subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"),
       Age = scale(Age),
       played_hours = scale(played_hours)) |>
select(played_hours, Age, subscribe)

#view changes
head(players)

In [None]:
# Totalling the number of players based on subscription

subscriber_counts <- players |>
    group_by(subscribe) |>
    summarize(count=n())

#view results
subscriber_counts

In [None]:
# Minimum, maximum and mean number of player's age and number of hours played
# Age
age_summary <- players |>
    summarize(age_min = min(Age, na.rm=TRUE),
             age_max = max(Age, na.rm=TRUE),
             age_mean = mean(Age, na.rm=TRUE))

age_summary

# Hours played
played_hours_summary <- players |>
    summarize(played_hours_min = min(played_hours, na.rm=TRUE),
             played_hours_max = max(played_hours, na.rm=TRUE),
             played_hours_mean = mean(played_hours, na.rm=TRUE))

played_hours_summary

In [None]:
#Distribution of Age by subscription Histogram. 
#This helps visualize the age of subscribed vs not subscribed players. 

age_plot <- ggplot(players, aes(x = Age, fill = subscribe)) +
geom_histogram(binwidth = 2, color = "black", position = "dodge") +
labs(x = "Age of Players (standardized)", 
     y = "Count (Number of Players)", 
     title = "Figure 1: Distribution of Age by subscription", 
    fill = "Subscription Status") +
theme(plot.title = element_text(size = 15),
      axis.title = element_text(size = 15),
      legend.title = element_text(size = 15),
      legend.text = element_text(size = 15))

#view the plot
age_plot

In [None]:
#Distribution of played hours by subscription Histogram. 
#This helps visualize whether how much the individuals play the game affects their subscription status. 

hours_played_plot <- ggplot(players, aes(x = played_hours, fill = subscribe)) +
geom_histogram(binwidth = 2, color = "black", position = "dodge") +
labs(x = "Hours Played (standardized)", 
     y = "Count (Number of Players)", 
     title = "Figure 2: Distribution of played hours by subscription", 
    fill = "Subscription Status") +
theme(plot.title = element_text(size = 15),
      axis.title = element_text(size = 15),
      legend.title = element_text(size = 15),
      legend.text = element_text(size = 15))

#view the plot
hours_played_plot

In [None]:
#Age vs Hours Played colored by subscription Scatterplot. 
#This helps visualize and depict how age and hours played relate to subscription together. 

age_vs_hours_played_plot <- ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
geom_point(alpha = 0.7) +
labs(x = "Age of Players (standardized)", 
     y = "Hours Played (standardized)", 
     title = "Figure 3: Age vs played hours by subscription", 
    fill = "Subscription Status") +
theme(plot.title = element_text(size = 15),
      axis.title = element_text(size = 15),
      legend.title = element_text(size = 15),
      legend.text = element_text(size = 15))

#view the plot
age_vs_hours_played_plot

In [None]:
set.seed(333) #makes the result the same each time

#creating the split
players_split <- initial_split(players, prop = 0.75, strata = subscribe)  
players_train<- training(players_split)   
players_test<- testing(players_split)

head(players_train)
head(players_test)

#creating the recipe
players_recipe <- recipe(subscribe ~ Age + played_hours , data = players_train) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())

#creating model and fit
knn_tune<- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

#do cross validation
players_vfold <- vfold_cv(players_train, v = 10, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
knn_results <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = players_vfold, grid = k_vals) |>
      collect_metrics()
head(knn_results)

In [None]:
#get accuracies
accuracies <- knn_results |> 
      filter(.metric== "accuracy")

#plot accuracies vs k to find best k
accuracy_versus_k<- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate", title="Figure 4: Plotting Accuracies to Find the Best k") +
      scale_x_continuous(breaks = seq(0, 14, by = 1))  # adjusting the x-axis
accuracy_versus_k

In [None]:
# Building the final model

# Set the seed
set.seed(999) 

# your code here
knn_spec <- nearest_neighbor(weight_func="rectangular", neighbors = 9) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_fit <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(knn_spec) |>
      fit(data = players_train)
knn_fit

In [None]:
# Set the seed
set.seed(999)

# Predicting the test data set
players_predictions <- predict(knn_fit, players_test) |>
                        bind_cols(players_test)

# Reporting the accuracy of prediction
players_metrics <- players_predictions |> metrics(truth = subscribe, estimate = .pred_class)
players_metrics

# Reporting the confusion matrix
players_conf_mat <- players_predictions |>
                        conf_mat(truth = subscribe, estimate = .pred_class) 
players_conf_mat

## Discussion

### Confusion Matrix Summary

By looking at the confusion matrix, produced through data analysis, we can conclude:
- 33 True positives
- 5 True negatives
- 8 False positives
- 3 False negatives

This tells us that the model is better at predicting positives than it is negatives. The model correctly predicted 33 of 36 subscribers but only 5 of 13 non-subscribers. This is likely because of the fact that there are many more subscribers than non-subscribers in the complete data set, causing the model to be stronger when predicting the majority class. We can conclude that subscription is predictable, but mostly when the subscription status is 'Yes', meaning age and play time help to predict if someone will subscribe but not if they wont subscribe.

It was expected that if two players were of similar age and played the game for a similar amount of time, then they would have the same subscription status. When looking at Figure 3 (Age vs. Play Time), you can see that subscription appears to be random, not particularly based on age, nor play time. It was also predicted that the model would be strong at predicting 'yes', as most players subscribed 'yes', therefore making it more likely to be correct when predicting this.