**DSCI 100 Final Project: Predicting Usage of a Video Game Research Server**

Question #2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

*Research Question: Can we predict playing time based on age?*


In [None]:
#loading libraries for analyses - to install, use install.packages()!
library(tidyverse)
library(readr)
library(tidymodels)
library(scales)
library(janitor)

#preset the max rows shown when displaying data
options(repr.matrix.max.rows = 6)

In [None]:
#find working directory
getwd()

#read in the appropriate dataset called players.csv using a relative path and
#cleaning col names to remove uneccessary capitals
players <- read_delim('Data/players.csv', delim = ',', skip = 1 ) |> clean_names()
players

In [None]:
#setting dimensions for the plots
options(repr.plot.length = 10, repr.plot.width = 10)

#Exploratory plots to better understand the dataset 
playr_time_age_plot <- players |> ggplot(aes(x = age, 
                            y = played_hours)) +
                    geom_point() +  
                    labs(color = "Did the player subscribe?") +
                    ylab("Total hours Played") + xlab("Age (in Years)")

playr_time_age_plot 

#we see a lot of points near the x-axis, causing some overplotting losing detail - I created a 'zoomed-in' graph 
#to better examine these data points
playr_time_age_plot_scaled <- players |> ggplot(aes(x = age, 
                            y = played_hours)) +
                    geom_point(alpha = 0.25) +  scale_y_log10() +
                    labs(color = "Did the player subscribe?") +
                    ylab("Total hours Played") + xlab("Age (in Years)")

playr_time_age_plot_scaled

#here we can see that most points are below 10 hours played 
#and below 30 years (which makes sense for an undergraduate course)
#We also see that the data does not visibly appear to have any linearity, 
#so we should likely use a KNN regression rather than a simple linear regression to
#try to predict played hours based on age

Let's clean the data to answer this specific research question, removing uneccessary variables and removing NAs.

In [None]:
players_knn <- players |> select(age, played_hours) |> na.omit()
players_knn

Now we will try to build a KNN regression model for our research question and then undergo cross validation to find the best K value. We will split the data first into a 75/25 split.

In [None]:
#splitting the training and testing set
knn_split <- initial_split(players_knn, prop = 0.75, strata = played_hours)
knn_training <- training(knn_split)
knn_testing <- testing(knn_split)

In [None]:
#building the model and recipe (e.g.,standardizing) for tuning
knn_recipe <- recipe(played_hours ~ age, data = knn_training) |> step_center(all_predictors()) |>
    step_scale(all_predictors())
tune_knn_spec <- nearest_neighbor(weight_func = 'rectangular', neighbors = tune()) |> set_engine('kknn') |>
    set_mode('regression')

In [None]:
set.seed(1010) 
#for the purposes of consistency (for grading) I have preset the randomness - remove in real life circumstances!

#creating 5 v folds and performing cross validation to find the best K
vfolds <- vfold_cv(knn_training, v = 5, strata = played_hours)

best_k_wflw <- workflow() |> add_recipe(knn_recipe) |> add_model(tune_knn_spec) 

k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

best_k_metrics <- best_k_wflw |> tune_grid(resamples = vfolds, grid = k_vals) |> collect_metrics() |> filter(.metric == 'rmse')

best_reg_k <- best_k_metrics |> slice_min(mean, n=1) |> select(neighbors) |> pull()
best_reg_k
##our best K for this KNN regression is K = 5


Now that we have found our best K value (5), we will create a new tuned model and test the  model on our test data.

In [None]:
knn_reg_spec <- nearest_neighbor(weight_func = 'rectangular', neighbors = best_reg_k) |> set_engine('kknn') |>
    set_mode('regression')

#testing the k-fitted regression model on our testing data
knn_reg_fit <- workflow() |> add_recipe(knn_recipe) |> add_model(knn_reg_spec) |> fit(knn_training)

#predicting the test data and assessing its RMPSE
knn_reg_test <- knn_reg_fit |> predict(knn_testing) |> 
        bind_cols(knn_testing) 
knn_reg_test
knn_reg_test_metrics <- knn_reg_fit |> predict(knn_testing) |> 
        bind_cols(knn_testing) |> metrics(truth = played_hours, estimate = .pred) |> filter(.metric =='rmse')
knn_reg_test_metrics

In [None]:
#Let's visualize these predictions through a plot

knn_reg_rmse_plot <- players_knn |> ggplot(aes(x = age, y = played_hours)) + geom_point() + 
        geom_line(data = knn_reg_test, mapping = aes(x = age, y = .pred), linewidth = 1, color = 'mediumseagreen') +
        scale_y_log10() + ylab("Hours Played on Minecraft") + xlab("Age (in years)") + 
        theme(text = element_text(size = 15))

knn_reg_rmse_plot

Based on this plot and the RMSPE, we can tell that the KNN regression model is not operating in a helpful way. This is likely due to the distribution of data, wherein there are many playersin the 20-30 year range and little data points in other ranges. This limits the model's ability to predict at extremes, as the KNN model starts to grab the same data points.