Final Data Sci 100 Project - Sophia Koronczay 87575445

Introduction

Online games such as MineCraft are hosted on servers which can allow the data of its users to be collected. A research group in Computer Science at UBC, led by Frank Wood, collected data such as user engagment or age. Data like this can be valuable to researchers who aim to answer predictive questions about their users or to make calculated changes to the game, software, or advertising. This study aims to answer which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts(Question 2). To investigate this we will address if there is a relationship between player experience level and total game time. Since experience levle is assumed to be self reported (as minecraft doesnt have levles) and may be inaccurate, the study also investigates if age and frequency of play sessions can be accurate predictor for experience level and thus total game time (amount of data). Through this we can identify if players more likely to contribute large amounts of data (ie:total play hours).

The data contained within players.csv and sessions.csv used to answer this question was hashedEmail (character variable of user ID), experience (catagorical variable containing:"Beginner", "Amateur", "Regular", "Pro", or "Veteran"), played_hours(numerical variable of total per hashed email), and Age (numerical variable) some of which contained NA values which were removed. Other columns were: start_time, end_time, original_start_time, original_end_time, subscribe, name, gender which were not used. There was a total of 196 observations once both data sets were joined by hashed email. Additionally, several play sessions were recorded per user ID, thus these were aggregated during analysis prior to modeling.

Methods and Results
To begin, the necessary libraries and data were loaded in.

In [None]:
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 6)

Next, data was read in and intial wrangling was completed to join the two seperate data frames together by matching the "hashedEmail" which functions as the users ID.

In [None]:
players<- read_csv("players.csv")
sessions<- read_csv("sessions.csv")
sessions
players
#join data frames
all_data<- inner_join(sessions,players, by = "hashedEmail")
all_data

The inital goal was to visually look if there is a trend between experience levle and played hours through a bar graph. Ie: we want to identify which experience groups would give the most amount of data. To begin, the experience data must be reordered to reflect the most beginner to the most experienced player.

In [None]:
all_data <- all_data|>
mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Pro", "Veteran")))

In the process of wrangling it was important to check to see if played_hours is per session or per player.

In [None]:
all_data|>
count(hashedEmail)|>
arrange(desc(n))

The large amounts of each hashed email indicates that played_hours is per session, thus to get total play time it must be summarized into one number that reflects total play time per player. This is acheived by grouping by hashed email, and the other predictors later used in analysis, and computing the sum of played hours per email. Additionally, the amount of sessions is calculated by counting the occurances of hashedEmail and storing it as session_count for later analysis.

In [None]:
play_time<- all_data|>
group_by(hashedEmail,experience,Age)|>
summarise(total_hours = sum(played_hours, na.rm = TRUE),
         session_count = n())|>
ungroup()

To explore which experience levles contribute the most to the data of total time played, we visualize the average total hours of play time by experience levle.

In [None]:
play_time_plot<-play_time|> 
group_by(experience)|>
summarise(avg_hours = mean (total_hours, na.rm = TRUE))|>
ggplot(aes(x=experience, y=avg_hours)) + 
    geom_col(alpha = 0.6) +
    xlab("Experience Level") + ylab("Total Time Played (Hrs)")+
    ggtitle("Total Time Played Across All Sessions Versus Players Experience Level") +
    theme(text = element_text(size = 12))
play_time_plot
#The plot lets me identify that "Regulars" and "Amateurs" have the most game time, thus providing the most data. The reason why I want to predict experience level by age is incase experience levles are self proclaimed and not accurate. I want to establish a trend with a variable that is concrete and unchanging with bias.
#The question now becomes: What kinds of players by age group are most likely to fall into experience levels associated with the highest total game time/amount of data?

Figure 1- Average total hours by experience levle where x= experience levle ordered by least to most experience, y = total time played. Amateur and Regular have the highest mean total time played across n=196.

Now that the experience levels associated with the highest total play times have been identified, KNN classification can be utilised to see if age and number of play sessions can  predict experience levels likely to yeild large amounts of data (ie:high total play time). To begin, the already selected data of total play time, sesion_count, age, and experience. A seed is set for reproducibliity in the model.

In [None]:
play_time_clean <- play_time|>
filter(!is.na(Age), !is.na(experience), !is.na(session_count))
set.seed(2020)

Next the data is split into training and testing sets with a proportion of 75% training to 25% testing to maintain the distribution of experience across both sets.

In [None]:
data_split<- initial_split(play_time_clean, prop = 0.75, strata = experience)
train_data <- training(data_split)
test_data <- testing(data_split)

Next the number of observations for each experience levele are checked for cross validation to ensure the folds chosen for cross validation contain all experience levles.

In [None]:
count(train_data, experience)

Since there is n=9 for pro, I will reduce the flolds from 5->3 in the cross validation to ensure that every experience level is in each fold.

Next the recipe to standardize numerical predictiors and specify the variables in this classification was created.

In [None]:
experience_recipe <-recipe(experience ~ Age + session_count, data = train_data)|>
step_center(all_predictors()) |>
step_scale(all_predictors())

Next, the KNN model with tuneable number of neighbors is made so that the optimal number for K can be selected.

In [None]:
model_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = tune ())|>
set_engine("kknn")|>
set_mode("classification")

Next a 3 fold cross validation to evaluate model performance is created as well as a grid of k values to tune.

In [None]:
numbervfold <- vfold_cv(train_data, v = 3, strata = experience)
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by =1))

The specified model and recipe are implemented into a workflow with tune to determine the K with highest accuracy for the variables in question. The accuracy of different K values is extracted from the workflow and the number of neighbors is treated as numeric.

In [None]:
tune_data <- workflow()|>
add_recipe (experience_recipe) |>
add_model (model_spec) |>
tune_grid(resamples = numbervfold, grid = k_vals)

number_metrics <- tune_data|>
collect_metrics() |>
filter(.metric =="accuracy")|>
mutate(neighbors = as.double(neighbors))|>
filter(!is.na(mean))

A cross evaluation plot was then utilised to model the accuracy of different values of K for this model

In [None]:
cross_val_plot <- ggplot(number_metrics, aes(x = neighbors, y = mean)) +
geom_point()+
geom_line()+
labs(x = "Number of Neighbors (K)", y =
"Accuracy Estimate", title= "KNN Cross-Validation Accuracy")+ scale_x_continuous (breaks = seq(1, 10, by = 1))
cross_val_plot

Figure 2 - KNN Cross-Validation accuracy for the model predicting experience levle from number of play sessions and age revealting highest accuracy of about 33% with K=5.

The final model then was defined with K=5 based on the KNN Cross-Validation Accuracy. Then a final fit on the training data was completed. 

In [None]:
#best k to use is:K=5

final_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = 5)|>
set_engine("kknn")|>
set_mode("classification")
                              
final_workflow <- workflow()|>
add_recipe(experience_recipe)|>
add_model(final_spec)
                              
final_fit <- fit(final_workflow, data = train_data)

Lastly, a prediction on the test set as well as an evaluation was preformed through the metrics of the prediction as well as a confusion matrix to visulize if precision or recall have any significant addditions for later analysis. 

In [None]:
test_predictions<-(predict(final_fit, new_data = test_data))|>
bind_cols(test_data)
                              
test_metrics <-test_predictions |>
metrics(truth = experience, estimate = .pred_class)
test_metrics

In [None]:
final_conf_mat <- test_predictions |>
conf_mat(truth = experience, estimate = .pred_class)
final_conf_mat  

Figure 3 - Confusion matrix for experience levles predited by number of play sessions and age.

Discussion 

The goal of this analysis was to determine which players contibute the most gameplay data, modeled by total play time by experience levle. Additionally, whether we could predict the experience of players by more concrete variables that are not self proclaimed to back up our prediction. As seen in Figure 1, we identified that Regular and Amateur players contributed one average the most total play time across all experience levles. This means these catagories of players would contribute the most to the quantity of game play data collected since the play time is the longest. Since, experience levle is likely self reported since a game like MineCraft does not have levles, the KNN model was used to see if other predictors could help identify which players fell into Regular and Amateur, thus predicting large data contributers. As seen in Figure 2, using age and number of sessions as predictors, the cross validation accuracy was best at K=5 at ~33%. Additionally, it preformed poorly on the test set with an accuracy of 18% and a negative kap. Laslly, the confusion matrix in Figure 3 misclassifies majority of the experience levles especically with the levle of Regular which is important for this analysis, The low accuracy and inability to predict experience levle suggests that the model is poor and age ith session count alone are not sufficient to classyify experience levle throguh KNN classification. This is not what was expected as the selected predictors seemed to be alligned with providing a related way to lable data. If using this analysis alone, the researchers likely could rely on Regular and Amateur players to contribute on average the most total play time across all experience levles. However, further data analysis should be done to see if more predictors can make this process more robust such as session duration, or in game stats with a model that supports the accuracy of this prediction. Once this is acheived, the reaserch could be supplimented with further investigation into questions such as if experience levle can change over time with increase game play, thus altering the models accuracy if only reported once. 