This project investigates whether player characteristics such as age, gender, experience, and newsletter subscription status can predict how many hours a player contributes to the Minecraft server.

Introduction: A research group in the faculty of Computer Science at the University of British Columbia, led by Frank Wood collected data from a Minecraft server. This collection was meant to understand how people play video games. One of the main data collected was the players data file which includes a list of all the players along with data such as the level of experience ranging from amateur to pro, whether or not they are subscribed, their hashed email, their name, gender and age. 

My project consists of building a linear regression with the explanatory variables: experience, subscription, gender and age. This linear regression hopes to answer the question “Can the variables collected from the players data set predict how many hours different players will spend on the server?”. In this case the response variable of interest is the number of played hours. 

STEP 1: Loading Libraries and Visualizing Data

In [1]:
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(RColorBrewer)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

Following this we can load and visualize the data

In [2]:
players <- read_csv("Data/players.csv")

head(players)
tail(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Amateur,True,644fe4200c19a73768b3fa598afcbd0948f7557925b7f17166285da23af31cc6,0.0,Rhys,Male,20.0
Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890b93ca52ac1dc76b08f,0.0,Bailey,Female,17.0
Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778b35c5802c3292c87bd,0.3,Pascal,Male,22.0
Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,17.0
Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17.0
Pro,True,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


In [3]:
summary(players)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 8.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :20.52  
                                       3rd Qu.:22.00  
                                       Max.   :50.00  
                               

The players data set contains 196 observations. In this analysis we will ignore the variables names and hashed emails as they are not useful for the regression. The variable experience shows the experience level of the players and it ranges from: 
“Amateur”, “Beginner”, “Pro”, “Regular”, and “Veteran”.

Then there is the subscribe variable which is a simple TRUE or FALSE output. The played_hours variable shows the amount of time played in hours. The age variable contains the age of the players. Finally, we have the gender variables which contain the outputs: 
“Agender”, “Female”, “Male”, “Non-Binary”, “Other”, “Prefer not to say”, and “Two-Spirited”.  


The data has a median of 0.10 hours and a mean of 5.85 hours with the maximum observation being 223 hours played. This will become important later on as we need to modify how we build the regression to accommodate for having such an extreme. 


Methods and Results: 


The main method I used in my analysis was linear regression. To start I loaded all the libraries I needed for my analysis which included tidyverse, tidymodels, ggplot2 and RColorBrewer. 
    
At this point I performed a summary to find out important information about the data. From this point I found the mean and the median hours played which contained important information and made me realize that it would be important to filter the data by played hours because there are very few people that played above that threshold.


STEP 2: Data Cleaning and Preparation

The second step was data wrangling: 


In [26]:
players_clean <- select(players, -hashedEmail, -name)
head(players_clean)

experience,subscribe,played_hours,gender,Age
<chr>,<lgl>,<dbl>,<chr>,<dbl>
Pro,True,30.3,Male,9
Veteran,True,3.8,Male,17
Veteran,False,0.0,Male,17
Amateur,True,0.7,Female,21
Regular,True,0.1,Male,21
Amateur,True,0.0,Female,17


I focused on getting rid of the column names and emails because they did not provide any useful information needed for my analysis. After that I noticed that the columns gender, experience, and subscribe were not factors which was an important correction I needed to do before actually performing my analysis. The data was already tidy so there was not much need to do anything else.


In [4]:
players_clean <- players_clean |>
    mutate(
        experience = as.factor(experience),
        subscribe = as.factor(subscribe), 
        gender = as.factor(gender)) |>
    rename(age = Age)
head(players_clean)

ERROR: Error in eval(expr, envir, enclos): object 'players_clean' not found


Now the variables we want to use are factors which will help with our linear regression. I also renamed Age to age for consistency in names

STEP 3: Visualizing and Final Cleaning

I proceed with step 3: Visualizing and Final Cleaning. In this step I wanted to get an idea of what the data looked like before creating a linear representation which I considered ideal since played_hours is a numeric and continuous variable and the goal was to predict a continuous outcome from a set of known predictors which made sense for a linear regression. 

This is how I graphed the data, I chose age as the quantitative variable on the x axis and used y to graph the number of hours played. I used color and shape to represent the categorical variables experience and gender. I chose the appropriate scatter plot graph with the correct labels and limited y to 20 hours which was what I wanted to do to capture the necessary information. I used 20 hours because it seemed like a reasonable amount of time to limit spending in Minecraft server and it aligned with data collected. I made sure to also use labels for everything and included a colour blind friendly palette. This plot showed that there could actually be a negative linear relationship between the age and the time played in hours which made me go ahead with the linear regression. After that I completed the filtering of played hours to 20 or under and created a new summary. Which now captured 186 observations and has a new mean of 0.70 closer to the median.

In [5]:
players_clean_plot <- players_clean |>
    ggplot(aes(x = age, y = played_hours, color = experience, shape = gender)) +
    geom_point(alpha = 0.7) +
    labs(x = "Age", y = "Time Played (in Hours)", color = "Experience Level",
        shape = "Gender", title = "Played Hours vs. Age") +
    theme(text = element_text(size = 12)) +
    ylim(0, 20) +
    scale_color_brewer(palette = "Dark2")

players_clean_plot

ERROR: Error in eval(expr, envir, enclos): object 'players_clean' not found


The graph above shows that initially just by looking at age vs time played there does not seem to be a lot of correlation in the data. However, we learnt something important which is that most of the time played data collected is concentrated in lower hours ~ 20 to 0 hours. This means that we have a few outliers outside of that data that can greatly affect our linear regression so we will not include those in our model.

In [6]:
players_clean <- players_clean |>
    filter(played_hours <= 20)

summary(players_clean)

ERROR: Error in eval(expr, envir, enclos): object 'players_clean' not found


STEP 4: Building and Evaluating Linear Regression Model

The final step was to build and evaluate the linear regression model. As mentioned above this method is appropriate given the trend shown by the graph and the nature of the variables. I assumed that there is a linearity in the predictors and the response variable as well as assumptions of how the data was collected such as independent and random samples. I am also assuming that there is no multicollinearity which means that the predictor variables are not highly correlated with one another. The weaknesses of this model lie in any of these assumptions being wrong, and the data having any non linear trends that my model might not be able to capture. 


In [30]:
set.seed(32)

players_split <- initial_split(players_clean, prop = 0.75)
players_training <- training(players_split)
players_testing <- testing(players_split)

I set a seed and created a 75% split with training and testing data. With this split I created a recipe and a specification for the linear model: 


In [31]:
players_recipe <- recipe(played_hours ~ age + gender + experience + subscribe, data = players_training) |>
    step_dummy(all_nominal_predictors())

lm_spec <- linear_reg() |>
    set_engine("lm") |>
    set_mode("regression")

This step contains an important observation which was the use of step_dummy. Since we have predictors that are factors, the step_dummy converts those to numerical values such as 0, 1. This allows those factors to become predictors in our model. The information I used to use this function is contained in https://recipes.tidymodels.org/reference/step_dummy.html. The rest is a standard recipe and linear regression. I fitted the model and got ready to test it with the testing data. 


In [32]:
players_fit <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(lm_spec) |>
    fit(data = players_training)

players_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
[3mPreprocessor:[23m Recipe
[3mModel:[23m linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
             (Intercept)                       age             gender_Female  
                 7.09777                  -0.02125                  -5.92770  
             gender_Male         gender_Non.binary              gender_Other  
                -5.87636                  -5.92750                        NA  
gender_Prefer.not.to.say       gender_Two.Spirited       experience_Beginner  
                -5.77631                  -6.59794                  -0.40387  
          experience_Pro        experience_Regular        experience_Veteran  
                -0.47043                  -0.2935

I did not create any visualization for the analysis because given the nature of the linear regression with multiple variables it would not have been an analysis that would have been easy to understand. 


In [35]:
players_predictions <- predict(players_fit, players_testing) %>%
  bind_cols(players_testing)
metrics <- metrics(players_predictions, truth = played_hours, estimate = .pred) |>
    filter(.metric == "rmse") |>
    select(.estimate)

metrics

“prediction from rank-deficient fit; consider predict(., rankdeficient="NA")”


.estimate
<dbl>
3.04623


Discussion: 
This got me a RMSPE of 3.04 which is not ideal given the very low mean. This means that the average error was around 3 hours. I expected the data to be much closer and not have an error that high. The main thing that must have gone wrong was the 20 hour ceiling was too low and I should have used an even lower threshold to capture the accurate data of all the hours played. This finding however, could still be helpful in the future and it could lead into questioning specifically the correlation between play time and skill level and finding a relationship between whether good players are the ones who spend the most amount of time playing a game.


In [34]:
summary(players$played_hours)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.100   5.846   0.600 223.100 

Github Link: https://github.com/Javi-b32/DSCI-100--Final-Project.git

References 

https://recipes.tidymodels.org/reference/step_dummy.html
