This project investigates whether player characteristics such as age, gender, experience, and newsletter subscription status can predict how many hours a player contributes to the Minecraft server.

STEP 1: Loading Libraries and Visualizing Data

In [None]:
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(RColorBrewer)

Following this we can load and visualize the data

In [None]:
players <- read_csv("Data/players.csv")

head(players)
tail(players)

As shown above, the data contained player experience from Amateur to Pro, wether or not they are subscribed, the hashed email, their played hours, their names, gender and ages.

STEP 2: Data Cleaning and Preparation

Now that we have the data loaded we can see that it requires some cleaning, and since we are not using all the columns such as hashedEmail we can get rid of those. 

In [None]:
players_clean <- select(players, -hashedEmail, -name)
head(players_clean)

Since our goal is to create a linear regression that attempts to predict played_hours using the other variables we need to turn variables such as experience, gender and subscription into factor variables. 

In [None]:
players_clean <- players_clean |>
    mutate(
        experience = as.factor(experience),
        subscribe = as.factor(subscribe), 
        gender = as.factor(gender)) |>
    rename(age = Age)
head(players_clean)

Now the variables we want to use are factors which will help with our linear regression. I also renamed Age to age for consistency in names

STEP 3: Building and Evaluating Linear Regression Model

Before building the needed elements for the regression it would be useful to have some idea of what the data looks like. First I will plot the hours played against the age of the participants showing different colours to differentiate the experience of the players and shape to show the gender. This will give us a solid understanding of the data before preparing the regression.

In [None]:
players_clean_plot <- players_clean |>
    ggplot(aes(x = age, y = played_hours, color = experience, shape = gender)) +
    geom_point(alpha = 0.7) +
    labs(x = "Age", y = "Time Played (in Hours)", color = "Experience Level",
        shape = "Gender", title = "Played Hours vs. Age") +
    theme(text = element_text(size = 12)) +
    ylim(0, 20) +
    scale_color_brewer(palette = "Dark2")

players_clean_plot

The graph above shows that initially just by looking at age vs time played there does not seem to be a lot of correlation in the data. However, we learnt something important which is that most of the time played data collected is concentrated in lower hours ~ 20 to 0 hours. This means that we have a few outliers outside of that data that can greatly affect our linear regression so we will not include those in our model.

In [None]:
players_clean <- players_clean |>
    filter(played_hours <= 20)

After adjusting 