This project aims to investigate whether age and recorded playing time of a player are accurate predictors of whether a player is subscribed to the Minecraft newsletter. 

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
source('cleanup.R')

R packages that may be used in the project are loaded here. The following dataset will be used in this investigation.

In [None]:
players_data <- read_csv("data/players.csv")
slice(players_data, 1:8)

The preview of the dataset here was generated by slicing the first 8 rows.

In [None]:
summary(players_data)

(1) Data Description:

This project will be using the "player.csv" dataset that contains information about players who participated in a video game research and their activity in a designated MineCraft server. The dataset contains 196 observations of players and 7 variables. The following 7 variables of players are recorded.

- experience (categorical): A rating of how experienced the player is at Minecraft based on four descriptors (Amateur, Regular, Veteran, Pro), stored as character data type. 
- subscribe (categorical) - Whether a player subscribed to the Minecraft newsletter, stored as logical values data type.
- hashedEmail (identifier) - Emails of players transformed into a character string. 
- played_hours (quantitative)- Time spent on the MineCraft server in hours, this is the only player behaviour data recorded as decimals as double data type.
- name (identifier) - Name of players as character data type.
- gender (categorical) - Gender of players as male or female, stored as character data type.
- Age (quantitative)- Age of players as integers stored as double data type. 

The dataset is tidy because each row only has a single player's observation, each column is only one variable about players, and there is only one number/descriptor/logical value in all cells. Based on the 'summary()' function's results, there are no missing values or "NA" in the dataset, thus no observations need to be removed. A potential problem that is not observable with this dataset is that whether the order of observations were completely random is not known and the 'sample()' function may need to be used before splitting the dataset into training and testing datasets.

(2) Questions:

Chosen Broad Question: Question 1: "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"

Formulated Specific Question: "Can age and playing time of players accurately predict whether a player is subscribed to the Minecraft newsletter?"

The data in the dataset "player.csv" contains “played_hours” and “age” of players as explanatory variables that are both quantitative and “subscribe” as the categorical response variable, which are all variables of interest in this classification. The column with these variables will be retained using the 'select()' function, and the dataframe with these three variables will later be split and used for the classification. 

In [None]:
players_select <- players_data |> select(subscribe, played_hours, Age)
slice(players_select, 1:8)

Variables of interest in this classification (whether if subscribed, hours played, age) are selected to display on a tibble.