Individual Project Planning

In this project, I will conduct a complete data science workflow to explore and predict player behavior on a video game research server. The dataset comes from a real-world study conducted by a research group in Computer Science at UBC, led by Frank Wood, which investigates how people play and interact within a Minecraft environment. The data records how players navigate the server, complete tasks, and engage in different in-game activities. Understanding this dataset is important because the research team must allocate appropriate resources to support ongoing experiments.

This project will involve data cleaning, transformation, exploratory analysis, visualization, and predictive modeling. By examining patterns and relationships within the dataset, I will work toward formulating and answering a predictive question about the data related to a video game research server.

In this report, I will conduct a preliminary analysis, organization and visualization of the dataset to make it ready for further operation and manipulation. The report will have three major parts: I will first state one broad question that I will address, and the specific question that I have formulated; Then I will analyze and visualize the data; In the end, I will discuss what methods I will use in further analysis and how I plan to conduct the analysis. 

1.Broad Question

The broad question I intend to address is "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?" My specific question is: Is played_hours a good predictor of the subscription status of a game-related newsletter? Is it tend to be longer among younger people? 

2.Data Description, Wrangling, and Analysis

In [10]:
library(tidyverse)
library(repr)
library(tidymodels)
# library(GGally)
# library(ISLR)
library(lubridate)
# options(repr.matrix.max.rows = 6)

We will be using two datasets: players and sessions. In this part, I will explore and provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics, number of variables, name and type of variables, what the variables mean, and any potential issues. Then, I will turn the data into a tidy format, and then make preliminary preparations and visualizations for the subsequent predictive problems.

In [31]:
player <- read_csv("data/players.csv")
head(player, 5)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21


The players dataset contains 7 variables in total, 4 character type, 2 double type, and 1 logical type. Each variable contains 196 observations. 

· experience: character strings; describes the users' past gaming experience. 

· subscribe: logical; describes the subscription status to a game-related newsletter. 

· harshedEmail: character strings; refers to someone's email that has been transformed into a unique string of characters using a hash function, and this transformation is irreversible. 


· played_hours": double; gives the gaming time each user played. 


· name: character character strings; the users' name.


· gender: character strings; the users' gender. 

· Age: double; the users' age.


Since this dataset is already tidy, now we can calculate the summary statistics of played_hours and Age. The Age column contains one NA values. To avoid the influence of NA values on the statistics, we first remove the NA values. We name this new dataset "players". All our future calculations and visualizations will use players as the data.

In [44]:
players <- player |>
   filter(!is.na(played_hours), !is.na(Age))

played_hours_stat <- players |>
   summarise(mean = mean(played_hours),
    median = median(played_hours),
    sd = sd(played_hours),
    min = min(played_hours),
    max = max(played_hours)) |>
    round(2)
Age_stat <- players |>
   summarise(mean = mean(Age),
    median = median(Age),
    sd = sd(Age),
    min = min(Age),
    max = max(Age)) |>
    round(2)

played_hours_stat
Age_stat

mean,median,sd,min,max
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
5.9,0.1,28.5,0,223.1


mean,median,sd,min,max
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
21.14,19,7.39,9,58


The two tables above show the mean, median, standard deviation, minimum, and maximum value of the played_hours and Age variable. In order to read and call better in the later project, I put the mean values of the two variables in one table called players_stat.

In [48]:
players_stat <- tibble(mean_played_hours = 5.9, mean_Age = 21.14)
players_stat

mean_played_hours,mean_Age
<dbl>,<dbl>
5.9,21.14


Now we explore the "sessions" dataset.

In [4]:
session <- read_csv("data/sessions.csv")
head(session, 5)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0


The sessions dataset contains 5 variables in total, 3 character type and 2 double type. Each variable contains 1535 observations.

· hashedEmail: character strings; the same as hashedEmail in the player dataset. 


· start_time and end_time: character strings; give the time the game session start and end respectively. 


· original_start_time and original_end_time: double; the "start_time" and "end_time" encoded by the computer, and the value is recorded in UNIX time format. 

The dataset is not tidy enough, since the start_time and end_time contain more than one value in a cell (date and time). We can create a new variable called session_time using end_time minus start_time——this would make the table tidy and be helpful to our further analysis. However, there is one potential problem with the data: the start_time and end_time are currently character strings, not date-time objects. If we want to do time calculations, we need to convert them into date/datetime first. The original_start_time and original_end_time are not human readable, and we already have readable start_time and end_time, so these two columns will be removed. Also, there might be NA values, so we need to remove the NA values before execute any further visualization. 

In [30]:
sessions <- session |> 
    mutate(start_time = dmy_hm(start_time), end_time = dmy_hm(end_time)) |>
    mutate(session_time = as.numeric(difftime(end_time, start_time, units = "mins")),
           session_time = pmax(session_time, 0)) |> 
    select(hashedEmail, session_time) |>
    filter(!is.na(session_time), !is.na(hashedEmail)) 
head(sessions, 5)

hashedEmail,session_time
<chr>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,13
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,23
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,36
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,11


Now the data is in tidy format, and all the NA value is removed. Below provides the summary statistics of the session_time variable.

In [45]:
session_time_stat <- sessions |>
   summarise(mean = mean(session_time),
    median = median(session_time),
    sd = sd(session_time),
    min = min(session_time),
    max = max(session_time)) |>
    round(2)
session_time_stat

mean,median,sd,min,max
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
50.86,30,55.57,3,259


3. Visualization

4. Methods and Plan