Individual Project Planning

In this project, I will conduct a complete data science workflow to explore and predict player behavior on a video game research server. The dataset comes from a real-world study conducted by a research group in Computer Science at UBC, led by Frank Wood, which investigates how people play and interact within a Minecraft environment. The data records how players navigate the server, complete tasks, and engage in different in-game activities. Understanding this dataset is important because the research team must allocate appropriate resources to support ongoing experiments.

This project will involve data cleaning, transformation, exploratory analysis, visualization, and predictive modeling. By examining patterns and relationships within the dataset, I will work toward formulating and answering a predictive question about the data related to a video game research server.

In this report, I will conduct a preliminary analysis, organization and visualization of the dataset to make it ready for further operation and manipulation. The report will have three major parts: I will first state one broad question that I will address, and the specific question that I have formulated; Then I will analyze and visualize the data; In the end, I will discuss what methods I will use in further analysis and how I plan to conduct the analysis. 

1.Broad Question

The broad question I intend to address is "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?" My specific question is: Is played_hours a good predictor of the subscription status of a game-related newsletter? Is it tend to be longer among younger people? 

2.Data Description, Analysis, and Visualization

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
# options(repr.matrix.max.rows = 6)

We will be using two datasets: players and sessions. In this part, I will explore and provide a full descriptive summary of the dataset, including information such as the number of observations, summary statistics, number of variables, name and type of variables, what the variables mean, and any potential issues. Then, I will turn the data into a tidy format, and then make preliminary preparations and visualizations for the subsequent predictive problems.

In [None]:
players <- read_csv("data/players.csv")
head(players, 5)

The players dataset contains 7 variables in total, 4 character type, 2 double type, and 1 logical type. Each variable contains 196 observations. 

· experience: character strings; describes the users' past gaming experience. 

· subscribe: logical; describes the subscription status to a game-related newsletter. 

· harshedEmail: character strings; refers to someone's email that has been transformed into a unique string of characters using a hash function, and this transformation is irreversible. 


· played_hours": double; gives the gaming time each user played. 


· name: character character strings; the users' name.


· gender: character strings; the users' gender. 

· Age: double; the users' age.


Since this dataset is already tidy, now we can calculate the summary statistics of played_hours and Age. The data might contains NA values. To avoid the influence of NA values on the statistics, we first remove the NA values.

In [None]:
clean_data <- players |> 
  select(played_hours, Age) |>
  filter(!is.na(played_hours), !is.na(Age)) 

played_hours_stat <- clean_data |>
   summarise(mean = mean(played_hours),
    median = median(played_hours),
    sd = sd(played_hours),
    min = min(played_hours),
    max = max(played_hours))
Age_stat <- clean_data |>
   summarise(mean = mean(Age),
    median = median(Age),
    sd = sd(Age),
    min = min(Age),
    max = max(Age))

played_hours_stat
Age_stat

The two tables above show the mean, median, standard deviation, minimum, and maximum value of the played_hours and Age variable.

Now we explore the "sessions" dataset.

In [None]:
session <- read_csv("data/sessions.csv")
head(session, 5)

The sessions dataset contains 5 variables in total, 3 character type and 2 double type. Each variable contains 1535 observations.

· hashedEmail: character strings; the same as hashedEmail in the player dataset. 


· start_time and end_time: character strings; give the time the game session start and end respectively. 


· original_start_time and original_end_time: double; the same as "start_time" and "end_time", but the value is recorded in UNIX time format. 

The dataset is not tidy enough, since we can create a new variable called session_time using end_time minus start_time——this would make the table easier to read and be helpful to our further analysis. However, there is one potential problem with the data: the start_time and end_time are currently character strings, not date-time objects. If we want to do time calculations, we need to convert them into date/datetime first. An alternative way is to convert the original_start_time and original_end_time. They are now in Unix milliseconds, and we can divide by 1000 and convert. 

3. Methods and Plan