## Data Science Final Project Report: Predicting Player Playtime in a Minecraft Research Server


**Group Members**:


Han Nguyen     

Vincent Nguyen

Sanuli Weihena Gamage

Huixin Zhang



In [6]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
library(dplyr)
options(repr.matrix.max.rows = 6)


**Introduction** 

Understanding player behavior is essential for optimizing resources and improving user experience in online gaming environments. In this project, we analyze data collected from a Minecraft research server operated by a UBC Computer Science research group led by Frank Wood. The server records player actions, providing valuable insights into gaming patterns. However, managing such a project requires careful resource allocation, ensuring sufficient server capacity and targeted recruitment strategies. To support these efforts, we investigate key player characteristics that may influence their engagement.

The primary question guiding this analysis is:

Can the experience and age of players (predictors) accurately predict their total play time (response variable) using multivariate K-Nearest Neighbors (KNN) and multivariate linear regression? 

To address this, we focus on the `players.csv` dataset, which contains relevant features such as player age, experience level, and total playtime. The `sessions.csv` dataset, which logs individual gameplay sessions, is not directly useful for this analysis since we are interested in aggregated player behavior rather than session-specific details. By applying predictive modeling techniques, we aim to uncover meaningful relationships between player attributes and their total playtime, providing actionable insights for the research team.



**Data summary**

There are 2 datasets that are available to and both will be look at: `players.csv`  and `sessions.csv`

players.csv : A list of all unique players, including data about each player. From this dataset with ~200 observations, there are 7 variables we can look at: 

`hashedEmail` `(chr)`: Unique identifier for players (hashed for privacy).

`experience` `(chr)` = How experienced the player is.

`subscribe` `(lgl)` = Whether this player subscribed to the newsletter.

`played_hours`  `(dbl)` = Total time spent playing in hours.

`name` `(chr)` = Name of the player.

`gender` `(chr)` = Gender of the player.

`age` `(dbl)` = Age of the player. 


sessions.csv : A list of individual play sessions by each player, including data about the session. From this dataset with 1000+ observations, there are 5 variables we can look at: 

`hashedEmail` `(chr)`: Unique identifier for players (hashed for privacy).

`start_time` `(chr)`: Session start time in human-readable format.

`end_time` `(chr)`: Session end time in human-readable format.

`original_start_time` `(dbl)`: Session start time in Unix epoch time.

`original_end_time` `(dbl)`: Session end time in Unix epoch time.


To take a better look we will first load the two datasets and assign them to two objects, `players` and `sessions`.



In [7]:
players <- read_csv("https://raw.githubusercontent.com/Han27-io/ds-project/refs/heads/main/players%20(1).csv")
players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,17
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


In [8]:
session <- read_csv("https://raw.githubusercontent.com/Han27-io/ds-project/refs/heads/main/sessions%20(1).csv")
session

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1.72193e+12,1.72193e+12
⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,28/07/2024 15:36,28/07/2024 15:57,1.72218e+12,1.72218e+12
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12
