<p style="text-align: center; font-weight: bold; font-size: 18pt">Loading Data</p>

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)

In [None]:
download.file(url = "https://raw.githubusercontent.com/MatthewKyi/DSCI-100-004-Individual/refs/heads/main/players.csv", destfile = "players.csv")
players <- read_csv("players.csv")
head(players)

distinct(players, experience)
distinct(players, gender)

"Mean"
map_df(select(players, subscribe, played_hours, Age), mean, na.rm=TRUE)
"Min"
map_df(select(players, played_hours, Age), min, na.rm=TRUE)
"Max"
map_df(select(players, played_hours, Age), max, na.rm=TRUE)

In [None]:
download.file(url = "https://raw.githubusercontent.com/MatthewKyi/DSCI-100-004-Individual/refs/heads/main/sessions.csv", destfile = "sessions.csv")
sessions <- read_csv("sessions.csv")
head(sessions)

"Mean"
map_df(select(sessions, original_start_time, original_end_time), mean, na.rm=TRUE)
"Min"
map_df(select(sessions, original_start_time, original_end_time), min, na.rm=TRUE)
"Max"
map_df(select(sessions, original_start_time, original_end_time), max, na.rm=TRUE)

<p style="text-align: center; font-weight: bold; font-size: 18pt">Data Explanation</p>

The data in this project relates to the PlaiCraft Minecraft server, and is meant to provide information on how people play video games. There are two relevant datasets. One of which includes information about each player in the Minecraft server, and the other information about all the player's gaming sessions on the server.
<br><br>
<b style="font-size: 14pt">players dataset</b>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- 7 columns/variables, 196 rows/observations<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Approximately 73.47% of players are subscribed<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Average played_hours = 5.84<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Minimum played_hours = 0<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Maximum played_hours = 223.10<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Average player age is 21.13<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Minimum player age is 9<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Maximum player age is 58<br></p>
<br><br>
<b style="font-size: 14pt">sessions dataset</b>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- 5 columns/variables, 1535 rows/observations<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Average original_start_time = 1.719201e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Earliest original_start_time = 1.7124e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Latest original_start_time = 1.72733e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Average original_end_time = 1.719196e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Earliest original_end_time = 1.7124e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Latest original_end_time = 1.72734e+12ms<br></p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Variables Explanation</p>

The dataset of *players* has 7 columns/variables: <br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- experience (Character): Pro, Veteran, Regular, Beginner, or Amateur <br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- subscribe (Logical): Is the player subscribed -- TRUE or FALSE <br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- hashedEmail (Character): SHA256 encrypted email address of player (196 unique hashed emails)<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- played_hours (Double): how many hours has the player been online <br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- name (Character): player's name<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- gender (Character): Male, Female, Non-binary, Agender, Two-Spirited, Other, Prefer not to say<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Age (Double): player's age<br></p>
<br><br>
The dataset of *sessions* has 5 columns/variables:<br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- hashedEmail (Character): SHA256 encrypted email address of player (125 unique hashed emails)<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- start_time (Character): session start time in dd/mm/yyyy hh:mm format<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- end_time (Character): session end time in dd/mm/yyyy hh:mm format<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- original_start_time (Double): time of session start in UNIX Time (Time since Jan 1st 1970) in milliseconds<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- original_end_time (Double): elapsed of session end in UNIX Time (Time since Jan 1st 1970) in milliseconds</p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Potential Issues</p>

<p>- The hashedEmail variable is somewhat difficult to work with as opposed to something like username to keep track of which player is which <br></p>
<p>- The original_start_time and original_end_time variables are only measured to 6 signifcant figures, which is not enough precision to show any difference between the two. <br></p>
<p>- The start_time and end_time variables are of type character, and will take some effort to find the difference in time between the two if that is necessary.<br></p>
<p>- The data may not consider Away from Keyboard (AFK) time of players in the session dataframe<br></p>
<p>- As opposed to something like a years played variable, the experience variable is somewhat subjective for measuring player skill <br></p>


<p style="text-align: center; font-weight: bold; font-size: 18pt">Question</p>

<h4>Broad question</h4>
We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

<h4>Specific question</h4>
Can the current date/time predict the number of active players in the sessions dataset?
<br>
<p>The sessions dataset contains session time information. If there are many sessions at particular times, we know that those times are more likely to have a high volume of players, and vice. versa. It may help to use the start_time and end_time variable, and convert them into numeric UNIX time such as the original_start_time and original_end_time but with more percision.</p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Wrangling Data</p>
<br>
<p>Rename hasedEmail column and Age column to fit camel case convention of other columns</p>

In [None]:
players_tidy <- players |>
    rename(hashed_email = hashedEmail, age = Age)

sessions_tidy <- sessions |>
    rename(hashed_email = hashedEmail)

<p style="text-align: center; font-weight: bold; font-size: 18pt">Players Data Averages</p>

In [None]:
map_df(select(players, subscribe, played_hours, Age), mean, na.rm=TRUE)

In [None]:
original_start_time_histogram <- sessions_tidy |>
    