<p style="text-align: center; font-weight: bold; font-size: 18pt">Loading Data</p>

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)

In [None]:
download.file(url = "https://raw.githubusercontent.com/MatthewKyi/DSCI100-004-37/refs/heads/main/players.csv", destfile = "players.csv")
players <- read_csv("players.csv")
head(players)


In [None]:
download.file(url = "https://raw.githubusercontent.com/MatthewKyi/DSCI-100-004-Individual/refs/heads/main/sessions.csv", destfile = "sessions.csv")
sessions <- read_csv("sessions.csv")
head(sessions)

"Mean"
map_df(select(sessions, original_start_time, original_end_time), mean, na.rm=TRUE)
"Min"
map_df(select(sessions, original_start_time, original_end_time), min, na.rm=TRUE)
"Max"
map_df(select(sessions, original_start_time, original_end_time), max, na.rm=TRUE)

<p style="text-align: center; font-weight: bold; font-size: 18pt">Data Explanation</p>

The data in this project relates to the PlaiCraft Minecraft server, and is meant to provide information on how people play video games. The sessions dataset contains information about all the player's gaming sessions on the server.
<br><br>
<b style="font-size: 14pt">sessions dataset</b>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- 5 columns/variables, 1535 rows/observations<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Average original_start_time = 1.719201e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Earliest original_start_time = 1.7124e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Latest original_start_time = 1.72733e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Average original_end_time = 1.719196e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Earliest original_end_time = 1.7124e+12ms<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Latest original_end_time = 1.72734e+12ms<br></p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Variables Explanation</p>


The dataset of *sessions* has 5 columns/variables:<br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- hashedEmail (Character): SHA256 encrypted email address of player (125 unique hashed emails)<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- start_time (Character): session start time in dd/mm/yyyy hh:mm format<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- end_time (Character): session end time in dd/mm/yyyy hh:mm format<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- original_start_time (Double): time of session start in UNIX Time (Time since Jan 1st 1970) in milliseconds<br></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- original_end_time (Double): elapsed of session end in UNIX Time (Time since Jan 1st 1970) in milliseconds</p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Potential Issues</p>

<p>- The hashedEmail variable is somewhat difficult to work with as opposed to something like username to keep track of which player is which <br></p>
<p>- The original_start_time and original_end_time variables are only measured to 6 signifcant figures, which is not enough precision to show any difference between the two. <br></p>
<p>- The start_time and end_time variables are of type character, and will take some effort to find the difference in time between the two if that is necessary.<br></p>
<p>- The data may not consider Away from Keyboard (AFK) time of players in the session dataframe<br></p>


<p style="text-align: center; font-weight: bold; font-size: 18pt">Question</p>

<h4>Broad question</h4>
We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

<h4>Specific question</h4>
Can the current date/time predict the number of active players in the sessions dataset?
<br>
<p>The sessions dataset contains session time information. If there are many sessions at particular times, we know that those times are more likely to have a high volume of players, and vice. versa. It may help to use the start_time and end_time variable, and convert them into numeric UNIX time such as the original_start_time and original_end_time but with more percision.</p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Wrangling Data</p>
<br>
<p>Rename hasedEmail column and Age column to fit camel case convention of other columns</p>

In [None]:
players_tidy <- players |>
    rename(hashed_email = hashedEmail, age = Age)

sessions_tidy <- sessions |>
    rename(hashed_email = hashedEmail)

<p style="text-align: center; font-weight: bold; font-size: 18pt">Players Data Averages</p>

In [None]:
map_df(select(players_tidy, subscribe, played_hours, age), mean, na.rm=TRUE)

<p style="text-align: center; font-weight: bold; font-size: 18pt">Visualizations</p>

In [None]:
options(repr.plot.width = 5, repr.plot.height = 5)

original_start_time_histogram <- sessions_tidy |>
    ggplot(aes(x = original_start_time)) +
    geom_histogram() +
    labs(title = "Number of sessions vs. session start time", x = "Session Start Time (UNIX Time)", y = "Count")

original_start_time_histogram

hashed_email_histogram <- sessions_tidy |>
    ggplot(aes(x = hashed_email)) +
    geom_histogram(stat = "count") +
    labs(title = "Number of sessions vs. unique player", x = "Hashed Email", y = "Count") +
    theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

hashed_email_histogram

<p>The histogram of original_start_time shows somewhat of a bellcurve of start times from the start of measurement on the server to the end.</p>
<p>The histogram of hashed_email shows that some players have significanty more sessions logged than others</p>

<p style="text-align: center; font-weight: bold; font-size: 18pt">Methods and Plan</p>
I plan to the relationship between a given time period, and the number of players online. I will convert the session start and end times to be in UNIX time but with a bit more percision than the original_start_time and original_end_time variables. I will then create time periods of something like 30 minutes, or 1 hour, and determine how many players had sessions overlapping with those time periods. I can then visualize the time periods on the x axis and number of players on y axis, and see the trends. We can then use a model to predict number of players given a time period.

<br><br>
Why is this method appropriate? <br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- It is scalable and will get accurate results<br></p>
<br>  
Which assumptions are required, if any, to apply the method selected?<br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- start_time and end_time variables are all measured in the same time zone</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- time periods that I will chose is percise enough to show meaningful results<br></p>
<br>  

What are the potential limitations or weaknesses of the method selected?<br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- This analysis won't suggest any reason why the results are the way they are<br></p>
<br>
How are you going to compare and select the model?<br>
<p>&nbsp;&nbsp;&nbsp;&nbsp;- Because we are looking at a quantitative variable, we will use a regression model, and we will analyze the visualizations of the data to determine which specific model to use.<br></p>
<br>

How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?
<p>I will split the data chronologically, after cleaning and aggregating, where the training data is from earlier time periods, and the testing is from later time periods. The validation set will be a time period in between training and testing and cross validation will be used. This split is because we want to know if the model can predict player count in the future.<br></p>
<br>