Create an electronic report in English with a maximum of 2000 words (excluding citations) using Jupyter. The report should include the posed question, conducted analysis, and derived conclusion. Only one team member needs to submit this report. It is not required to include all tasks completed by every group member in their individual assignments; tailor the final report to the collective group's work. 

You must submit 2 files: an .html file (File -> Download As -> HTML) an .ipynb file. This file must be fully reproducible. It must run completely from top to bottom without any additional files.

**Title Introduction:**
provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and fully describe the dataset that was used to answer the question

**Methods & Results:**
describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 
note: all figures should have a figure number and a legend

**Discussion:**
summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?

**References:**
You may include references if necessary, as long as they all have a consistent citation style.

<h2 style="font-size: xxx-large;">Group 21 Project.</h2>

<h2 style="font-size: x-large;">Contributors:</h2>
Andrew Dai SN:

Aydin den Ouden SN:36321925

<h4 style="font-size: xx-large;">Introduction</h4>

<h4 style="font-size: x-medium;">Introduction to our question:</h4>
For this project, our group chose to investigate question 3 on demand forcasting, namely - "what time windows are most likely to have large number of simultaneous players", to ensure the server has a sufficient number of licences in order to keep up with concurrent players. Reflecting the courses philosphy of transform -> visualize -> model -> repeat, and our focus on modeling over the last few weeks, we were specifically interested in whether we could use the preexisting server data to predict what times may have the most players, and the probably upper bounds of player activity we could reasonably predict, in order to advise the server team if they need to increase their server capacity.
<br>
<h4 style="font-size: x-medium;">The data:</h4>
In order to begin our investigation, we were given 2 dataframes, one on player information and one on play sessions information, both of which we deemed important for us to answer our question:

The sessions dataframe includes 1606 rows, with each corresponding to a single play session, and 5 columns, titled:
- start_time - The time and date that a player logged onto the server and began to play, as a string in a DD/MM/YYYY format for data and a 24 hour time.
- end_time - The time and date that a player logged off the server after stopping playing or being kicked off, as a string in a DD/MM/YYYY format for data and a 24 hour time.
- original_start_time - The same information as start_time but measured in unix time, as a float which measures from Jan 1st, 1970 (Wikipedia)
- original_end_time - The same information as end_time but as a float measured in unix time 
- hashedEmail - A string identifier that attributed each play session to a specific individual

This granular data on the dates of player logins and their session start/end times is useful to our investivation as it allowed us to analyze overlap between their playtimes and find trends in playtime.
<br>
<br>
The players dataframe includes information on each player, having 196 rows which corresponded to a player each, and 7 columns, titled:
- experience - A metric of self-reported experience within the game, an ordinal with categories 'amateur', 'beginner', 'regular', 'pro', and 'veteran'
- subscribe - Whether the player has or has not subscribed to the servers game-related newsletter
- hashedEmail - A string identifier that attributed each play session to a specific individual, this is the same 'hashedEmail' as in the sessions dataset, which allowed for merging both on this variable
- played_hours - The number of hours each player cumulatively put into playing on the minecraft server, reported by the player and saved as a float (being a number with a decimal). We changed it to lifetime_hours in the merged dataframe
- name - The player's name, reported by the player, saved as a string
- age - The player's age, reported by the player, saved as an integer value
- individualID - NAN values
- organizationname - NAN values

This data is useful as it enabled us to attribute our findings about playtime/play sessions to specific individuals, and can help in investigating further into trends in play sessions.
<h4 style="font-size: x-medium;">Issues:</h4>
In our cross-analysis of dataframes, we came across 3 main issues we thought were valuable to mention:

1.  First, we noticed that there were some hashedEmails not attributed to any playtimes, which means that there are some players who registered who have never played, and therefore are not tracked in the sessions frame. 
2.  The second issue we noticed was in the self-reporting process of personal information. While logistically impossible to collect data otherwise, self reporting can lead to people lying or putting false information, meaning that our information about played hours or experience could be faulty, and even throw our analysis entirely off.
3. The final issue is the relative measure of experience, relying on the individuals own interpretation of what a beginner/amatuer/veteran/pro is.

These issues didn't end up influencing our results heavily, however, as our question centered around play sessions, not necessarily the players behind them, which means we could drop any player information that was attributed to people who never logged into the server, and could ignore most of the self-reported information that was possibly faulty.

<h4 style="font-size: xx-large;">Methods & Results</h4>
(describe the methods you used to perform your analysis from beginning to end that narrates the analysis code. your report should include code which: loads data wrangles and cleans the data to the format necessary for the planned analysis performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis performs the data analysis creates a visualization of the analysis note: all figures should have a figure number and a legend)
<br>
<br>
<h4 style="font-size: x-medium;">Importing/tidying:</h4>
For our investigation, our method to address question 3 incorporated a regression model to predict the activity/concurrent players vs the time of day/day of the week, then comparing them to specific categories of days like weekdays/weekends/holidays to find spikes in player activity. We thought regression would be appropriate because the goal of regression is to predict a numerical value, which is applicable for the numerical measurements of time in the sessions dataset.
<br>
<br>
Our preparation included:

1. Importing all relevant packages and functions
2. Importing our dataframes and tidying them by dropping NAN columns
3. Merging dataframes on hashedEmail

In [2]:
#Importing relevant packages/functions
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
#set_config(transform_output="pandas")
# this can break the regression model earlier because it forces a sparse matrix because sklearn can't work with dataframes for OneHotEncoder() into LinearRegression()

from datetime import datetime
import datetime

from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [28]:
#Merge Dataframes for simplicity. While not completely necessary, we thought that we could demonstrate our understanding of the topic at hand by trying to integrate our
#question into a wider context, and offer some potential avenues for further analysis after our project

url_players = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url_players).drop(columns = ["individualId", "organizationName"])

url_sessions = "https://drive.google.com/uc?id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
sessions = pd.read_csv(url_sessions).drop(columns=['original_start_time','original_end_time'])

players_sessions = sessions.merge(players, on = 'hashedEmail')
players_sessions.head(3)

Unnamed: 0,hashedEmail,start_time,end_time,experience,subscribe,played_hours,name,gender,age
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,Regular,True,223.1,Hiroshi,Male,17
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,Amateur,True,53.9,Alex,Male,17
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,Amateur,True,150.0,Delara,Female,16


<h4 style="font-size: x-medium;">Wrangling</h4>
To prepare our data for our model specifically, we:

1. Converted time in start_time and end_time from strings to datetime objects
2. Counted the amount of session playtime by each unique user

In [29]:
#Converting time strings to datetime objects. Notice the slight change in formatting for dates and times

players_sessions["start_time"] = pd.to_datetime(players_sessions["start_time"], dayfirst=True)
players_sessions["end_time"] = pd.to_datetime(players_sessions["end_time"], dayfirst=True)
players_sessions = players_sessions.rename(columns={"played_hours" : "lifetime_hours"})
players_sessions.head(3)

Unnamed: 0,hashedEmail,start_time,end_time,experience,subscribe,lifetime_hours,name,gender,age
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,Regular,True,223.1,Hiroshi,Male,17
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,Amateur,True,53.9,Alex,Male,17
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,Amateur,True,150.0,Delara,Female,16


In [5]:
players_sessions.to_csv("unsorted_players_sessions.csv")

In [6]:
#Inputting timedelta64 objects from end time and start time with pandas tools
timedelta = players_sessions["end_time"] - players_sessions["start_time"]
players_sessions["session_length_minutes"] = timedelta.dt.total_seconds()/60
players_sessions_timedelta = players_sessions.copy(deep=True)
players_sessions_timedelta["timedelta"] = timedelta

timedelta
#currently IS NOT USED but is nice to have around

0      0 days 00:12:00
1      0 days 00:13:00
2      0 days 00:23:00
3      0 days 00:36:00
4      0 days 00:11:00
             ...      
1530   0 days 00:06:00
1531   0 days 00:11:00
1532   0 days 00:21:00
1533   0 days 00:07:00
1534   0 days 00:19:00
Length: 1535, dtype: timedelta64[ns]

In [7]:
#REMOVE IN FINAL ANALYSIS (does not modify any dfs)
players_sessions[players_sessions['lifetime_hours'] == 0.1].sort_values(by="lifetime_hours").head()

Unnamed: 0,hashedEmail,start_time,end_time,experience,subscribe,lifetime_hours,name,gender,age,session_length_minutes
46,6b1cdc07fcc1f7ea09509341fd245dd34fdba386f14a49...,2024-06-25 22:58:00,2024-06-25 23:09:00,Veteran,True,0.1,Finnian,Non-binary,17,11.0
70,4bfad3613c71ace05644bf210195d9fb0d3d9513753ad2...,2024-07-28 08:49:00,2024-07-28 09:00:00,Veteran,False,0.1,Rocco,Male,17,11.0
89,8b71f4d66a38389b7528bb38ba6eb71157733df7d17403...,2024-04-16 05:09:00,2024-04-16 05:22:00,Amateur,True,0.1,Natalie,Male,17,13.0
103,dc73467f73263dd4a07838330dd1cc115aa3f8b0353891...,2024-05-25 01:36:00,2024-05-25 01:43:00,Veteran,True,0.1,Felix,Male,21,7.0
155,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,2024-04-25 02:04:00,2024-04-25 02:13:00,Regular,True,0.1,Kylie,Male,21,9.0


We then processed the players_sessions data into time-series data, to track the number of concurrent players during any 1-hour interval:

1. Creating an empty "timeline" dataframe with intervals of 5 minutes, between Apr 6-Sept 26, 2025 (the period of time the server gathered data throughout).
2. Using series-value comparisons (to return a series of T/F bools), we found the number of players online during each interval (so the entire sessions dataframe is checked for each entry in our timeline).

Note: Unfortunately because of the comparisons between the loop index's timestamp and start_time and end_time series, if a session took place entirely within one 'period' without going into another (e.g. if hourly, then not crossing 6pm or 7pm sharp), the session would not be logged. This is a current issue with our code that we were unable to properly figure out, which means that our ability to predict is weaker on short term bursts of player activity.

Building off of this, we attempted to createa a nested for loop which may have been more intuitive, but was impractical due to how long it was taking to implement, so we instead stuck with this simpler approach which still highlights all the main components of our question. 

In [30]:
#Transforming the players_sessions data into time-series data, and tracking the concurrent players for each 1 hour interval 
timeline = pd.date_range(
    pd.Timestamp("2024-04-06 09:20:00"), #First instance is apr 6 09:27:00
    pd.Timestamp("2024-09-26 09:15:00"), #Final instance is sept 26, 06:09:00
    freq="5min",
    name="timestamp"
)
timeline = timeline.to_frame().reset_index(drop=True)
timeline["concurrent"] = 0
timeline

#The length of this dataframe is 49824 rows. This can easily be explained by there being 173 days between Apr 6 to Sept 26. 173 * 24 * 12 = 49824 hour long time 'slots'

Unnamed: 0,timestamp,concurrent
0,2024-04-06 09:20:00,0
1,2024-04-06 09:25:00,0
2,2024-04-06 09:30:00,0
3,2024-04-06 09:35:00,0
4,2024-04-06 09:40:00,0
...,...,...
49819,2024-09-26 08:55:00,0
49820,2024-09-26 09:00:00,0
49821,2024-09-26 09:05:00,0
49822,2024-09-26 09:10:00,0


In [56]:
#Finding concurrent players during any given interval:

concurrent_counts = []

for snapshot in timeline.index:
    sessions_active_now = (
        (timeline["timestamp"][snapshot] >= players_sessions['start_time'])
        & 
        (timeline["timestamp"][snapshot] < players_sessions['end_time'] ) 
           #so comparing against the snapshot window, this makes a Series same length as sessions.csv where True means a player is online in that window. This is the boolean AND of the set of True from timestamp >= session start time (series!), and timestamp < session end time (also series!) Note that this is always 8 True or under, so there were never large player volumes   
    )      
    #this produces a Series of trues and falses (mostly False), 1535 long, and we will do this for every snapshot (~4000)

    number_active = sessions_active_now.sum()  #sum counts every True as 1, every False as 0. this is the amount of sessions active now for any given timestamp
    concurrent_counts.append(number_active)    #this is a List object which counts the number of players active concurrently (during any period of time)

timeline = timeline.assign(concurrent=concurrent_counts) #assigning a column with concurrent player counts for the respective 
timeline

Unnamed: 0,timestamp,concurrent
0,2024-04-06 09:20:00,0
1,2024-04-06 09:25:00,0
2,2024-04-06 09:30:00,1
3,2024-04-06 09:35:00,1
4,2024-04-06 09:40:00,1
...,...,...
49819,2024-09-26 08:55:00,0
49820,2024-09-26 09:00:00,0
49821,2024-09-26 09:05:00,0
49822,2024-09-26 09:10:00,0


In [55]:
#This represents the number of timestamps that had a certain number of players active throughout them, note that most (~90%) had fewer than 3 at a time
pd.Series(concurrent_counts).value_counts()

0    38996
1     7446
2     2402
3      705
4      212
5       28
7       19
6       15
8        1
Name: count, dtype: int64

## Some visualizations of timeline

Before we continue our wrangling and modeling, this visualization can help contextualize the data we've gathered so far:

In [57]:
mean_online_per_interval = timeline["concurrent"].mean()
mean_online_per_interval

np.float64(0.3127809890815671)

There is a mean of 0.3 players online on any given interval

In [75]:
timeline2 = timeline.assign(concurrent_players = timeline['concurrent'])
timeline2.groupby('concurrent').agg('mean')

Unnamed: 0_level_0,timestamp,concurrent_players
concurrent,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2024-07-02 06:57:06.759154688,0.0
1,2024-06-29 09:16:12.884770304,1.0
2,2024-06-30 14:17:46.611157248,2.0
3,2024-07-02 16:39:45.106382848,3.0
4,2024-07-16 23:45:45.283019008,4.0
5,2024-07-19 03:56:57.857142784,5.0
6,2024-08-07 08:40:40.000000000,6.0
7,2024-09-01 04:12:53.684210688,7.0
8,2024-09-01 05:05:00.000000000,8.0


In [89]:
#WTF is this visualization, I think I made it worse but at least it means something now...?

alt.Chart(timeline2).mark_bar().encode(
    x=alt.X("timestamp").title('Timestamps'),
    y=alt.Y("concurrent:Q").title('Concurrent players')
).properties(width=1200, height=500)

## Continued wrangling + prediction: 
#### One-hot encoding attributes of a Timestamp, and predicting with Linear Regression.

We're going to sort days by categories of days, to isolate the days with the most average concurrent players (so we can more easily visualize the data between fewer variables):
1. is_monday through is_sunday & is_holiday will be added as one-hot encoded predictors, as can be seen in the following dataframe

In [15]:
day_of_week_int = []

for snapshot in timeline.index:
    day_of_week_int.append(
        timeline["timestamp"][snapshot].weekday()
    )

#timeline_features = timeline.assign(day_of_week_int=day_of_week_int)
#timeline_features["day_of_week_int"].value_counts()
day_of_week_one_hot = pd.get_dummies(day_of_week_int, dtype=int)
day_of_week_one_hot.columns = ["is_monday", "is_tuesday", "is_wednesday", "is_thursday", "is_friday", "is_saturday", "is_sunday"]
timeline_features = pd.concat([timeline, day_of_week_one_hot], axis=1)
timeline_features.sort_values(by="concurrent")

Unnamed: 0,timestamp,concurrent,is_monday,is_tuesday,is_wednesday,is_thursday,is_friday,is_saturday,is_sunday
16,2024-04-06 10:40:00,0,0,0,0,0,0,1,0
49823,2024-09-26 09:15:00,0,0,0,0,1,0,0,0
49822,2024-09-26 09:10:00,0,0,0,0,1,0,0,0
49821,2024-09-26 09:05:00,0,0,0,0,1,0,0,0
49820,2024-09-26 09:00:00,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...
42565,2024-09-01 04:25:00,7,0,0,0,0,0,0,1
42570,2024-09-01 04:50:00,7,0,0,0,0,0,0,1
42566,2024-09-01 04:30:00,7,0,0,0,0,0,0,1
42569,2024-09-01 04:45:00,7,0,0,0,0,0,0,1


### Adding is_holiday

In [90]:
#These are a list of the 11 statutory holidays in BC, which we'll keep in mind as they may have higher player counts (like weekends)
bc_holidays_2024 = [
    datetime.date(2024, 1, 1),   # New Year’s Day
    datetime.date(2024, 2, 19),  # Family Day
    datetime.date(2024, 3, 29),  # Good Friday
    datetime.date(2024, 5, 20),  # Victoria Day
    datetime.date(2024, 7, 1),   # Canada Day
    datetime.date(2024, 8, 5),   # B.C. Day
    datetime.date(2024, 9, 2),   # Labour Day
    datetime.date(2024, 9, 30),  # National Day for Truth and Reconciliation
    datetime.date(2024, 10, 14), # Thanksgiving Day
    datetime.date(2024, 11, 11), # Remembrance Day
    datetime.date(2024, 12, 25), # Christmas Day
]

In [91]:
is_holiday = []

for snapshot in timeline.index:
    date = timeline.loc[snapshot, "timestamp"]
    if timeline.loc[snapshot, "timestamp"].date() in bc_holidays_2024: #this is ~1100, or 2% of the timestamps which checks out. also, .date() is required for types to match
        is_holiday.append(1)
    else:
        is_holiday.append(0)

    
    #is_holiday.append(date in bc_holidays_2024) #membership test, T/F
timeline_features = timeline_features.assign(is_holiday=is_holiday)
timeline_features

Unnamed: 0,timestamp,concurrent,is_monday,is_tuesday,is_wednesday,is_thursday,is_friday,is_saturday,is_sunday,is_holiday
0,2024-04-06 09:20:00,0,0,0,0,0,0,1,0,0
1,2024-04-06 09:25:00,0,0,0,0,0,0,1,0,0
2,2024-04-06 09:30:00,1,0,0,0,0,0,1,0,0
3,2024-04-06 09:35:00,1,0,0,0,0,0,1,0,0
4,2024-04-06 09:40:00,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...
49819,2024-09-26 08:55:00,0,0,0,0,1,0,0,0,0
49820,2024-09-26 09:00:00,0,0,0,0,1,0,0,0,0
49821,2024-09-26 09:05:00,0,0,0,0,1,0,0,0,0
49822,2024-09-26 09:10:00,0,0,0,0,1,0,0,0,0


#### Predictions with linear regression:

To begin our linear regression, we follow the more general methods we followed in class:
1. train_test_split our data
2. Properly define our predictor (our output) and predicting variables ( the ones we're going to compare by)

In [95]:
#train_test_split our data
tf = timeline_features
pred_tf = tf.drop(["timestamp", "concurrent"], axis=1)
resp_tf = tf["concurrent"]

pred_tf_train, pred_tf_test, resp_tf_train, resp_tf_test = train_test_split(pred_tf, resp_tf, train_size=0.75, random_state=2025)

In [96]:
#For testing only
enc = None
pred_tf_train_transform = None
model = None

In [97]:
#Setting up the model

enc = OneHotEncoder(drop="first")
pred_tf_train_transform = enc.fit_transform(pred_tf_train)

model = LinearRegression()
model.fit(pred_tf_train_transform, resp_tf_train)

In [22]:
#Our training coefficients
model.coef_

array([-0.07380724, -0.01890661, -0.03147736, -0.06712927, -0.02344687,
        0.10013512,  0.11463223,  0.12665173])

In [23]:
results_tf = pd.DataFrame(model.coef_)
results_tf.columns = ["coefficient"]
results_tf = results_tf.assign(attribute=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Holiday"])

The following are the coefficients of the model, associated with each of the days that we treated as categories above (weekdays, weekends, and holidays)

In [24]:
results_tf

Unnamed: 0,coefficient,attribute
0,-0.073807,Monday
1,-0.018907,Tuesday
2,-0.031477,Wednesday
3,-0.067129,Thursday
4,-0.023447,Friday
5,0.100135,Saturday
6,0.114632,Sunday
7,0.126652,Holiday


As we can see from the above dataframe, Saturdays, Sundays, and holidays were associated with a 0.1 increase in player count in any given 5 minute interval. Weekdays had slightly negative, from -0.07 fewer players expected on Monday to -0.023 fewer expected on Friday. Recall that the mean amount of players per interval was 0.3

From this, we can continue our investigation, but narrow our visualization to only present the most relevant data, the holidays and weekends with larger player numbers on average. We thought this was similar to how we'd present our investigation to the server owners ourself, if this was our role, by giving them the most relevant data to answer the question that they had (which time periods can be predicted to have the most players), while also having the modeling capability to explain specifics, and explore other times.

In [98]:
pd.DataFrame(model.predict(pred_tf_train))



Unnamed: 0,0
0,0.409185
1,0.285603
2,0.235242
3,0.285603
4,0.290143
...,...
37363,0.285603
37364,0.290143
37365,0.423682
37366,0.423682


In [26]:
RMSPE = mean_squared_error(
    y_true=resp_tf_test,
    y_pred=model.predict(pred_tf_test)
)**(1/2)

RMSPE



np.float64(0.6909179207229749)

<h4 style="font-size: xx-large;">Discussion</h4>
(Summarize what you found discuss whether this is what you expected to find? discuss what impact could such findings have? discuss what future questions could this lead to?)
<br>
<br>
While quite a simple analysis on a rudimentary dataset like the one provided, we believe that 

<h4 style="font-size: xx-large;">References (Cited APA7)</h4> 
(You may include references if necessary, as long as they all have a consistent citation style)