Create an electronic report in English with a maximum of 2000 words (excluding citations) using Jupyter. The report should include the posed question, conducted analysis, and derived conclusion. Only one team member needs to submit this report. It is not required to include all tasks completed by every group member in their individual assignments; tailor the final report to the collective group's work. 

You must submit 2 files: an .html file (File -> Download As -> HTML) an .ipynb file. This file must be fully reproducible. It must run completely from top to bottom without any additional files.

**Title Introduction:**
provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and fully describe the dataset that was used to answer the question

**Methods & Results:**
describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 
note: all figures should have a figure number and a legend

**Discussion:**
summarize what you found
discuss whether this is what you expected to find?
discuss what impact could such findings have?
discuss what future questions could this lead to?

**References:**
You may include references if necessary, as long as they all have a consistent citation style.

<h2 style="font-size: xxx-large;">Group 21 Project.</h2>

<h4 style="font-size: xx-large;">Introduction</h4>

<h4 style="font-size: x-medium;">Introduction to our question:</h4>
For this project, our group chose to investigate question 3 on demand forcasting, namely - "what time windows are most likely to have large number of simultaneous players", to ensure the server has a sufficient number of licences in order to keep up with concurrent players. Reflecting the courses philosphy of transform -> visualize -> model -> repeat, and our focus on modeling over the last few weeks, we were specifically interested in whether we could use the preexisting server data to predict what times may have the most players, and the probably upper bounds of player activity we could reasonably predict, in order to advise the server team if they need to increase their server capacity.
<br>
<h4 style="font-size: x-medium;">The data:</h4>
In order to begin our investigation, we were given 2 dataframes, one on player information and one on play sessions information, both of which we deemed important for us to answer our question:

The sessions dataframe includes 1606 rows, with each corresponding to a single play session, and 5 columns, titled:
- start_time - The time and date that a player logged onto the server and began to play, as a string in a DD/MM/YYYY format for data and a 24 hour time.
- end_time - The time and date that a player logged off the server after stopping playing or being kicked off, as a string in a DD/MM/YYYY format for data and a 24 hour time.
- original_start_time - The same information as start_time but measured in unix time, as a float which measures from Jan 1st, 1970 (Wikipedia)
- original_end_time - The same information as end_time but as a float measured in unix time 
- hashedEmail - A string identifier that attributed each play session to a specific individual

This granular data on the dates of player logins and their session start/end times is useful to our investivation as it allowed us to analyze overlap between their playtimes and find trends in playtime.
<br>
<br>
The players dataframe includes information on each player, having 196 rows which corresponded to a player each, and 7 columns, titled:
- experience - A metric of self-reported experience within the game, an ordinal with categories 'amateur', 'beginner', 'regular', 'pro', and 'veteran'
- subscribe - Whether the player has or has not subscribed to the servers game-related newsletter
- hashedEmail - A string identifier that attributed each play session to a specific individual, this is the same 'hashedEmail' as in the sessions dataset, which allowed for merging both on this variable
- played_hours - The number of hours each player cumulatively put into playing on the minecraft server, reported by the player and saved as a float (being a number with a decimal)
- name - The player's name, reported by the player, saved as a string
- age - The player's age, reported by the player, saved as an integer value
- individualID - NAN values
- organizationname - NAN values

This data is useful as it enabled us to attribute our findings about playtime/play sessions to specific individuals, and can help in investigating further into trends in play sessions.
<h4 style="font-size: x-medium;">Issues:</h4>
In our cross-analysis of dataframes, we came across 3 main issues we thought were valuable to mention:

1.  First, we noticed that there were hashedEmails not attributed to any playtimes, which means that there are some players who registered who have never played, and therefore are not tracked in the sessions frame. 
2.  The second issue we noticed was in the self-reporting process of personal information. While logistically impossible to do otherwise, self reporting can lead to people lying or putting false information, meaning that our information about played hours or experience could be faulty, and even throw our analysis entirely off.
3. The final issue is the relative measure of experience, relying on the individuals own interpretation of what a beginner/amatuer/veteran/pro is.

These issues didn't end up influencing our results heavily, however, as our question centered around play sessions, not necessarily the players behind them, which means we could drop any player information that was attributed to people who never logged into the server, and could ignore most of the self-reported information that was possibly faulty.

<h4 style="font-size: xx-large;">Methods & Results</h4>
(describe the methods you used to perform your analysis from beginning to end that narrates the analysis code. your report should include code which: loads data wrangles and cleans the data to the format necessary for the planned analysis performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis performs the data analysis creates a visualization of the analysis note: all figures should have a figure number and a legend)
<br>
<br>
<h4 style="font-size: x-medium;">Importing/tidying:</h4>
For our investigation of question 3, our preparation included:

1. Importing all relevant packages and functions
2. Importing our dataframes and tidying them by dropping NAN columns
3. Merging dataframes on hashedEmail

In [1]:
#Importing relevant packages/functions
import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

# Simplify working with large datasets in Altair
alt.data_transformers.enable('vegafusion')

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [2]:
#Merge Dataframes for simplicity
url_players = "https://drive.google.com/uc?id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players = pd.read_csv(url_players).drop(columns = ["individualId", "organizationName"])

url_sessions = "https://drive.google.com/uc?id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"
sessions = pd.read_csv(url_sessions)

players_sessions = sessions.merge(players, on = 'hashedEmail')
players_sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,experience,subscribe,played_hours,name,gender,age
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,30/06/2024 18:12,30/06/2024 18:24,1.719770e+12,1.719770e+12,Regular,True,223.1,Hiroshi,Male,17
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,17/06/2024 23:33,17/06/2024 23:46,1.718670e+12,1.718670e+12,Amateur,True,53.9,Alex,Male,17
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,25/07/2024 17:34,25/07/2024 17:57,1.721930e+12,1.721930e+12,Amateur,True,150.0,Delara,Female,16
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,25/07/2024 03:22,25/07/2024 03:58,1.721880e+12,1.721880e+12,Regular,True,223.1,Hiroshi,Male,17
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,25/05/2024 16:01,25/05/2024 16:12,1.716650e+12,1.716650e+12,Amateur,True,53.9,Alex,Male,17
...,...,...,...,...,...,...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,10/05/2024 23:01,10/05/2024 23:07,1.715380e+12,1.715380e+12,Amateur,True,53.9,Alex,Male,17
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,01/07/2024 04:08,01/07/2024 04:19,1.719810e+12,1.719810e+12,Veteran,True,1.6,Lane,Female,23
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,28/07/2024 15:36,28/07/2024 15:57,1.722180e+12,1.722180e+12,Amateur,True,56.1,Dana,Male,23
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,25/07/2024 06:15,25/07/2024 06:22,1.721890e+12,1.721890e+12,Amateur,True,56.1,Dana,Male,23


<h4 style="font-size: x-medium;">Pre-modeling:</h4>
To prepare our data for our model specifically, we:

1. Converted time in start_time and end_time from strings to datetime objects
2. Counted the amount of session playtime by each unique user

In [3]:
#Converting time strings to datetime objects. Notice the slight change in formatting 
players_sessions["start_time"] = pd.to_datetime(players_sessions["start_time"], dayfirst=True)
players_sessions["end_time"] = pd.to_datetime(players_sessions["end_time"], dayfirst=True)
players_sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,experience,subscribe,played_hours,name,gender,age
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1.719770e+12,1.719770e+12,Regular,True,223.1,Hiroshi,Male,17
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1.718670e+12,1.718670e+12,Amateur,True,53.9,Alex,Male,17
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1.721930e+12,1.721930e+12,Amateur,True,150.0,Delara,Female,16
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1.721880e+12,1.721880e+12,Regular,True,223.1,Hiroshi,Male,17
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1.716650e+12,1.716650e+12,Amateur,True,53.9,Alex,Male,17
...,...,...,...,...,...,...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-10 23:01:00,2024-05-10 23:07:00,1.715380e+12,1.715380e+12,Amateur,True,53.9,Alex,Male,17
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,2024-07-01 04:08:00,2024-07-01 04:19:00,1.719810e+12,1.719810e+12,Veteran,True,1.6,Lane,Female,23
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-28 15:36:00,2024-07-28 15:57:00,1.722180e+12,1.722180e+12,Amateur,True,56.1,Dana,Male,23
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-25 06:15:00,2024-07-25 06:22:00,1.721890e+12,1.721890e+12,Amateur,True,56.1,Dana,Male,23


In [4]:
#Inputting timedelta64 objects from end time and start time with pandas tools
timedelta = players_sessions["end_time"] - players_sessions["start_time"]
players_sessions["session_length_minutes"] = timedelta.dt.total_seconds()/60
players_sessions_timedelta = players_sessions.copy(deep=True)
players_sessions_timedelta["timedelta"] = timedelta

timedelta

0      0 days 00:12:00
1      0 days 00:13:00
2      0 days 00:23:00
3      0 days 00:36:00
4      0 days 00:11:00
             ...      
1530   0 days 00:06:00
1531   0 days 00:11:00
1532   0 days 00:21:00
1533   0 days 00:07:00
1534   0 days 00:19:00
Length: 1535, dtype: timedelta64[ns]

In [6]:
#does this do anything useful or is this actually working against us, didn't merging alreading 
sessions_dropped = players_sessions.drop(players_sessions[players_sessions['played_hours'] == 0].index)
players_sessions[players_sessions['session_length_minutes'] == 0].head(40)#yeah... we already have no 0 session time observations logged, I'm just wondering how this happened? As the player dataframe has a few players who have no playtime

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,experience,subscribe,played_hours,name,gender,age,session_length_minutes


<h4 style="font-size: xx-large;">Discussion</h4>
(Summarize what you found discuss whether this is what you expected to find? discuss what impact could such findings have? discuss what future questions could this lead to?)
<br>
<br>
While quite a simple analysis on a rudimentary dataset like the one provided, we believe that 

<h4 style="font-size: xx-large;">References (Cited ____)</h4> 
(You may include references if necessary, as long as they all have a consistent citation style)
<br>
<br>
https://en.wikipedia.org/wiki/Unix_time
https://stackoverflow.com/questions/56611698/pandas-how-to-read-csv-file-from-google-drive-public
