EDA for plaicraft data, Andrew Dai

In [1]:
import pandas as pd
import altair as alt

## (1) Data Description:
Provide a full descriptive summary of the dataset, including information such as the number of observations, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

## (2) Question:
Clearly state one question your group will try to answer using the selected dataset (of the questions above). Your analysis should involve the response variable of interest and one or more explanatory variables. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

## (3) Exploratory Data Analysis and Visualization
In this assignment, you will:

## Demonstrate that the dataset can be loaded into Python.
Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question
Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

## (4) Methods and Plan
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

## Part 1: Data description

In [30]:
players = pd.read_csv("data/players.csv")

In [31]:
sessions = pd.read_csv("data/sessions.csv")

In [32]:
#loading in the dataframes

In [33]:
players.describe()

Unnamed: 0,played_hours,age,individualId,organizationName
count,196.0,196.0,0.0,0.0
mean,5.845918,21.280612,,
std,28.357343,9.706346,,
min,0.0,8.0,,
25%,0.0,17.0,,
50%,0.1,19.0,,
75%,0.6,22.0,,
max,223.1,99.0,,


In [34]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


In [35]:
sessions.describe()

Unnamed: 0,original_start_time,original_end_time
count,1535.0,1533.0
mean,1719201000000.0,1719196000000.0
std,3557492000.0,3552813000.0
min,1712400000000.0,1712400000000.0
25%,1716240000000.0,1716240000000.0
50%,1719200000000.0,1719180000000.0
75%,1721890000000.0,1721890000000.0
max,1727330000000.0,1727340000000.0


In [36]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   hashedEmail          1535 non-null   object 
 1   start_time           1535 non-null   object 
 2   end_time             1533 non-null   object 
 3   original_start_time  1535 non-null   float64
 4   original_end_time    1533 non-null   float64
dtypes: float64(2), object(3)
memory usage: 60.1+ KB


In [37]:
#converting time as string to datetime objects for convenience

In [38]:
sessions["start_time"] = pd.to_datetime(sessions["start_time"], dayfirst=True)

In [39]:
sessions["end_time"] = pd.to_datetime(sessions["end_time"], dayfirst=True)

In [40]:
#feature engineering timedelta64 objects from end time and start time with pandas' handy tools!

In [41]:
timedelta = sessions["end_time"] - sessions["start_time"]

In [42]:
timedelta.info()

<class 'pandas.core.series.Series'>
RangeIndex: 1535 entries, 0 to 1534
Series name: None
Non-Null Count  Dtype          
--------------  -----          
1533 non-null   timedelta64[ns]
dtypes: timedelta64[ns](1)
memory usage: 12.1 KB


In [43]:
#quick showing of what timedelta can do for us:

In [44]:
sessions["start_time"][0] + timedelta[0]

Timestamp('2024-06-30 18:24:00')

In [45]:
sessions["end_time"][0]

Timestamp('2024-06-30 18:24:00')

In [51]:
sessions["session_length_minutes"] = timedelta.dt.total_seconds() / 60

In [52]:
sessions_timedelta = sessions.copy(deep=True)

In [53]:
sessions_timedelta["timedelta"] = timedelta
sessions_timedelta.head()

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes,timedelta
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1719770000000.0,1719770000000.0,12.0,0 days 00:12:00
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1718670000000.0,1718670000000.0,13.0,0 days 00:13:00
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1721930000000.0,1721930000000.0,23.0,0 days 00:23:00
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1721880000000.0,1721880000000.0,36.0,0 days 00:36:00
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1716650000000.0,1716650000000.0,11.0,0 days 00:11:00


In [54]:
sessions.head()

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1719770000000.0,1719770000000.0,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1718670000000.0,1718670000000.0,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1721930000000.0,1721930000000.0,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1721880000000.0,1721880000000.0,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1716650000000.0,1716650000000.0,11.0


In [55]:
sessions.loc[0, "session_length_minutes"] #just to check if it works


np.float64(12.0)

In [56]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   hashedEmail             1535 non-null   object        
 1   start_time              1535 non-null   datetime64[ns]
 2   end_time                1533 non-null   datetime64[ns]
 3   original_start_time     1535 non-null   float64       
 4   original_end_time       1533 non-null   float64       
 5   session_length_minutes  1533 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(1)
memory usage: 72.1+ KB


In [57]:
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1.719770e+12,1.719770e+12,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1.718670e+12,1.718670e+12,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1.721930e+12,1.721930e+12,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1.721880e+12,1.721880e+12,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1.716650e+12,1.716650e+12,11.0
...,...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-10 23:01:00,2024-05-10 23:07:00,1.715380e+12,1.715380e+12,6.0
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,2024-07-01 04:08:00,2024-07-01 04:19:00,1.719810e+12,1.719810e+12,11.0
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-28 15:36:00,2024-07-28 15:57:00,1.722180e+12,1.722180e+12,21.0
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-25 06:15:00,2024-07-25 06:22:00,1.721890e+12,1.721890e+12,7.0


In [58]:
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


### Description:

So there are two csv files, sessions and players

"players" contains all registrants, and is 196 rows long. It is 9 columns wide. It effectively has 7 columns. 1 row/observation = 1 player's info, so 196 players' info registered.

In its columns are: 
- self-reported experience level, either Pro/Veteran/Amateur (string),
- subscribed to newsletter, which is T/F,
- a player's hashedEmail which is an anonymous but unique identifier (string),
- total hours player which is a number,
- their name (string),
- gender (string),
- age (int),
- and individualID and organization name which are both empty.

So the "subscribe" column should mean subscription to plaicraft's newsletter, and the rest is self explanatory. "played_hours" is the cumulative hours played for a player

The "sessions.csv" file is 1606 rows long, and 5 columns wide. 1 row = 1 tracked session

It contains hashedEmail as an identifier, which it shares in common with "players". Note that the frames can be merged on this variable.

The next columns are "start_time", "end_time", "original_start_time", "original_end_time".

"original_start_time" and "original_end_time" are both times of sessions beginning and ending in Unix time, and this can be verified by copying values into a Unix time calculator online (after undoing the exponential notation).

"start_time" and "end_time" are then dates and times in DD/MM/YYYY, as well as 24 hour time. Time zones are not specified (and would be a bit excessive?)

Also note that a few hashedEmail records in players do not have corresponding records in sessions. The reverse is not true.

This means that there are some players who registered who have never played, and therefore are not tracked in the sessions frame.

Fortunately every record in sessions has a corresponding hashedEmail in players - that is everybody who played also had their registration tracked.

## Question:

I have selected question 3, which is as follows:

Question 3: We are interested in demand forecasting, namely, what time windows are most likely to have large number of simultaneous players. This is because we need to ensure that the number of licenses on hand is sufficiently large to accommodate all parallel players with high probability. 

The data in players and sessions should help me in answering that, because the latter contains very granular data on the start and end times of each play session and by who, and the former has other info on each player.

## 2) EDA:

In [59]:
players.describe()

Unnamed: 0,played_hours,age,individualId,organizationName
count,196.0,196.0,0.0,0.0
mean,5.845918,21.280612,,
std,28.357343,9.706346,,
min,0.0,8.0,,
25%,0.0,17.0,,
50%,0.1,19.0,,
75%,0.6,22.0,,
max,223.1,99.0,,


In [60]:
players_by_played_hours = players.sort_values(by="played_hours")

In [61]:
players_by_played_hours

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
7,Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.0,Emerson,Male,21,,
6,Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d...,0.0,Luna,Female,19,,
5,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee...,0.0,Adrian,Female,17,,
15,Amateur,False,2313a06afe47eacc28ff55adf6f072e7d12b0d12d7cbae...,0.0,Quinlan,Male,22,,
...,...,...,...,...,...,...,...,...,...
130,Amateur,True,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,56.1,Dana,Male,23,,
90,Amateur,True,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,150.0,Delara,Female,16,,
158,Regular,True,ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b1...,178.2,Piper,Female,19,,
51,Regular,True,b622593d2ef8b337dc554acb307d04a88114f2bf453b18...,218.1,Akio,Non-binary,20,,


In [62]:
unique_vs_cumulative_hours = alt.Chart(players_by_played_hours).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played")
).properties(width = 2000)

unique_vs_cumulative_hours

In [63]:
under20 = players[players["played_hours"] < 20]

In [64]:
unique_vs_cumulative_hours_under20 = alt.Chart(under20).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played")
).properties(width=1000)

unique_vs_cumulative_hours_under20

In [65]:
(players["played_hours"] < 1).sum()

np.int64(154)

In [66]:
(players["played_hours"] < 2).sum()

np.int64(170)

In [67]:
unique_vs_cumulative_hours_and_experience = alt.Chart(players_by_played_hours).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played"),
    color=alt.Color("experience")
).properties(width = 2000)

unique_vs_cumulative_hours_and_experience


In [68]:
unique_vs_cumulative_hours_and_experience_under20 = alt.Chart(under20).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played"),
    color=alt.Color("experience")
).properties(width = 2000)

unique_vs_cumulative_hours_and_experience_under20

In [69]:
unique_vs_hours_and_subscribed = unique_vs_cumulative_hours = alt.Chart(players_by_played_hours).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played"),
    color=alt.Color("subscribe")
).properties(width = 2000)

unique_vs_cumulative_hours


unique_vs_hours_and_subscribed

In [70]:
alt.Chart(under20).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played"),
    color=alt.Color("subscribe")
).properties(width = 2000)

In [71]:
# I could do more but i think it's time to move on:

### info for sessions.csv

In [72]:
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1.719770e+12,1.719770e+12,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1.718670e+12,1.718670e+12,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1.721930e+12,1.721930e+12,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1.721880e+12,1.721880e+12,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1.716650e+12,1.716650e+12,11.0
...,...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-10 23:01:00,2024-05-10 23:07:00,1.715380e+12,1.715380e+12,6.0
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,2024-07-01 04:08:00,2024-07-01 04:19:00,1.719810e+12,1.719810e+12,11.0
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-28 15:36:00,2024-07-28 15:57:00,1.722180e+12,1.722180e+12,21.0
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-25 06:15:00,2024-07-25 06:22:00,1.721890e+12,1.721890e+12,7.0


In [73]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   hashedEmail             1535 non-null   object        
 1   start_time              1535 non-null   datetime64[ns]
 2   end_time                1533 non-null   datetime64[ns]
 3   original_start_time     1535 non-null   float64       
 4   original_end_time       1533 non-null   float64       
 5   session_length_minutes  1533 non-null   float64       
dtypes: datetime64[ns](2), float64(3), object(1)
memory usage: 72.1+ KB


In [74]:
sessions.sort_values(by="start_time")

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes
1050,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-04-06 09:27:00,2024-04-06 09:31:00,1.712400e+12,1.712400e+12,4.0
894,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-04-06 09:35:00,2024-04-06 10:16:00,1.712400e+12,1.712400e+12,41.0
1247,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-04-06 20:56:00,2024-04-06 22:04:00,1.712440e+12,1.712440e+12,68.0
417,f6daba428a5e19a3d47574858c13550499be23603422e6...,2024-04-06 22:24:00,2024-04-06 23:33:00,1.712440e+12,1.712450e+12,69.0
94,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-04-07 00:17:00,2024-04-07 00:28:00,1.712450e+12,1.712450e+12,11.0
...,...,...,...,...,...,...
301,b622593d2ef8b337dc554acb307d04a88114f2bf453b18...,2024-09-21 05:07:00,2024-09-21 06:58:00,1.726900e+12,1.726900e+12,111.0
397,b622593d2ef8b337dc554acb307d04a88114f2bf453b18...,2024-09-21 21:13:00,2024-09-21 22:14:00,1.726950e+12,1.726960e+12,61.0
722,a175d4741dc84e6baf77901f6e8e0a06f54809a34e6b52...,2024-09-21 23:49:00,2024-09-22 00:23:00,1.726960e+12,1.726960e+12,34.0
1365,7c0ae28a5f85a515a8063f9ed989aa26c5ebcc64f6b7be...,2024-09-24 06:30:00,2024-09-24 06:39:00,1.727160e+12,1.727160e+12,9.0


In [75]:
unique_vs_start_time = alt.Chart(sessions).mark_bar().encode(
    x=alt.X("hashedEmail").sort("y").title("Unique registrants"),
    y=alt.Y("start_time:T").title("Hours played").axis(tickCount=30, format='%Y-%m-%d'),
    #color=alt.Color("experience")
).properties(width = 2000, height=800 )

unique_vs_start_time

In [76]:
sessions["session_length_minutes"].sum() / 60

np.float64(1299.4333333333334)

### 4. Methods and plans



One method to address this question would be to have a regression model to predict the activity/concurrent players vs the time of day/day of the week

I believe this method is appropriate because the goal of regression is to predict a numerical value output from a set of new inputs, by using a bunch of existing inputs and outputs as training 

Some limitations may difficulty in wrangling the data, as the structure of sessions tracks based on 1 row = 1 session, and not 1 row = a fixed amount of time.
I would have to construct that time-series data of "concurrent players online" from the existing data.

*** unsure 

Potential limitations are that I can't account for spikes (e.g. during a holiday), as my data runs between April-Sept 2024

I will split the data into training and testing, and the training will be split with cross validation.


## 5) Analysis/Further processing

In [206]:
sessions.head()

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1719770000000.0,1719770000000.0,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1718670000000.0,1718670000000.0,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1721930000000.0,1721930000000.0,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1721880000000.0,1721880000000.0,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1716650000000.0,1716650000000.0,11.0


In [246]:
sessions_timedelta.head()

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length_minutes,timedelta
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1719770000000.0,1719770000000.0,12.0,0 days 00:12:00
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1718670000000.0,1718670000000.0,13.0,0 days 00:13:00
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1721930000000.0,1721930000000.0,23.0,0 days 00:23:00
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1721880000000.0,1721880000000.0,36.0,0 days 00:36:00
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1716650000000.0,1716650000000.0,11.0,0 days 00:11:00


In [269]:
sessions['end_time'].max()

Timestamp('2024-09-26 07:39:00')

In [270]:
sessions['start_time'].min()

Timestamp('2024-04-06 09:27:00')

In [271]:
#these will be our bounds 

In [272]:
timeline = pd.date_range(
    sessions['start_time'].min(),
    sessions['end_time'].max(),
    freq="1h",
    name="timestamp"
)

In [273]:
timeline = timeline.to_frame().reset_index(drop=True)
timeline

Unnamed: 0,timestamp
0,2024-04-06 09:27:00
1,2024-04-06 10:27:00
2,2024-04-06 11:27:00
3,2024-04-06 12:27:00
4,2024-04-06 13:27:00
...,...
4146,2024-09-26 03:27:00
4147,2024-09-26 04:27:00
4148,2024-09-26 05:27:00
4149,2024-09-26 06:27:00


In [296]:
timeline["concurrent"] = 0
#timeline.head()

In [297]:
timeline["timestamp"][2]

Timestamp('2024-04-06 11:27:00')

In [298]:
#timeline.loc[0, "concurrent"] = timeline.loc[0, "concurrent"] + 1

In [299]:
timeline.shape

(4151, 2)

In [300]:
# for n in sessions.index:
#     if sessions["start_time"][n] < timeline["timestamp"][n] < sessions["end_time"][n]:
#         timeline.loc[n, "concurrent"] = timeline.loc[n, "concurrent"] + 1

# timeline

In [301]:
timeline

Unnamed: 0,timestamp,concurrent
0,2024-04-06 09:27:00,0
1,2024-04-06 10:27:00,0
2,2024-04-06 11:27:00,0
3,2024-04-06 12:27:00,0
4,2024-04-06 13:27:00,0
...,...,...
4146,2024-09-26 03:27:00,0
4147,2024-09-26 04:27:00,0
4148,2024-09-26 05:27:00,0
4149,2024-09-26 06:27:00,0


In [302]:
#sessions["start_time"][4] < sessions["start_time"][0]

In [303]:
#sessions["start_time"][0:10]

In [304]:
concurrent_counts = []

for current_time in timeline2.index:
    
    sessions_active_now = (
        (sessions['start_time'] <= timeline["timestamp"][current_time]) & 
        (sessions['end_time'] > timeline["timestamp"][current_time])
    )

    number_active = sessions_active_now.sum()

    concurrent_counts.append(number_active)

In [305]:
timeline = timeline.assign(concurrent=concurrent_counts)
timeline

Unnamed: 0,timestamp,concurrent
0,2024-04-06 09:27:00,1
1,2024-04-06 10:27:00,0
2,2024-04-06 11:27:00,0
3,2024-04-06 12:27:00,0
4,2024-04-06 13:27:00,0
...,...,...
4146,2024-09-26 03:27:00,0
4147,2024-09-26 04:27:00,0
4148,2024-09-26 05:27:00,0
4149,2024-09-26 06:27:00,1


In [306]:
timeline["concurrent"].max()

7

In [307]:
# concurrent_counts = []

# for current_time in sample.index:
    
#     sessions_active_now = (
#         (sessions['start_time'] <= timeline2["timestamp"][current_time]) & 
#         (sessions['end_time'] > timeline2["timestamp"][current_time])
#     )

#     number_active = sessions_active_now.sum()

#     concurrent_counts.append(number_active)

In [308]:
#timeline[0:1]

In [326]:
timeline[timeline["concurrent"] > 3]

Unnamed: 0,timestamp,concurrent
710,2024-05-05 23:27:00,4
1145,2024-05-24 02:27:00,5
1146,2024-05-24 03:27:00,4
1687,2024-06-15 16:27:00,4
1715,2024-06-16 20:27:00,4
2028,2024-06-29 21:27:00,4
2586,2024-07-23 03:27:00,4
2731,2024-07-29 04:27:00,4
2779,2024-07-31 04:27:00,4
2850,2024-08-03 03:27:00,4


In [327]:
timeline_chart = alt.Chart(timeline).mark_bar().encode(
    x="timestamp",
    y="concurrent"
).properties(width=4000)

In [328]:
timeline_chart

In [329]:
timeline[0:700]

Unnamed: 0,timestamp,concurrent
0,2024-04-06 09:27:00,1
1,2024-04-06 10:27:00,0
2,2024-04-06 11:27:00,0
3,2024-04-06 12:27:00,0
4,2024-04-06 13:27:00,0
...,...,...
695,2024-05-05 08:27:00,0
696,2024-05-05 09:27:00,0
697,2024-05-05 10:27:00,0
698,2024-05-05 11:27:00,0


In [330]:
alt.Chart(timeline[0:700]).mark_bar().encode(
    x="timestamp",
    y="concurrent"
).properties(width=4000)