In [184]:
import pandas as pd
import altair as alt

## (1) Data Description:
Provide a full descriptive summary of the dataset, including information such as the number of observations, number of variables, name and type of variables, what the variables mean, any issues you see in the data, any other potential issues related to things you cannot directly see, how the data were collected, etc. Make sure to use bullet point lists or tables to summarize the variables in an easy-to-understand format.

Note that the selected dataset will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. You need to summarize the full data regardless of which variables you may choose to use later on.

## (2) Question:
Clearly state one question your group will try to answer using the selected dataset (of the questions above). Your analysis should involve the response variable of interest and one or more explanatory variables. Describe clearly how the data will help you address the question of interest. You may need to describe how you plan to wrangle your data to get it into a form where you can apply one of the predictive methods from this class.

(3) Exploratory Data Analysis and Visualization
In this assignment, you will:

## Demonstrate that the dataset can be loaded into Python.
Do the minimum necessary wrangling to turn your data into a tidy format. Do not do any additional wrangling here; that will happen later during the group project phase.
Make a few exploratory visualizations of the data to help you understand it.
Use our visualization best practices to make high-quality plots (make sure to include labels, titles, units of measurement, etc)
Explain any insights you gain from these plots that are relevant to address your question
Note: do not perform any predictive analysis here. We are asking for an exploration of the relevant variables to demonstrate that you understand them well before performing any additional modelling, and to identify potential problems you anticipate encountering.

## (4) Methods and Plan
Propose one method to address your question of interest using the selected dataset and explain why it was chosen. Do not perform any modelling or present results at this stage. We are looking for high-level planning regarding model choice and justifying that choice.

In your explanation, respond to the following questions:

Why is this method appropriate?
Which assumptions are required, if any, to apply the method selected?
What are the potential limitations or weaknesses of the method selected?
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

## Part 1: Data description

In [310]:
players = pd.read_csv("data/players.csv")

In [311]:
sessions = pd.read_csv("data/sessions.csv")

In [312]:
#loading in the dataframes

In [313]:
players.describe()

Unnamed: 0,played_hours,age,individualId,organizationName
count,196.0,196.0,0.0,0.0
mean,5.845918,21.280612,,
std,28.357343,9.706346,,
min,0.0,8.0,,
25%,0.0,17.0,,
50%,0.1,19.0,,
75%,0.6,22.0,,
max,223.1,99.0,,


In [314]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   experience        196 non-null    object 
 1   subscribe         196 non-null    bool   
 2   hashedEmail       196 non-null    object 
 3   played_hours      196 non-null    float64
 4   name              196 non-null    object 
 5   gender            196 non-null    object 
 6   age               196 non-null    int64  
 7   individualId      0 non-null      float64
 8   organizationName  0 non-null      float64
dtypes: bool(1), float64(3), int64(1), object(4)
memory usage: 12.6+ KB


In [315]:
sessions.describe()

Unnamed: 0,original_start_time,original_end_time
count,1535.0,1533.0
mean,1719201000000.0,1719196000000.0
std,3557492000.0,3552813000.0
min,1712400000000.0,1712400000000.0
25%,1716240000000.0,1716240000000.0
50%,1719200000000.0,1719180000000.0
75%,1721890000000.0,1721890000000.0
max,1727330000000.0,1727340000000.0


In [316]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   hashedEmail          1535 non-null   object 
 1   start_time           1535 non-null   object 
 2   end_time             1533 non-null   object 
 3   original_start_time  1535 non-null   float64
 4   original_end_time    1533 non-null   float64
dtypes: float64(2), object(3)
memory usage: 60.1+ KB


In [317]:
#converting time as string to datetime objects for convenience

In [318]:
sessions["start_time"] = pd.to_datetime(sessions["start_time"], dayfirst=True)

In [319]:
sessions["end_time"] = pd.to_datetime(sessions["end_time"], dayfirst=True)

In [320]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   hashedEmail          1535 non-null   object        
 1   start_time           1535 non-null   datetime64[ns]
 2   end_time             1533 non-null   datetime64[ns]
 3   original_start_time  1535 non-null   float64       
 4   original_end_time    1533 non-null   float64       
dtypes: datetime64[ns](2), float64(2), object(1)
memory usage: 60.1+ KB


In [321]:
sessions["session_length"] = sessions["end_time"] - sessions["start_time"]

In [322]:
#dt.total_minutes, dt.total_seconds()

In [323]:
sessions["session_length_minutes"] = sessions["session_length"].dt.total_seconds() / 60

In [325]:
hi = sessions.loc[0, "session_length_minutes"]
hi

np.float64(12.0)

In [309]:
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time,session_length,test,session_length_minutes
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1.719770e+12,1.719770e+12,0 days 00:12:00,12.0,12.0
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1.718670e+12,1.718670e+12,0 days 00:13:00,13.0,13.0
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1.721930e+12,1.721930e+12,0 days 00:23:00,23.0,23.0
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1.721880e+12,1.721880e+12,0 days 00:36:00,36.0,36.0
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1.716650e+12,1.716650e+12,0 days 00:11:00,11.0,11.0
...,...,...,...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-10 23:01:00,2024-05-10 23:07:00,1.715380e+12,1.715380e+12,0 days 00:06:00,6.0,6.0
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,2024-07-01 04:08:00,2024-07-01 04:19:00,1.719810e+12,1.719810e+12,0 days 00:11:00,11.0,11.0
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-28 15:36:00,2024-07-28 15:57:00,1.722180e+12,1.722180e+12,0 days 00:21:00,21.0,21.0
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-25 06:15:00,2024-07-25 06:22:00,1.721890e+12,1.721890e+12,0 days 00:07:00,7.0,7.0


In [229]:
players

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
0,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6...,30.3,Morgan,Male,9,,
1,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa9397...,3.8,Christian,Male,17,,
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
3,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4f...,0.7,Flora,Female,21,,
4,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb...,0.1,Kylie,Male,21,,
...,...,...,...,...,...,...,...,...,...
191,Amateur,True,b6e9e593b9ec51c5e335457341c324c34a2239531e1890...,0.0,Bailey,Female,17,,
192,Veteran,False,71453e425f07d10da4fa2b349c83e73ccdf0fb3312f778...,0.3,Pascal,Male,22,,
193,Amateur,False,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db29...,0.0,Dylan,Prefer not to say,17,,
194,Amateur,False,f19e136ddde68f365afc860c725ccff54307dedd13968e...,2.3,Harlow,Male,17,,


### Description:

## Question:

In [230]:
#what is love

## EDA:

In [231]:
players.describe()

Unnamed: 0,played_hours,age,individualId,organizationName
count,196.0,196.0,0.0,0.0
mean,5.845918,21.280612,,
std,28.357343,9.706346,,
min,0.0,8.0,,
25%,0.0,17.0,,
50%,0.1,19.0,,
75%,0.6,22.0,,
max,223.1,99.0,,


In [232]:
players_by_played_hours = players.sort_values(by="played_hours")

In [233]:
players_by_played_hours

Unnamed: 0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
2,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3...,0.0,Blake,Male,17,,
7,Amateur,False,1d2371d8a35c8831034b25bda8764539ab7db0f6393869...,0.0,Emerson,Male,21,,
6,Regular,True,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d...,0.0,Luna,Female,19,,
5,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee...,0.0,Adrian,Female,17,,
15,Amateur,False,2313a06afe47eacc28ff55adf6f072e7d12b0d12d7cbae...,0.0,Quinlan,Male,22,,
...,...,...,...,...,...,...,...,...,...
130,Amateur,True,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,56.1,Dana,Male,23,,
90,Amateur,True,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,150.0,Delara,Female,16,,
158,Regular,True,ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b1...,178.2,Piper,Female,19,,
51,Regular,True,b622593d2ef8b337dc554acb307d04a88114f2bf453b18...,218.1,Akio,Non-binary,20,,


In [234]:
unique_vs_cumulative_hours = alt.Chart(players_by_played_hours).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played")
).properties(width = 2000)

unique_vs_cumulative_hours

In [235]:
under20 = players[players["played_hours"] < 20]

In [236]:
unique_vs_cumulative_hours_under20 = alt.Chart(under20).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played")
).properties(width=1000)

unique_vs_cumulative_hours_under20

In [237]:
(players["played_hours"] < 1).sum()

np.int64(154)

In [238]:
# What the above cell means is that 154/191 players have registered under 1 hour of cumulative gameplay.

In [239]:
(players["played_hours"] < 2).sum()

np.int64(170)

In [240]:
# Changing the threshold, 170/191 players have less than 2 hours logged.

In [241]:
unique_vs_cumulative_hours_and_experience = alt.Chart(players_by_played_hours).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played"),
    color=alt.Color("experience")
).properties(width = 2000)

unique_vs_cumulative_hours_and_experience


In [242]:
unique_vs_cumulative_hours_and_experience_under20 = alt.Chart(under20).mark_bar().encode(
    x=alt.X("hashedEmail").sort("-y").title("Unique registrants"),
    y=alt.Y("played_hours").title("Hours played"),
    color=alt.Color("experience")
).properties(width = 2000)

unique_vs_cumulative_hours_and_experience_under20

In [243]:
# I could do more but i think it's time to move on:

### sessions.csv

In [244]:
sessions

Unnamed: 0,hashedEmail,start_time,end_time,original_start_time,original_end_time
0,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-06-30 18:12:00,2024-06-30 18:24:00,1.719770e+12,1.719770e+12
1,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-06-17 23:33:00,2024-06-17 23:46:00,1.718670e+12,1.718670e+12
2,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3...,2024-07-25 17:34:00,2024-07-25 17:57:00,1.721930e+12,1.721930e+12
3,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431...,2024-07-25 03:22:00,2024-07-25 03:58:00,1.721880e+12,1.721880e+12
4,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-25 16:01:00,2024-05-25 16:12:00,1.716650e+12,1.716650e+12
...,...,...,...,...,...
1530,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f5...,2024-05-10 23:01:00,2024-05-10 23:07:00,1.715380e+12,1.715380e+12
1531,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e...,2024-07-01 04:08:00,2024-07-01 04:19:00,1.719810e+12,1.719810e+12
1532,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-28 15:36:00,2024-07-28 15:57:00,1.722180e+12,1.722180e+12
1533,fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33...,2024-07-25 06:15:00,2024-07-25 06:22:00,1.721890e+12,1.721890e+12


In [255]:
alt.Chart(sessions).mark_bar().encode(
    x=alt.X("hashedEmail").sort("y").title("Unique registrants"),
    y=alt.Y("start_time:T").title("Sessions started"),
    #color=alt.Color("experience")
).properties(width = 2000)

In [246]:
sessions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1535 entries, 0 to 1534
Data columns (total 5 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   hashedEmail          1535 non-null   object        
 1   start_time           1535 non-null   datetime64[ns]
 2   end_time             1533 non-null   datetime64[ns]
 3   original_start_time  1535 non-null   float64       
 4   original_end_time    1533 non-null   float64       
dtypes: datetime64[ns](2), float64(2), object(1)
memory usage: 60.1+ KB


In [247]:
#no_datetime_sessions = sessions["session_length"] = sessions["session_length"].seconds
#okay, this errors. how do i iterate over ts

In [95]:
original_sessions = pd.read_csv("data/sessions.csv")

In [149]:
alt.Chart(original_sessions).mark_bar().encode(
    x=alt.X("hashedEmail").sort("y").title("Unique registrants"),
    y=alt.Y("start_time").title("Hours played"),
    #color=alt.Color("experience")
).properties(width = 2000, height= 4000)

In [116]:
sessions["start_time"].dt.strftime('%Y-%m-%d %H%M%S')
##pin in this

0       2024-06-30 181200
1       2024-06-17 233300
2       2024-07-25 173400
3       2024-07-25 032200
4       2024-05-25 160100
              ...        
1530    2024-05-10 230100
1531    2024-07-01 040800
1532    2024-07-28 153600
1533    2024-07-25 061500
1534    2024-05-20 022600
Name: start_time, Length: 1535, dtype: object