# DSCI 100 Project

In [2]:
library(repr)
library(tidyverse)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

## Introduction

We wish to help Prof. Frank to figure out how to what types of players will be more likely to contribute more data.

There's a reddit rumor saying that truly dedicated people don't stop, never stop. They have obsessions. For example, mathematicans may spend years studying a single problem, while most people would not persist for years. Moderators on reddit may spends hours each day moderating posts on reddit, despite they are unpaid.

This leads me to think: are true gamers sleepless? If a gamer plays game at 4 AM, will the gamer be a real gamer that would provide endless data?

In this project, we will explore whether participants who play game overnight will have more `played_hours`.

## Filter players by whether they play overnight

To answer the question, the very first step we need to do is to remove all records consisting of players who do not stay overnight. We say a player stays overnight to play game if the player plays game between 0 AM to 4 AM.

Let's start with loading the datasets.

In [7]:
players <- read_csv("data/players.csv") |> mutate(gender = as.factor(gender))
sessions <- read_csv("data/sessions.csv")
players
sessions

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ERROR: Error in players.head(4): could not find function "players.head"


Then, let's proceed to to parse the start time and the end time.

In [11]:
sessions_with_parsed_time <- sessions |> mutate(start_time = as_datetime(start_time, format = "%d/%m/%Y %H:%M"), 
                                                end_time = as_datetime(end_time, format = "%d/%m/%Y %H:%M"))
sessions_with_parsed_time

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<dttm>,<dttm>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-06-30 18:12:00,2024-06-30 18:24:00,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-06-17 23:33:00,2024-06-17 23:46:00,1.71867e+12,1.71867e+12
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,2024-07-25 17:34:00,2024-07-25 17:57:00,1.72193e+12,1.72193e+12
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-07-25 03:22:00,2024-07-25 03:58:00,1.72188e+12,1.72188e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-05-25 16:01:00,2024-05-25 16:12:00,1.71665e+12,1.71665e+12
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-06-23 15:08:00,2024-06-23 17:10:00,1.71916e+12,1.71916e+12
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,2024-04-15 07:12:00,2024-04-15 07:21:00,1.71317e+12,1.71317e+12
ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b16210addd44d7c81f83,2024-09-21 02:13:00,2024-09-21 02:30:00,1.72688e+12,1.72689e+12
96e190b0bf3923cd8d349eee467c09d1130af143335779251492eb4c2c058a5f,2024-06-21 02:31:00,2024-06-21 02:49:00,1.71894e+12,1.71894e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-05-16 05:13:00,2024-05-16 05:52:00,1.71584e+12,1.71584e+12


Then, we proceed to extract out the hours and minutes of start time and end time, and convert the hours to to minutes. I read https://ubc-dsci.github.io/dsci-100-student/REFERENCE_R.html over and over again and didn't see how to get hours and minutes out of a `dttm`. Therefore, we have to go out of the box and use `format`, which is not part of reference sheet.

In [16]:
sessions_with_time_in_hr <- sessions_with_parsed_time |> mutate(
    start_time_in_hr = as.numeric(format(start_time, "%H"))
    + as.numeric(format(start_time, "%M")) / 60,
    end_time_in_hr = as.numeric(format(end_time, "%H")) 
    + as.numeric(format(end_time, "%M")) / 60,
  )
sessions_with_time_in_hr

hashedEmail,start_time,end_time,original_start_time,original_end_time,start_time_in_hr,end_time_in_hr
<chr>,<dttm>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-06-30 18:12:00,2024-06-30 18:24:00,1.71977e+12,1.71977e+12,18.2000000,18.400000
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-06-17 23:33:00,2024-06-17 23:46:00,1.71867e+12,1.71867e+12,23.5500000,23.766667
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,2024-07-25 17:34:00,2024-07-25 17:57:00,1.72193e+12,1.72193e+12,17.5666667,17.950000
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-07-25 03:22:00,2024-07-25 03:58:00,1.72188e+12,1.72188e+12,3.3666667,3.966667
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-05-25 16:01:00,2024-05-25 16:12:00,1.71665e+12,1.71665e+12,16.0166667,16.200000
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-06-23 15:08:00,2024-06-23 17:10:00,1.71916e+12,1.71916e+12,15.1333333,17.166667
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,2024-04-15 07:12:00,2024-04-15 07:21:00,1.71317e+12,1.71317e+12,7.2000000,7.350000
ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b16210addd44d7c81f83,2024-09-21 02:13:00,2024-09-21 02:30:00,1.72688e+12,1.72689e+12,2.2166667,2.500000
96e190b0bf3923cd8d349eee467c09d1130af143335779251492eb4c2c058a5f,2024-06-21 02:31:00,2024-06-21 02:49:00,1.71894e+12,1.71894e+12,2.5166667,2.816667
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-05-16 05:13:00,2024-05-16 05:52:00,1.71584e+12,1.71584e+12,5.2166667,5.866667


Now, we have start time and end time in hour, it suffices for us to use filter. For a player to play overnight, there are following cases:
- Start time is between 0 AM to 4 AM
- Start time is greater than the end time (the player played pass midnight)

In [18]:
filtered_sessions <- sessions_with_time_in_hr |> filter(start_time_in_hr < 4 | start_time_in_hr > end_time_in_hr)
filtered_sessions

hashedEmail,start_time,end_time,original_start_time,original_end_time,start_time_in_hr,end_time_in_hr
<chr>,<dttm>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-07-25 03:22:00,2024-07-25 03:58:00,1.72188e+12,1.72188e+12,3.3666667,3.9666667
ad6390295640af1ed0e45ffc58a53b2d9074b0eea694b16210addd44d7c81f83,2024-09-21 02:13:00,2024-09-21 02:30:00,1.72688e+12,1.72689e+12,2.2166667,2.5000000
96e190b0bf3923cd8d349eee467c09d1130af143335779251492eb4c2c058a5f,2024-06-21 02:31:00,2024-06-21 02:49:00,1.71894e+12,1.71894e+12,2.5166667,2.8166667
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,2024-07-03 01:31:00,2024-07-03 01:35:00,1.71997e+12,1.71997e+12,1.5166667,1.5833333
f2826fb8dbce4d450348f99cb27ade184b713998d9679780442efaaf218038f2,2024-08-24 02:32:00,2024-08-24 03:12:00,1.72447e+12,1.72447e+12,2.5333333,3.2000000
b622593d2ef8b337dc554acb307d04a88114f2bf453b18fb5d2c80052aeb2319,2024-08-18 00:51:00,2024-08-18 03:15:00,1.72394e+12,1.72395e+12,0.8500000,3.2500000
f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,2024-08-08 00:21:00,2024-08-08 01:35:00,1.72308e+12,1.72308e+12,0.3500000,1.5833333
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,2024-06-03 00:22:00,2024-06-03 00:33:00,1.71737e+12,1.71737e+12,0.3666667,0.5500000
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,2024-07-31 02:58:00,2024-07-31 03:21:00,1.72239e+12,1.72240e+12,2.9666667,3.3500000
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,2024-07-04 02:25:00,2024-07-04 04:05:00,1.72006e+12,1.72007e+12,2.4166667,4.0833333


Now, suffices to use semi_join to join `filtered_sessions` with `players`, to filter all players who play overnight.

In [20]:
filtered_players <- players |> semi_join(filtered_sessions, by = "hashedEmail")
filtered_players

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<fct>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Veteran,True,7a4686586d290c67179275c7c3dfb4ea02f4d317d9ee0e2cee98baa27877a875,1.6,Lane,Female,23
Amateur,True,a175d4741dc84e6baf77901f6e8e0a06f54809a34e6b5211159bced346f7fb3e,48.4,Xander,Female,17
Amateur,True,ab1f44f93c3b828f55458971db393052d9711df3e0e7ff69540bfebfcec555ff,0.5,Marley,Male,17
Regular,True,bc704ff2bc676dbf48ee41b9e11481c1387bf758ad318f2428f336e3fecc6660,0.3,Andy,Male,8
Amateur,True,4b01bce3f141289709e8278b02ba5d2aaa7105d7ccb9c7deb37670a80e332774,1.8,Luca,Male,23
Veteran,True,f1b432523542f90c61176a555ccb2144468d76c91a32d74082ab8c101f9d25b6,0.1,London,Male,21


Now, let us compute the average played hours across players who play overnight compared to the entire population, to see if the mean changes significantly.

In [24]:
mean_played_hour_overnight <- filtered_players |> summarize(mean = mean(played_hours))
mean_played_hour_overnight

mean_played_overall <- players |> summarize(mean = mean(played_hours))
mean_played_overall

mean
<dbl>
21.4


mean
<dbl>
5.845918


As we can see, there's a significant difference! The mean played hour across players who played overnight is a whopping 21.4, while the mean played hour across all players is only 5.84!