# Project Planning

Alex Zhang (38154290) Team 18 Section 003

In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [3]:
players <- read_csv("https://raw.githubusercontent.com/Alexjhz07/DSCI-100-Project-Individual/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Alexjhz07/DSCI-100-Project-Individual/refs/heads/main/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## (1) Data Description:

The `players.csv` dataset contains `196` rows and 7 `columns` the following properties:

Column Name | Data Type | Has NA | Variable Description |
- | -| - | - |
experience|chr|NO|Self-reported level of experience with the game<br>5 possible values are `Pro`, `Veteran`, `Amateur`, `Regular`, `Beginner`
subscribe|lgl|NO|Boolean for whether the player has subscribed to the game-related newsletter<br>Either `TRUE` or `FALSE`
hashedEmail|chr|NO|Hash of the email the player used for registration<br>(Hash for privacy, still usable for identifying which rows correspond to each other)
played_hours|dbl|NO|Number of hours played by the player on the server
name|chr|NO|Name (Not username) of the player
gender|chr|NO|Gender identity of the player<br>7 possible values are `Male`, `Female`, `Non-binary`, `Prefer not to say`, `Agender`, `Two-Spirited`, `Other` 
Age|dbl|YES|Age of the player<br>Ranges from `9` to `58`, the mean is `21.14` and median is `19`. There are 2 `NA` values

There are issues with self-reported data regarding whether the person is being truthful with their information (For example, their age), whether each person has only one account in this dataset (What if multiple accounts belong to the same person), and issues where some columns have NA like Age.

We can check these facts using the following code:

In [19]:
# CODE FOR CHECKING PLAYERS.CSV
head(players, n = 2) # Get first 2 rows
dim(players) # Get the table dimensions

# Check each column for NA
na_r <- c(any(is.na(players$experience))) |>
    c(any(is.na(players$subscribe))) |>
    c(any(is.na(players$hashedEmail))) |>
    c(any(is.na(players$played_hours))) |>
    c(any(is.na(players$name))) |>
    c(any(is.na(players$gender))) |>
    c(any(is.na(players$Age)))
na_r

# Check what the unqiue values of each relevant factor type column has
unique_experience <- unique(players$experience)
unique_experience
unique_gender <- unique(players$gender)
unique_gender

# Get summary statistics
summary(players, na.rm=TRUE)

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17


  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

Let's first check the different values in each column to ensure there are no unexpected values.\
We start off with the players dataset:

In [11]:
head(sessions)
dim(sessions)

hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


In [22]:
any(is.na(sessions$hashedEmail))
any(is.na(sessions$start_time))
any(is.na(sessions$end_time))
any(is.na(sessions$original_start_time))
any(is.na(sessions$original_end_time))

In [23]:
players |> distinct(experience)
players |> distinct(subscribe)

experience
<chr>
Pro
Veteran
Amateur
Regular
Beginner


subscribe
<lgl>
True
False
