In [1]:
# Please load this first

library(tidyverse)
library(repr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [6]:
# Run this to load the data sets and display respective summary statistics

players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

summary_players <- summary(players)
summary_sessions <- summary(sessions)

summary_players
summary_sessions

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

### Data Description

##### Data file 1 - players.csv
- 196 rows (observations) and 7 columns (variables)
- Each row corresponds to a unique player
    - experience (character)
        - This variable indicates the experience level of each MineCraft player. There are five unique experience levels: "Beginner", "Amateur", "Regular", "Pro" and "Veteran". These variables could be changed to be of the 'factor' data type.
    - subscribe (factor)
      - This variable contains TRUE or FALSE logical values, indicating whether a player is subscribed to a gaming newsletter.
    - hashedEmail (character)
      - This variable contains the player's email addresses, hashed for privacy.
    - played_hours (double)
      - This variable indicates the total number of hours played on the MineCraft server by each player.
    - name (character)
      - This variable reports the first name of each player.
    - gender (character)
      - This variable reports the gender identity of each player. This variable's data type could be changed from 'character' to 'factor.
    - Age (double)
      - This variable contains the age of each player. This variable's data type could be changed from 'double' to 'integer'.

##### Data file 2 - sessions.csv
- 1535 rows (observations) and 5 columns (variables)
- Each row corresponds to a unique play session.
    - hashedEmail (character)
      - This variable contains the player's email addresses, hashed for privacy.
    - start_time (character)
      - Describes the start date and time of a player's play session.
    - end_time (character)
      - Describes the end date and time of a player's play session.
    - original_start_time (double)
      - Contains the start time of a player's play session in UNIX timestamp form.
    - original_end_time (double)
      - Contains the end time of a player's play session in UNIX timestamp form.
     
For most play sessions, the original_start_time and original_end_time values appear the same as the UNIX timestamp values were not recorded to include enough decimal points, so the difference of the two values often won't communicate the total play time of a session.

##### Question:
For this project, I will be looking into the following question of interest: "What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?" Using this topic, I plan to determine if 'Age' and 'played_hours' can accurately predict 'subscribe' in 'players.csv'. 