### Loading Libraries and Datasets

In [11]:
# Loading libraries
library(tidyverse)
library(repr)
library(tidymodels)

# Loading the two datasets directly from their URL's on Github in order to have the full file be reproducible
url_players <- "https://raw.githubusercontent.com/Finnypiney/individual_project_finnp/refs/heads/main/players.csv"
url_sessions <- "https://raw.githubusercontent.com/Finnypiney/individual_project_finnp/refs/heads/main/sessions.csv"

players <- read_csv(url_players)
sessions <- read_csv(url_sessions)

glimpse(players)
glimpse(sessions)


[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 196
Columns: 7
$ experience   [3m[90m<chr>[39m[23m "Pro", "Veteran", "Veteran", "Amateur", "Regular", "Amate…
$ subscribe    [3m[90m<lgl>[39m[23m TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, T…
$ hashedEmail  [3m[90m<chr>[39m[23m "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8…
$ played_hours [3m[90m<dbl>[39m[23m 30.3, 3.8, 0.0, 0.7, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 1.6, 0…
$ name         [3m[90m<chr>[39m[23m "Morgan", "Christian", "Blake", "Flora", "Kylie", "Adrian…
$ gender       [3m[90m<chr>[39m[23m "Male", "Male", "Male", "Female", "Male", "Female", "Fema…
$ Age          [3m[90m<dbl>[39m[23m 9, 17, 17, 21, 21, 17, 19, 21, 47, 22, 23, 17, 25, 22, 17…
Rows: 1,535
Columns: 5
$ hashedEmail         [3m[90m<chr>[39m[23m "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8a…
$ start_time          [3m[90m<chr>[39m[23m "30/06/2024 18:12", "17/06/2024 23:33", "25/07/202…
$ end_time            [3m[90m<chr>[39m[23m "30/06/2024 18:24"

### (1) Data Description (code cell followed by explanation/markdown cell)

In [31]:
# PLAYERS SUMMARY STATISTICS
summary_players <- summary(players)

experience_categories <- unique(players$experience)
gender_categories <- unique(players$gender)

summary_players
experience_categories
gender_categories

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

**Players:** in the players.csv dataset, we have
- 196 rows (196 observations)
- 7 variables
  - experience
    - This is a character variable that has 5 unique categories: 'Pro', 'Amateur', 'Regular, and 'Beginner'. It indicates the level of MineCraft experience an individual player in the dataset has.
  - subscribe
    - This is a logical variable (true or false) that tells you if a player **FIX**
  - hashedEmail
    - This is a character variable that reports a player's hashed email address.
  - played_hours
    - This is a double variable (number with decimal values) that reports the number of MineCraft hours played by each individual.
  - name
    - This is a character variable that reports a player's first name.
  - gender
    - This is a character variable that reports a player's gender (7 unique categories).
  - Age
    - This is a double variable that reports a player's age.

**Issues:**
- In the Age variable, 2 observations are NA's, meaning that when we compute summary statistics/wrangle our data we should be careful to account for this using NA.RM = true.

In [29]:
# SESSIONS SUMMARY STATISTICS
summary_sessions <- summary(sessions)
summary_sessions

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

**Sessions:** in the sessions.csv dataset, we have
- 1,535 rows
- 5 variables
  - hashedEmail
    - As before, this is a character variable that reports a player's hashed email address.
  - start_time
    - The time (character variable) of the day a player begins playing on the server.
  - end_time
    - The time (character variable) of the day a player stops playing on the server.
  - original_start_time
    - This is a double variable that reports the same as start_time but in UNIX time (milliseconds)
  - original_end_time
    - This is a double variable that reports the same as end_time but in UNIX time (milliseconds)

**Issues:**
- The rows here are not individual observations, as there are 196 players from players.csv, but 1,535 rows here in sessions.csv. This data is not tidy.
- There are NA's in our original_end_time variable.

### (2) Questions