This is my individual planning report

In [1]:
library(tidyverse)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

<h2>1) Data Description:</h2>

<ul>
  <li>players.csv: 197 rows × 7 columns</li>
  <li>sessions.csv: 1536 rows × 5 columns</li>
</ul>

Players.csv (196 observations 7 variables)
| Variables       | Type | Meaning |
| :---------------- | :------: | ----: |
| Experience       |   Character   | Levels include Beginner, Amateur, Regular, Pro, Veteran. Player’s self-reported/assigned experience tier. |
| Subscribe         |   Logical   | Whether the player is subscribed to the newsletter |
| hashedEmail    |   Character  | unique player identifier (hash of email); use to join with sessions.csv |
| played_hours |  Double (Numerical)    | Total hours played |
| name         |   Categorical    | Player’s first name |
| gender    |  Categorical   | Players' gender includes Male, Female, Non-binary, Prefer not to say, |
| age |  Double (Numerical)   | Player’s age in years |

Sessions.csv (1535 observations 5 variables)
| Variables       | Type | Meaning |
| :---------------- | :------: | ----: |
| hashedEmail       |   Character   | Used to identify a player from players.csv |
| start_time         |   Character   | Relative session date and start time for user |
| end_time    |   Character  | Relative session date and end time for user |
| original_start_time |  Double (Numerical)    | Session start time, likely from when server was created |
| original_end_time         |   Double (Numerical)    | Session end time, likely from when server was created |

<h4>Visible Issues:</h4>

<ul>
  <li>A small number of sessions have no end_time, so duration can’t be computed for those rows.</li>
  <li>The units for original_start_time and original_end_time are not stated</li>
  <li>There are players in players.csv who never appear in sessions.csv</li>
  <li>A few of the ages in players.csv have NA, the age range also spans from very young to very old</li>
  <li>Many players in players.csv have 0 played_hours and played_hours has several large outliers</li>
    
</ul>

<h4>Non-Visible Issues:</h4>

<ul>
  <li>Not all sessions are active, players could be AFK</li>
  <li>Seasonal events or school breaks could impact player' numbers and subscriptions</li>
  <li>There could be shared accounts under the same hashedEmail</li>
  <li>Recruitment bias, such as players coming from a research server</li>
</ul>

<h4>How Data Was Collected</h4>

<ul>
  <li>A UBC research group ran a Minecraft research server and logged player data (players.csv) and session activity per player (sessions.csv).</li>
  <li>Players’ actions and session times were recorded as they played in the Minecraft world</li>
</ul>







In [29]:
#Summary Statistics:
players  <- read_csv("players.csv", show_col_types = FALSE)
sessions <- read_csv("sessions.csv", show_col_types = FALSE)

summary(players)
summary(sessions)

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 9.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :21.14  
                                       3rd Qu.:22.75  
                                       Max.   :58.00  
                               

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          