<h1 style="text-align: center; font-weight: bold;">Individual Planning Report</h1>

### Data Description

In [1]:
#First, load the packages
library(tidymodels)
library(tidyverse)
library(dplyr)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrecipes     [39m 1.1.0
[32m✔[39m [34mdials       [39m 1.3.0     [32m✔[39m [34mrsample     [39m 1.2.1
[32m✔[39m [34mdplyr       [39m 1.1.4     [32m✔[39m [34mtibble      [39m 3.2.1
[32m✔[39m [34mggplot2     [39m 3.5.1     [32m✔[39m [34mtidyr       [39m 1.3.1
[32m✔[39m [34minfer       [39m 1.0.7     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.4.0     [32m✔[39m [34mworkflows   [39m 1.1.4
[32m✔[39m [34mparsnip     [39m 1.2.1     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mpurrr       [39m 1.0.2     [32m✔[39m [34myardstick   [39m 1.3.1

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m

Since the CSV files of the source for the datasets are uploaded in the GitHub repository, we will read the data from it

In [2]:
players <- read_csv("https://raw.githubusercontent.com/HollieHuang666/Dsci-100-Individual-Planning-Report/main/players.csv")
head(players)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


The first table contains player-level data, describing characteristics of individual Minecraft players. It has 196 rows and 7 columns, so there are also 196 objects as players and 7 variables. The variables are `experience`, `subscribe`, `hashedEmail`, `played_hours`, `name`, `gender`, and `Age`, with types character, logical, character, double, character, character, and double respectively.

`experience` indicates the player’s skill level (beginner, amateur, veteran, or pro). `subscribe` is a logical variable showing whether the player subscribed to the game newsletter. `hashedEmail` contains privacy-protected email identifiers. `played_hours `records the total hours played, and `name`, `gender`, and `Age` describe the name, gender and age of each player.

<table>
  <caption><strong>Player Data: Variables & Descriptions</strong></caption>
  <thead>
    <tr>
      <th>Variable</th>
      <th>Type</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>experience</code></td>
      <td>character</td>
      <td>Player skill level: beginner, amateur, veteran, or pro.</td>
    </tr>
    <tr>
      <td><code>subscribe</code></td>
      <td>logical</td>
      <td>Whether the player subscribed to the game newsletter (TRUE/FALSE).</td>
    </tr>
    <tr>
      <td><code>hashedEmail</code></td>
      <td>character</td>
      <td>Hashed email identifier (stored for privacy).</td>
    </tr>
    <tr>
      <td><code>played_hours</code></td>
      <td>double</td>
      <td>Total hours the player has played.</td>
    </tr>
    <tr>
      <td><code>name</code></td>
      <td>character</td>
      <td>Player's name.</td>
    </tr>
    <tr>
      <td><code>gender</code></td>
      <td>character</td>
      <td>Player's gender.</td>
    </tr>
    <tr>
      <td><code>Age</code></td>
      <td>double</td>
      <td>Player's age.</td>
    </tr>
  </tbody>
</table>


In [3]:
stats_played_hours <- tibble(
  played_hours_min = min(pull(players, played_hours)),
  played_hours_max = max(pull(players, played_hours)),
  played_hours_mean = mean(pull(players, played_hours)),
  played_hours_sd = sd(pull(players, played_hours)),
  played_hours_median = median(pull(players, played_hours))
)
stats_played_hours

stats_age <- tibble(
  age_min = min(pull(players, Age), na.rm = TRUE),
  age_max = max(pull(players, Age), na.rm = TRUE),
  age_mean = mean(pull(players, Age), na.rm = TRUE),
  age_sd = sd(pull(players, Age), na.rm = TRUE),
  age_median = median(pull(players, Age), na.rm = TRUE)
)
stats_age

played_hours_min,played_hours_max,played_hours_mean,played_hours_sd,played_hours_median
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,223.1,5.845918,28.35734,0.1


age_min,age_max,age_mean,age_sd,age_median
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
9,58,21.13918,7.389687,19


For the numeric variables of `played_hours` and `Age`, looking at the created tables above, `played_hours` ranges from 0 to 223.1, with a mean of 5.85, a median of 0.1, and a standard deviation of 28.36; `Age` ranges from 9 to 58, with a mean of 21.14, a median of 19 and a standard deviation of 7.39.

The data has several issues to consider, including missing values in the `Age` and `gender` columns that will need to be handled in our analysis. We also notice potential sampling biases - the data appears to skew toward younger male players with relatively low play hours, which could limit how well our findings apply to other types of players like older individuals, different genders, or more dedicated gamers.

In [4]:
#Now read the sessions.csv
sessions <- read_csv("https://raw.githubusercontent.com/HollieHuang666/Dsci-100-Individual-Planning-Report/refs/heads/main/sessions.csv")
head(sessions)

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


Thus, there are 196 people as observations (interested objects) in this table. Similarly, for the sessions table, there are 1535 rows and 5 columns, so there are 1536 people as observations.