In [1]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [4]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


<h1>1) Data Description</h1>

This project utilizes two datasets provided as CSV files: players.csv and sessions.csv. The data was collected from a Minecraft research server run by a Computer Science group at UBC, where <code>players.csv</code> contains information on individual players, while <code>sessions.csv</code> contains information on individual play sessions.

<h2>Dataset Summaries</h2>
<h3><code>players.csv</code></h3>
<ul>
    <li>196 observations, each representing a unique player</li>
    <li>7 columns/variables</li>
    <li>Variable Details:
    <ol>
        <li>experience (character): A categorical variable describing the player's experience (e.g., 'Pro', 'Veteran', 'Amateur', 'Regular').</li>
        <li>subscribe (boolean/logical vector): A boolean variable (True/False) indicating if the player is subscribed.</li>
        <li>hashedEmail (character): A categorical variable serving as a unique identifier for each player.</li>
        <li>played_hours (double): A quantitative variable for the total number of hours played.</li>
        <li>name (character): A categorical variable for the player's in-game name.</li>
        <li>gender (character): A categorical variable for the player's gender.</li>
        <li>Age (double): A quantitative variable for the player's age.</li>
    </ol>
    </li>
</ul>

<h3><code>sessions.csv</code></h3>
<ul>
    <li>1,535 observations, each representing one play session</li>
    <li>5 columns/variables</li>
    <li>Variable Details:
        <ol>
            <li>hashedEmail (character): A categorical variable serving as a unique identifier for each player.</li>
            <li>start_time (character): The start time of each session as a categorical variable, stored in plain text.</li>
            <li>end_time (character): The end time of each session as a categorical variable, stored in plain text.</li>
            <li>original_start_time (double): The start time of each session as a quantitative variable, stored as a Unix timestamp.</li>
            <li>original_end_time (double): The end time of each session as a quantitative variable, stored as a Unix timestamp.</li>
        </ol>
    </li>
</ul>

<h2>Issues within the Data</h2>

Upon initial analysis of the data, there are a few potential errors that will need to be addressed. <ul> <li>
<code>players.csv</code>'s <code>Age</code> column contains 2 missing/NA values, and <code>sessions.csv</code>'s <code>end_time</code> and <code>original_end_time</code> columns each have 2 missing/NA values. Because these errors are fairly minor, we can choose to remove them or perform mean imputation.</li> <li> Additionally, <code>start_time</code> and <code>end_time</code> would likely need to be converted to a numerical form for any time-based analysis, but we are already given their Unix 'original' counterparts. </li> <li> Finally, and most glaringly, some data is highly skewed. <code>played_hours</code>, for example, has a mean of 5.85 yet a median of 0.10, implying that most players have very few hours while a few players have very many hours. This trend continues into the experience variable as well, where there are significantly more Amateurs (63) than Pros (14). </li></ul>

In [35]:
players_hours_stats <- players |>
    select(played_hours) |>
    summarise(
        mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
        median_played_hours = median(played_hours, na.rm = TRUE))

players_experience_stats <- players |>
    group_by(experience) |>
    summarize(count = n())

players_hours_stats
players_experience_stats

mean_played_hours,median_played_hours
<dbl>,<dbl>
5.85,0.1


experience,count
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48
