In [2]:
library(tidyverse)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [22]:
players_url <- "https://raw.githubusercontent.com/NelsonJYLee/dsci100-project/refs/heads/main/data/players.csv"
sessions_url <- "https://raw.githubusercontent.com/NelsonJYLee/dsci100-project/refs/heads/main/data/sessions.csv"

players <- read_csv(players_url)

nrow(players)
ncol(players)

played_hours_summary <- players |>
  summarize(
    min_hours = min(played_hours, na.rm = TRUE),
    max_hours = max(played_hours, na.rm = TRUE),
    mean_hours = mean(played_hours, na.rm = TRUE),
    median_hours = median(played_hours, na.rm = TRUE),
    sd_hours = sd(played_hours, na.rm = TRUE)
  )

age_summary <- players |>
  summarize(
    min_age = min(Age, na.rm = TRUE),
    max_age = max(Age, na.rm = TRUE),
    mean_age = mean(Age, na.rm = TRUE),
    median_age = median(Age, na.rm = TRUE),
    sd_age = sd(Age, na.rm = TRUE)
  )

experience_summary <- players |>
    group_by(experience) |>
    summarize(count = n()) |>
    mutate(percent = count/sum(count) * 100)

subscribe_summary <- players |>
    group_by(subscribe) |>
    summarize(count = n()) |>
    mutate(percent = count/sum(count) * 100)

gender_summary <- players |>
    group_by(gender) |>
    summarize(count = n()) |>
    mutate(percent = count/sum(count) * 100)

played_hours_summary
age_summary
experience_summary
subscribe_summary
gender_summary

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


min_hours,max_hours,mean_hours,median_hours,sd_hours
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,223.1,5.845918,0.1,28.35734


min_age,max_age,mean_age,median_age,sd_age
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
9,58,21.13918,19,7.389687


experience,count,percent
<chr>,<int>,<dbl>
Amateur,63,32.142857
Beginner,35,17.857143
Pro,14,7.142857
Regular,36,18.367347
Veteran,48,24.489796


subscribe,count,percent
<lgl>,<int>,<dbl>
False,52,26.53061
True,144,73.46939


gender,count,percent
<chr>,<int>,<dbl>
Agender,2,1.0204082
Female,37,18.877551
Male,124,63.2653061
Non-binary,15,7.6530612
Other,1,0.5102041
Prefer not to say,11,5.6122449
Two-Spirited,6,3.0612245


In [48]:
sessions <- read_csv(sessions_url)
nrow(sessions)
ncol(sessions)

sessions_hashedEmail_summary <- sessions |>
    group_by(hashedEmail) |>
    summarize(count = n())

sessions_num_summary <- sessions_hashedEmail_summary |>
    summarize(
        min_sessions = min(count, na.rm = TRUE),
        max_sessions = max(count, na.rm = TRUE),
        mean_sessions = mean(count, na.rm = TRUE),
        median_sessions = median(count, na.rm = TRUE),
        sd_sessions = sd(count, na.rm = TRUE))

original_start_time_summary <- sessions |>
    summarize(
        min_start = min(original_start_time, na.rm = TRUE),
        max_start = max(original_start_time, na.rm = TRUE),
        mean_start = mean(original_start_time, na.rm = TRUE),
        median_start = median(original_start_time, na.rm = TRUE),
        sd_start = sd(original_start_time, na.rm = TRUE))

original_end_time_summary <- sessions |>
    summarize(
        min_end = min(original_end_time, na.rm = TRUE),
        max_end = max(original_end_time, na.rm = TRUE),
        mean_end = mean(original_end_time, na.rm = TRUE),
        median_end = median(original_end_time, na.rm = TRUE),
        sd_end = sd(original_end_time, na.rm = TRUE))

hashedEmail_summary
original_start_time_summary
original_end_time_summary

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


min_sessions,max_sessions,mean_sessions,median_sessions,sd_sessions
<int>,<int>,<dbl>,<int>,<dbl>
1,310,12.28,1,41.3269


min_start,max_start,mean_start,median_start,sd_start
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1712400000000.0,1727330000000.0,1719201000000.0,1719200000000.0,3557491589


min_end,max_end,mean_end,median_end,sd_end
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1712400000000.0,1727340000000.0,1719196000000.0,1719180000000.0,3552813134


# Data Description

## Players file
- consists of 196 observations and 7 variables
- variables: experience, subscribe,	hashedEmail, played_hours, name, gender, Age
- name and hashedEmail are are mostly unique categorical varibles and not useful to summarize

### Quantitative Variables Summary
| Variable    | Description               | Mean     | SD     | Min    | Max     | Median  |
|-------------|---------------------------|----------|--------|--------|---------|---------|
| played_hours| number of hours played    | 5.85     | 28.36  | 0.00   | 223.10  | 0.10    |
| Age         | age of the player         | 21.14    | 7.39   | 9.00   | 58.00   | 19.00   |

### Categorical Variables Summary
#### Summary of experience
Description of variable: experience with Minecraft
| Experience | Count | Percent |
|------------|-------|---------|
| Amateur    | 63    | 32.14   |
| Beginner   | 35    | 17.86   |
| Pro        | 14    | 7.14    |
| Regular    | 36    | 18.37   |
| Veteran    | 48    | 24.49   |

#### Summary of subscribe
Description of variable: subscribe is true if they are subscribed to a game-related newsletter and false if not
| Subscribe  | Count | Percent |
|------------|-------|---------|
| TRUE       | 144   | 73.47   |
| FALSE      | 52    | 26.53   |

#### Summary of gender
Description of variable: self-claimed gender
| Gender            | Count | Percent |
|-------------------|-------|---------|
| Agender           | 2     | 1.02    |
| Female            | 37    | 18.88   |
| Male              | 124   | 63.27   |
| Non-binary        | 15    | 7.65    |
| Other             | 1     | 0.51    |
| Prefer not to say | 11    | 5.61    |
| Two-Spirited      | 6     | 3.06    |

### Potential Problems with Players file
- Age column has 2 missing values, so we must ignore then when getting Age's statistics
- Age is reported in whole years but has the double type. Should be of type int.

## Sessions file
- consists of 1535 observations and 5 variables
- variables: hashedEmail, start_time, end_time, original_start_time, original_end_time
- start_time and end_time are not useful to summarize, explained below
- original_start_time and original_end_time are not useful to summarize, explained below

### Quantitative Variables Summary
| Variable         | Description                                                | Mean     | SD     | Min    | Max     | Median  |
|------------------|------------------------------------------------------------|----------|--------|--------|---------|---------|
| hashedEmailCount | count of sessions of each unique encrypted email address   | 12.28    | 41.33  | 1.00   | 310.00  | 1.00    |

### Potential Problems with Sessions file
- original_start_time and original_end_time columns are both missing 2 values
- original_start_time and original_end_time are mostly the same value in all observations. The ending decimal places seem to have been rounded off. This means if we try to calculate the sesison duration from original_end_time - original_start_time, we will get a duration of 0 for almost every obsevation (besides a couple)
- start_time and end_time are strings which are unique to the minute, meaning that we cannot immediatley use them to calculate session durations or filter for time periods. We must first convert the strings
- most email addresses are associated with only one playing session, meaning it would be hard to draw conclusions about player behaviour over time


# Question of Interest

- broad: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- specific: Can the age and minecraft hours played predict the if a player is subscribed to a game-related newletter?
- The data will help me answer this question by allowing me to train 