***Data Description***

**players.csv**

Number of Observations: 196

Number of Variables: 7

*Variable Descriptions*

| Name | Type | Description |
|----------|---------------|-------------|
| experience | chr | One of: Beginner, Amateur, Regular, Veteran, Pro used to describe familiarity with Minecraft|
| subscribe | lgl | TRUE or FALSE representing if a subscription was purchased |
| hashedEmail | chr | Encrypted user-specific email but still distinguishable |
| played_hours | dbl | Total number of hours played during testing |
| name | chr | First name of user |
| gender | chr | Gender of user |
| Age | dbl | Age of user (years) |


Issues & Potential Issues Within Data:
1. there may be multiple people with the same name which may run into issues: ideally users are distinguished by hashedEmail
2. the way experience is measured as a class can be ambiguous as there is no set in stone definition for what classifies one as a "Amateur" vs a "Pro"

Data Collection Methods: Recording user data from those who play on the minecraft server (PLAICraft)

**sessions.csv**

Number of Observations: 1535

Number of Variables: 5

*Variable Descriptions*

| Name | Type | Description |
|----------|---------------|-------------|
| hashedEmail | chr | Encrypted user-specific email but still distinguishable |
| start_time | chr | Start of play session in dd/mm/yyyy hh:mm format |
| end_time | chr | End of play session in dd/mm/yyyy hh:mm format |
| original_start_time | dbl | Start of play session in unix epoch time in milliseconds (milleseconds since Jan 1st 1970 00:00:00 UTC) |
| original_end_time | dbl | End of play session in unix epoch time in milliseconds (milleseconds since Jan 1st 1970 00:00:00 UTC) |

Issues & Potential Issues Within Data:
1. start_time and end_time include both the date as well as the time, may need to be split somehow into two different variables consisting of just the date and just the time
2. original_start_time and original_end_time may need to be converted into more human ways of measuring time
3. original_start_time and original_end_time does not have sufficient decimal places to be able to have the calculate the exact session time of a user


Data Collection Methods: Recording user data from those who play on the minecraft server (PLAICraft)

***Questions***

**Broad Question**

What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific Formulated Question**

Can user-average play session length predict subscription status in the Minecraft server data?

***Exploratory Data Analysis and Visualization***

| Variable Name | Mean | 
|---------------|------|
| played_hours | 5.845918 |
| age | 21.13918 |
| original_start_time | 1.719201e+12 |
| original_end_time | 1.719196e+12 |

***Methods and Plan***

I propose to use kknn nearest neighbors because...

In [2]:
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [14]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [26]:
players_mean <- players |>
    select(played_hours, Age) |>
    summarize(avg_played_hours = mean(played_hours, na.rm = TRUE),
              avg_age = mean(Age, na.rm = TRUE))
session_mean <- sessions |>
    select(original_start_time, original_end_time) |>
    summarize(avg_og_start = mean(original_start_time, na.rm = TRUE),
              avg_og_end = mean(original_end_time, na.rm = TRUE))

players_mean
session_mean

avg_played_hours,avg_age
<dbl>,<dbl>
5.845918,21.13918


avg_og_start,avg_og_end
<dbl>,<dbl>
1719201000000.0,1719196000000.0


In [13]:
players_tidy <- players
sessions_tidy <- sessions |> 
    select(hashedEmail, start_time, end_time)