In [None]:
#Run these beforehand
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


# Data Description
The dataset is separated into two lists. One is a list of 196 unique players with 7 variables of data on each player:
- experience, their experience as one of 'Beginner', 'Regular', 'Amateur', 'Veteran', or 'Pro' (chr)
- hashedEmail, Private hashed email address (chr)
- name, player name (chr)
- gender, player gender (chr)
- played_hours, number of hours they have played on the server (dbl)
- Age, player age in years (dbl)
- subscribe, whether or not they are subscribed to a games-related news letter (lgl)

Some issues I noticed with this list was that experience hierarchy is unclear (such as wehther or not 'Regular' is more experienced than 'Amatuer'), there are NA values in the age, and that it is unclear which news letter the players are subscribed to.

The other is a list of individual play sessions with 5 variables on each session:
- hashedEmail, private hashed email address (chr)
- start_time, start time of session in day/month/year time (chr)
- end_time, end time of session in day/month/year time (chr)
- original_start_time, Unix epoch start time in milliseconds (dbl)
- original_end_time, Unix epoch end time in milliseconds (dbl)

Some issues I noticed with this list is that the start and end include the time and date in the same column, and should probably be seperated. It also lacks the session time length, which is probably the most basic piece of data which should be included as it's own column. I am also not certain if every session is associated with a hashedEmail in the player list. 

# Questions
For this project I would like to focus on Question 2, or what "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts. The specific question I would like to ask is: Can player experience and and age predict the length of play time in the list of sessions for that player. To answer this question, I would use the data to determine if age and total session length have a linear assocation. I could also either split the data by the categories of experience, or use k-nn fit algorithms to see which creates a better prediction system. Since there are NA values in the age, I would likely have to remove those observations in the cases when examining age, and I would also likely need to create a new session length column in the list of sessions for ease of use. 

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

In [None]:
clean_sessions <- separate(sessions, start_time, c("start_date", "start_time"), " ") |>
    separate(end_time, c("end_date", "end_time"), " ")
colnames(clean_sessions) <- c("hashed_email", "start_date", "start_time", "end_date", "end_time", "original_start_time", "original_end_time")
head(clean_sessions)
colnames(players)