In [None]:
#libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

# Data Description

### For the Players data:  
Rows: 196  
Columns: 7
#### Variables
- **experience:**
   - Categorical variable meant to show how experienced at the game each player is
   - Unclear as to how it was collected, if it is self reported it may not be a reliable skill level indicator
- **subscribe**
    - Categorical variable reporting whether a player is subscribed
- **hashedEmail**
    - character variable containing each players hashed email
- **played_hours**
    - double containing the number of hours each player has played
- **name**
    - Character variable containing each players first name
    - likely self reported 
- **gender**
    - categorical variable containing players gender
    - includes "Male", "Female", various nonbinary identies, and "prefer not to state"
    - likely self reported
- **Age**
    - double variable giving the age of each player
    - likely self reported
#### Summary Statistics:



### For the sessions data

# Questions:  
**Broad question:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  
**Specific question:** Can a players average session length, total number of sessions and total played hours predict whether a player will subscribe to a game related newsletter?  
**Connection to the data:**  
I can use my data to answer my question of interest by training a knn classification model using average session length, total number of sessions and total played hours as my predictors and the subscribe variable as my response variable.  
**Wrangling:**  
In order to make my model I need to:  
- Join the data sets by player id  
- Calculate the average session length and total number of sessions using the session data  
- Filter to include only relevant data  
- Tidy and normalize data as needed for knn classification



# Exploratory Data Analysis and Visualization

In [None]:
# importing the data
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")
players_data
sessions_data

In [None]:
#removing any rows with na values, making name formats consistent

sessions_tidy <- sessions_data |>
na.omit() |>
mutate(hashed_email = hashedEmail, hashedEmail = NULL)

sessions_tidy

players_tidy <- players_data |>
na.omit() |>
mutate(age = Age, Age = NULL, hashed_email = hashedEmail, hashedEmail = NULL)

players_tidy

#computing summary statistics for players data and formatting for readbility
players_summary <-players_tidy |>
    summarize(
        played_hours = mean(played_hours),
        age = mean(age),   
    ) |>
    pivot_longer(1:2, names_to = "variable", values_to = "mean value")
players_summary

#plots

#plotting played hours vs age
#age_plot <- ggplot(aes(x = played_hours, y = Age, color = subscribe)