In [None]:
#libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

# Data Description

### For the Players data:  
Rows: 196  
Columns: 7
#### Variables
- **experience:**
   - Categorical variable giving experience level.
   - Likely self reported so potentially inacurate. 
- **subscribe**
    - Categorical variable reporting subscription status
- **hashedEmail**
    - Categorical variable containing each players hashed email
- **played_hours**
    - double containing the number of hours each player has played
- **name**
    - Categorical variable containing each players first name
- **gender**
    - categorical variable containing players gender
- **Age**
    - double variable giving the age of each player
#### Summary Statistics:   
Of the players:  
124 are male, 37 are female, 33 identify as non binary or didn't state their gender.  
For experience level: 35 are beginner, 35 are regular, 63 are amuteur, 48 are veteran, and 13 are pro.  

Note: name, gender, and experience level are likely self reported so may be innacurate for some observations. There are some cells with missing values.

### For the sessions data:
Rows: 1535  
Collumns: 5  
#### Variables
- **hashedEmail**
    - Same as above
- **start_time, end_time**
    - Contain the start and end time for the observed session in character form
- **original_start_time, original_end_time**
    - Doubles that contain a time value in milliseconds
    - Appear to contain identical values for a given observation so possibly incorrect?

#### Summary Statistics:  
Average number of sessions per player: 12.26
Most sessions by one player: 310  
Note: average number of sessions per player appears to be very skewed by a few high outliers.
  

# Questions:  
**Broad question:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  
**Specific question:** Can a player's average session length, total number of sessions and total played hours predict whether a player will subscribe to a game related newsletter?  
**Connection to the data:**  
I can use my data to answer my question of interest by training a knn classification model using average session length, total number of sessions and total played hours as my predictors and the subscribe variable as my response variable.  
**Wrangling Plan:**  
In order to make my model I need to:  
- Join the data sets by player id  
- Calculate the average session length and total number of sessions using the session data  
- Filter to include only relevant data  
- Tidy and normalize data as needed for knn classification



# Exploratory Data Analysis and Visualization

In [None]:
# importing the data
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")
players_data
sessions_data



In [None]:
#removing any rows with na values, making name formats consistent

# renamed things so I could work on the data more easily please have mercy I know its unecesary but it was helpfull
# 
sessions_tidy <- sessions_data |>
na.omit() |>
mutate(hashed_email = hashedEmail, hashedEmail = NULL,)

sessions_tidy

players_tidy <- players_data |>
na.omit() |>
mutate(age = Age, Age = NULL, 
       hashed_email = hashedEmail, hashedEmail = NULL,
      )

players_tidy

#computing summary statistics for players data and formatting for readbility
players_summary <-players_tidy |>
    summarize(
        played_hours = mean(played_hours),
        age = mean(age),   
    ) |>
    pivot_longer(1:2, names_to = "variable", values_to = "mean value")
players_summary

#computing number of players from each gender (for summary stats)
players_summary <- players_tidy |> 
    group_by(gender) |>
    summarize(n())

#computing the number of players in each skill level (for summary stats)
players_summary <- players_tidy |> 
    group_by(experience) |>
    summarize(n())

#computer summary statistics for sessions data:

#finding average number of sessions per player:
sessions_summary <- sessions_tidy |>
    group_by(hashed_email) |>
    summarize(num_sessions = n()) |>
    ungroup() |>
    summarize(mean_sessions = mean(num_sessions))

#finding max sessions by one player
sessions_summary <- sessions_tidy |>
    group_by(hashed_email) |>
    summarize(num_sessions=n()) |>
    arrange(desc(num_sessions))


# Plots

In [None]:
#plots

#time played vs age plot
playtime_age_plot <- players_tidy|>
    group_by(subscribe)|>
    summarize(mean_age = mean(age)) |>
    ggplot(aes(x = subscribe, y = mean_age, fill = subscribe)) +
    geom_bar(stat = "identity")+
    labs(x = "Subscription status", y = "Age (Years)", fill = "Subscribed")+
    ggtitle("subscription status vs Age")+
    theme(plot.title = element_text(hjust = 0.5))
playtime_age_plot

In [None]:
#subscription vs experience levvel
subscription_experience_plot <- players_tidy |> ggplot(aes(x = experience, fill = subscribe)) +
    geom_bar(position="fill")+
    labs(x = "Experience Level", y = "Proportion of Players Subscribed", fill = "Subscribed")+
    ggtitle("Experience vs Subscription Status")+
    theme(plot.title = element_text(hjust = 0.5))

subscription_experience_plot

## Plot Conclusions:  
I can see from these plots that players who are not subscribed are on average a couple years older then those who are, and that the proportions of those who are subscribed to those who aren't are pretty similar across different experience levels. Overall I can conclude that there does not seem to be a strong relationship between age or experience level and subscription status.

# Methods and plan:  
#### Method:  
I will use a knn classification model to predict whether a player will subscribe to the game related newsletter using  average session length, total played hours, and total number of sessions as my predictive variables.  
#### Why this method:   
Knn classification is the method best suited to classification that we have learned so far. With knn the data doesn't need to be any particular shape, it can classify even if relationships in data are nonlinear. 
#### Assumptions:
Knn assumes that variables are scaled and centered as needed, and that players that have similar session characteristics will have the same subscription status.
#### Limitations:  
Knn doesn't work well when classes are uneven, which could cause a problem given that the data here is fairly uneven.  
#### Model selections:  
I will tune the value of k using 5 fold cross validation and compare each model using accuracy to find the value of k that gives the best performance.   
#### Data Processing and Splitting:
I will normalize the data then split it into a training set and a test set with a 70/30 split. I will perform 5 fold cross validation on the training set to find the best model and evaluate using the test set.