In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

# Project Planning

## 1) Data Description 
### players.csv
- 196 observations
- 7 variables
    - experience (character): level of experience of each player
    - subscribe (logical): whether the player is subscribed to a game-related newsletter or not
    - hashedEmail (character): hashed email of each player
    - played_hours (double): hours played by each player (hr)
    - name (character): name of each player
    - gender (character): gender of each player
    - Age (double): age of each player
- Issues:
    - not tidy; variable names are not consistent (underscore vs capitalization), and variable names are not informative enough 
- Average played hours for different experience levels
    - Amateur: 6.02
    - Beginner: 1.25
    - Pro: 2.60
    - Regular: 18.21
    - Veteran: 0.65
- Average played hours: 5.85
- Average age: N/A

In [None]:
players <- read_csv("data/players.csv")
players

players_summary <- players |>
    group_by(experience) |>
    summarize(mean_hours_by_experience = mean(played_hours))
players_summary

players_avg_hour <- players |>
    summarize(mean_hours = mean(played_hours))
players_avg_hour

players_avg_age <- players |>
    summarize(mean_hours = mean(Age))
players_avg_age

### sessions.csv
- 1535 observations
- 5 variables
    - hashedEmail (character): hashed email of each player
    - start_time (character): start time of one session (dd/mm/yyyy time)
    - end_time (character): end time of one session (dd/mm/yyyy time)
    - original_start_time (double): start time of one session in Unix epoch time (ms)
    - original_end_time (double): end time of one session in Unix epoch time in (ms)
- Issues:
    - not tidy; one cell consists of multiple values, and variable names are not informative enough (e.g., unit, the difference between start_time and original_start_time)
- Average start time: 1.72e+12
- Average end time: N/A

In [None]:
sessions <- read_csv("data/sessions.csv")
sessions

sessions_avg_start <- sessions |>
    summarize(mean_start = mean(original_start_time))
sessions_avg_start

sessions_avg_end <- sessions |>
    summarize(mean_end = mean(original_end_time))
sessions_avg_end

## 2) Questions
Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- Can "played_hours" and "experience" predict "subscribe" in players.csv?

- "played_hours" and "experience" variables from players.csv will be used to answer the question through the process of knn classification
- the accuracy will be calculated
- Wrangling steps:
    - tidy data by changing the variable names (e.g., subscribe to newsletter_subscription)
    - select only the variables of interest ("played_hours", "experience", "subscribe")
    - mutate experience to numerical values

## 3) Exploratory Data Analysis and Visualization

In [None]:
tidy_players <- players |>
    rename(newsletter_subscription = subscribe) |>
    select(played_hours, experience, newsletter_subscription)
tidy_players

In [None]:
players_graph_1 <- players_summary |>
    ggplot(aes(x = experience, y = mean_hours_by_experience)) +
    geom_bar(stat = "identity") +
    labs(x = "Experience Level", y = "Play Time (hr)") +
    ggtitle("Relationship between average play time (hr) and experience level") +
    scale_x_discrete(limits = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))
players_graph_1

In [None]:
players_graph_2 <- tidy_players |>
    ggplot(aes(x = played_hours, fill = newsletter_subscription)) +
    geom_histogram(binwidth = 5, position = "fill") +
    labs (x = "Play Time (hr)", y = "Count", fill = "Newsletter Subscription") +
    ggtitle("Number of users for different play time (hr)")
players_graph_2

In [None]:
players_graph_3 <- tidy_players |>
    ggplot(aes(x = experience, fill = newsletter_subscription)) +
    geom_histogram(stat = "count", position = "fill") +
    labs(x = "Experience", y = "Proportion", fill = "Newsletter Subscription") +
    ggtitle("Number of users for different experience levels") +
    scale_x_discrete(limits = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))
players_graph_3

### Insights from the above graphs
- Graph 1:
    - The mean play time is the highest for the regular level of experience
    - It is the lowest for veterans
    - If we assume that the experience level increases in the order of beginner, amateur, regular, veteran, pro, we could infer that the play time increases as the level goes up until veteran
    - There is a relationship between the two variables, but not too strong
- Graph 2:
    - Most users are subscribed
    - The correlation between subscription and play time is not clear
- Graph 3:
    - The proportions of subscribed users and non-subscribed users are similar across different experience levels, with slightly higher proportions for beginner and regular
    - Similar to other graphs, there isn't a strong relationship

## 4) Methods and Plan
method: knn classification
- This method is more appropriate in this situation, because the response variable is categorical, not numerical.

assumptions
- 

limitations/weaknesses
- 

plan
- 
How are you going to compare and select the model?
How are you going to process the data to apply the model? For example: Are you splitting the data? How? How many splits? What proportions will you use for the splits? At what stage will you split? Will there be a validation set? Will you use cross validation?

## 5) GitHub Repository
link to GitHub repository: 