In [None]:
library(tidyverse)
library(cowplot)
library(RColorBrewer)

# Data Description 
## Observations and Variables
As shown in the tables below we can see that:

In the player dataset, there are **196** observations, and **7** variables:
- **experience**<chr>: The level of experience of the player.
- **subscribe**<lgl>: Whether the player is subscribed is represented in the type boolean.
- **hashedEmail**<chr>: The hashed email of the player.
- **played_hours**<dbl>:The number of hours played is represented in the type double.
- **name**<chr>: The name of the player.
- **gender**<chr>: The gender of the player.
- **Age**<dbl>: The age of the player.

In the sessions dataset, there are **1535** observations, and **5** variables:
- **hasedEmail**<chr>: the encrypted email of the player.
- **start_time**<chr>: the start time of the session.
- **end_time<chr>**:The end time of the session.
- **original_start_time**<dbl>: The time passed since the start of the project when the session started.
- **original_end_time**<dbl>: The time passed since the start of the project when the session ended.
 
## summary statistics
In the **player** dataset we can see that:
- **Mean age**: 20.52062
- **Max age**: 50
- **Min age**: 8
- **Mean played hours**: 5.845918
- **Max played hours**: 223.1
- **Min played hours**: 0

## Issues
- Start time and end_time in the sessions dataset have the year, month, date, and hours in one variable.
- The original_start_time and original_end_time are in a very small unkown unit.
- The name of variable Age and hashedEmail is in a different format then the other variables in the player dataset.
- There might be NA values in the dataset.

In [None]:
players <- read_csv("https://raw.githubusercontent.com/AndyHuang36888/GroupProjectDSCI100/refs/heads/main/players.csv")
sessions <- read_csv("data/sessions.csv")


# tidy data
colnames(players) <-c("experience", "subscribe", "hashed_email", "played_hours", "name", "gender","age")
colnames(sessions) <-c("hashed_email", "start_time", "end_time", "original_start_time","original_end_time")

players_stats <- players|>
                 select(age, played_hours)

players_mean <- map_df(players_stats, mean, na.rm = TRUE)            
players_max <- map_df(players_stats, max, na.rm = TRUE)
players_min <- map_df(players_stats, min, na.rm = TRUE)
players_summary <- bind_cols(players_mean, players_max) |>
                   bind_cols(players_min)
colnames(players_summary) <- c("mean_age", "mean_played_hours", "max_age", "max_played_hours", "min_age", "min_played_hours")
players_summary
head(players)
head(sessions)

# Questions
Can the variables played_hours, age, and experience predict who has subscribed to a game letter in the player's data set? 

The players hours played, age, and level of experience could be predictive of the player's tendency to subscribe to a game's newsletter. I can use K-nearest-neighbours to build a model that can try to predict if the player is subscribed to a game newsletter with these variables.
Among these variables, the player's experience is the type string, I will need to convert it to a numeric value if I want to use it as a predictor.

# Visualization
Below are some visualizations of the data and the relationship of selected predictors.

In [None]:
options(repr.plot.width = 20, repr.plot.height = 8) 

players_point <- players |>
                 mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))) |>
                 ggplot(aes(x = played_hours, y = age, color = experience)) +
                 geom_point() +  
                 xlim(0, 15) +
                 scale_fill_brewer(palette = "Set1") +
                 labs(title = "Age vs Played Hours",
                 x = "Played (hours)",
                 y = "Age (year)",
                 fill = "Experience")


players_experiece <- players |>
                 arrange(played_hours) |>
                 mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))) |>
                 ggplot(aes(x = experience, fill = subscribe)) +
                 geom_bar() +  
                 scale_fill_brewer(palette = "Set1") +
                 labs(title = "Subscribe vs Experience",
                 y = "Subscribed players",
                 x = "Experience",
                 fill = "Subscribe")

players_age <-   players |>
                 arrange(age) |>
                 mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))) |>
                 ggplot(aes(x = age, fill = subscribe)) +
                 geom_histogram(binwidth =5) +  
                 scale_fill_brewer(palette = "Set1") +
                 labs(title = "Subscribe vs Age",
                 x = "Age (years)",
                 y = "Experience ratio",
                 fill = "Subscribe")
                 
players_hours <- players |>
                 mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro"))) |>
                 ggplot(aes(x = played_hours, fill = subscribe)) +
                 geom_histogram() + 
                 xlim(0, 20) +
                 ylim(0, 30) +
                 scale_fill_brewer(palette = "Set1") +
                 labs(title = "Player's Hours distribution",
                 x = "Played Hours",
                 y = "Amount of players",
                 fill = "Subscribe")
players_point
plot_grid(players_experiece, players_age, players_hours, ncol = 3)


From the graphs, we can see that there are not a lot of players that play more than 10 hours. This has cause a huge empty gap in the last graph. We can also see that the majority of the players are between the age of 10 to 25 years old.

There are also a lot more amateur players than pro players, and that could skew the data when doing the classification. There are also more subscribed players than unsubscribed players, which could also cause the model to predict that the player is subscribed more often.

# Methods and Plan
I can use k-nearest-neighbours classification to predict the player's experience, since subscribe is a categorical variable.
It is assumed that players in the same class have the similar stats and appear in the same general area of the graph.
One potential issue is that I might not be using enough predictors, or the predictors chosen are not correlated to the player's experience.


The data is going to be split into two sets, with 75% as the training set and 25% as the testing set. The training data will be spilt into 5 more sets, so the model can be trained with cross validation. 
I am going to compare the metrics of the models with different k values and select the one with the best accuracy.
