# Predicting Usage of a Video Game Research Server (Project Planning)

## Data Description
A research group in Computer Science at UBC, led by Frank Wood, collected the player's profile and actions with the following format. 
### Players
There are 196 players provided with the following variables. Some categories in gender, such as Agender or Other, have a small number of samples, which might not be enough to perform accurate classification based only on these data. In addition, Age has 2 N/As, which might cause a reduction in sample size due to skipping these data. 

|**Variable Names**| **Type**  |**Unique Values**| **Min**| **Max**| **Mean**|**NAs**|
|------------------|-----------|-----------------|--------|--------|---------|-------|
|experience        |categorical|- Amateur(63)<br> - Beginner(35)<br> - Pro(14)<br> - Regular(36)<br> -Veteran(48)               |-       |-       |-        |0      |
|subscribe         |categorigal|- TRUE (144)<br> -FALSE (52)       |-       |-       |-        |0      |
|hashedEmail       |String     |-                |-       |-       |-        |0      |
|played_hours      |numerical  |-                |0.000   |223.100 |5.846    |0      |
|name              |String     |-                |-       |-       |-        |0      |
|gender            |categorical|-Agender(2)<br> -Female(37)<br> - Male(124)<br> - Non-binary(15)<br> - Other(1)<br> - Prefer not to say(11)<br> - Two-Spirited(6)                 |-       |-       |-        |0      |
|Age               |numerical  |-                |8       |50      |20.52    |2      |

### Sessions
1535 different sessions recorded linked with hashedEmail with the following variables.2 N/As in original_end_time should be skipped if necessary.  
|**Variable Names**| **Type**  |**Unique Values**| **Min**| **Max**| **Mean**|**NAs**|
|------------------|-----------|-----------------|--------|--------|---------|-------|
|hashedEmail       |String     | 125 different hashedEmails recorded | - | - | - | - |
|start_time        |String     |-                |-       |-       |-        |-      |
|end_time          |String     |-                |-       |-       |-        |-      |
|original_start_time|numerical |-                |1.712e+12|1.727e+12|1.719e+12|-    |
|original_end_time |numerical  |-                |1.712e+12|1.727e+12|1.719e+12|2    |



## Questions
I chose **question 1**: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? More specifically, the question is "**Can played hours and age predict the subscription status in players dataset?**" The classification analysis will be carried out with two explanatory variables, age and played_hours, and the response variable subscription. Data needs to be modified to have only three columns related to this research and N/A data should be removed.

## Explanatory Data Analysis and Visualization

### Mean tables
|**variable name**|**mean**|
|-----------------|--------|
| played_hours    |5.845918|
| Age             |20.52062|

### Insights from plots
Plot 1 shows that the majority of people played approximately less than 5 hours. There are no non-subscribers who played more than 12 hours, which suggests that playing time could tell who is more likely a subscriber. However, this plot suggests that there is no clue to tell if players who played less than 5 hours are subscribers or not solely with played time. Plot 2 shows that more varieties compared to Plot 1. Overall, the older, the less likely to be subscribers. Last plot shows the relationship between age and playing time, grouping by subscription status. It shows younger people play more and subscribe more than older people in general. One concern is that a large amount of points lie on the 0 hours regardless of subscription status, which might cause low accuracy of prediction.


In [None]:
# set up
library(tidyverse)
library(readr)
library(repr)
#options(repr.matrix.max.rows = 6)

In [None]:
# Data load
url_player <- "https://raw.githubusercontent.com/Lada496/self-report/main/data/players.csv"
url_sessions <- "https://raw.githubusercontent.com/Lada496/self-report/main/data/sessions.csv"
players <- read_csv(url_player)
sessions <- read_csv(url_sessions)

# Compute the mean value for each quantitative variable
players_quantitative <- players |> select(played_hours, Age) |>
    map_df(mean, na.rm = TRUE)

# plots
hours_hist <- ggplot(players, aes(x = played_hours, fill = subscribe)) + 
    geom_histogram()+
    labs(x = "Played Time (hours)", fill = "Subscription Status")+
    ggtitle("Plot 1: Played Time distrubution with subscription status")+
    theme(plot.title = element_text(size = 18, hjust = 0.5),
            plot.margin = margin(t = 20, r = 10, b = 10, l = 30))
hours_hist
age_hist <- ggplot(players, aes(x = Age, fill = subscribe)) +
    geom_histogram()+
    labs(x = "Age", fill = "Subscription Status")+
    ggtitle("Plot 2: Age distrubution with subscription status")+
    theme(plot.title = element_text(size = 18, hjust = 0.5),
            plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

age_hist

players <- players |>
    mutate(subscribe=as_factor(subscribe))

players_plot <- players |>
    ggplot(aes(x = Age, y= played_hours, color = subscribe)) +
    geom_point(alpha = 0.4) + 
    labs(x = "Age", y = "played time (hours)", color = "Subscription Status") +
    ggtitle("Plot 3: The relationship between age and played hours") +
    theme(plot.title = element_text(size = 18, hjust = 0.5),
            plot.margin = margin(t = 20, r = 10, b = 10, l = 30))

players_plot

## Methods and Plan

Since the question tries to determine whether a person subscribes to Minecraft, we can conduct a k-nearest classification analysis. 

### Tuning nearest k
#### Splitting data into two sets: training and testing
To tune the best k, we will conduct cross-validation by splitting the data into two sets: training data and testing data with `initial_split`, `training` and `testing`. The proportion should 75% and `strata` is `subscribe`.
The data columns should be correctly selected before splitting the data. In this case, played_hours, age, and subscribe should be chosen.

#### Create recipe
The response variable is `subscribe,` and the predictors are `age` and `played_hours`. Since the data columns are already selected, `all_predictors()` is chosen for `step_scale` and `step_center`.

#### specification with tune()
To conduct cross-validation, we'll define the model specification with `nearest_neighbor` and set `tune()` as `neighbors`.

#### Getting five folds
Then, we'll split the data into five folds with `vfold_cv`, setting `subscribe` to `strata`.

#### Getting metrics to check accuracy
Collect metrics with `collect_metrics()` after fitting models with the code below:

```R
vfold_metrics <- workflow() |>
                  add_recipe(players_recipe) |>
                  add_model(knn_spec) |>
                  fit_resamples(resamples = players_vfold) |>
                  collect_metrics()
```
Then, plot the accuracy vs k and find the best k. Also, the code can pull the best k.
```R
accuracies <- vfold_metrics |>
  filter(.metric == "accuracy")

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k
```




We will conduct k-nearest classification analysis because the question is a good example of binary classification where the response variables have two classes: true or false subscription status. For this analysis, we'll split data into two: 75% training data and 15 % testing data, find the best k with cross-validation. As the table in the second section shows, this classification may not perform well due to imbalanced classes.

## Appendicies (Code)

In [None]:
url_player <- "https://raw.githubusercontent.com/Lada496/self-report/main/data/players.csv"
url_sessions <- "https://raw.githubusercontent.com/Lada496/self-report/main/data/sessions.csv"
players <- read_csv(url_player)
sessions <- read_csv(url_sessions)

table(players$experience)
table(players$gender)
n_distinct(players$hashedEmail) 
summary(players)
summary(sessions)
head(sessions)
n_distinct(sessions$hashedEmail) 