# Predicting Subscription Status Based on Played Hours and Age

## Introduction
In this assignment we will be analyzying the study done by a group in the computer science department of UBC, led by Frank Wood. The department made a game like minecraft and used it to conduct the study. The department recorded players actions as they navigated the world. For this project, we asked what type of a player is most likey to subscribe. We used the players dataset to analyze and predict this subscription likelihood. The dataset consists of 196 observations and the variables are experience, played_hours, gender, and age. The experience variable tells us players skill level. The played_hours variable measures how long a player has played the game. Demographic variables like gender and age are also included in the dataset, although there are some missing values for age. We are using the two variables “played hours” and “age” to predict the subscription status in the players dataset.

## Method & Result

### Explanatory Analysis: Data description
A research group in Computer Science at UBC, led by Frank Wood, collected the player's profile and actions with the following format.
#### Players
There are 196 players provided with the following variables. Some categories in gender, such as Agender or Other, have a small number of samples, which might not be enough to perform accurate classification based only on these data. In addition, Age has 2 N/As, which might cause a reduction in sample size due to skipping these data.

|**Variable Names**| **Type**  |**Unique Values**| **Min**| **Max**| **Mean**|**NAs**|
|------------------|-----------|-----------------|--------|--------|---------|-------|
|experience        |categorical|- Amateur(63)<br> - Beginner(35)<br> - Pro(14)<br> - Regular(36)<br> -Veteran(48)               |-       |-       |-        |0      |
|subscribe         |categorigal|- TRUE (144)<br> -FALSE (52)       |-       |-       |-        |0      |
|hashedEmail       |String     |-                |-       |-       |-        |0      |
|played_hours      |numerical  |-                |0.000   |223.100 |5.846    |0      |
|name              |String     |-                |-       |-       |-        |0      |
|gender            |categorical|-Agender(2)<br> -Female(37)<br> - Male(124)<br> - Non-binary(15)<br> - Other(1)<br> - Prefer not to say(11)<br> - Two-Spirited(6)                 |-       |-       |-        |0      |
|Age               |numerical  |-                |8       |50      |20.52    |2      |,

In [None]:
# set up
library(tidyverse)
library(readr)
library(repr)

# Getting data
url_player <- "https://raw.githubusercontent.com/Lada496/self-report/main/data/players.csv"
players <- read_csv(url_player)

table(players$experience)
table(players$gender)
n_distinct(players$hashedEmail) 

### Player Summary Table
|**variable name**|**mean**|
|-----------------|--------|
| played_hours    |5.845918|
| Age             |20.52062|


In [None]:
summary(players)

### Insights from plots
Plot 1 shows that the majority of people played approximately less than 5 hours. There are no non-subscribers who played more than 12 hours, which suggests that playing time could tell who is more likely a subscriber. However, this plot suggests that there is no clue to tell if players who played less than 5 hours are subscribers or not solely with played time. Plot 2 shows that more varieties compared to Plot 1. Overall, the older, the less likely to be subscribers. Last plot shows the relationship between age and playing time, grouping by subscription status. It shows younger people play more and subscribe more than older people in general. One concern is that a large amount of points lie on the 0 hours regardless of subscription status, which might cause low accuracy of prediction.

In [None]:
# Compute the mean value for each quantitative variable
players_quantitative <- players |> select(played_hours, Age) |>
    map_df(mean, na.rm = TRUE)

# plots
hours_hist <- ggplot(players, aes(x = played_hours, fill = subscribe)) + 
    geom_histogram()+
    labs(x = "Played Time (hours)", fill = "Subscription Status")+
    ggtitle("Plot 1: Played Time distrubution with subscription status")+
    theme(plot.title = element_text(size = 18, hjust = 0.5),
            plot.margin = margin(t = 20, r = 10, b = 10, l = 30))
hours_hist
age_hist <- ggplot(players, aes(x = Age, fill = subscribe)) +
    geom_histogram()+
    labs(x = "Age", fill = "Subscription Status")+
    ggtitle("Plot 2: Age distrubution with subscription status")+
    theme(plot.title = element_text(size = 18, hjust = 0.5),
            plot.margin = margin(t = 20, r = 10, b = 10, l = 10))

age_hist

players <- players |>
    mutate(subscribe=as_factor(subscribe))

players_plot <- players |>
    ggplot(aes(x = Age, y= played_hours, color = subscribe)) +
    geom_point(alpha = 0.4) + 
    labs(x = "Age", y = "played time (hours)", color = "Subscription Status") +
    ggtitle("Plot 3: The relationship between age and played hours") +
    theme(plot.title = element_text(size = 18, hjust = 0.5),
            plot.margin = margin(t = 20, r = 10, b = 10, l = 30))

players_plot

### Methods and Plan

Since the question tries to determine whether a person subscribes to Minecraft, we can conduct a k-nearest classification analysis. 

#### Tuning nearest k
##### Splitting data into two sets: training and testing
To tune the best k, we will conduct cross-validation by splitting the data into two sets: training data and testing data with `initial_split`, `training` and `testing`. The proportion should 75% and `strata` is `subscribe`.
The data columns should be correctly selected before splitting the data. In this case, played_hours, age, and subscribe should be chosen.

##### Create recipe
The response variable is `subscribe,` and the predictors are `age` and `played_hours`. Since the data columns are already selected, `all_predictors()` is chosen for `step_scale` and `step_center`.

##### specification with tune()
To conduct cross-validation, we'll define the model specification with `nearest_neighbor` and set `tune()` as `neighbors`.

##### Getting five folds
Then, we'll split the data into five folds with `vfold_cv`, setting `subscribe` to `strata`.

##### Getting metrics to check accuracy
Collect metrics with `collect_metrics()` after fitting models with the code below:

```R
vfold_metrics <- workflow() |>
                  add_recipe(players_recipe) |>
                  add_model(knn_spec) |>
                  fit_resamples(resamples = players_vfold) |>
                  collect_metrics()
```
Then, plot the accuracy vs k and find the best k. Also, the code can pull the best k.
```R
accuracies <- vfold_metrics |>
  filter(.metric == "accuracy")

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k
```
##### Fit with the best k and evaluate values with confusion matrix
Finally, we obtained the confusion matrix with `conf_mat` and computed accuracy, precision and recall to evaluate the model.

### Results

#### Choosing the best k
First we splitted players into two sets: `players_train` and `players_tes`. We used `tidymodels` to conduct classification analysis. We first selected three covariates: `subscribe`, `played_hours`, and `Age`.

In [None]:
players <- players |> select(subscribe, played_hours, Age)

In [None]:
library(tidymodels)
players_split <- initial_split(players, prop = 0.75, strata = subscribe)  
players_train <- training(players_split)   
players_test <- testing(players_split)

nrow(players_train)

Then, we created `players_recipe` and `players_spec` with the following code.

In [None]:
players_recipe <- recipe(subscribe ~ Age + played_hours , data = players_train) |>
    step_scale(all_predictors()) |> 
    step_center(all_predictors())

players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
       set_engine("kknn") |>
       set_mode("classification")

To perform 5-fold cross-validation in R with tidymodels, we used `vfold_cv`.

In [None]:
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

Then, we obtained metrics to find the best with the following code.

In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

vfold_metrics <- workflow() |>
                  add_recipe(players_recipe) |>
                  add_model(players_spec) |>
                  tune_grid(resamples = players_vfold, grid = k_vals) |>
                  collect_metrics()

We filtered `vfold_metrics` to get `accuracy` with the code below.

In [None]:
accuracies <- vfold_metrics |>
  filter(.metric == "accuracy")

As the plot shows, k = 5 and k = 6 have the same accuracy; we chose 5 for the rest of the analysis.

In [None]:
accuracy_versus_k  <- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(0, 14, by = 1)) +  
      scale_y_continuous(limits = c(0.4, 1.0)) 
accuracy_versus_k

In [None]:
best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

### Classification analysis with k = 5
Then, we repeated the same process with k = 5.

In [None]:
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) |>
       set_engine("kknn") |>
       set_mode("classification")

players_recipe <- recipe(subscribe ~ Age + played_hours, data = players_train) |>
    step_impute_median(all_predictors()) |>
    step_scale(all_predictors()) |> 
    step_center(all_predictors())

players_fit_best <- workflow() |>
                  add_recipe(players_recipe) |>
                  add_model(players_spec) |>
                  fit(players_train)

Then, we predicted that with new model.

In [None]:
player_predictions <- predict(players_fit_best, players_test) |>
      bind_cols(players_test)

In [None]:
Then, we obtained the accuracy and confusion matrix with the following code.

In [None]:
player_prediction_accuracy <- player_predictions |>
        metrics(truth = subscribe, estimate = .pred_class)  

player_prediction_accuracy

player_mat <- player_predictions |> 
      conf_mat(truth = subscribe, estimate = .pred_class)
player_mat
