In [None]:
#libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(lubridate)

# Project report
### Group 25: Nelson, Will, Caroline

### Introduction

This report looks at two data sets from a Minecraft server set up by a computer science research group at the University of British Columbia. The researchers collected data about how people play games by recording players’ actions on the server. However, in order to run the project, the researchers need to figure out how to attract more players for their server and manage resources. One such method is to advertise on a game-related newsletter. Broadly, we looked at the types of players subscribed the the newsletter and on the server to determine which players the researchers should try to target with their efforts. 

Here, we will try to answer the following question: 
_Can we predict whether a player will subscribe to the game’s newsletter based on their age, total hours played and average session length?_

In order to answer this question, we will use both the players data, which shows data about each of the players on the server, and sessions data, which shows data from individual sessions played by each player. Some details of the two datasets are listed below:

#### Players.csv

Rows: 196

Columns: 7

**Variables**

- experience: Categorical variable giving experience level.

- subscribe: Categorical variable reporting subscription status.

- hashedEmail: Categorical variable containing each players hashed email.

- played_hours: double containing each player's total played hours.

- name: Categorical variable containing each player's first name.

- gender: Categorical variable containing player's gender.

- Age: double variable giving the age of each player.

**Summary Statistics**

There are 196 players on the server in total.

124 players are male, 37 are female, 33 identify as other or didn't state their gender.

35 players are beginners, 35 are regulars, 63 are amateurs, 48 are veterans, and 13 are pros.

144 players are subscribed to the newsletter, while 52 players are not.

_Note: name, gender, and experience level are likely self reported so may be inaccurate for some observations. Some cells have missing values._


#### Sessions.csv

Rows: 1535

Columns: 5

**Variables**

- hashedEmail: Same as players data

- start_time, end_time: Contains character formatted session start and end times

- original_start_time, original_end_time: Both doubles, containing each session’s start and end times in milliseconds as stored by the server. However they appear to contain identical values for some given observations which is possibly an issue.

**Summary Statistics**

Average sessions per player: 12.26

Most sessions by one player: 310

_Note: Session counts per player appear to be very skewed due to a few heavy users._


In [None]:
# importing the data
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")
players_data
sessions_data

In [None]:
# tidys data and computes summary stats
#removing any rows with na values, making name formats consistent

# renamed things so I could work on the data more easily please have mercy I know its unecesary but it was helpfull 
sessions_tidy <- sessions_data |>
na.omit() |>
mutate(hashed_email = hashedEmail, hashedEmail = NULL,)

sessions_tidy

players_tidy <- players_data |>
na.omit() |>
mutate(age = Age, Age = NULL, 
       hashed_email = hashedEmail, hashedEmail = NULL,
      )

players_tidy

#computing summary statistics for players data and formatting for readbility
players_summary <-players_tidy |>
    summarize(
        played_hours = mean(played_hours),
        age = mean(age),   
    ) |>
    pivot_longer(1:2, names_to = "variable", values_to = "mean value")
players_summary


#computing number of players from each gender 
players_summary <- players_tidy |> 
    group_by(gender) |>
    summarize(n())

#computing the number of players in each skill level 
players_summary <- players_tidy |> 
    group_by(experience) |>
    summarize(n())

#computer summary statistics for sessions data:

#finding average number of sessions per player:
sessions_summary <- sessions_tidy |>
    group_by(hashed_email) |>
    summarize(num_sessions = n()) |>
    ungroup() |>
    summarize(mean_sessions = mean(num_sessions))

#finding max sessions by one player
sessions_summary <- sessions_tidy |>
    group_by(hashed_email) |>
    summarize(num_sessions=n()) |>
    arrange(desc(num_sessions))


# Wrangling

In [None]:
#Wrangling

# create a table with hashedEmail and average session length
new_sessions_data <- sessions_data |>
    mutate(session_length_mins = as.numeric(dmy_hm(end_time) - dmy_hm(start_time))) |> 
    select(hashedEmail, session_length_mins) |>
    group_by(hashedEmail) |> 
    summarize(average_session_length = mean(session_length_mins))

new_sessions_data

#adds each player's average playtime to players_data by hashedEmail.
#join only if player is in players_data, if a player did not have any sessions, set average_session_length to 0

# Do you want to join the average session length into players_tidy instead? 
joined_table <- players_data |>
    left_join(new_sessions_data, join_by(hashedEmail)) |>
    mutate(subscribe = as.factor(subscribe)) |>
    mutate(average_session_length = replace_na(average_session_length, 0))

joined_table

#check if any columns contain any NA values
colSums(is.na(joined_table))

#remove 2 rows with NA values in Age Column
final_table <- joined_table |>
    filter(!is.na(Age))

final_table

# Plots

In [None]:
options(repr.plot.height = 8, repr.plot.width = 12)

#Exploratory Plots

#Total hours played, age, and subscription status
plot1 <- final_table |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs(title = "Plot 1. Player Age, Hours Played, and Subscription Status", x = "Age (years)", y = "Total Hours Played", color = "subscribed to a game-related newsletter?")

plot1

#Average session length, age, and subscription status
plot2 <- final_table |>
    ggplot(aes(x = Age, y = average_session_length, color = subscribe)) +
    geom_point() +
    labs(title = "Plot 2. Player Age, Average Session Length, and Subscription Status", x = "Age (years)", y = "Average Session Length(mins)", color = "subscribed to a game-related newsletter?")

plot2

#Average session length, total hours played, and subscription status
plot3 <- final_table |>
    ggplot(aes(x = played_hours, y = average_session_length, color = subscribe)) +
    geom_point() +
    labs(title = "Plot 3. Hours Played, Average Session Length, and Subscription Status", x = "Total Hours Played", y = "Average Session Length(mins)", color = "subscribed to a game-related newsletter?")

plot3

# Methods

In [None]:
set.seed(123) 

#split data into training and testing sets
players_split <- initial_split(final_table, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

In [None]:
set.seed(123) 
#create recipe
player_recipe <- recipe(subscribe ~ played_hours + Age + average_session_length, data = players_train) |>
    step_center(all_predictors()) |>
    step_scale(all_predictors())

#create knn tune spec
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

#test the number of neighbors (1-15)
k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))

#cross validation with 5 folds
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

#get metrics for k
players_results <- workflow() |>
    add_recipe(player_recipe) |>
    add_model(knn_tune) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

players_results

# Gets best value for k from metrics table
# Arranges mean by descending order and selects first row
best_k <- players_results |>
    filter(.metric == 'accuracy') |>
    arrange(desc(mean)) |>
    slice(1) 

print('Best value of k:')
best_k

In [None]:
options(repr.plot.width = 8, repr.plot.height = 7)

accuracies <- players_results |>
    filter(.metric == "accuracy")

#plot k values vs mean accuracy
accuracy_plot <- accuracies |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Number of Neighbors", y = "Mean Accuracy", title = "Accuracy of KNN Models of Differing Neighbors") +
    theme(text = element_text(size = 18))
    scale_x_continuous(breaks = seq(0, 15, by = 1))

accuracy_plot


In [None]:
#create a new model with the otimal amount of neighbors
knn_optimal <- nearest_neighbor(weight_func = "rectangular", neighbors = 10) |>
    set_engine("kknn") |>
    set_mode("classification")

#create a new fit with the optimal model
players_fit <- workflow() |>
    add_recipe(player_recipe) |>
    add_model(knn_optimal) |>
    fit(data = players_train)

#predict on the test set
players_predictions <- predict(players_fit, players_test) |>
    bind_cols(players_test)

players_predictions

#create metrics for predictions
players_metrics <- players_predictions |>
    metrics(truth = subscribe, estimate = .pred_class)

players_metrics

#create confusion matrix
players_conf_mat <- players_predictions |>
    conf_mat(truth = subscribe, estimate = .pred_class)

players_conf_mat

# Calculating precision and recall!

# comments below is for my goldfish memory, please someone delete this later thanks
# precision = (true positive) / (true positive + false positive)
# aka how many of the positives predicted by the model is actually positive

# recall = (true positive) / (true positive + false negative)
# aka how many of the positives are correctly labelled as positive by the model

testing_precision <- players_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level = 'second')

testing_recall <- players_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level = 'second')

testing_accuracy <- players_metrics |>
    filter(.metric == 'accuracy') 

# Putting all metrics into a single data frame
print('Model performance on testing data') 
testing_metrics <- testing_precision |>
    bind_rows(testing_recall) |>
    bind_rows(testing_accuracy)

testing_metrics


# Results

- After tuning the classifier and using cross-validation, we found that using 10 neighbours had the best accuracy on the training set, at 76.5%.
- Our classifier did a lot worse on predicting the testing data set compared to the training data set. The accuracy on the testing data set had an accuracy of 67.3%.
- Our classifier’s precision on the testing set (72.7%) was much lower than recall (88.9%)
- Both the recall and the precision were higher than the model’s overall accuracy on the testing set.
- Out of the 49 observations in the testing set, our classifier predicted 32 true positives, 12 false positives, 4 false negatives, and only one true negative.
- 44 players (89.8%) were predicted as subscribers, though the true number of subscribers in that set was 36 players, or 73.5% of the total.

# Discussion

- The low precision of our classifier poses a major issue for its utility in a practical setting. Our model tends to predict that many players who are in fact not subscribed ‘should be’ subscribed to the newsletter.
- It is very clear that the KNN algorithm is very biased towards the majority class, predicting subscribed more often than it should. This is expected as the majority of the area on the plot is surrounded by more points (other players) that are subscribed than not, due to the nature of the imbalanced dataset.
- The accuracy of the optimized model (67%) is less than the proportion of the majority class (72.7%), meaning it performs worse than a model that classifies every point as subscribed. This shows that the model has no predictive skill for this classification problem.
- To improve the accuracy of our classifier, the researchers should try and collect more data on players who are not subscribed to the newsletter, to balance the proportion of subscribed vs unsubscribed. This would allow the model to have a better representative group of points for the non-subscribed class.
- However, gathering more data of the non-subscribed class would be difficult as those who do not subscribe to gaming newsletters are probably less likely to be interested in gaming studies.
- Overall, the low accuracy of our optimized KNN model means we are unable to make conclusions on the correlation of the predictors and subscription status. This was evident, as seen in the scatterplot visualizations, where there are no distinct ‘clusters’ of players who are subscribed or not subscribed.
- Due to the low accuracy of our model, we couldn’t conclude the features of the type of players that are subscribed to the newsletter, thus showing our model is ineffective for the researchers’ predictive purposes.
- If we had more prior knowledge, we would have explored other alternative classification algorithms and most likely would have selected one which can account for the imbalance in subscription status within the datasets. We would then be able to better determine if there is a correlation between our predictors and subscription status.
- 
