### DSCI 100 FINAL PROJECT ###
# By Shamel, Mochammad, Emma, Jiaxie

### ARE AGE AND HOURS DETERMINERS OF SUBSCRIPTION STATUS?

This is a final project for introduction for data science course in UBC vancouver campus. For this project, each assigned group was given data collected from a research project regarding a minecraft server run by a from UBC club. The data collected record the player information as well as the play sessions of each player in the minecraft research server.
The goal of this project is to look for meaningful data and answer some exploratory questions regarding the minecraft research server and player data. Our group decided to explore and try to answer one of the 3 given broad questions, that is “What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?”.


The following is our process to answer the stated question from making a more specific aspect we want to explore, loading and wrangling the data, making visualizations, and finally came to our conclusion. 

The specific question we are trying to answer is:
#### Can we predict a player's likelihood to subscribe to the server based on age and hours played on a minecraft server in the players.csv data set?

Hours played and age act as meaningful and accurate predictors for this specific question as more hours played indicates a higher level of commitment to the game and age can determine the amount of time and money you have available to be able to commit to the server. 

## Methods and Results

### Library Import
Here, we add the libraries that will help make our program:

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(themis)
options(repr.matrix.max.rows = 6)

### Loading Dataset

Here, we load the dataset using the url attained from the Github Repository:

In [None]:
players_url <- "https://raw.githubusercontent.com/ShamelessRake/DSCI-100-Project-005-039/refs/heads/main/players.csv"

players <- read_csv(players_url)
head(players)

### Wrangling Data

While the data seems pretty clean, we could still benefit from removing unwanted variables, altering datatypes, lowercasing the A in "Age" for consistency, and removing NA values. We will only need the subscribe, age, and played_hours variable:

In [None]:
sub_time_age_df <- players |>
    select(subscribe, played_hours, Age) |>
    rename("age" = Age) |>
    mutate(subscribe = fct_recode(as.factor(subscribe), "subscribed" = "TRUE", "Not subscribed" = "FALSE")) |>
    drop_na()
head(sub_time_age_df)

### Summarizing Data
It is important to know what data we're working with here, so some summarization regarding the maximum values, the minimum values, the mean, the standard deviations. Additionally, we should also find out how many people are subscribed and how many aren't.

In [None]:
max_age_time <- sub_time_age_df |>
    select(played_hours, age) |>
    map_df(max, na.rm = TRUE) |>
    rename("most_hours_played" = played_hours, "oldest" = age)
min_age_time <- sub_time_age_df |>
    select(played_hours, age) |>
    map_df(max, na.rm = TRUE) |>
    rename("most_hours_played" = played_hours, "oldest" = age)

mean_age_time <- sub_time_age_df |>
    select(played_hours, age) |>
    map_df(mean, na.rm = TRUE) |>
    rename("average_hours_played" = played_hours, "average_age" = age)
sd_age_time <- sub_time_age_df |>
    select(played_hours, age) |>
    map_df(sd, na.rm = TRUE) |>
    rename("sd_hours_played" = played_hours, "sd_age" = age)
num_subscribed_and_unsubscribed <- sub_time_age_df |>
    select(subscribe) |>
    group_by(subscribe) |>
    summarize(count = n())

max_age_time
min_age_time
mean_age_time
sd_age_time
num_subscribed_and_unsubscribed

### Visualizing the Data
Let's take a look at the relationship of age (x-axis), hours played (y-axis), and subscription status (coloured).

In [None]:
plot_sub_time_age <-sub_time_age_df |>
    ggplot(aes(x = age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs(x = "Age (Years)", y = "Time Spent Playing (Hours)", color = "Subscription Status", title = "Time Spent Playing vs Age") +
    theme(text = element_text(size = 15))
plot_sub_time_age

### Initial Observations
Nothing comes off very clearly as of now in indicating subscription status. There exist some spots where data seems to have points of subscription while other areas don't, such as those below the age of 17 being subscribed, and those above the age of 30 being not subscribed. There is still potential in finding patterns, but with eyes alone, we can't for sure spot a pattern.

### Creating the Training and Testing Split
Here, we split our data, allocating 75% of it to the training data, and the other 25% to the testing, a reasonable split that will give enough data to train the accuracy of our model, and test it.

In [None]:
set.seed(4321)

players_split <- initial_split(sub_time_age_df)
players_training <- training(players_split)
players_testing <- testing(players_split)

### Create and Train our K-Nearest Neighbor Classification Model

Now that we've done all the preliminary work, we can begin to work on actually training a K-Nearest Neighbor Classification Model that can potentially classify newly introduced data by learning from our previous data. To do this, we need to identify a proper K value that's computationally inexpensive, and offers high accuracy. Then, we need to establish a spec, a recipe, a range of K values for our cross validation to determine the most helpful K value, and a workflow. In the recipe, due to the class imbalance seen between subscribed and not subscribed, we will need to oversample to make up for it. For the cross validation, we'll make 5 fold cross,

In [None]:
set.seed(4321)

k_grid <- tibble(neighbors = 1:100)
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

players_recipe <- recipe(subscribe ~., data = players_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors()) |>
    step_upsample(subscribe, over_ratio = 1, skip = TRUE)

players_vfold <- vfold_cv(players_training, v = 5, strata = subscribe)

players_fit <- workflow() |>
    add_model(players_spec) |>
    add_recipe(players_recipe) |>
    tune_grid(resamples = players_vfold, grid = k_grid)

players_result <- players_fit |>
    collect_metrics() |>
    filter(.metric == "accuracy") |>
    select(neighbors, mean) |>
    arrange(desc(mean))

plot_accuracy <- ggplot(players_result, aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "K", y = "Accuracy", title = "Accuracy vs K") +
    theme(text = element_text(size = 15))
plot_accuracy
players_result

From what we can see, it appears that **64** offers the best K value for the Nearest Neighbor algorithm. The problem with this is that the K value is very high, so it would computationally expensive. Additionally, it very quickly dips in accuracy with the values surrounding it. Therefore, we can look at someting below K= 20. the greatest peak we have is at **K = 17** This value is relatively small so it will be computationally inexpensive. Additionally, the surrounding values appear pretty high as well to it, with much less drastic drops than 64. HOWEVER, it should be noted that the K values exist in accuracy range of 0.1, so maybe multiple K values are possibly. Because of this, it doesn't really make sense to use a K value that large when a much smaller one can give decently similar results. Therefore, we will use **K = 4**. This is very computationally inexpensive, and doesn't have the K's surrounding it deviate too much.

### Creating and Testing the Classifier
Now that we've decided the K = 4 is our classifier, we will make use that K-nearest neighbor value to test how accurate our classification model is. Additionally, we will test the recall, precision, as well as see how many false positives and negatives we got, along with true positives and negatives.

In [None]:
players_spec_known_k <- nearest_neighbor(weight_func = "rectangular", neighbor = 4) |>
    set_engine("kknn") |>
    set_mode("classification")
players_final_fit <- workflow() |>
    add_model(players_spec_known_k) |>
    add_recipe(players_recipe) |>
    fit(players_training)
prediction_players <- players_final_fit |>
    predict(players_testing) |>
    bind_cols(players_testing)
player_accuracy <- prediction_players |>
    metrics(truth = subscribe, estimate = .pred_class) |>
    filter(.metric == "accuracy")
prediction_players

player_recall <- recall

This aligns with existing research showing that younger individuals are more susceptible to intensive gaming behaviors, often driven by developmental and psychological factors (Gentile et al., 2011). In contrast, older adults tend to approach gaming differently, often using video games for relaxation, cognitive stimulation, or social connection rather than compulsion (De Schutter, 2011). Moreover, game design elements like subscription models or in-game content unlocks have been shown to influence user behavior significantly, especially among those more engaged or vulnerable to overuse (Kuss & Griffiths, 2012).

These prior findings suggest that age and psychological traits could indeed play a role in subscription patterns, and a deeper statistical analysis may reveal clearer insights.
### Reference
#### 1.Pathological video game use among youths: A two-year longitudinal study
Reference: Gentile, D. A., Choo, H., Liau, A., Sim, T., Li, D., Fung, D., & Khoo, A. (2011).Pathological video game use among youths: A two-year longitudinal study. Pediatrics, 127(2), e319–e329.
https://doi.org/10.1542/peds.2010-1353

summary: This longitudinal study tracked children's gaming behaviors over two years and found that a subset of young gamers developed symptoms of pathological use. The research linked excessive gaming to outcomes like poorer academic performance, attention problems, and social difficulties.
#### 2.Never too old to play: The appeal of digital games to an older audience
Reference: De Schutter, B. (2011).Never too old to play: The appeal of digital games to an older audience. Games and Culture, 6(2), 155–170.
https://doi.org/10.1177/1555412010364978

summary: This article explores why older adults play video games and finds that their motivations differ from younger players. Older gamers often value mental stimulation, relaxation, and social interaction over competitive or compulsive play.
#### 3. Video Game Addiction and Mental Health
Reference: Kuss, D. J., & Griffiths, M. D. (2012). Internet gaming addiction: A systematic review of empirical research. International Journal of Mental Health and Addiction, 10(2), 278–296. 

Summary: This systematic review examines empirical studies on internet gaming addiction, highlighting its association with mental health issues such as depression, anxiety, and social phobia.

https://doi.org/10.1007/s11469-011-9318-5
#### 4. Prevalence of Video Gaming Among Older Adults

Reference: AARP Research. (2023). Why Video Games Click With People 50-Plus. 

Summary: This report reveals that approximately 45% of adults aged 50 and above engage in video gaming, indicating a significant increase over recent years.

#### 5. Therapeutic Use of Video Games in Cognitive Training

Reference: Ballesteros, S., Prieto, A., Mayas, J., Toril, P., Pita, C., de León, L. P., ... & Reales, J. M. (2014). Brain training with non-action video games enhances aspects of cognition in older adults: A randomized controlled trial. Frontiers in Aging Neuroscience, 6, 277.

Summary: This randomized controlled trial indicates that non-action video games can enhance various cognitive functions in older adults, supporting their use in cognitive training programs.
