# DSCI100 FINAL PROJECT

# Introduction
Name: Isabella Lin

**Background:**
- Minecraft, released in 2011 by the Swedish developer Mojang Studios, is a sandbox game that has become iconic among young audiences. As of 2025, it ranks as the third most popular video game worldwide (Wikipedia contributors, 2025).

- This study utilizes data collected by a research group in the UBC Computer Science department, led by Frank Wood. The team operated a dedicated Minecraft server and recorded detailed information on player activity as users interacted within the game environment. The dataset includes variables related to player skill levels, demographics, and gameplay sessions.

  
**Questions:**
- Broad Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
- Specific question: Can a player's age and hours played predict whether they will subscribe to the game-related newsletter?



# Data Description of players.csv 

The datafile I used for answering the question was the `players.csv` which had a list of data of all unique players. This dataset contains the necessary variables to answer the question.

- Number of variables: 7
- Number of observations: 195


**ISSUES:**

- The `Age` variable is skewed toward younger players, with most ages clustered around 20. This may indicate a potential bias in the dataset.
- The `played_hours` variable contains many entries with zero hours and very small values. Converting this variable from hours to minutes could provide more meaningful insights in future analyses.
- There are numerous outliers in the `played_hours` data.
- Session counts are imbalanced: some players have participated in many sessions, while others have very few or none.

**Summary Table of `players.csv`**


| variable          | type      | meaning                                                                                                                        | # of missing observations | summary statistics (if applicable)                                             |
|-------------------|-----------|--------------------------------------------------------------------------------------------------------------------------------|--------------------------|--------------------------------------------------------------------------------|
| experience        | character | The level of experience of the player (Beginner (least experienced), Amateur, Regular, Veteran, Pro (most experienced))         | 0                        | N/A                                                                            |
| subscribe         | logical   | Whether subscribed to a game newsletter or not (True, or False)                                                                | 0                        | N/A                                                                            |
| hashedEmail       | character | Private personal Email information                                                                                              | 0                        | N/A                                                                            |
| played_hours      | double    | The number of hours played by the player (hours)                                                                               | 0                        | Max = 223.1, Min = 0, Mean = 5.845918, Standard Deviation = 28.35734           |
| name              | character | The player's name                                                                                                              | 0                        | N/A                                                                            |
| gender            | character | The player's gender (male, female, non-binary, agender, two-spirited, prefer not to say, other)                                | 0                        | N/A                                                                            |
| age               | double    | The player's age (years)                                                                                                       | 2                        | Max = 50, Min = 8, Mean = 20.52062, Median = 19, Mode = 17, SD = 6.174667      |

# Pre-Processing Data and Exploratory Visualization

In [None]:
#1 load the relevant R packages:
library(tidyverse)
library(tidymodels)
library(repr)
library(themis)
library(cowplot)

In [None]:
#2 loads data
minecraft_full<- read_csv("https://raw.githubusercontent.com/Isabella-dsci100/Final-Project-DSCI100/refs/heads/main/players.csv")
head(minecraft_full)

In [None]:
#3 wrangles and cleans the data to the format necessary for the planned analysis
minecraft<- minecraft_full|>
                   filter(!is.na(Age))|>
                   mutate(subscribe = as.factor(subscribe))

head(minecraft)

In [None]:
#4 Summary statistics (mean values for each numerical variables)
players_average <- minecraft |>
                   summarize(min_played_hours = min(played_hours),
                             max_played_hours = max(played_hours),
                             average_played_hours = mean(played_hours),
                             min_age = min(Age),
                             max_age = max(Age),
                             median_age = median(Age),
                             mode_age = names(which.max(table(minecraft[[7]]))),
                             average_age = mean(Age))
players_average

**Statistics Interpretation**
- The maximum value for `played_hours` (223.1) is much higher than the average value (5.90), suggesting the presence of outliers in playtime.
- The mode of `age` is 17, likely because the Minecraft server automatically assigns 17 as the default age for each player. This default setting may have influenced the mode.
- The average age is about 20, which is reasonable since most research participants are university students.

In [None]:
#5 splitting into training and testing datasets
set.seed(1)
players_split <- initial_split(minecraft, prop = 0.75, strata = subscribe)

players_training <- training(players_split)
players_testing <- testing(players_split)

In [None]:
#6 create a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
# finding relationships predictor and variables 
#ie. Age+ Hours players as predictors and subscription status as variable
options(repr.plot.width = 8, repr.plot.height = 8)
Age_boxplot <- ggplot(players_training, aes(x = subscribe, y = Age, fill = subscribe)) +
               geom_boxplot() +
               labs(x= "Subscription Status", y= "Age of Player", fill = "Subscription\nStatus", title = "Figure 1: Age vs. Subscription Status")+
               theme(text = element_text(size = 15))
Age_boxplot

options(repr.plot.width = 8, repr.plot.height = 8)
played_hours_boxplot <- ggplot(players_training, aes(x = subscribe, y = played_hours, fill = subscribe)) +
                        geom_boxplot() +
                        scale_y_log10(labels=label_comma())+
                        labs(x = "Subscription Status", y = "Game Played Hours", fill = "Subscription\nStatus", title = "Figure 2: Played Hours vs. Subscription Status")+
                        theme(text = element_text(size = 15))
played_hours_boxplot

**Plot Descriptions:**

Figure 1.
- This pllot indicates that players who subscribe to game-related newsletters are generally younger, with a median age close to 17, while non-subscribers have a higher median age of about 22. Both groups display a similar range of age variation. Notably, there are four distinct outliers in the non-subscriber group with much higher ages. This pattern suggests that younger players are more inclined to subscribe to game-related newsletters.

Figure 2.
- This graph suggests a potential association between played hours and subscription status. The subscribed group exhibits a higher median number of played hours compared to the unsubscribed group. Additionally, the subscribed group displays greater variability in played hours, including a notably high outlier.

# Methods and Plan

***Method= KNN Classifcation***

**Why Choose This Method?**

Classification is appropriate for this predictive task because it involves using the variables `played_hours` and `Age` to predict the category `subscribe` for new samples. The K-Nearest Neighbors (KNN) algorithm is well-suited here since it does not require strict assumptions about the data distribution or shape. Instead, it classifies new samples based on their proximity to nearby data points.

**Model Application**

Model comparison and selection will be conducted through cross-validation. The `initial_split()` function is used to partition the dataset into 75% training and 25% testing sets, stratified by `subscribe` to maintain class balance. This ensures the testing data remains unseen during model training.

Cross-validation, typically with 5- or 10-folds, will assess the model’s performance, balancing accuracy and computational cost. Tuning the model to find the best K value will help maximize classification accuracy on new observations.

**Potential Limitations**

- KNN is sensitive to feature scaling. If `played_hours` and `Age` have vastly different ranges, Euclidean distance calculations may be biased toward the variable with the larger scale, leading to inaccurate neighbor selection and predictions.
- Class imbalance can affect KNN performance. If there are significantly more subscribed than unsubscribed cases, the model may overclassify new samples as subscribed, reducing prediction accuracy.


# Data Analysis

In [None]:
# 6. Preparing the recipe using only the training data. The data is standardized and upsampled to resolve data imbalance issues. 
players_recipe<- recipe(subscribe~Age+played_hours, data=players_training)|>
step_scale(all_predictors())|>
step_center(all_predictors()) |>
step_upsample(subscribe, over_ratio = 1)

In [None]:
# 7. a) Cross-validation and Parameter (K) Value Selection
players_vfold<- vfold_cv(players_training, v=5, strata=subscribe)
players_spec_tune<- nearest_neighbor(weight_func="rectangular", neighbors=tune())|>
set_engine("kknn")|>
set_mode("classification")

k_vals<- tibble(neighbors=seq(from=1, to=100, by=1))

player_tune_results<- workflow()|>
add_recipe(players_recipe)|>
add_model(players_spec_tune)|>
tune_grid(resamples=players_vfold, grid=k_vals)|>
collect_metrics()

accuracy<- player_tune_results|>
filter(.metric=="accuracy")
     

# Discussion

# References