<h1> DSCI 100; Final Project Report </h1>

### ***Group Members (Group 31):*** Ivan Liu (28950673), Isabella Huang (53142667), Parleen Uppal (70438452), and Arman Behzadnia (89834832)

Github repository: https://github.com/Arzmxn/dsci-100-2025W-009-31

<h2> Introduction & Description</h2>	


**Background Information**
Game-related newsletters are used by developers and publishers to keep players informed about events and updates and increase player engagement. The UBC Computer Science research group is conducting a study on player behaviour in video games using a Minecraft research server. The team must target recruitment efforts, so understanding what factors may influence a player to subscribe can help them effectively recruit. 

We are given two datasets, but we will only be using the `players.csv` dataset. Players differ in their demographics, such as experience, age, and gender, gaming experience, and engagement levels. These differences can influence their preferences for receiving newsletters. In this project, we will answer the broad question below, and specifically try to predict subscription status. 

**Broad Question**

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 
- This broad question will explore the demographic and behavioural factors (such as age, gender, experience, and engagement) and how they may influence a player's likelihood of subscribing. It looks at who the players are and how they interact and engage with the game. 


**Specific Question** 

Can played hours and age predict subscription status in `players.csv`? 
- This narrower question focuses on the played hours, a measure of engagement, and age, a demographic variable, to determine whether these factors would be able to explain a player's likelihood of subscribing to the newsletter.


<h2> Data Summary Description</h2>	

To answer the specific question, we will be using the players.csv dataset. The file "players.csv" contains demographic and behavioural information; there are seven variables that are `experience`, `subscribe`, `hashedEmail`, `played_hours`, `name`, `gender`, and `Age`. Within this file, there are 196 observations and only the `Age` variable is missing two values. The key indentifier of each player is the "hashedEmail" variable, which is unique to each player. 



| Variable | Type | Missing Values | Unique Values | Description / Notes |
|-----------|------|----------------|----------------|----------------------|
| `experience` | fct | 0 | 5 | Describes the gaming experience of each player. Categories/limited number of values for data include Veteran, Pro, Regular, Amateur and Beginner in order of experience. |
| `subscribe` | lgl | 0 | 2 | Logical data type that indicates whether the player subscribed to the game-related newsletter. |
| `hashedEmail` | chr | 0 | 196 | Unique anonymized player ID (key for joining with `sessions.csv`). This identifies the players and is a string of lowercase letters and numbers. |
| `played_hours` | int | 0 | 43 | Total number of hours played of each player. |
| `name` | chr | 0 | 196 | Player alias or name (not used as an analytical variable). |
| `gender` | fct | 0 | 7 | Player-reported gender (categorical). Categories of gender include: Male, Female, Agender, Non-binary, and some individuals indicated "prefer not to say". |
| `Age` | int | 2 | 32 | Playerâ€™s age  |

<h2> Methods and Results </h2>

#### **Loading the Packages**

The first step was to load all necessary libraries for data manipulation, visualization, and modelling. These packages provide the tools needed for reading data, wrangling variables, creating graphics, and performing KNN classification.

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10) 

#### **Reading the Dataset**

We can import the `players.csv` file, which contains player demographics and behaviour measures. This dataset will be used to investigate whether played hours and age predict newsletter subscription status (as mentioned in the Introduction). 

In [None]:
player_data=read_csv("https://raw.githubusercontent.com/Arzmxn/ideal-umbrella/refs/heads/main/players.csv")

#### **Wrangling and Cleaning the Data**

We prepared the dataset by:
- Converting **subscription status** into a categorical variable  
- Selecting only the variables relevant to our question:  
  **subscribe, played_hours, Age**
- Filtering out invalid or missing age values

This creates a clean dataset for analysis and visualization.

In [None]:
wrangled_player=player_data|>
    mutate(subscribe=as_factor(subscribe))|>
    select(subscribe,played_hours,Age)|>
    filter(Age>=0)

#### **Summary Statistics for Key Variables**

We calculated the mean values of **played hours** and **age**. This provides a simple numerical overview of the dataset before visualization and modelling.

In [None]:
summary_data=player_data|>  
    select(played_hours,Age)|>
    summarize(across(played_hours:Age, ~ mean(.x, na.rm = TRUE)))

#### **Scatterplot of Age vs Played Hours**

This scatterplot shows the relationship between **age** and **total played hours**, with points colored by **subscription status**.  
It allows us to visually check for potential patterns related to the prediction task.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 8)
scatter_viz=wrangled_player|>
    ggplot(aes(x=Age,y=played_hours,color=subscribe))+
    geom_point()+
    labs(x="Age of Player (Years)",y="Total Playtime (Hours)",
         color="Subscription Status",
         title = "Fig.1 Scatterplot of Age (yrs) vs Playtime (hrs) with Subscription Status")+
    theme(text = element_text(size = 15))

#### **Histogram of Age**

This histogram displays the distribution of player ages and how the count of players in each age group differs by **subscription status**.

In [None]:
histogram_viz_1=wrangled_player|>
    ggplot(aes(x=Age,fill=subscribe))+
    geom_histogram()+
    labs(x="Age of Player (Years)",y="Number of Players",
         fill=" Subscribtion Status",
         title = "Fig.2 Distribution of Age (yrs) with Subscription Status")+
    theme(text = element_text(size = 15))

#### **Histogram of Played Hours**

This histogram shows how total playtime is distributed among players, with colours indicating whether the player subscribed to the newsletter.

In [None]:
histogram_viz_2=wrangled_player|>
    ggplot(aes(x=played_hours,fill=subscribe))+
    geom_histogram()+
    labs(x="Total Playtime (Hours)",y="Number of Players",
         fill=" Subscribtion Status",
         title = "Fig.3 Distribution of Playtime (hrs) with Subscription Status")+
    theme(text = element_text(size = 15))

#### **Splitting the Data**

We split the data into **training (75%)** and **testing (25%)** sets, stratifying by subscription status to keep class proportions balanced.

In [None]:
set.seed(123123123)

player_split <- initial_split(wrangled_player, prop = 0.75, strata = subscribe)
player_train <- training(player_split)
player_test <- testing(player_split)

#### **Preprocessing (Recipe)**

We created a recipe that:
- Uses **played_hours** and **age** to predict **subscription**
- **Scales and centers** the predictor variables

Scaling is required for distance-based models like KNN.

In [None]:
player_recipe <- recipe(subscribe ~ played_hours + Age , data = player_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

#### **Cross-Validation**

We used **5-fold cross-validation** to tune and evaluate the KNN model while maintaining balanced class distributions.

In [None]:
player_vfold <- vfold_cv(player_train, v = 5, strata = subscribe)

#### **KNN Tuning Setup**

We then defined a KNN model where the number of neighbours (**K**) will be tuned. A grid of K values from 1 to 20 is created for evaluation.

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
       set_engine("kknn") |>
       set_mode("classification")

k_vals <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

#### **Evaluating K Values and Selecting an Optimal K Value**

Next, we performed a grid search using the cross-validation folds and collected accuracy metrics for each value of K. We also extracted the K value that produced the highest classification accuracy.

In [None]:
knn_player_results <- workflow() |>
       add_recipe(player_recipe) |>
       add_model(knn_tune) |>
       tune_grid(resamples = player_vfold, grid = k_vals) |>
       collect_metrics()

best_k_value=knn_player_results|>
    filter(.metric=="accuracy")|>
    select(neighbors,mean)

#### **Training the Final KNN Model**

Using the best-performing K value found during tuning, we fit the final model to the training data.

In [None]:
knn_best <- nearest_neighbor(weight_func = "rectangular", neighbors = 16) |>
       set_engine("kknn") |>
       set_mode("classification")

player_best_fit <- workflow() |>
       add_recipe(player_recipe) |>
       add_model(knn_best) |>
       fit(player_train)

#### **Generating Predictions**

We can apply the final model to the testing set and bind predictions to the original test data.

In [None]:
player_predictions <- predict(player_best_fit, player_test) |>
                        bind_cols(player_test)

#### **Model Performance Evaluation**

We computed classification metrics on the testing set, including accuracy and other standard performance measures, using the code below. 

In [None]:
player_metrics <- player_predictions |> 
    metrics(truth = subscribe, estimate = .pred_class)

player_metrics

#### **Visualization of Accuracy Across K Values**

This line plot shows the **accuracy estimate** for each tested K value. It illustrates how the best K was selected during tuning.

In [None]:
vis_of_best_k=best_k_value|>
    ggplot(aes(x=neighbors,y=mean))+
    geom_line()+
    labs(x="Neighbors (K)", y="Accuracy Estimate")+
    ggtitle("Fig.5 Line Plot of Accuracy Estimate vs. Neighbors")

vis_of_best_k

<h2> Discussion </h2>

In [None]:
add discussion here!