# Predicting Newletter Subscription from Player Behaviour

This report uses data frame from a UBC MInecraft research server to explore which player characteristics and in-game behaviours predict subscribing to a game-related newsletter. I load the provided CSVs, perform wrangling, compute summaries for quantitative variables, and create four exploratory plots. I then outline a baseline method (KNN classification). 

In [None]:
# Load data and wrangling
library(tidyverse)
library(knitr)

players <- read_csv("players.csv") |>
    mutate(experience = factor(experience),
           gender = factor(gender))

sessions <- read_csv("sessions.csv") |>
    #timestamps in ms, so convert to minutes
    mutate(duration_min = (original_end_time - original_start_time) / (1000 * 60),
           # extract hour
           hour = as.numeric(substr(start_time, 12, 13)))

glimpse(players)
glimpse(sessions)

## 1) Data Description

The players dataset contains 196 rows and 7 variables, with one row per unique player.
It includes:
    - 'subscribe' (logical): newsletter status and the response variable.
    - 'Age', 'played_hours' (numeric): quantitative features.
    - 'experience', 'gender' (categorical): self-reported attributes.
    - 'name', 'hashedEmail': identifiers used only for joining.
    
The sessions dataset contains 1,535 rows and 5 variables, with one row per individual play session.
    - Derived in this notebook:
        - 'duration_min' = per-session minutes played (from ms timestamps).
        - 'hour' = start hour of day (0-23), from the 'start_time' string.
 
**Potential issues with data** - Heavy right-skew in playtime/durations, some players may have zero sessions, and daily timing patterns could influence behaviour.

In [None]:
# Variable overview & summaries
players_means <- players |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE),
              mean_age = mean(Age, na.rm = TRUE))
players_means

sessions_means <- sessions |>
    summarize(mean_duration_min = mean(duration_min, na.rm = TRUE),
                  mean_hour = mean(hour, na.rm = TRUE))
sessions_means

## 2) Research Question

**Broad:** Which player characteristics and in-game behaviours are most predictive of subscribing to the newsletter?

**Specific:** Can a player's age, total played hours, experience, gender, and typical session behaviour (number of sessions and average session length) predict whether they subscribe?

**Response:** 'subscribe' (TRUE/FALSE)

**Explanatory:** 'Age', 'played_hours', 'experience', 'gender', and per-player 'n_sessions' and 'mean_session_min" derived from 'sessions.csv'.



## 3) Exploratory Data Analysis

### 3.1 Subscription by experience (Plot 1)

In [None]:
# Subscription proportion by experience level
ggplot(players, aes(x = experience, fill = subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Experience level", y = "Proportion subscribed", title = "Newsletter subscription by experience")


**Insight:** All experience groups show high subscription proportions. Beginners/Regulars appear slightly higher than Veterans/Pros.

### 3.2 Played hours vs subscription (Plot 2)

In [None]:
# Played hours vs subscription
ggplot(players, aes(x = subscribe, y = played_hours)) +
    geom_boxplot() +
    labs(x = "Subscribed", y = "Total played hours", title = "Played hours vs subscription (full)")

# Zoomed in to see low-hour players
ggplot(players, aes(x = subscribe, y = played_hours)) +
    geom_boxplot() +
    coord_cartesian(ylim = c(0, 10)) +
    labs(x = "Subscribed", y = "Total played hours", title = "Played hours vs subscription (0-10 hours)")

**Insight:** Most players play less than an hour in total, but subscribers tend to spend more time in-game on average. The higher median and longer upper range for subscribers indicate that greater engagement is linked to a higher likelihood of subscribing, even among mostly low-activity players. 

### 3.3 Session starts by hour of day (Plot 3)

In [None]:
#Sessions by hour of day
ggplot(sessions, aes(x = hour)) +
    geom_bar() +
    labs(x = "Hour of day", y = "Number of sessions", title = "Session starts by hour (UTC)")


**Insight:** Session activity peaks between 00:00-05:00 UTC and again near 21:00-23:00 UTC, while midday hours show minimal play. This pattern suggests that most players log in during late-night or evening period, consistent with typical leisure times. The clear off-peak gap implies predictable daily cycles which is useful for planning server capacity and understanding when engagement is highest. 

### 3.4 Per-player sessions vs subscription (Plot 4)

In [None]:

# Per-player session features and plots
per_player <- sessions |>
    group_by(hashedEmail) |>
    summarize(n_sessions = n(), mean_session_min = mean(duration_min, na.rm = TRUE))

player_df <- players |>
    left_join(per_player, by = "hashedEmail")

# Number of sessions vs subscription
ggplot(player_df, aes(x = subscribe, y = n_sessions)) +
    geom_boxplot() +
    labs(x = "Subscribed", y = "Number of sessions", title = "Sessions per player vs subscription")

# Zoomed in Number of sessions vs subscription
ggplot(player_df, aes(x = subscribe, y = n_sessions)) +
    geom_boxplot() +
    coord_cartesian(ylim = c(0, 3)) +
    labs(x = "Subscribed", y = "Number of sessions", title = "Sessions per player vs subscription (0-3 sessions)")


**Insight:** Most players who never subscribe only play once, while subscribers are more likely to return for multiple sessions. The higher median and wider spread for subscribers indicate that repeat engagement is linked to higher likelihood of subscribing, even when overall session counts remain low. 

## 4) Methods & Plan

**Task:** Binary classification of 'subscribe'.

**Method:** **k-Nearest Neighbours (k-NN) classification;** It predicts whether a player will subscribe by looking at the most similar players in the dataset and assigning the majority outcome among those neighbours.

**Why this is appropriate:** k-NN is a simple and interpretable baseline method for classification with both numeric and categorical variables. 

**Assumptions/requirements:** Players who have similar values for the chosen variables are expected to have similar outcomes. Numeric variables should be standardized, and categorical variables converted into factors. Too many unrelated variables or outliers can reduce performance. 

**Data processing:** Use Age, played_hours, experience, gender, n_sessions, and mean_session_min to build the dataset. Standardize numeric variables and encode categorical ones before using k-NN to make predictions. 

**Evaluation & selection:** Split the data into training (70%) and testing (30%) sets. Use k-fold (5-fold) cross-validation on the training data to find the best number of neighbours K. Measure performance using accuracy and a confusion matrix on the test data. 

**Limitations/weaknesses:** k-NN can be sensitive to scaling, outliers, and having many variables that introduce noise. Later, it could be compared with a linear model to check consistency. 

## 5) GitHub Repository

https://github.com/Jundaze/dsci-project-002.git