# Predicting Newletter Subscription from Player Behaviour

This report uses data frame from a UBC MInecraft research server to explore which player characterristics and in-game behaviours predict subscribing to a game-related newsletter. I load the provided CSVs, perform wrangling, compute summaries for quantitative variables, and create four exploratory plots. I then outline a baseline method (KNN classification). 

In [None]:
# Load data and wrangling
library(tidyverse)
library(knitr)

players <- read_csv("players.csv") |>
    mutate(experience = factor(experience),
           gender = factor(gender))

sessions <- read_csv("sessions.csv") |>
    #timestamps in ms, so convert to minutes
    mutate(duration_min = (original_end_time - original_start_time) / (1000 * 60),
           # extract hour
           hour = as.numeric(substr(start_time, 12, 13)))

glimpse(players)
glimpse(sessions)

## Data Description
- **players.csv** - 196 rows x 7 columns: one row per player with demographics and engagement.
    - 'subscribe' (logical): newsletter status (response).
    - 'Age', 'played_hours' (numeric): quantitative features.
    - 'experience', 'gender' (factor): categorical predictors.
    - 'name', 'hashedEmail': identifiers (not used for modeling).
- **sessions.csv** - 1,535 rows x 5 columns: one row per play session.
    - Derived in this notebook:
        - 'duration_min' = per-session minutes played (from ms timestamps).
        - 'hour' = start hour of day (0-23), from the 'start_time' string.
 
**Potential issues with data** - Heavy right-skew in playtime/durations, some players may have zero sessions, and daily timing patterns could influence behaviour.

In [None]:
# Variable overview & summaries
players_means <- players |>
    summarize(mean_played_hours = mean(played_hours, na.rm = TRUE),
              mean_age = mean(Age, na.rm = TRUE))
players_means

## Research Question

**Broad:** Which player characteristics and in-game behaviours are most predictive of subscribing to the newsletter?

**Specific:** Can a player's age, total played hours, experience, gender, and typical session behaviour (number of sessions and average session length) predict whether they subscribe?

**Response:** 'subscribe' (TRUE/FALSE)

**Explanatory:** 'Age', 'played_hours', 'experience', 'gender', and per-player 'n_sessions' and 'mean_session_min" derived from 'sessions.csv'.



## Exploratory Data Analysis

### 5.1 Subscription by experience (Plot 1)

In [None]:
# Subscription proportion by experience level
ggplot(players, aes(x = experience, fill = subscribe)) +
    geom_bar(position = "fill") +
    labs(x = "Experience level", y = "Proportion subscribed", title = "Newsletter subscription by experience")


**Insight:** All experience groups show high subscription proportions. Beginners/Regulars appear slightly higher than Veterans/Pros.

### 5.2 Played hours vs subscription(Plot 2)

In [None]:
# Played hours vs subscription
ggplot(players, aes(x = subscribe, y = played_hours)) +
    geom_boxplot() +
    labs(x = "Subscribed", y = "Total played hours", title = "Played hours vs subscription (full)")

# Zoomed in to see low-hour players
ggplot(players, aes(x = subscribe, y = played_hours)) +
    geom_boxplot() +
    coord_cartesian(ylim = c(0, 10)) +
    labs(x = "Subscribed", y = "Total played hours", title = "Played hours vs subscription (0-10 hours)")

**Insight:** Most players play less than an hour in total, but subscribers tend to spend more time in-game on average. The higher median and longer upper range for subscribers indicate that greater engagement is linked to a higher likelihood of subscribing, even among mostly low-activity players. 

In [None]:


#Sessions by hour of day
ggplot(sessions_hour, aes(x = hour)) +
    geom_bar() +
    labs(x = "Hour of day", y = "Number of sessions", title = "Session starts by hour (UTC)")

# Distribution of session duration
ggplot(sessions, aes(x = duration_min)) +
    geom_histogram(binwidth = 5) +
    labs(x = "Session length (minutes)", y = "Count", title = "Distribution of session duration")

# Per-player session features and plots
per_player <- sessions |>
    group_by(hashedEmail) |>
    summarize(n_sessions = n(), mean_session_min = mean(duration_min, na.rm = TRUE))

player_df <- players |>
    left_join(per_player, by = "hashedEmail")

# Number of sessions vs subscription
ggplot(player_df, aes(x = subscribe, y = n_sessions)) +
    geom_boxplot() +
    labs(x = "Subscribed", y = "Number of sessions", title = "Sessions per player vs subscription")

# Played hours by experience
ggplot(players, aes(x = experience, y = played_hours)) +
    geom_boxplot() +
    labs(x = "Experience level", y = "Total played hours", title = "Played hours by experience level")