# Project
### Link to GitHub Repository

## Predicting Newsletter Subscription in a Minecraft Research Server

## Introduction

### Background

Understanding user behavior in video games is a growing area of interest for both researchers and developers. In this project, we analyze player data collected from a Minecraft research server to investigate which player characteristics and behaviors are predictive of subscribing to a game-related newsletter. This question is important because targeting players more likely to engage with the community can help optimize recruitment strategies and server resource allocation.

### Question

Can player characteristics such as experience level and total played hours predict newsletter subscription in the Minecraft dataset?


## Data Description

Two datasets were used:

- `players.csv`: Includes demographic and self-reported information on 196 unique players.
- `sessions.csv`: Includes 1535 records of individual game sessions.

The following variables were selected to answer the question:

| Variable       | Type       | Description                                                         |
|----------------|------------|---------------------------------------------------------------------|
| `subscribe`    | factor     | Whether the player subscribed to the newsletter.                    |
| `Age`          | numeric    | Age of the player. Two missing values were filled using the median. |
| `gender`       | categorical| Player’s self-reported gender.                                      |
| `experience`   | categorical| Player’s self-assessed experience level.                            |
| `played_hours` | numeric    | Total hours the player has spent playing.                           |

- **Data Cleaning**:  
  - The two missing `Age` values were replaced with the median age.
  - Players with no sessions were retained, as `played_hours` still reflects their activity.

The outcome variable is `subscribe`, and the predictors include demographic and behavioral variables. After wrangling and cleaning, the data was ready for modeling using classification techniques.


In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
players <- read_csv("players.csv")
sessions <- read_csv("sessions.csv")

## Methods & Results

### Data Loading and Cleaning

First, loaded two datasets, `players.csv` and `sessions.csv`, using the `read_csv()` function from the `readr` package. However, for this analysis, we only used the `players.csv` file since our research question focuses on player characteristics rather than session behavior.

Selected three columns from the dataset: `subscribe` (whether the player subscribed), `experience` (self-reported experience level), and `played_hours` (total hours spent playing). We converted `subscribe` to a factor and ordered the `experience` variable by defining a custom factor level from beginner to pro.

In [None]:
experience_levels <- c("Beginner", "Amateur", "Regular", "Veteran", "Pro")

model_data <- players |>
  select(subscribe, experience, played_hours) |>
  mutate(subscribe = as.factor(subscribe),
         experience = factor(experience, levels = experience_levels))

head(model_data)

In [None]:
model_data |>
  group_by(experience_level, subscribe) |>
  summarise(count = n()) |>
  pivot_wider(names_from = subscribe, values_from = count, values_fill = 0) |>
  rename(False = `FALSE`, True = `TRUE`)

### Exploratory Data Analysis

To explore the relationship between experience level and subscription, a **proportional stacked bar chart** (Figure 1) was created. The data were grouped by `experience` and `subscribe`, and the number of players in each group was counted. The chart used `position = "fill"` to show the proportion of subscribers within each experience level.

**Figure 1** shows that players with higher experience levels (such as “Pro” and “Regular”) had a larger proportion of subscribers compared to lower-level players like “Beginner” or “Amateur”.

To investigate whether game time was associated with subscription, a **boxplot** was generated (Figure 2). This visualization compared the distribution of `played_hours` between players who subscribed and those who did not.

**Figure 2** indicates that subscribers generally played more hours on average, with a higher median and a wider range of playtime. This suggests that more active players may be more likely to engage further with the game, such as by subscribing to the newsletter.

### Method Justification

The visualizations used are suitable for comparing categorical variables (`experience`, `subscribe`) with numeric distributions (`played_hours`). Bar plots and boxplots are commonly used for exploratory data analysis in classification problems.

- **Assumptions**: Boxplots assume a meaningful comparison of numeric data across categorical groups, which is valid for `subscribe` as a binary factor.
- **Limitations**: The analysis does not include other potential variables, such as gender or behavioral session patterns, which may influence subscription.
- **Model Comparison**: This section is focused on initial exploratory analysis, providing visual insights into the data prior to formal predictive modeling.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 6)

experience_subscribe_counts <- model_data |>
  count(experience, subscribe)

experience_subscribe_counts |>
  ggplot(aes(x = experience, y = n, fill = subscribe)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_y_continuous(labels = scales::label_percent()) +
  labs(title = "Proportion of Subscription Status by Experience Level",
       x = "Experience Level",
       y = "Proportion of Players",
       fill = "Subscribed") +
       theme(text = element_text(size = 20))

In [None]:
ggplot(model_data, aes(x = subscribe, y = played_hours, fill = subscribe)) +
  geom_boxplot() +
  labs(title = "Played Hours by Subscription Status",
       x = "Subscribed",
       y = "Total Played Hours") +
  theme(text = element_text(size = 20))

### Exploratory Data Analysis

To explore the relationship between experience level and subscription, a **proportional stacked bar chart** (Figure 1) was created. The data were grouped by `experience` and `subscribe`, and the number of players in each group was counted. The chart used `position = "fill"` to show the proportion of subscribers within each experience level.

**Figure 1** shows that players with higher experience levels (such as “Pro” and “Regular”) had a larger proportion of subscribers compared to lower-level players like “Beginner” or “Amateur”.

To investigate whether game time was associated with subscription, a **boxplot** was generated (Figure 2). This visualization compared the distribution of `played_hours` between players who subscribed and those who did not.

**Figure 2** indicates that subscribers generally played more hours on average, with a higher median and a wider range of playtime. This suggests that more active players may be more likely to engage further with the game, such as by subscribing to the newsletter.

### Method Justification

The visualizations used are suitable for comparing categorical variables (`experience`, `subscribe`) with numeric distributions (`played_hours`). Bar plots and boxplots are commonly used for exploratory data analysis in classification problems.

- **Assumptions**: Boxplots assume a meaningful comparison of numeric data across categorical groups, which is valid for `subscribe` as a binary factor.
- **Limitations**: The analysis does not include other potential variables, such as gender or behavioral session patterns, which may influence subscription.
- **Model Comparison**: This section is focused on initial exploratory analysis, providing visual insights into the data prior to formal predictive modeling.

In [None]:
model_data_played <- model_data |>
  select(subscribe, played_hours)

set.seed(2025)
data_split <- initial_split(model_data_played, prop = 0.75, strata = subscribe)
data_train <- training(data_split)
data_test <- testing(data_split)

knn_recipe <- recipe(subscribe ~ played_hours, data = data_train) |>
  step_scale(all_numeric_predictors()) |>
  step_center(all_numeric_predictors())

knn_spec <- nearest_neighbor(neighbors = 5, weight_func = "rectangular") |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec) |>
  fit(data = data_train)

knn_preds <- knn_fit |>
  predict(data_test) |>
  bind_cols(data_test)

knn_preds |>
 conf_mat(truth=subscribe, estimate =.pred_class)
knn_preds |>
 metrics(truth=subscribe, estimate =.pred_class)

### Visualization

To explore potential patterns, we created two plots:

- A bar chart showing the proportion of subscribed players by experience level.

The bar chart shows that more experienced players (e.g., “Pro” and “Veteran”) tend to subscribe more often. The boxplot suggests that subscribed players also tend to have played more hours overall.
