# Individual Planning Report â€” Predicting Newsletter Subscription #

#### Narek Wartanian - 84186642 ####

In [None]:
# Libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)

In [None]:
# Reading the data
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

head(players)
summary(players)
head(sessions)
summary(sessions)

## Data description ##
### player.csv ###
* It has 7 features and 196 rows.
* **experience**, **hashed_email** and **name** are characters, **subscribe** is a boolean and the rest are all doubles.
* Looking at the data shows that there are two cases in the **age** column with NA values.
* Experience is a character value, it would've been easier if it would be a ordinal value of some kind (0 = beginner, 1 = amateur etc.)
* The name column isn't useful as it is very likely that the name doesn't correlate to anything. (We need the hashed e-mail to find the appropriate entries in the sessions.csv)
* The formatting style of the column names are incosistent. (e.g. Age starts with capital, but the rest all start with lower case letter.)

### sessions.csv ###
* It has 5 features and 1535 rows.
* **hashed_email**, **start_time** and **end_time** are characters, the rest are all doubles.
* Looking at the data shows that there are two cases in the **original_end_time** column with NA values.
* The original_start_time and original_end_time variables are scaled in such a way that it makes them unuseable to work with (they're basically the same value)

## Project statement ##
This section aims to briefly cover the question and methods of the individual planning stage. The broad question that will be researched in this report is, *What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter?* With the specific question: *Can player demographic and in-game behaviour variables in players.csv and aggregated session statistics from sessions.csv predict whether a player subscribes to the newsletter?* The response variable in our case is **subscribe** which is either TRUE or FALSE, and the predictor variables will be played_hours, gender, age and average session length which will be calculated using sessions.csv.

## Exploratory Data Analysis and Visualization ##
In this section we'll try to explore the data and perform minimal amounts of wrangling to turn the data into a tidy dataset.

## Data wrangling ##

In [None]:
# Factorize data
levels <- c("Beginner", "Amateur", "Regular", "Veteran", "Pro")

players <- players |> mutate(experience = factor(experience, levels=levels), subscribe = as.factor(subscribe), gender = as.factor(gender))

colnames(players) <- c("experience", "subscribed", "hashed_email", "hours_played", "player_name", "gender", "age")

head(players)

In [None]:
# Factorize data
colnames(sessions) <- c("hashed_email", "start_time", "end_time", "original_start_time", "original_end_time")

head(sessions)

## Mean values ##

In [None]:
# Mean values for players.csv
tibble(summarise(players, mean_played_hours = mean(played_hours, na.rm=TRUE)), summarise(players, mean_age = mean(Age, na.rm=TRUE)))

## Plots - player.csv ##

In [None]:
ggplot(players, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "dodge") +
  labs(title = "Players by Experience Level and Newsletter Subscription",
       x = "Experience Level", y = "Number of Players")

ggplot(players, aes(x = Age, fill = subscribe)) +
  geom_histogram(position = "identity", alpha = 0.4, bins = 10) +
  labs(title = "Age Distribution by Newsletter Subscription",
       x = "Age", y = "Count")

### Plot 1 ###
We can see that the distribution of players that subscribed to the newsletter tend to be players with skill levels from Amateur to Veteran.

### Plot 2 ###
We can see that people around the age of 20 tend to be more likely to subscribe to the newsletter.

### Methods and Plan ###
In this project, I will use a K-Nearest Neighbors (KNN) classification model to predict whether a player subscribes to the game-related newsletter based on their demographic characteristics and in-game behavior. The response variable, subscribe, is binary (TRUE/FALSE), making classification an appropriate choice. The KNN algorithm is a non-parametric method that classifies an observation based on the majority class among its k-nearest neighbors in the feature space.

KNN is suitable because it can handle non-linear relationships between predictors and the target without assuming a specific functional form. Player behavior and demographics likely have complex, non-linear effects on the probability of subscribing. Additionally, KNN is easy to interpret conceptually and provides a strong baseline model for binary classification problems like this one.
Although KNN makes few statistical assumptions, it relies on certain conditions for good performance:
Feature scaling: All numeric predictors must be on similar scales, as distance is used to find neighbors.


Meaningful distance metric: Euclidean distance assumes numeric features represent meaningful geometry categorical variables like experience must be encoded properly.


Sufficient and balanced data: KNN requires enough observations from both classes (subscribed and not subscribed) for meaningful local neighborhoods.
KNN can be computationally expensive with large datasets, as it stores all observations for distance calculations. It is also sensitive to irrelevant or correlated features and imbalanced data. If one class (e.g., non-subscribers) dominates, the algorithm may bias toward that class. To address this, I will consider feature scaling, potential feature selection, and possibly weighting distances or resampling the data. Choosing the optimal number of neighbors (k) will also be crucial. Too small a k may overfit, while too large may underfit.
I will:
Split the dataset into training (70%) and testing (30%) sets to evaluate model generalization.


Use cross-validation (5-fold) on the training data to identify the best value of k (number of neighbors) based on classification accuracy or F1-score.


Compare model performance using metrics such as accuracy, precision and recall.
