In [None]:
library(tidyverse)
library(ggplot2)

## 1. Data Description

The **players.csv** dataset has 196 rows and 7 columns. 
* The categorical variable **experience** has five different levels of experience, beginner, amateur, regular, veteran, and pro. The potential issues are that it must be properly encoded as an ordinal variable in the later steps. 
* The logical variable **suscribe** has two options, true or false, describing whether the player is suscribed to the newsletter. 
* The character variable **hashedEmail** describes the unique player ID. 
* The quantitative variable **played_hours** describes the total time played as a double. 
* The categorical variable **name** describes the player's name. 
* The categorical variable **gender** describes the player's gender. 
* The quantitative variable **age** describes the player's age in years with integers. 


The **sessions.csv** dataset has 1535 rows and 5 columns. 
* It also features the same variable **hashedEmail** with the unique player ID. 
* There are **start_time** and **end_time** variables with the data in date-time format. 
* And finally, there are **original_start_time** and **original_end_time** variables as a double. 

## 2. Questions

The one broad question that I will be addressing is Question 1. What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? \
My specific formulated question will be: \
Can player characteristics (age, experience, gender) and their session engagement (total number of sessions played) predict whether a player will subscribe to the game's newsletter? \
This question focuses on classification using the variable **subscribe** as the response. The explanatory variables (age, experience, gender) provide characteristics. The session engagement is quantified by calculating the total sessions from sessions.csv.

I will wrangle the data by creating a single, tidy dataset where each row represents a unique player. 
* The sessions.csv file will be summarized by performing a group_by on the **hashedEmail** variable. Then, I can summarize this grouping by counting the number of rows (sessions) for each player, creating a new feature: total sessions. This can serve as the metric for player engagement. 
* The total_sessions data will then be merged with the players.csv file using a left_join based on their common key of **hashedEmail**. This will ensure that all the players from players.csv is retained, even the ones without recorded sessions.

## 3. Exploratory Data Analysis and Visualization

In [None]:
players <- read_csv("players.csv", show_col_types = FALSE)
sessions <- read_csv("sessions.csv", show_col_types = FALSE)

session_counts <- sessions |>
  group_by(hashedEmail) |>
  summarize(total_sessions = n())

full_data_tidy <- players |>
  left_join(session_counts, by = "hashedEmail") |>
  mutate(total_sessions = replace_na(total_sessions, 0)) |>
  select(-name, -played_hours) |>
  mutate(subscribe = factor(subscribe, levels = c(FALSE, TRUE), labels = c("No", "Yes")))

mean_table <- players |>
  select_if(is.numeric) |>
  summarise_all(mean) |>
  mutate_all(~format(round(., 2), nsmall = 2)) |>
  t() |>
  as.data.frame() |>
  rownames_to_column(var = "Variable") |>
  rename("Mean Value" = V1)

print(mean_table)

In [None]:
full_data_tidy |>
  ggplot(aes(x = experience, fill = subscribe)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Subscription Rate by Player Experience Level",
    x = "Player Experience Level",
    y = "Proportion of Players",
    fill = "Subscribed to Newsletter"
  ) +
  theme_minimal()

This graph reveals that the 'Pro' players exhibit the highest rate of newsletter subscription, while 'Amateur' and 'Regular' players have the lowest rates. This indicates that experience will be a highly predictive categorical feature in the model.

In [None]:
full_data_tidy |>
  ggplot(aes(x = Age, fill = subscribe)) +
  geom_density(alpha = 0.6) +
  labs(
    title = "Player Age Distribution by Subscription Status",
    x = "Age of Player (Years)",
    y = "Density",
    fill = "Subscribed to Newsletter"
  ) +
  theme_minimal()

The age distributions for both subscribed and unsubscribed players are almost identical, peaking in the early twenties.
This indicates that **age** might be a weak predictor of subscriotion status when we compare it to other variables like
engagement or experience level.

## 4. Methods and Plan

I've chosen regression 