# Methods & Results

In [None]:
library(tidyverse)
library(tidymodels)

In [None]:
players_data <- read_csv("data/players.csv")
players_data

Our analysis starts by importing the dataset directly from the data folder.

In [None]:
players_tidy <- players_data |>
  select (experience, subscribe, played_hours, Age) |>
  filter (Age != 17) |>
  mutate (experience = as_factor(experience),
        subscribe = as_factor(subscribe))
head(players_tidy)

After loading, we clean the dataset to include only the relevant variables: experience, subscribe, played_hours, and Age. These variables are chosen because they relate closely to user behavior and potential factors influencing subscription decisions. Players aged exactly 17 are removed, because they form an unrepresentative group that could introduce noise. Converting experience and subscribe to categorical types ensures that these variables are treated correctly during modeling, especially since subscribe is the outcome we aim to predict.

In [None]:
p1 <- players_tidy |>
  ggplot(aes(x = played_hours, fill = subscribe)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
  labs(title = "Figure 1: Distribution of Played Hours by Subscription",
       x = "Played Hours", y = "Count") +
  theme_minimal()
p1

In [None]:
p2 <- players_tidy |>
  ggplot(aes(x = experience, y = played_hours, fill = experience)) +
  geom_boxplot() +
  labs(title = "Figure 2: Played Hours by Experience Level",
       x = "Experience Level", y = "Played Hours") +
  theme_minimal()
p2

The next step involves visual exploration of the data. A histogram is created to compare how playtime is distributed across subscription statuses. This visualization provides an intuitive way to assess whether more engaged users tend to subscribe and whether there is a pattern that could inform predictions. The bin width is chosen to balance granularity and clarity. A second plot—a boxplot—illustrates how playtime differs across experience levels. This helps us explore whether experience is related to the amount of time spent playing, which might be another indicator of subscription behavior. Both visualizations serve to uncover possible relationships among the variables and the target, guiding feature selection and model expectations.