# Project final # 
#### Akbar Ismatullayev 17376021, Jairoop Brar 19169291, Jaime Keith, Ian Yoon ###

In [None]:
library(tidyverse)
library(lubridate)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

#### Reading the data for analysing

In [None]:
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

head(sessions)
head(players)

# quick sense-check on dimensions
dim(players)
dim(sessions)

## Introduction

#### About the data

The data this report will be analyzing comes from The Pacific Laboratory for Artificial Intelligence right here at UBC. They set up a minecraft server called PLAICraft and record play sessions with the goal of understanding how people play video games in order to advance artificial intelligence. They collected a set of data on the players on their server and are interested to know what kinds of players are the most likely to contribute more data so they can target them in their recruitment efforts. 


#### Data Description: ###


In [None]:
glimpse(players)
glimpse(sessions)

player.csv contains demographic and information about each unique player. Each row represents one player, so there are 196 unique players.
What we also noticed that the formatting style of the column names are inconsistent, some are camelcase some are with an underscore.
For the variables:
* Experience is a character variable, with some classifactions like pro, veteran etc. It looks like an ordinal feature which means there is a orderning.
* subscribe is a boolean, saying if the player is subscribed to newsletter.
* hashed_email is a character variable which is unique for each player
* played_hours is a double variable, which says how many hours each player plays, however what we noticed is that there are players with zero play hours time, so they only created the account but no playing time, so we have to be carefull of that.
* name is a character variable, which is not unique and it is not usefull


session.csv contains records of all gameplay sessions for every player which can be connected through their 'hashed_email', also the variables as  their timestamps and duration which we can calculate. Each row corresponds to one gameplay session. Since there are 1535 rows which are more than the amount of playes, this suggest that there are duplicates of the hashed_email in the dataset so players are gaming more than once on their server. So we have to be carefull of that.
For the variables:
* hashed_email is a character variable which you can link to the player.
* start_time is a character variable saying when the session started.
* end_time is a character variable saying when the session ended.
* both orginal_start_time and orginal_end_time are the same or almost the same so I will remove them.

In order to target future players on how much data they are expected to contribute, we needed to analyze the player data set to see if any of these variables are able to predict how many hours someone will play on the server.


#### Summary statistics

In [None]:
summary(players)
summary(sessions)


Looking at the summary, we notice that there are empty columns in age (players) and orginal_end_time (session). Which we need to be carefull. For the subscribe variable we notice it is very skewed, most players are subscribed so this variable is very imbalanced

##### Potential issues in short #####
* age and orginal_end_time has na values
* experience can be made as an ordinal feature so beginner = 1, intermediate = 2, advanced = 3, pro = 4.
* some players dont have play time so they dont have any game session.
* both original_start_time and original_end_time are almost the same or the same
* subscribe is very imbalanced so there more true values than false values which can mislead a classification model.

#### Project Statement

The question this report is investigating is: 

*Can age predict experience level and/or the amount of hours played on the server and if so which specific age groups have the highest experience level and hour played?*

To answer this, the following variables do support the research question:

 * ##### Age — the main predictor variable.

 * ##### Experience — an ordinal outcome measure of player skill.

 * ##### Played_hours — a continuous measure representing total time spent playing.

Additional variables such as start_time, end_time, and duration_min from the sessions dataset can be used to validate play time records, although they are not essential for answering the primary question.

Variables including gender, subscribe and hashed_email do not contribute to the research question and are therefore excluded from further analysis.



## Methods & Results

### Minimal wrangling ####

Minimal cleaning was applied to ensure consistency across variable names and formats.
Column names were standardized and categorical variables were converted into factors to keep the ordening.

In [None]:
#make the format consistent
players <- players |>
  rename(
    hashed_email = hashedEmail,
    age = Age
  )

sessions <- sessions |>
  rename(
    hashed_email = hashedEmail
  )

# mutate the player dataset since experience, subscribe and gender are factor data's. But experience has an order so we make it as an ordinal feature

players <- players |>
    filter(!is.na(age)) |>
    mutate(
        experience = factor(experience, levels = c("Beginner", "Amateur","Regular","Veteran","Pro")),
        subscribe  = as.factor(subscribe),
        gender     = as.factor(gender),
        log_played_hours = log(played_hours+1)
  )

### Exploratory Data Analysis

##### Summary Statistics

In [None]:
players_means <- players |> 
                 summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
players_means

This indicates that the typical player is around 21 years old and plays just under 6 hours, but we don't know yet if there are any outliers or how this data is distributed. To check for this, we have to plot the distribution of total played hours among players, specifically, the number of players that fall in each number of total hours played.

##### Age vs. Total Minutes (Subscribers vs. Non-subscribers)

In [None]:
# variable dictionary (just to keep track of what's what) 

# var_dict <- tibble(
#   table    = c(rep("players", 7), rep("sessions", 5)),
#   variable = c(
#     "experience","subscribe","hashed_email","played_hours","name","gender","age",
#     "hashed_email","start_time","end_time","original_start_time","original_end_time"
#   ),
#   type = c(
#     "factor","logical","id","numeric","string","factor","numeric",
#     "id","string","string","numeric","numeric"
#   ),
#   meaning = c(
#     "Experience tier","Has subscription","Unique player key","Total lifetime hours",
#     "Player name","Self-reported gender","Age in years",
#     "Unique player key","Session start (d/m/Y H:M)","Session end (d/m/Y H:M)",
#     "Start epoch (ms)","End epoch (ms)"
#   )
# )

# var_dict

# convert session timestamps to usable datetimes --------------------------
# The start/end times are strings like "14/05/2024 19:22" so I parsed them
# Then computed the playtime length for each session in minutes

sessions <- sessions |>
  mutate(
    start_time = dmy_hm(start_time, tz = "UTC"),
    end_time   = dmy_hm(end_time,   tz = "UTC"),
    duration_min = as.numeric(difftime(end_time, start_time, units = "mins"))
  )

# summarise total playtime per player 
# For each hashed_email, we want:
# - total minutes played
# - average session length
# - number of sessions actually logged

by_player <- sessions |>
  group_by(hashed_email) |>
  summarise(
    minutes  = sum(duration_min, na.rm = TRUE),
    avg_min  = mean(duration_min, na.rm = TRUE),
    sessions = sum(!is.na(duration_min)),
    .groups = "drop"
  )

# join the playtime summary onto the players table 
# If a player has no sessions, I filled their minutes/avg/sessions with zero.

dat <- players |>
  left_join(by_player, by = "hashed_email") |>
  mutate(
    minutes  = replace_na(minutes,  0),
    avg_min  = replace_na(avg_min,  0),
    sessions = replace_na(sessions, 0)
  )

dat |>
  select(hashed_email, experience, subscribe, age, minutes, sessions) |>
  head()

# numeric summaries for the players table 
# A quick overview of the numeric columns (Age, played_hours, etc.)

players_num_summary <- players |>
  select(where(is.numeric)) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  group_by(variable) |>
  summarise(
    n    = sum(!is.na(value)),
    mean = mean(value, na.rm = TRUE),
    sd   = sd(value,   na.rm = TRUE),
    min  = min(value,  na.rm = TRUE),
    p25  = quantile(value, 0.25, na.rm = TRUE),
    med  = median(value,   na.rm = TRUE),
    p75  = quantile(value, 0.75, na.rm = TRUE),
    max  = max(value,  na.rm = TRUE),
    .groups = "drop"
  )

players_num_summary

# summary of session durations 
# Same type of summary but just for the session length variable.

duration_summary <- sessions |>
  summarise(
    n    = sum(!is.na(duration_min)),
    mean = mean(duration_min, na.rm = TRUE),
    sd   = sd(duration_min,   na.rm = TRUE),
    min  = min(duration_min,  na.rm = TRUE),
    p25  = quantile(duration_min, 0.25, na.rm = TRUE),
    med  = median(duration_min,   na.rm = TRUE),
    p75  = quantile(duration_min, 0.75, na.rm = TRUE),
    max  = max(duration_min,  na.rm = TRUE)
  )

duration_summary

# plot: age vs total minutes 
# We log-transform (minutes + 1) because playtime is extremely skewed,
# and we colour by subscription status just to see if subs behave differently.

dat |>
  ggplot(aes(x = age,
             y = log10(minutes + 1),
             colour = subscribe)) +
  geom_point(alpha = 0.8) +
  scale_colour_manual(
    values = c("FALSE" = "black", "TRUE" = "blue"),
    name = "Subscribed"
  ) +
  labs(
    x = "Age (years)",
    y = "Total minutes played (log10 scale)",
    title = "Age vs total minutes"
  ) +
  theme_bw()


There's no strong relationship between playtime and age, as there's a high degree of variability in playtime for all ages. In addition, a line of fit would be a biased here given the lack of older players. Younger players (10-25 years old) have the widest diversity of total playtime. Here we see the most players with 0 playtime, as well as that small contingent of highly dedicated players with up to 150+ hours of playtime. Older players tend to be clustered into lower playtimes, presumably due to busier schedules, lower interest, or perhaps due to the aforementioned small sample size. There is however one single outlier, a 49 year old player with over 18 hours of playtime, far more than most players, even those in the younger demographic.
There does not appear to be strong correlation between age and subscription, but subscription and playtime may be connected. Non-subscribers are over represented in the 0 minutes-very low playtime region, it can be inferred that those willing to pay money for the service are more likely to engage and invest time into the server. Subscribers are also more prevalent

##### Player Age Distribution

In [None]:
# 1. How are player ages distributed? 
# Gives a feel for who is actually on the server.

dat |>
  ggplot(aes(x = age)) +
  geom_histogram(binwidth = 2, boundary = 0, closed = "left") +
  labs(
    x = "Age (years)",
    y = "Number of players",
    title = "Distribution of player ages"
  ) +
  theme_bw()

The bulk of server members are around 18 years old, with a sharp drop off after about 28 years old. After the 30-year mark, there are between 0-2 individuals per age range. This tracks with the general statistics for Minecraft players (citation needed), in addition to this being a UBC server, and thus more likely to attract college-age UBC students.

##### Checking the distribution of total play hours

In [None]:
# 2. Raw total minutes (super skewed, but good to see once) 
# Shows just how locked in some players are.

dat |>
  ggplot(aes(x = minutes)) +
  geom_histogram(bins = 40) +
  labs(
    x = "Total minutes played",
    y = "Number of players",
    title = "Total minutes played (raw scale)"
  ) +
  theme_bw()


# 3. Total minutes on a log scale 
# Same thing but easier to read the bulk of players.

dat |>
  ggplot(aes(x = log10(minutes + 1))) +
  geom_histogram(bins = 30) +
  labs(
    x = "log10(total minutes + 1)",
    y = "Number of players",
    title = "Total minutes played (log10 scale)"
  ) +
  theme_bw()


The vast majority of registered players did not rack up any amount of time on the server. Therefor, the majority of registered users either haven't played on the server or have spent such an insignificant amount of time on the server that their time played has rounded down to 0. Many users register but do not meaningfully play or interact with the server. This confirms our initial thought, it shows the distribution of play hours is highly right skewed. This supports the need to consider possible transformations like log for example in later analysis.

##### Log version
There is relatively large contingent of casual players, i.e those who have contributed and interacted with the server but not to an extreme degree. This spans from about log .8 to log 2.5 (anywhere from 6 minutes to 5 hours on the server). Most players who have actually interacted with the server still show low-moderate playtime. The tail of the graph showcases small numbers of highly dedicated players, ranging from log 3-4 (anywhere from 15-150+ hours). There's a very small but very dedicated playerbase, leading to a small but long tail in the data.


In [None]:
# 3. Total minutes on a log scale 
# Same thing but easier to read the bulk of players.

dat |>
  ggplot(aes(x = log10(minutes + 1))) +
  geom_histogram(bins = 30) +
  labs(
    x = "log10(total minutes + 1)",
    y = "Number of players",
    title = "Total minutes played (log10 scale)"
  ) +
  theme_bw()

In [None]:





# 4. Playtime vs age, coloured by subscription 

dat |>
  ggplot(aes(x = age,
             y = log10(minutes + 1),
             colour = subscribe)) +
  geom_point(alpha = 0.8) +
  scale_colour_manual(
    values = c("FALSE" = "black", "TRUE" = "blue"),
    name = "Subscribed"
  ) +
  labs(
    x = "Age (years)",
    y = "Total minutes played (log10 scale)",
    title = "Age vs total minutes"
  ) +
  theme_bw()


# 5. Do subscribers actually play more? hint: yes lol
# Boxplot of playtime split by subscription status.

dat |>
  ggplot(aes(x = subscribe,
             y = log10(minutes + 1),
             fill = subscribe)) +
  geom_boxplot(alpha = 0.6, outlier.alpha = 0.6) +
  scale_fill_manual(values = c("FALSE" = "grey70", "TRUE" = "skyblue")) +
  labs(
    x = "Subscribed?",
    y = "Total minutes played (log10 scale)",
    title = "Playtime by subscription status"
  ) +
  theme_bw() +
  theme(legend.position = "none")


# 6. Playtime by experience tier 

dat |>
  ggplot(aes(x = experience,
             y = log10(minutes + 1),
             fill = experience)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.6) +
  labs(
    x = "Experience tier",
    y = "Total minutes played (log10 scale)",
    title = "Playtime by experience tier"
  ) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "none")


# 7. What do individual sessions look like? 
# Distribution of single-session lengths in minutes.

sessions |>
  ggplot(aes(x = duration_min)) +
  geom_histogram(bins = 40) +
  labs(
    x = "Session duration (minutes)",
    y = "Number of sessions",
    title = "Distribution of session lengths"
  ) +
  theme_bw()


# 8. Zoomed-in view of “normal” sessions 
# Cuts off extreme marathons so we can see the typical range better.

sessions |>
  filter(duration_min <= 180) |>  # keep sessions up to 3 hours
  ggplot(aes(x = duration_min)) +
  geom_histogram(binwidth = 10, boundary = 0) +
  labs(
    x = "Session duration (minutes)",
    y = "Number of sessions",
    title = "Session lengths up to 3 hours"
  ) +
  theme_bw()


### Approach

#### Proposed Method

To investigate whether age predicts experience level and hours played, we considered the following approaches:

<!--Guys mabye we use knn, but the only issue there is, experience leevel is an orderning so an ordinal logistic regression is the most logicall even though we never had this in class -->
* Ordinal logistic regression for predicting experience level (ordered categories: Beginner -> Pro).

* Linear regression for predicting total hours played (log transformed to reduce skewness).

* KNN regression as an alternative non-parametric approach for both outcomes to see if local patterns differ from parametric models.

This approach allows us to compare parametric and non-parametric models and will show which captures the relationships most effectively

#### Why This Method Is Appropriate

* Ordinal logistic regression respects the natural order in experience levels, making it suitable for ordinal outcomes.

* linear regression is appropriate for continuous numeric outcomes like played hours, especially after log transformation to handle skewness.

* KNN regression is non-parametric so it makes no assumptions about the functional form between age and the outcomes.

#### Assumptions

##### General Assumptions About Our Data:
* We are assuming that data collected from the PLAICraft server is representative of the player base we will analyze, and that the sample of 196 players accurately reflects the entire population of interest.
* Each player and their respective data in the players.csv dataset is unique and independent of other players, and each player's age or experience level is not statistically dependent on any other player.
* All values within the dataset are recorded accurately, and excluding any NAs, the rest of the recorded data is accurate and not manipulated in any way.

##### Linear Regression Assumptions:
* We are assuming that there is a linear relationship between Age (the predictor) and played_hours (the outcome)
* assuming that any outliers or residuals are evenly distributed throughout the time_played data, and that doing a log transformation (as our data is right-skewed) will help highlight the normality of the data, and that the aforementioned residuals are distributed evenly across all age groups (one or a select few age groups don't have all of the residuals or outliers)

#####  Missing Data Values (and zeros) Assumptions:
* As previously mentioned, any NAs, specifically found in the Age of players can be excluded and removed from the data without any significant impact on the overall trends, as these NAs are sparse and shouldn't lead to any unfairness.
* We are assuming that players with played_hours of 0.0 are valid, as they likely joined the research but never played. By transforming the value in our linear regression through a log transformation (log(played_hours + 1)) , these values of 0.0 won't impact the data drastically.

### Linear Regression and K-NN Regression (Predicting Time Played)

Moving on, we used a linear regression and K-NN regression model to predict the total time played based on the player's age. Additionally, to account for the skew in Figure 1, we used the log-transformed `played_hours` which is `log_played_hours`. Calculated using `log(played_hours + 1)`.

In [None]:
players_split <- initial_split(players, prop = 0.75, strata = log_played_hours)   #im using log_played_hours instead of played_hours here to account for the 0.0 values!
players_train <- training(players_split)
players_test <- testing(players_split)

In [None]:
# Linear Regression!
lm_spec <- linear_reg() |>
    set_engine("lm") |>
    set_mode("regression")

lm_recipe <- recipe(log_played_hours ~ age, data = players_train)

lm_fit <- workflow() |>
    add_recipe(lm_recipe) |>
    add_model(lm_spec) |>
    fit(data = players_train)

lm_fit

lm_test_results <- lm_fit |>
    predict(players_test) |>
    bind_cols(players_test) |>
    metrics(truth = log_played_hours, estimate = .pred)

lm_test_results

Here we see that the Linear Regression Model's calculated RMSPE is 1.2765!

In [None]:
# K-NN Regression!
knn_recipe <- recipe(log_played_hours ~ age, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

knn_spec <- nearest_neighbor(neighbors = 5) |>
    set_engine("kknn") |>
    set_mode("regression")

knn_fit <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec) |>
    fit(data = players_train)

knn_metrics <- predict(knn_fit, players_test) |>
    bind_cols(players_test) |>
    metrics(truth = log_played_hours, estimate = .pred)

knn_metrics

Here we see that the K-NN Regression Model's calculated RMSPE is 1.3383!

In [None]:
# Plotting Linear and K-NN Regression Models!
age_prediction_grid <- tibble(
    age = c(players |> select(age) |> min(), 
            players |> select(age) |> max() ) )

lm_preds <- lm_fit |>
    predict(age_prediction_grid) |>
    bind_cols(age_prediction_grid) |>
    mutate(Model = "Linear Regression")

knn_preds <- knn_fit |>
    predict(age_prediction_grid) |>
    bind_cols(age_prediction_grid)|>
    mutate(Model = "KNN")

options(repr.plot.width = 14, repr.plot.height = 7)
regression_plot <- bind_rows(knn_preds, lm_preds) |>
    ggplot(aes(x = age, y = .pred, color = Model)) +
    geom_point(data = players_train, aes(y = log_played_hours), alpha = 0.4, color = 'Black') +
    geom_line(size = 1.5) +
    labs(x = "Age (years)", y = "Log(Played Hours + 1)", title = "Figure 2: Linear and K-NN Regression Plot") +
    theme(text = element_text(size = 20))

regression_plot

## Discussion

In [None]:
# variable dictionary (just to keep track of what's what) 

var_dict <- tibble(
  table    = c(rep("players", 7), rep("sessions", 5)),
  variable = c(
    "experience","subscribe","hashed_email","played_hours","name","gender","age",
    "hashed_email","start_time","end_time","original_start_time","original_end_time"
  ),
  type = c(
    "factor","logical","id","numeric","string","factor","numeric",
    "id","string","string","numeric","numeric"
  ),
  meaning = c(
    "Experience tier","Has subscription","Unique player key","Total lifetime hours",
    "Player name","Self-reported gender","Age in years",
    "Unique player key","Session start (d/m/Y H:M)","Session end (d/m/Y H:M)",
    "Start epoch (ms)","End epoch (ms)"
  )
)

var_dict

# convert session timestamps to usable datetimes --------------------------
# The start/end times are strings like "14/05/2024 19:22" so I parsed them
# Then computed the playtime length for each session in minutes

sessions <- sessions |>
  mutate(
    start_time = dmy_hm(start_time, tz = "UTC"),
    end_time   = dmy_hm(end_time,   tz = "UTC"),
    duration_min = as.numeric(difftime(end_time, start_time, units = "mins"))
  )

# summarise total playtime per player 
# For each hashed_email, we want:
# - total minutes played
# - average session length
# - number of sessions actually logged

by_player <- sessions |>
  group_by(hashed_email) |>
  summarise(
    minutes  = sum(duration_min, na.rm = TRUE),
    avg_min  = mean(duration_min, na.rm = TRUE),
    sessions = sum(!is.na(duration_min)),
    .groups = "drop"
  )

# join the playtime summary onto the players table 
# If a player has no sessions, I filled their minutes/avg/sessions with zero.

dat <- players |>
  left_join(by_player, by = "hashed_email") |>
  mutate(
    minutes  = replace_na(minutes,  0),
    avg_min  = replace_na(avg_min,  0),
    sessions = replace_na(sessions, 0)
  )

dat |>
  select(hashed_email, experience, subscribe, age, minutes, sessions) |>
  head()

# numeric summaries for the players table 
# A quick overview of the numeric columns (Age, played_hours, etc.)

players_num_summary <- players |>
  select(where(is.numeric)) |>
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  group_by(variable) |>
  summarise(
    n    = sum(!is.na(value)),
    mean = mean(value, na.rm = TRUE),
    sd   = sd(value,   na.rm = TRUE),
    min  = min(value,  na.rm = TRUE),
    p25  = quantile(value, 0.25, na.rm = TRUE),
    med  = median(value,   na.rm = TRUE),
    p75  = quantile(value, 0.75, na.rm = TRUE),
    max  = max(value,  na.rm = TRUE),
    .groups = "drop"
  )

players_num_summary

# summary of session durations 
# Same type of summary but just for the session length variable.

duration_summary <- sessions |>
  summarise(
    n    = sum(!is.na(duration_min)),
    mean = mean(duration_min, na.rm = TRUE),
    sd   = sd(duration_min,   na.rm = TRUE),
    min  = min(duration_min,  na.rm = TRUE),
    p25  = quantile(duration_min, 0.25, na.rm = TRUE),
    med  = median(duration_min,   na.rm = TRUE),
    p75  = quantile(duration_min, 0.75, na.rm = TRUE),
    max  = max(duration_min,  na.rm = TRUE)
  )

duration_summary

# plot: age vs total minutes 
# We log-transform (minutes + 1) because playtime is extremely skewed,
# and we colour by subscription status just to see if subs behave differently.

dat |>
  ggplot(aes(x = age,
             y = log10(minutes + 1),
             colour = subscribe)) +
  geom_point(alpha = 0.8) +
  scale_colour_manual(
    values = c("FALSE" = "black", "TRUE" = "blue"),
    name = "Subscribed"
  ) +
  labs(
    x = "Age (years)",
    y = "Total minutes played (log10 scale)",
    title = "Age vs total minutes"
  ) +
  theme_bw()


In [None]:
# 1. How are player ages distributed? 
# Gives a feel for who is actually on the server.

dat |>
  ggplot(aes(x = age)) +
  geom_histogram(binwidth = 2, boundary = 0, closed = "left") +
  labs(
    x = "Age (years)",
    y = "Number of players",
    title = "Distribution of player ages"
  ) +
  theme_bw()


# 2. Raw total minutes (super skewed, but good to see once) 
# Shows just how locked in some players are.

dat |>
  ggplot(aes(x = minutes)) +
  geom_histogram(bins = 40) +
  labs(
    x = "Total minutes played",
    y = "Number of players",
    title = "Total minutes played (raw scale)"
  ) +
  theme_bw()


# 3. Total minutes on a log scale 
# Same thing but easier to read the bulk of players.

dat |>
  ggplot(aes(x = log10(minutes + 1))) +
  geom_histogram(bins = 30) +
  labs(
    x = "log10(total minutes + 1)",
    y = "Number of players",
    title = "Total minutes played (log10 scale)"
  ) +
  theme_bw()


# 4. Playtime vs age, coloured by subscription 

dat |>
  ggplot(aes(x = age,
             y = log10(minutes + 1),
             colour = subscribe)) +
  geom_point(alpha = 0.8) +
  scale_colour_manual(
    values = c("FALSE" = "black", "TRUE" = "blue"),
    name = "Subscribed"
  ) +
  labs(
    x = "Age (years)",
    y = "Total minutes played (log10 scale)",
    title = "Age vs total minutes"
  ) +
  theme_bw()


# 5. Do subscribers actually play more? hint: yes lol
# Boxplot of playtime split by subscription status.

dat |>
  ggplot(aes(x = subscribe,
             y = log10(minutes + 1),
             fill = subscribe)) +
  geom_boxplot(alpha = 0.6, outlier.alpha = 0.6) +
  scale_fill_manual(values = c("FALSE" = "grey70", "TRUE" = "skyblue")) +
  labs(
    x = "Subscribed?",
    y = "Total minutes played (log10 scale)",
    title = "Playtime by subscription status"
  ) +
  theme_bw() +
  theme(legend.position = "none")


# 6. Playtime by experience tier 

dat |>
  ggplot(aes(x = experience,
             y = log10(minutes + 1),
             fill = experience)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.6) +
  labs(
    x = "Experience tier",
    y = "Total minutes played (log10 scale)",
    title = "Playtime by experience tier"
  ) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "none")


# 7. What do individual sessions look like? 
# Distribution of single-session lengths in minutes.

sessions |>
  ggplot(aes(x = duration_min)) +
  geom_histogram(bins = 40) +
  labs(
    x = "Session duration (minutes)",
    y = "Number of sessions",
    title = "Distribution of session lengths"
  ) +
  theme_bw()


# 8. Zoomed-in view of “normal” sessions 
# Cuts off extreme marathons so we can see the typical range better.

sessions |>
  filter(duration_min <= 180) |>  # keep sessions up to 3 hours
  ggplot(aes(x = duration_min)) +
  geom_histogram(binwidth = 10, boundary = 0) +
  labs(
    x = "Session duration (minutes)",
    y = "Number of sessions",
    title = "Session lengths up to 3 hours"
  ) +
  theme_bw()


## Gitlink

https://github.com/Aki175/Project-Final-Report