## Group number: 26. 
### Group members: Ara Kwon, Anastasija Lagodzinska, Nihat Mansurov, Taewoo Kim

# Title: Relationship Between Player Experience and Hours Played

# 1. Introduction

## Background information

Data collection and analysis is an important and valuable tool for game development and improvement. Player characteristics analysis and its influence on player's behavior helps to identify patterns and create strategies for increasing game engagement and attracting more users. Since this report analyzes data from a game research project, using existing data for attracting new players can help to expand the project and strengthen the research.  
In this report we want to determine if there is a relationship between the total hours played by user and their experience level. We are interested whether we can predict the time our new user will spend playing the game knowing their experience level. This data will help us to determine which users are most active on the server and which playtime corresponds to each experience category. This requires regression analysis, as we want to predict numerical value (hours) based on a categorical value (experience level). In this case, we use KNN regression to analyze the values closest to the new observation, in order to make a prediction about the hours new players will spend on the server.  
This report shows how identifying patterns between previously known players characteristics can help researchers to predict players behavior in the future.

## Research question

**Broad question:** Which "kinds" of players are most likely to contribute a large amount of data?   
**Specific question:** Can `experience` predict `played_hours` in the `players` dataset?

## Dataset Description

### Dataset summary  
Data generated by research group in Computer Science at UBC, led by Frank Wood.  
Game research project goal is to enable advanced AI research by analyzing player's actions on a MineCraft(PLAICraft) server.
Data collected from the people who signed up and played on PLAICraft server. 
#### Players dataset summary    
Contains a list of all unique players and data about each player.   
**Number of observations**: 196   
**Number of variables**: 7  
**Summary statistics**:
|                            |Average    |Min    |Max|
|----------------------------|-----------|-------|-------|
|**Total time played (in hours)**|5.85       |0      |223.10|
|**User's age**                 |21         |9      |58|    

**Variables**:  
- `experience`(character) - User's experience level. Five categories: Pro(professional player), Veteran(plays for a long time), Amateur, Regular(frequent player) and Beginner.
- `subscribe`(logical) - User's subscription status to a game-related newsletter.
- `hashedEmail`(character) - Encoded user's email.
- `played_hours`(double) - Total hours played by user.
- `name`(character) - User's first name.
- `gender`(character) - User's gender. Seven categories: Male, Female, Non-binary, Prefer not to say, Agender, Two-Spirited, Other.
- `Age`(integer) - User's age.  
**Dataset issues**:
  - inconsistent column names
  - missing age values
  - factor values stored as charcters
  - played hours precision (stored as hours not minutes)  
### Sessions dataset summary
Contains a list of individual play sessions by each player and data about the session.  
**Number of observations**: 1535   
**Number of variables**: 5  
**Observation period**: 06/04/2024 - 26/09/2024   
**Variables**:  
- `hashedEmail`(character) - Encoded user's email.
- `start_time`(character) - Formatted date and time of the player`s game session start.
- `end_time`(character) - Formatted date and time of the player`s game session end.
- `original_start_time`(double) - Date and time of the player`s game session start stored as a number.
- `original_end_time`(double) - Date and time of the player`s game session end stored as a number.  
**Dataset issues**:
  - inconsistent column names
  - missing time values
  - start_time and end_time stored as character values, not in a dattime format

# 2. Methods & Results

In [None]:
# Adding all necessary libraries to the report
library(tidyverse)
library(lubridate)
library(dplyr)
library(cowplot)
library(RColorBrewer)
library(repr)
library(tidymodels)
source('cleanup.R')
# Setting tthe maximum rows displayed for a tibble
options(repr.matrix.max.rows = 6)

## Loading Data

Loading players dataset:

In [None]:
url_players <- "https://raw.githubusercontent.com/ALagodzinska/Group26-FinalReport/refs/heads/main/data/players.csv"

players <- read_delim(url_players, delim = ",")
# players

Loading sessions dataset:

In [None]:
url_sessions <- "https://raw.githubusercontent.com/ALagodzinska/Group26-FinalReport/refs/heads/main/data/sessions.csv"

sessions <- read_delim(url_sessions, delim = ",")
# sessions

## Cleaning and Wrangling Data

Cleaning players:

In [None]:
# Converting experience and gender columns from char to factor.
players_clean <- players |> mutate(experience = as_factor(experience), gender = as_factor(gender))

# Fill missing age values with mean age value.
mean_age <- players |>
    summarize(mean_age = mean(Age, na.rm = TRUE)) |>
    round() |>
    pull()

players_clean <- players_clean |> 
    mutate(Age = if_else(is.na(Age), mean_age, Age))

# Create consistent column names and remove name column as it is not needed for analysis
players_clean <- players_clean |> rename(is_subscribed = subscribe, hashed_email = hashedEmail, age = Age) |>
    select(-name)

# Contains clean players dataset
players_clean

Cleaning sessions:

In [None]:
# Convert start time and end time into a datetime format using lubridate.
sessions_clean <- sessions |>
    mutate(start_time = dmy_hm(start_time),
           end_time = dmy_hm(end_time))

# Calculate played minutes and played_hours for each session; extract start date from the date and time column.
sessions_clean <- sessions_clean |>
    mutate(playtime_in_minutes = as.numeric(end_time - start_time)) |>
    mutate(playtime_in_hours = round(playtime_in_minutes/60, 1)) |>
    mutate(start_date = as_date(start_time))

# Create consistent column names and select only columns that contain user email and minutes
sessions_clean <- sessions_clean |> rename(hashed_email = hashedEmail) |>
    select(hashed_email, playtime_in_minutes, playtime_in_hours, start_date)

# Join sessions with players dataset by hashed_email, remove na rows.
sessions_joined <- inner_join(sessions_clean, players_clean, by = "hashed_email") |>
    select(start_date, playtime_in_minutes, playtime_in_hours, experience) |>
    filter(!is.na(playtime_in_minutes))

# Remove hashed email from players table as it is no longer needed
players_clean <- players_clean |> select(-hashed_email)

# Contains each session data that includes time played and player's experience level
sessions_joined

## Summary of the datasets

### Players summary 

In [None]:
summary(players_clean)

In [None]:
# Average played_hours, subscription proportion, average age and prevailing gender for players with different experience levels.
summary_by_experience <- players_clean |>
    group_by(experience) |>
    summarise(number_of_players = n(),
              mean_played_hours = round(mean(played_hours), 1),
              subscription_proportion = round(mean(is_subscribed), 2),
              mean_age = round(mean(age), 2),
              prevailing_gender = names(sort(table(gender), decreasing = TRUE)[1]))

# Finding out the experience level of the player who played the most and the least hours.
most_hours_player <- players_clean |> slice_max(played_hours)

#### Summary for cleaned players dataset by players experience

In [None]:
summary_by_experience

##### Player with most hours

In [None]:
most_hours_player

### Sessions summary

In [None]:
summary(sessions_joined)

In [None]:
# Mean session playtime by player's experience
playtime_by_experience <- sessions_joined |>
    group_by(experience) |>
    summarise(mean_minutes = round(mean(playtime_in_minutes), 1),
              max_minutes = max(playtime_in_minutes),
              min_minutes = min(playtime_in_minutes))
playtime_by_experience

## Exploratory visualizations

### Distribution of total and average time played across player experience categories

In [None]:
# Creating data frame that is grouped by experience and stores total hours.
players_grouped_total <- players_clean |>
    group_by(experience) |>
    summarize(total_hours = sum(played_hours),
             mean_hours = mean(played_hours),
             count = n()) 

options(repr.plot.width = 15, repr.plot.height = 7) 

# To change color
# scale_fill_brewer(palette = 'Greens')

# Creating a bar plot that shows cumulative hours of all players grouped by experience type.
total_hours_by_experience_plot <- players_grouped_total |>
    ggplot(aes(x = experience, y = total_hours)) +
    geom_bar(stat = "identity" ) +
    labs(x = "User experience", y = "Total hours played") +
    ggtitle("Total hours played by player experience type") +
    scale_y_continuous(breaks = seq(0, 650,by = 50)) +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))
    

# Creating a bar plot that shows average hours of all players grouped by experience type.
mean_hours_by_experience_plot <- players_grouped_total |>
    ggplot(aes(x = experience, y = mean_hours)) +
    geom_point(size = 5) +
    labs(x = "User experience", y = "Average hours played") +
    ggtitle("Average hours played by player experience type") +
    scale_y_continuous(breaks = seq(0, 20,by = 2)) +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

# Creating a bar plot that shows the number of all players grouped by experience type.
count_by_experience_plot <- players_grouped_total |>
    ggplot(aes(x = experience, y = count)) +
    geom_bar(stat = "identity") +
    labs(x = "User experience", y = "Number of users") +
    ggtitle("The number of users by experience type") +
    scale_y_continuous(breaks = seq(0, 70,by = 10)) +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

total_hours_by_experience_plot
plot_grid(mean_hours_by_experience_plot, count_by_experience_plot, nrow = 1)


**Insights:** The bar plot depicts cumulative hours played by all users. We can see that the most of the hours played on the game server were by `Regular` experience players, these players played in total around 650 hours. The next group that played the most hours is `Amateur` with around 370 hours of playing. The other three categories played less than 50 hours in total.  
The plot showing mean total hours played grouped by user experience indicates that indeed `Regular` players on average spend much more time playing than any other category. However, we need to keep in mind that the data is unbalanced, as most of the users in our dataset are `Amateurs`, while `Pro` category is underrepresented and it might influence results of the research.

### Exploring correlations between hours played, experience category and other variables

#### Total played hours by experience and subscription status

In [None]:
# Grouping players data by both experience and subscription
players_experience_subscription <- players_clean |>
     group_by(experience, is_subscribed) |>
     summarize(total_hours = sum(played_hours))

# Creating a bar plot that shows the number of total hours played by experience and subscription status.
total_hours_by_experience_subscription <- players_experience_subscription |>
    ggplot(aes(x = experience, y = total_hours, fill = is_subscribed)) +
    geom_bar(stat = "identity", position = "dodge") +
    labs(x = "User experience", y = "Total hours played (sqrt scale)", fill = "Has subscription") +
    ggtitle("Total hours played by player experience type and subscription") +
    scale_y_sqrt() +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

total_hours_by_experience_subscription

**Insights:** The bar plot shows that the majority of hours played in every experience category came from players who had a subscription. The difference is most visible for `Regular` players, where subscribed users contribute most of the hours. Whereas `Beginners` show the smallest difference between different subscription status, meaning that there might be less influence of the subscription on these type of players and their playtime.

#### Total played hours by experience and gender

In [None]:
# Grouping players data by both experience and gender
players_experience_gender <- players_clean |>
    group_by(experience, gender) |>
    summarize(total_hours = sum(played_hours))

# Creating a bar plot that shows the proportion of total hours played by experience and gender.
total_hours_by_experience_gender <- players_experience_gender |>
    ggplot(aes(x = experience, y = total_hours, fill = gender)) +
    geom_bar(stat = "identity", position = "fill") +
    labs(x = "User experience", y = "Total hours played", fill = "Gender") +
    ggtitle("Total hours played by player experience type and gender") +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

total_hours_by_experience_gender

**Insights:** The stacked bar plot shows the total hours proportions by gender within each experience category. Male players play contribute the most hours in `Pro` and`Beginner` category, whereas female players contribute the most hours in Amateur category. We can see that `Veterans` represent the most gender diversity across all experience levels. Although most of the players are male, the distribution of hours played by gender vary in most of the experience levels. For instance, we only have 2 `Agender` users in dataset, and they contributed almost half of all hours played in `Veteran` group (which consists of 48 players) . This shows that total played time differs within users and shows no consistent patterns by gender.  
*Statistics from `players` dataset summary

#### Total played hours by age and experience 

In [None]:
# Creating a bar plot that shows the relationship between user's age and hours played, colored by experience level.
hours_played_age_experience_plot <- players_clean |>
    ggplot(aes(x = age, y = played_hours, color = experience)) +
    geom_point(size = 3, alpha = 0.8) +
    scale_y_sqrt() +
    labs(x = "User's age", y = "Total hours played (sqrt scale)", color = "Experience") +
    ggtitle("Total hours played by player experience type and age") +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

hours_played_age_experience_plot

**Insights:** The plot shows the distribution of total hours played by user's age. The majority of active players are within 15 - 30 age group, with only a few players who played more than 50 hours. All players who played more than 150 hours are classified as the `Regular` experience players. The plot shows that there is no specific pattern between experience level, age and hours played. But this visualization shows that `Regular` group contributed the most hours becuase of the three players who played longer than 150 hours. These values are unique and do not represent the behaviour of all users in this category. 

#### Distribution of time played per session

In [None]:
# Creating a histogramm that shows the distribution of playtime per session in minutes.
minute_session_plot <- sessions_joined |>
    ggplot(aes(x = playtime_in_minutes)) +
    geom_histogram(binwidth = 15) +
    scale_x_continuous(breaks = seq(0, 280,by = 30)) +
    labs(x = "Time played in minutes", y = "Number of sessions") +
    ggtitle("Distribution of time played in minutes per session") +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

# Creating a plot that shows the number of sessions across different experience categories.
count_by_session_plot <- sessions_joined |>
    ggplot(aes(x = experience)) +
    geom_bar(stat = "count") +
    labs(x = "User experience", y = "Number of sessions") +
    ggtitle("The number of sessions by experience type") +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))
    
plot_grid(minute_session_plot , count_by_session_plot, nrow = 1)

**Insights:** The histogram shows that most of the users play time per session is less than 100 minutes with a peak at 15 minutes, indicating that user perfer to play casually for a short amount of time. The bar plot depicts that most frequent players are `Amateur` and `Regular` categories, meaning that data is mostly contributed by moderately experienced players.

#### Hours played per session by day and experience

In [None]:
sessions_by_week <- sessions_joined |>
    mutate(week = week(start_date)) |>
    group_by(week, experience) |>
    summarize(number_of_players = n(),
             total_played_hours = sum(playtime_in_hours))

players_by_week_experience_plot <- sessions_by_week |>
    ggplot(aes(x = week, y = number_of_players, fill = experience)) +
    geom_bar(stat = "identity") +
    labs(x = "Week", y = "Number of players", fill = "Experience") +
    ggtitle("Total number of players per week by player experience") +
    scale_y_continuous(breaks = seq(0, 200,by = 20)) +
    scale_x_continuous(breaks = seq(10, 40,by = 3)) +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

hours_by_week_experience_plot <- sessions_by_week |>
    ggplot(aes(x = week, y = total_played_hours, color = experience)) +
    geom_point() +
    geom_line() +
    labs(x = "Week", y = "Total hours played", color = "Experience") +
    ggtitle("Total hours played per week by player experience") +
    scale_y_sqrt() +
    scale_x_continuous(breaks = seq(10, 40,by = 3)) +
    theme(text = element_text(size = 16), plot.title = element_text(size = 18))

options(repr.plot.width = 20, repr.plot.height = 9) 
players_by_week_experience_plot
hours_by_week_experience_plot 

**Insights:** These plots display player activity throughout observation period, highlighting the change in the number of active players and hours played for each experience level. We can see that the peak activity was in the week 26-27 with most of the hours contributed by `Regular` players at around 60 hours per day during this period. However, during same week the largest number of players belonged to the `Amateur` category, showing that the time played per session is not consistent across all users.

## Data Analysis

In [None]:
# Splitting the data for training and testing
players_split <- initial_split(players_clean, prop = 0.75, strata = played_hours)
players_training <- training(players_split)
players_testing <- testing(players_split)

In [None]:
# Setting K-NN Regression specifications
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("regression")

# Creating recipe
players_recipe <- recipe(played_hours ~ experience, data = players_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

players_recipe

In [None]:
# V-fold to tune the optimal k for K-NN Regression
players_vfold <- vfold_cv(players_training, v = 5, strata = played_hours)

# Workflow to combine model specification and recipe
players_workflow <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec)

players_workflow

In [None]:
# Creating a sequence of k values for testing
