## Group number: 26. 
### Group members: Ara Kwon, Anastasija Lagodzinska, Nihat Mansurov, Taewoo Kim

# Title: Relationship Between Player Experience and Hours Played

# 1. Introduction

## Background information

TODO

## Research question

**Broad question:** ???  
**Specific question:** Can `experience` predict `played_hours` in the `players` dataset?

## Dataset Description

### Dataset summary  
Data generated by research group in Computer Science at UBC, led by Frank Wood.  
Game research project goal is to enable advanced AI research by analyzing player's actions on a MineCraft(PLAICraft) server.
Data collected from the people who signed up and played on PLAICraft server. 
#### Players dataset summary    
Contains a list of all unique players and data about each player.   
**Number of observations**: 196   
**Number of variables**: 7  
**Summary statistics**:
|                            |Average    |Min    |Max|
|----------------------------|-----------|-------|-------|
|**Total time played (in hours)**|5.85       |0      |223.10|
|**User's age**                 |21         |9      |58|    

**Variables**:  
- `experience`(character) - User's experience level. Five categories: Pro(professional player), Veteran(plays for a long time), Amateur, Regular(frequent player) and Beginner.
- `subscribe`(logical) - User's subscription status to a game-related newsletter.
- `hashedEmail`(character) - Encoded user's email.
- `played_hours`(double) - Total hours played by user.
- `name`(character) - User's first name.
- `gender`(character) - User's gender. Seven categories: Male, Female, Non-binary, Prefer not to say, Agender, Two-Spirited, Other.
- `Age`(integer) - User's age.  
**Dataset issues**:
  - inconsistent column names
  - missing age values
  - factor values stored as charcters
  - played hours precision (stored as hours not minutes)  
### Sessions dataset summary
Contains a list of individual play sessions by each player and data about the session.  
**Number of observations**: 1535   
**Number of variables**: 5  
**Observation period**: 06/04/2024 - 26/09/2024   
**Variables**:  
- `hashedEmail`(character) - Encoded user's email.
- `start_time`(character) - Formatted date and time of the player`s game session start.
- `end_time`(character) - Formatted date and time of the player`s game session end.
- `original_start_time`(double) - Date and time of the player`s game session start stored as a number.
- `original_end_time`(double) - Date and time of the player`s game session end stored as a number.  
**Dataset issues**:
  - inconsistent column names
  - missing time values
  - start_time and end_time stored as character values, not in a dattime format

# 2. Methods & Results

In [1]:
# Adding all necessary libraries to the report
library(tidyverse)
library(lubridate)
library(dplyr)
# Setting tthe maximum rows displayed for a tibble
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Loading Data

Loading players dataset:

In [2]:
url_players <- "https://raw.githubusercontent.com/ALagodzinska/Group26-FinalReport/refs/heads/main/data/players.csv"

players <- read_delim(url_players, delim = ",")
# players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Loading sessions dataset:

In [3]:
url_sessions <- "https://raw.githubusercontent.com/ALagodzinska/Group26-FinalReport/refs/heads/main/data/sessions.csv"

sessions <- read_delim(url_sessions, delim = ",")
# sessions

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Cleaning and Wrangling Data

Cleaning players:

In [4]:
# Converting experience and gender columns from char to factor.
players_clean <- players |> mutate(experience = as_factor(experience), gender = as_factor(gender))

# Fill missing age values with mean age value.
mean_age <- players |>
    summarize(mean_age = mean(Age, na.rm = TRUE)) |>
    round() |>
    pull()

players_clean <- players_clean |> 
    mutate(Age = if_else(is.na(Age), mean_age, Age))

# Create consistent column names
players_clean <- players_clean |> rename(is_subscribed = subscribe, hashed_email = hashedEmail, age = Age) |>
    select(-name)

players_clean

experience,is_subscribed,hashed_email,played_hours,gender,age
<fct>,<lgl>,<chr>,<dbl>,<fct>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Male,17
⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Prefer not to say,57
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Other,21


Cleaning sessions:

In [5]:
# Convert start time and end time into a datetime format using lubridate.
sessions_clean <- sessions |>
    mutate(start_time = dmy_hm(start_time),
           end_time = dmy_hm(end_time))

# Calculate played minutes and played_hours for each session.
sessions_clean <- sessions_clean |>
    mutate(playtime_in_minutes = as.numeric(end_time - start_time)) |>
    mutate(playtime_in_hours = round(playtime_in_minutes/60, 1)) 

# Create consistent column names and select only columns that contain user email and minutes
sessions_clean <- sessions_clean |> rename(hashed_email = hashedEmail) |>
    select(hashed_email, playtime_in_minutes, playtime_in_hours)

# Join sessions with players dataset by hashed_email, remove na rows.
sessions_joined <- inner_join(sessions_clean, players_clean, by = "hashed_email") |>
    select(playtime_in_minutes, playtime_in_hours, experience) |>
    filter(!is.na(playtime_in_minutes))

# Remove hashed email from players table as it is no longer needed.
players_clean <- players_clean |> select(-hashed_email)

# Contains each session data that includes time played and player's experience level.
sessions_joined

playtime_in_minutes,playtime_in_hours,experience
<dbl>,<dbl>,<fct>
12,0.2,Regular
13,0.2,Amateur
23,0.4,Amateur
⋮,⋮,⋮
21,0.3,Amateur
7,0.1,Amateur
19,0.3,Amateur


## Summary of the datasets

### Players summary 

In [6]:
summary(players_clean)

    experience is_subscribed    played_hours                   gender   
 Pro     :14   Mode :logical   Min.   :  0.000   Male             :124  
 Veteran :48   FALSE:52        1st Qu.:  0.000   Female           : 37  
 Amateur :63   TRUE :144       Median :  0.100   Non-binary       : 15  
 Regular :36                   Mean   :  5.846   Prefer not to say: 11  
 Beginner:35                   3rd Qu.:  0.600   Agender          :  2  
                               Max.   :223.100   Two-Spirited     :  6  
                                                 Other            :  1  
      age       
 Min.   : 9.00  
 1st Qu.:17.00  
 Median :19.50  
 Mean   :21.14  
 3rd Qu.:22.25  
 Max.   :58.00  
                

In [7]:
# Average played_hours, subscription proportion, average age and prevailing gender for players with different experience levels.
summary_by_experience <- players_clean |>
    group_by(experience) |>
    summarise(mean_played_hours = round(mean(played_hours), 1),
              subscription_proportion = round(mean(is_subscribed), 2),
              mean_age = round(mean(age), 2),
              prevailing_gender = names(sort(table(gender), decreasing = TRUE)[1]))

# Finding out the experience level of the player who played the most and the least hours.
most_hours_player <- players_clean |> slice_max(played_hours)

#### Summary for cleaned players dataset by players experience

In [8]:
summary_by_experience

experience,mean_played_hours,subscription_proportion,mean_age,prevailing_gender
<fct>,<dbl>,<dbl>,<dbl>,<chr>
Pro,2.6,0.71,17.21,Male
Veteran,0.6,0.69,20.96,Male
Amateur,6.0,0.71,21.37,Male
Regular,18.2,0.81,22.0,Male
Beginner,1.2,0.77,21.66,Male


##### Player with most hours

In [9]:
most_hours_player

experience,is_subscribed,played_hours,gender,age
<fct>,<lgl>,<dbl>,<fct>,<dbl>
Regular,True,223.1,Male,17


### Sessions summary

In [10]:
summary(sessions_joined)

 playtime_in_minutes playtime_in_hours    experience 
 Min.   :  3.00      Min.   :0.0000    Pro     : 39  
 1st Qu.:  9.00      1st Qu.:0.1000    Veteran : 51  
 Median : 30.00      Median :0.5000    Amateur :819  
 Mean   : 50.86      Mean   :0.8467    Regular :518  
 3rd Qu.: 73.00      3rd Qu.:1.2000    Beginner:106  
 Max.   :259.00      Max.   :4.3000                  

In [11]:
# Mean session playtime by player's experience
playtime_by_experience <- sessions_joined |>
    group_by(experience) |>
    summarise(mean_minutes = round(mean(playtime_in_minutes), 1),
              max_minutes = max(playtime_in_minutes),
              min_minutes = min(playtime_in_minutes))
playtime_by_experience

experience,mean_minutes,max_minutes,min_minutes
<fct>,<dbl>,<dbl>,<dbl>
Pro,62.6,211,7
Veteran,42.6,180,4
Amateur,33.0,255,3
Regular,83.3,259,4
Beginner,29.6,172,4


## Exploratory visualizations

TODO