In [None]:
#libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(lubridate)

# Project report
### Group 25: Nelson, Will, Caroline

### Introduction

This report looks at two data sets from a Minecraft server set up by a computer science research group at the University of British Columbia. The researchers collected data about how people play games by recording players’ actions on the server. However, in order to run the project, the researchers need to figure out how to attract more players for their server and manage resources. One such method is to advertise on a game-related newsletter, but since not every player on the server is subscribed to it, it can limit their recruitment ability.  

Here, we will try to answer the following question: 
_Can we predict whether a player will subscribe to the game’s newsletter based on their age, total hours played and average session length?_

In order to answer this question, we will use both the players and sessions data. Some details of the two datasets are listed below:

#### Players.csv

Rows: 196
Columns: 7
Variables
experience:
Categorical variable giving experience level.
subscribe
Categorical variable reporting subscription status
hashedEmail
Categorical variable containing each players hashed email
played_hours
double containing each players total played hours
name
Categorical variable containing each players first name
gender
categorical variable containing players gender
Age
double variable giving the age of each player
Summary Statistics:
There are 196 players on the server in total.
124 players are male, 37 are female, 33 identify as other or didn't state their gender.
35 players are beginners, 35 are regulars, 63 are amateurs, 48 are veterans, and 13 are pros.
144 players are subscribed to the newsletter, while 52 players are not.


Note: name, gender, and experience level are likely self reported so may be inaccurate for some observations. Some cells have missing values.
Sessions data:
Rows: 1535
Columns: 5
Variables
hashedEmail
Same as players data
start_time, end_time
Contains character formatted session start and end times
original_start_time, original_end_time
Both doubles, containing each session’s start and end times in milliseconds as stored by the server
Appears to contain identical values for some given observations which is possibly an issue.
Summary Statistics:
Average sessions per player: 12.26
Most sessions by one player: 310
Note: Session counts per player appear to be very skewed due to a few heavy users.


In [None]:
# importing the data
players_data <- read_csv("data/players.csv")
sessions_data <- read_csv("data/sessions.csv")
players_data
sessions_data

In [None]:
#create a table with hashedEmail and average session length
# Calculates average session length of each player
new_sessions_data <- sessions_data |>
    mutate(session_length_mins = as.numeric(dmy_hm(end_time) - dmy_hm(start_time))) |> #adding a new column for session_length_mins 
    select(hashedEmail, session_length_mins) |> #selecting 2 columns
    group_by(hashedEmail) |> 
    summarize(average_session_length = mean(session_length_mins)) # average session_length

new_sessions_data

# Adds each player's average playtime to players_data by hashedEmail
joined_table <- players_data |>
    full_join(new_sessions_data, join_by(hashedEmail))

joined_table