# Title

## Introduction

This project focuses on assisting a computer science research group at The University of British Columbia that is looking into player behaviour in video games. Specifically, the research involves a Minecraft server which tracks player movements and actions and how they interact with the world. 

The research group need to target their recruitment efforts to make sure they have enough resources and have enough players to get an ample amount of data. They want to answer the research question: *Which types of players are most likely to contribute significant amounts of data?*

Answering this question is crucial to helping the research group optimize their recruitment. Therefore, by leveraging data science and machine learning techniques, this project aims to predict those player engagement patterns.

### Data
The data comes in two different datasets, the players data and the sessions data. Below are full data descriptions of both datasets. 

#### Players Data
The players dataset has 9 variables with 196 observations. It captures user information on for the minecraft server.

- experience: A ordinal categorical variable that describes the user's experience level with Minecraft. The categories range from - "Beginner", "Amateur", "Regular", "Pro", "Veteran".
- subscribe: A logical data variable, which is either TRUE, FALSE or NA. In this case, there is only TRUE or FALSE. This variable represents if the player is subscribed to the servers email updates.
- hashedEmail: A string variable of a hashed representation of the email the player used to sign up with.
- played_hours: A numeric variable detailing the total hours a user has spent on the server.
- name: A string variable representing the first name of the player.
- gender: a string variable representing the players gender.
- age: A numeric variable representing the players age.
- individualId: The variable is empty throughout the entire dataset but appears to be just a unique identifier for the player.
- organizationName: The variable is empty throughout the entire dataset but appears to be capture the organization which the player is associated with.

#### Sessions Data
The players dataset has 5 variables with 1535 observations. It captures video game session information for a player. It is linked to the player through the hashedEmail thus the two tables could be combined into one dataset to give more information.

Variables
- hashedEmail: A string variable representing the hashed email of a player.
- start_time: A string variable representing the start time of the session in the format of DD/MM/YYYY HH:MM
- end_time: A string variable representing the end time of the session in the format of DD/MM/YYYY HH:MM
- original_start_time: A numerical variable representing the start time of the session in UNIX timestamp format (number of seconds since January 1, 1970 (UTC)).
- original_end_time: A numerical variable representing the end time of the session in UNIX timestamp format (number of seconds since January 1, 1970 (UTC)).

In [3]:
library(tidyverse)
library(ggplot2)
library(repr)
library(readr)  
library(dplyr)

## Downloading Data

In [4]:
players_url <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
sessions_url <- "https://drive.google.com/uc?export=download&id=14O91N5OlVkvdGxXNJUj5jIsV5RexhzbB"

players <- read.csv(players_url)
sessions <- read.csv(sessions_url)

In [5]:
head(players)

Unnamed: 0_level_0,experience,subscribe,hashedEmail,played_hours,name,gender,age,individualId,organizationName
Unnamed: 0_level_1,<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<int>,<lgl>,<lgl>
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9,,
2,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17,,
3,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17,,
4,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21,,
5,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21,,
6,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17,,


In [6]:
head(sessions)

Unnamed: 0_level_0,hashedEmail,start_time,end_time,original_start_time,original_end_time
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
2,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
3,f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
4,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
5,36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
6,bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


In [8]:
players_selected <- players |>
  select(-individualId, -organizationName, -name)
head(players_selected)

Unnamed: 0_level_0,experience,subscribe,hashedEmail,played_hours,gender,age
Unnamed: 0_level_1,<chr>,<lgl>,<chr>,<dbl>,<chr>,<int>
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Male,9
2,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Male,17
3,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Male,17
4,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Female,21
5,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Male,21
6,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Female,17


In [9]:
players_selected$subscribe <- as.numeric(players_selected$subscribe)

players_selected <- players_selected |>
    mutate(experience = as_factor(experience)) 

players_final <- players_selected |>
    drop_na()

head(players_final)

Unnamed: 0_level_0,experience,subscribe,hashedEmail,played_hours,gender,age
Unnamed: 0_level_1,<fct>,<dbl>,<chr>,<dbl>,<chr>,<int>
1,Pro,1,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Male,9
2,Veteran,1,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Male,17
3,Veteran,0,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Male,17
4,Amateur,1,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Female,21
5,Regular,1,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Male,21
6,Amateur,1,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Female,17


In [13]:
summarize_players <- players_selected |>
    group_by(experience) |>
    summarise(
        avg_played_hours = mean(played_hours, na.rm = TRUE),
        avg_age = mean(age, na.rm = TRUE),
        subscribe_rate = mean(as.numeric(subscribe), na.rm = TRUE), 
        gender_categories = paste(names(table(gender)), table(gender), sep = ":", collapse = ","),
        count = n()
      )

summarize_players

experience,avg_played_hours,avg_age,subscribe_rate,gender_categories,count
<fct>,<dbl>,<dbl>,<dbl>,<chr>,<int>
Pro,2.6,22.21429,0.7142857,"Male:11,Non-binary:2,Other:1",14
Veteran,0.6479167,20.95833,0.6875,"Agender:2,Female:5,Male:31,Non-binary:8,Prefer not to say:2",48
Amateur,6.0174603,20.25397,0.7142857,"Female:14,Male:40,Non-binary:1,Prefer not to say:4,Two-Spirited:4",63
Regular,18.2083333,22.77778,0.8055556,"Female:4,Male:26,Non-binary:3,Prefer not to say:2,Two-Spirited:1",36
Beginner,1.2485714,21.65714,0.7714286,"Female:14,Male:16,Non-binary:1,Prefer not to say:3,Two-Spirited:1",35
