# **Group 19 Project**  **GIVE BETTER TITLE**

# **Introduction**

Video games have evolved from simple pastimes into complex environments that offer rich data about user behavior and interaction. This report is grounded in a real-world data science project conducted by a research group in Computer Science at UBC, led by Frank Wood. The team has established a Minecraft server where every action taken by players is recorded. By capturing this data, the researchers aim to unlock insights into how individuals navigate and interact within virtual worlds.

The project has multiple objectives, and we will focus on understanding the characteristics and behaviors that most predict a player's likelihood to subscribe to a game-related newsletter. This targeted approach helps ensure that sufficient resources—such as software licenses and server hardware—are available to support the anticipated influx of players. By investigating player behavior through detailed analytics, the study aims to inform future strategies for engagement, recruitment, and resource allocation in online gaming communities. This report will detail the specific methodologies used to analyze the player data, the key findings related to newsletter subscription behavior, and the implications of these findings.

# **Question**

Are age and gender predictive of subscription status to a game related newsletter in the player.csv data set?

## Data Set Description

There a two datasets containing information on players on the MineCraft server; "players.csv" and "sessions.csv". 

The "players.csv" dataset contains observations collected for multiple different variables from people who played on the MineCraft server. The data frame contains 7 variables and 196 rows of data, producing 1372 observations in total. The variables are ordered in the table left to right are:

- `Experience`
    - This variable describes the level at which each player is at in terms of playing the game.
    - This variable is represented by a string value that can be either Amateur, Beginner, Regular, Pro, or Veteran
- `Subscribe`
    - This variable describes whether or not the player is subscribed to a game-related newsletter.
    - This variable is represented by a boolean value (either True or False) 
- `Hashed Email`
    - This variable describes lists each players email in a hashed format. 
    - This variable is represented by a string  
- `Hours Played`
    - This variable describeshow many hours each player spent playing the game (in hours). 
    - This variable is represented by a float value (number with a decimal value)
- `Name`
    - This variable states the players first name
    - This variable is represented by a string  
- `Gender`
    - This variable describes the gender of each player. 
    - This variable is represented by a string value that can be either Agender, Female, Male, Non-binary, other, Prefer not to say, or Two-Spirited
- `Age`
    - This variable describes the age of the players (in years) 
    - This variable represented by an integer value (whole number) 

**This is the data set that will be used in the analysis.**

The "sessions.csv" data contains observations collected for multiple different variables from people who played on the MineCraft server. The data frame contains 5 variables and 1535 rows of data, producing 7675 observations in total. The variables are ordered in the table from left to right are:

- `hashedEmail`
    - This variable gives a string of letters and numbers that represent the players email address. 
    - This variable is represented by a string  
- `start_time`
    - This variable gives the exact date (DD/MM/YR) and time (24 hour clock) that the player started their session.
    -  This variable is represented by a string
- `end_time`
    - This variable the exact date (DD/MM/YR) and time (24 hour clock) that the player ended their session. 
    - This variable is represented by a string  
- `original_start_time`
    - This variable describes the original start time of players  (**IDK HOW TO DESCRIBE IT BETTER**)
    - This variable is represented by a float value (number with a decimal value)
- `original_end_time`
    - This variable describes the original end time of players (**SAME HERE**)
    - This variable is represented by a float value (number with a decimal value)


This data set will not be used in the analysis. 

In [None]:
library(tidyverse)
library(rvest)
library(dplyr)

In [None]:
# url_sessions <- "https://raw.githubusercontent.com/IFQXK/DSCI-100-project-group-19/refs/heads/main/sessions.csv"
# sessions_data <- read.csv(url_sessions)
# head(sessions_data)

url_players <- "https://raw.githubusercontent.com/IFQXK/DSCI-100-project-group-19/refs/heads/main/players.csv"
players_data <- read.csv(url_players)
head(players_data)

In [None]:
players_data_fixed <- players_data |>
    mutate(gender = as.factor(gender)) |>
    select(subscribe, gender, Age)
head(players_data_fixed)

This is what the first 6 lines of code looks like without wrangling

In [None]:
mean_table <- players_data |>
  summarize(
   Average_Age = mean(Age, na.rm = TRUE),)

mean_table

gender_count <- players_data |>
        group_by(gender) |>
        count()

gender_count

subscriber_count <- players_data |>
        group_by(subscribe) |>
        count()

subscriber_count

## Statistics Summary of Variables used in Specific Question

`Age` :
- Mean: 20.52
- Max: 50
- Min: 8
  
`Gender` :
- Agender: 2
- Female: 37
- Male: 124
- Non-binary: 15
- Other: 1
- Perfer not to say: 11
- Two_Spirited: 6

`Subscribe` 

- True: 144
- False: 52

Above is the summary for each statistic used in the analysis. If an integer value, mean, max, and min were calculated. If the stat was a character value, the count was summarized per category. 

In [None]:
# Scatter graph comparing age and experience
options(repr.plot.width = 10, repr.plot.height = 4)
gender_bar <- players_data_fixed |>
            ggplot(aes(x = gender, fill = subscribe)) +
            geom_bar(position = "dodge") +
            labs(x = "Gender", title = "Relationship between gender and subscription status") + 
            theme(text = element_text(size = 14))       
gender_bar

In [None]:
age_histogram <- players_data |>
            ggplot(aes(x = Age, fill = subscribe)) +
            geom_histogram(position = "dodge", binwidth = 5) +
            labs(x = "Age (in years)", title = "Relationship between age and subscription status") + 
            theme(text = element_text(size = 14))       
age_histogram

^ creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis. **idk if this counts**

-**I think this represents the total age which I dont think is what we want**

In [None]:
scatter_plot <- players_data |>
            ggplot(aes(x = Age, y = gender)) +
            geom_point() +
            labs(y = "Gender", x = "Age (in years)", title = "Relationship between age and experience") + 
            theme(text = element_text(size = 14))
        

scatter_plot