# DSCI 100 Project Planning Stage (Individual) 

**Ajay Antonio (90110792)**

# Introduction 

In this project we will be tasked with analyzing a data set in collaboration with a research science group at UBC. Spearheaded by Frank Wood and his team, this project aims to provide insights to the relationships between online gaming and its users, highlighting often arbitrary forms of data, such as newsletter subscription, time logged in, and types of players. Through the use of different classification techniques and data recorded from a Minecraft server, we can then make conclusions for more efficient and targeted outreach to players.

The broad question that I will be aiming to tackle in my project is **What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?** 

More specifically, **Can we predict a player's chance to subscribe to the newsletter based on their total hours played, experience, and age?** In order to answer these questions, multiple steps must be taken to wrangle the data and carry out our predictions. More on this will be discussed later. 

# Data Description

We will be using the `players.csv` file given by DSCI 100, which provides detailed information and a list of all the unique players collected from the Minecraft Server. The data set has **196** observations and **8** variables.


#### Table 1 - Variable Names and Types in `players.csv`

| Variable     | Type      | Meaning                                                        |
|---------------|-----------|----------------------------------------------------------------|
| experience    | factor    | Experience of the player (Veteran, Pro, Amateur, Regular)       |
| subscribe     | logical   | Status of the player's subscription (TRUE OR FALSE)             |
| hashedEmail   | character | Email Addresses of the players (Encoded)                        |
| played_hours  | double    | Total hours of Minecraft played                                 |
| name          | character | Name of the player                                              |
| gender        | factor    | Gender of the player                                            |
| Age           | integer   | Age of the player in years                                      |

The second data set that we were given but is not used in the project is `sessions.csv`, which has **1535** observations and **5 variables**. 

#### Table 2 - Variable Names and Types in `sessions.csv`

| Variable           | Type      | Meaning                                                |
|--------------------|-----------|--------------------------------------------------------|
| hashedEmail        | character | Email Addresses of the players (Encoded)               |
| start_time         | character | Start time of gameplay                                 |
| end_time           | character | End time of gameplay                                   |
| original_start_time| double    | Start time in Epoch Milliseconds                       |
| original_end_time  | double    | End time in Epoch Milliseconds                         |

In [1]:
#Importing libraries I will use in this project
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)
# formatting graphs
options(repr.plot.width = 12, repr.plot.height = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [5]:
players <- read_csv("players.csv")
summary_stats <- players %>%
  summarise(
    n_obs = n(),
    mean_played_hours = round(mean(played_hours, na.rm = TRUE), 2),
    sd_played_hours = round(sd(played_hours, na.rm = TRUE), 2),
    min_played_hours = round(min(played_hours, na.rm = TRUE), 2),
    max_played_hours = round(max(played_hours, na.rm = TRUE), 2),
    mean_age = round(mean(Age, na.rm = TRUE), 2),
    sd_age = round(sd(Age, na.rm = TRUE), 2),
    min_age = round(min(Age, na.rm = TRUE), 2),
    max_age = round(max(Age, na.rm = TRUE), 2)
  )

summary_stats

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


n_obs,mean_played_hours,sd_played_hours,min_played_hours,max_played_hours,mean_age,sd_age,min_age,max_age
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
196,5.85,28.36,0,223.1,21.14,7.39,9,58


**Above we have imported the `players.csv` file and outlined the summary statistics (rounded to 2 decimal places).**

| Variable           | Amount    | 
|--------------------|-----------|
| mean_played_hours  | 5.85      | 
| sd_played_hours    | 28.36     |
| min_played_hours   | 0         | 
| max_played_hours   | 223.1     | 
| mean_age           | 21.14     | 
| sd_age             | 7.39      | 
| min_age            | 9         | 
| max_age            | 58        | 

#### Issues and Info about the Dataset:

Some potential issues that arise when dealing with the data are: 

**1:** The data in itself is not tidy, specifically the `sessions.csv` data set, and must be wrangled for simplicity and ease of understandability. For example, the variables of `original_start_time` and `original_end_time` are displayed in the Unix timestamp converter. Furthermore, there are multiple measurements in the columns of `start_time` and `end_time`, which does not adhere to the rule of tidy data - "each column must have a single variable".

**2:** Since my research question mainly deals with the `players.csv` data set and is focused on experience, age, and hours played, I will not be using the `name` or `gender` variable. 

# Questions and Methods



The goal of this project is to determine whether we can predict a player's likelihood of subscribing to the newsletter based on their total playtime `played_hours`, experience level, and age. This analysis will help identify the player characteristics most associated with subscription behavior and can provide useful insights for retention and marketing strategies. 

The combination of **behavioral data** `played_hours`, **experience data** `experience`, and **demographic data** `Age` allows for well-rounded analysis of factors influencing subscription. 

`played_hours` = indicates player engagement; higher playtime may reflect greater interest and likelihood to subscribe.

`experience` = captures skill and familiarity with the game - experienced players may be more connected to the community. 

`age` = represents demographic variation - certain age groups may have different levels of interest.

#### Data Wrangling Plan 

Before applying our predictive model, we will wrangle the data to clean and prepare it:

**1**: **Clean and Inspect the Data**:
- Remove the irrelevant variables like `name` and `hashedEmail` that do not contribute to the prediction.
- Check for missing or inconsistent values like blanks or NA in our chosen variables.
- Convert key variables like `subscribe` into factors for classification.

**2**: **Select Relevant Variables**:
- Since our focus is on predicting `subscribe`, we must remove the other variables using select.

**3**: **Data Splitting and Summary of Training Data**
-  Use `initial_split()` to divide the dataset into training and testing sets.
- Generate summary statistics and visualizations to understand the relationships between our variables.
- Plot scatterplots and density plots to examine the potential seperability of classes and identify any visible trends or patterns.

#### Classification Method: 

In [None]:
The method that I have chosen to address our question of interest