## Data Science Project
#### Riley So

#### Introduction
The Computer Science department at the University of British Columbia (UBC) is running a project that uses a Minecraft server to study players behaviors. Actions performed by players are logged, creating a data set that captures behavior, characteristiscs, and engagement patterns.

However running this type of research project takes time and money. The team needs to buy software licenses, manage server capacity, and find the right players to join. One way to reach players and increase player count is through a game-related newsletter. Knowing which players are most likely to subscribe can help the team focus their recruitment and use their resources more wisely.

##### **Question**
In order to help the research team target their recruitment efforts, we aim to answer the following question.
> Can player characteristics such as age, gender, experience level, and total hours played predict whether a player subscribes to a game-related newsletter?

#### Data Description

There are two datasets:
- `players.csv`: Contains demographic and self-reported experience information for 196 players.
- `sessions.csv`: Contains 1,535 records of individual gameplay sessions.

To explore this data we will begin by loading the data.


In [5]:
library(tidyverse)

# Load players and sessions data
players <- read_csv("data/players.csv")
sessions <- read_csv("data/sessions.csv")

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Now we will provide a find the dimensions of the data, its structure, and its variable types.

In [9]:
#Data Set Dimensions
players_dim <- players |> dim()
sessions_dim <- sessions |> dim()

players_dim
sessions_dim

#Structure and Variable Types
players |> glimpse()
sessions |> glimpse()

Rows: 196
Columns: 7
$ experience   [3m[90m<chr>[39m[23m "Pro", "Veteran", "Veteran", "Amateur", "Regular", "Amate…
$ subscribe    [3m[90m<lgl>[39m[23m TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, T…
$ hashedEmail  [3m[90m<chr>[39m[23m "f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8…
$ played_hours [3m[90m<dbl>[39m[23m 30.3, 3.8, 0.0, 0.7, 0.1, 0.0, 0.0, 0.0, 0.1, 0.0, 1.6, 0…
$ name         [3m[90m<chr>[39m[23m "Morgan", "Christian", "Blake", "Flora", "Kylie", "Adrian…
$ gender       [3m[90m<chr>[39m[23m "Male", "Male", "Male", "Female", "Male", "Female", "Fema…
$ Age          [3m[90m<dbl>[39m[23m 9, 17, 17, 21, 21, 17, 19, 21, 17, 22, 23, 17, 25, 22, 17…
Rows: 1,535
Columns: 5
$ hashedEmail         [3m[90m<chr>[39m[23m "bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8a…
$ start_time          [3m[90m<chr>[39m[23m "30/06/2024 18:12", "17/06/2024 23:33", "25/07/202…
$ end_time            [3m[90m<chr>[39m[23m "30/06/2024 18:24"

The `players.csv` dataset contains 196 rows and 7 columns, while `sessions.csv` has 1,535 rows and 5 columns. Each row in `players.csv` represents a unique player, while each row in `sessions.csv` represents one play session. The `players` dataset includes categorical, numeric, and logical types. `sessions` is mainly timestamp-based with a unique ID (`hashedEmail`) used in both datasets.

Before summarising statistics we want to account for missing values.

In [10]:
players |> summarise(across(everything(), ~ sum(is.na(.))))
sessions |> summarise(across(everything(), ~ sum(is.na(.))))

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<int>,<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,0,2


hashedEmail,start_time,end_time,original_start_time,original_end_time
<int>,<int>,<int>,<int>,<int>
0,0,2,0,2


- `players.csv` has **2 missing values** in the `Age` column.
- `sessions.csv` has **2 missing values** in `end_time` and `original_end_time`.
No other missing values are present.

In [18]:
players |> summary()
sessions |> summary()

  experience        subscribe       hashedEmail         played_hours    
 Length:196         Mode :logical   Length:196         Min.   :  0.000  
 Class :character   FALSE:52        Class :character   1st Qu.:  0.000  
 Mode  :character   TRUE :144       Mode  :character   Median :  0.100  
                                                       Mean   :  5.846  
                                                       3rd Qu.:  0.600  
                                                       Max.   :223.100  
                                                                        
     name              gender               Age       
 Length:196         Length:196         Min.   : 8.00  
 Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Median :19.00  
                                       Mean   :20.52  
                                       3rd Qu.:22.00  
                                       Max.   :50.00  
                               

 hashedEmail         start_time          end_time         original_start_time
 Length:1535        Length:1535        Length:1535        Min.   :1.712e+12  
 Class :character   Class :character   Class :character   1st Qu.:1.716e+12  
 Mode  :character   Mode  :character   Mode  :character   Median :1.719e+12  
                                                          Mean   :1.719e+12  
                                                          3rd Qu.:1.722e+12  
                                                          Max.   :1.727e+12  
                                                                             
 original_end_time  
 Min.   :1.712e+12  
 1st Qu.:1.716e+12  
 Median :1.719e+12  
 Mean   :1.719e+12  
 3rd Qu.:1.722e+12  
 Max.   :1.727e+12  
 NA's   :2          

##### Summary of Data
**players.csv**

- **Age**: Ranges from 8 to 50 years (Mean: 20.5, Median: 19). Most players are in their late teens or early 20s. Two values are missing.
- **played_hours**: Strongly right-skewed — Median is only 0.1, but Max is 223 hours. Most players spent very little time on the server, with a few heavy users.
- **subscribe**: About 73% (144/196) of players subscribed to the newsletter this is our target variable.
- **experience & gender**: Character columns that will be converted to categorical for modeling and grouped visualization.
- **name & hashedEmail**: Identifiers, not used for modeling directly.

**sessions.csv**

- **original_start_time / original_end_time**: Unix timestamps covering mid-2024. Can be used to analyze peak play times or session durations. Two `original_end_time` values are missing.
- **start_time / end_time**: Human-readable time strings. Useful for plotting session timing trends.
- **hashedEmail**: Common key with `players.csv` for potential grouping but not needed for prediction.



Additionally lets look at the **categorical data**

In [20]:
# players
players |> count(experience)
players |> count(gender)
players |> count(subscribe)

# sessions
sessions |> distinct(hashedEmail) |> count(name = "unique_players")

experience,n
<chr>,<int>
Amateur,63
Beginner,35
Pro,14
Regular,36
Veteran,48


gender,n
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


subscribe,n
<lgl>,<int>
False,52
True,144


unique_players
<int>
125


**Observations of categorical data**
- Most players classify as "Veteran" or "Pro".
- Gender appears evenly distributed.
- Subscription rates are close to balanced.
- 85+ unique players have sessions recorded, indicating repeat activity.

#### **Takeaway** 
Both datasets are clean and contain relevant features.
- `players.csv` is suited for modeling the subscription prediction.
- `sessions.csv` will support additional insights.

### Data Cleaning
Now that we have described and explored the data, lets prepare the data for anlaysis. To do this we will:
- Remove or impute missing values
- Convert characters to factors
- Prepare data for modeling

As seen above we found 2 missing values in the `Age` column. Since this is a small portion of the dataset (2 of 196) we will **remove** these rows.

In [26]:
# Clean Data
players_clean <- players |> filter(!is.na(Age))

experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,TRUE,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,TRUE,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,TRUE,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17
Regular,TRUE,8e594b8953193b26f498db95a508b03c6fe1c24bb5251d392c18a0da9a722807,0.0,Luna,Female,19
Amateur,FALSE,1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,0.0,Emerson,Male,21
Amateur,TRUE,8b71f4d66a38389b7528bb38ba6eb71157733df7d1740371852a797ae97d82d1,0.1,Natalie,Male,17
Veteran,TRUE,bbe2d83de678f519c4b3daa7265e683b4fe2d814077f9094afd11d8f217039ec,0.0,Nyla,Female,22
