In [12]:
# Load all the libraries needed for running this notebook
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 4) #limits output of dataframes to 4 rows

 # Questions  
#### Broad Question: 
Question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  

  
#### Specific Question:

Can a player's age and total play hours predict whether they subscribes to a game-related newsletter?

---

To answer this specific question, I will only use the `players` dataset as it includes all the variables I need: `Age`, and `played_hours` are my predictor/explanatory variables, and `subscribe` is my response variable.  
  
  I will wrangle the data to remove missing values, convert `subscribe` into a factor variable, and standardize the numerical predictors (`Age` and `played_hours`) by centering & scaling. I'll then apply **K-nearest neighbors classification model** to predict players' subscription status: a class with 2 levels of True and False.

# Data Description
## Players Dataset
Run the code below to load players.csv into a tibble named `players`.  

In [13]:
# load the players dataset from GitHub into an R dataframe (tibble). 
players_url <- "https://raw.githubusercontent.com/Aylin-Ab/dsci-100-2025w1-group-27/refs/heads/main/players.csv"
players <- read_csv(players_url)
players

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


This dataframe includes a list of all unique players, including data about each player. After loading the dataset, you can see that the `players` tibble consists of **7 variables** (columns) and **196 observations** (rows).  

  Here's what each variable in the dataset means:
  - **`experience`**: player's experience level, chosen from a menu with 5 different levels (see all experience levels in the output of below code)
  - **`subscribe`**: whether the player subscribed to a game-related newsletter (TRUE = subscribed, FALSE = didn't subscribe)
  - **`hashedEmail`**: an anonymized version of each player's email address used to link player data across the two datasets
  - **`played_hours`**: total number of hours, across different sessions, the player has spent playing on the server
  - **`name`**: player's chosen display name/alias
  - **`gender`** player's gender, chosen from a menu with 7 different genders (see all gender levels in the output of below code)
  - **`Age`**: player's age in years

In [14]:
# extract the experience column and find all unique experience levels
experience_levels <- players |>
    pull(experience) |>
    unique()

# extract the gender column and find all unique gender levels
gender_levels <- players |>
    pull(gender) |>
    unique()

experience_levels
gender_levels

The table below summarizes each variable in the players dataset, showing its original and suggested data type, and whether it is categorical or numerical.

The `experience` and `gender` variables are stored as *character* by default, but since they have a few distinct levels, they can be treated as **factors** for analysis; Although, the choice of data type ultimately depends on the research question. Variables marked N/A already have the correct data type.

| Variable Name | Original Data Type | Suggested Data Type | Categorical vs Numerical |
| ------------- | ---------------    | --------------------| -----------------------  |
| experience    | chr (character)    | fct (factor)        |  Categorical             |
| subscribe     | lgl (logical)      | N/A                 |  *Categorical (Logical)  |
| hashedEmail   | chr (character)    | N/A                 |  Categorical             |
| played_hours  | dbl (double)       | N/A                 |  Numerical               |
| name          | chr (character)    | N/A                 |   Categorical            |
| gender        | chr (character)    | fct (factor)        |   Categorical            |
| Age           | dbl (double)       | N/A                 |  Numerical               |

## Sessions Dataset
Run the code below to load sessions.csv into a tibble named `sessions`.  
  
  
This dataframe includes a list of all individual play sessions by each player, including data about each session. As you can see below, the `sessions` data frame consists of **5 variables** (columns) and **1535 observations** (rows).  
  
  
Here's what each variable in the dataset means:
- **`hashedEmail`**: an anonymized version of each player's email address used to link player data across the two datasets
- **`start_time`**: start timestamp of each play session, including the date and time (h:m)
- **`end_time`**: end timestamp of each play session, including the date and time (h:m)
- **`original_start_time`**: represent the same values as in `start_time`, but recorded in UNIX time (milliseconds)
- **`original_end_time`**: represent the same values as in `end_time`, but recorded in UNIX time (milliseconds)

In [15]:
# load the sessions dataset from GitHub into an R dataframe (tibble). 
sessions_url <- "https://raw.githubusercontent.com/Aylin-Ab/dsci-100-2025w1-group-27/refs/heads/main/sessions.csv"
sessions <- read_csv(sessions_url)
sessions

[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1.71977e+12,1.71977e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1.71867e+12,1.71867e+12
⋮,⋮,⋮,⋮,⋮
fd6563a4e0f6f4273580e5fedbd8dda64990447aea5a33cbb5e894a3867ca44d,25/07/2024 06:15,25/07/2024 06:22,1.72189e+12,1.72189e+12
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,20/05/2024 02:26,20/05/2024 02:45,1.71617e+12,1.71617e+12


The table below summarizes each variable in the players dataset, showing its data type, and whether it is categorical or numerical.

The `start_time` and `end_time` variables are currently stored as character data, which is fine for now. However, since they contain both date and time information, one might later separate these components (extracting the date and the hour:min), depending on the specific question they want to answer. 

| Variable Name  | Data Type      | Categorical vs Numerical | 
| -------------       | ---------------    | ---------------     | 
| hashedEmail         | chr (character)    |  Categorical             |
| start_time          | chr (character)    |  Categorical             |
| end_time            | chr (character)    |  Categorical             |
| original_start_time   | dbl (double)       |   Numerical              |
| original_end_time      | dbl (double)       |  Numerical               |