**<ins>1. Data Description:</ins>**
===
---

<ins>players.csv</ins> Dataset
---

Observations: 196

Variables (7):

| Variable               | Type                | Description  |
|------------------------|---------------------|--------------|
| **experience**         |<chr\>| Player's level of in-game experience.|
| **subscribe**          |<lgl\>| Indicates if player is subscribed to in-game newsletters.|
| **hashedEmail**        |<chr\>| Player's anonymous email.|
| **played_hours**       |<dbl\>| Time playing session started (relative).|
| **name**               |<chr\>| Player's in-game name.|
| **gender**             |<chr\>| Player's gender.|
| **Age**                |<dbl\>| Player's age.|

Issues: 

- Presence of outliers in played_hours may skew data analysis.
- experience is a character type instead of factor.

Summary Statistics are computed in Part 3.

<ins>sessions.csv</ins> Dataset
---

Observations: 1535

Variables (5):

| Variable               | Type                | Description  |
|------------------------|---------------------|--------------|
| **hashedEmail**        |<chr\>| Player's anonymous email.|
| **start_time**         |<chr\>| Time playing session started (relative).|
| **end_time**           |<chr\>| Time playing session ended (relative).|
| **original_start_time**|<dbl\>| Time playing session started (absolute).|
| **original_end_time**  |<dbl\>| Time playing session ended (absolute).|

Issues: 

- start_time and end_time contain more than one value per cell (date and time).
- start_time and end_time are non-numeric types.

---
The data were collected by a research group in Computer Science at UBC, led by Frank Wood, through a MineCraft server where players' in-game actions were recorded as they navigated through the game. 

**<ins>2. Questions:</ins>**
===
---

**Broad:**
What player characteristics and behaviors are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

**Specific:**
Can a player's playtime and age predict whether they subscribe to the game-related newsletter in players.csv?

#### Addressing the Question

The response variable is subscribe (TRUE/FALSE), with age and played_hours as predictors. K-NN classification will be used to identify patterns in player behavior to predict subscription likelihood.

#### Wrangling Plan

The data will be standardized so that Age and played_hours are on the same scale. Only subscribe, played_hours, and Age will be kept, and subscribe will be converted to a factor for classification.

**<ins>3. Exploratory Data Analysis and Visualization:</ins>**
===
---

### Loading ***tidyverse*** library and reading data sets

In [None]:
library(tidyverse)

players_url <- "https://raw.githubusercontent.com/veronicahzh/dsci100-groupproject30/refs/heads/main/players.csv"
sessions_url <- "https://raw.githubusercontent.com/veronicahzh/dsci100-groupproject30/refs/heads/main/sessions.csv"

players <- read_csv(file = players_url)
sessions <- read_csv(file = sessions_url)

head(players)
head(sessions)

### Wrangling Data Into a Tidy Format

Since the <ins>players.csv</ins> dataset is already tidy, there is no need for additional wrangling.

### Computing summary statistics all quantitative variables (played_hours, Age) in <ins>players.csv</ins>

In [None]:
# storing the selected columns in a variable
players_selected <- select(players, played_hours, Age)

# calculating the mean
players_mean <- players_selected |>
    map_df(mean, na.rm = TRUE) |>
    rename(mean_played_hours = played_hours, mean_Age = Age)

players_mean

# calculating the median
players_median <- players_selected |>
    map_df(median, na.rm = TRUE) |>
    rename(median_played_hours = played_hours, median_Age = Age)

# calculating the mode
players_mode <- players_selected |>
    map_df(mode) |>
    rename(mode_played_hours = played_hours, mode_Age = Age)

# calculating the Standard Deviation (SD)
players_sd <- players_selected |>
    map_df(sd, na.rm = TRUE) |>
    rename(sd_played_hours = played_hours, sd_Age = Age)

# calculating the min 
players_min <- players_selected |>
    map_df(min, na.rm = TRUE) |>
    rename(min_played_hours = played_hours, min_Age = Age)

# calculating the max
players_max <- players_selected |>
    map_df(max, na.rm = TRUE) |>
    rename(max_played_hours = played_hours, max_Age = Age)

# all summary statistics of played_hours and Age
summary_combined <- bind_cols(players_mean, players_median, players_mode, players_sd, players_max, players_min)
summary_combined

### Exploratory Visualizations of <ins>players.csv</ins>

#### Plot of total time played (in hours) vs age (in years), colored by subscription

This scatter plot shows a non-linear relationship between the 2 predictor variables, played_hours and Age, with noticeable outliers in time played.

In [None]:
library(RColorBrewer)
options(repr.plot.width = 10, repr.plot.height = 10)

age_vs_hours_sub <- players |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs(x = "Age (years)", y = "Total Time Played (hours)", color = "Subscription Status") +
    ggtitle("Plot of Total Time Played vs Age") +
    theme(text = element_text(size = 15)) +
    scale_color_brewer(palette = "Set2")

age_vs_hours_sub

#### Distribution of total time played (in hours), colored by subscription status

This plot shows subscription status by playtime, indicating that subscribers play more than non-subscribers.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

# summarize total played hours by subscription status
total_playtime <- players |>
    group_by(subscribe) |>
    summarize(total_hours = sum(played_hours, na.rm = TRUE))

# bar plot
hrs_by_sub <- total_playtime |>
    ggplot(aes(x = subscribe, y = total_hours, fill = subscribe)) +
    geom_bar(stat = "identity") +
    labs(x = "Subscribed?", y = "Total Time Played (hours)", fill = "Subscription Status") +
    ggtitle("Plot of Total Time Played by Subscription Status") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Set2")

hrs_by_sub

#### Distribution of age (in years) vs number of subscribers, colored by subscription status

This plot shows subscriber count by age, indicating most subscribers are between 15-25 years old.

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)

age_by_sub <- players |>
    ggplot(aes(x = Age, fill = subscribe)) +
    geom_histogram(binwidth = 2, position = "identity") +
    labs(x = "Age (years)", y = "Number of Subscribers", fill = "Subscription Status") +
    ggtitle("Plot of Age by Subscription Status") +
    theme(text = element_text(size = 15)) +
    scale_fill_brewer(palette = "Set2")

age_by_sub

**<ins>4. Methods and Plan:</ins>**
===
---

K-NN classification will predict subscription status using played_hours and Age as numeric predictors and subscribe as the categorical response variable. This method is appropriate because K-NN works with numeric inputs while predicting categorical outputs. The first exploratory plot above suggests a non-linear relationship between the predictors, making K-NN a better choice than linear regression, which assumes linearity. Since the goal is to classify subscription status rather than predict a continuous value, K-NN classification is used instead of K-NN regression.

K-NN assumes that closer data points are more similar, which makes sense since players with similar playtime and age likely have similar engagement. However, K-NN is sensitive to outliers, can overfit with small k values, and slows down with larger datasets. Standardizing the data ensures both predictors contribute equally to distance calculations.

The data will be split into 75% training and 25% testing. A 5-fold cross-validation will be used to fine-tune k, optimizing model performance while preventing overfitting. The training set will determine the best k, and the test set will evaluate final accuracy.