In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
#source('cleanup.R')

In [None]:
#players <- read_csv("data/players.csv")
#sessions <- read_csv("data/players.csv")
players <- read_csv("https://raw.githubusercontent.com/Jay7615/Project/refs/heads/main/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Jay7615/Project/refs/heads/main/sessions.csv")
players

In [None]:

# Summary statistics
cat("\nSummary statistics for players dataset:")
summary(players)
cat("\nSummary statistics for sessions dataset:")
summary(sessions)

# Check for missing values
cat("\nMissing values in players dataset:")
print(colSums(is.na(players)))
cat("\nMissing values in sessions dataset:")
print(colSums(is.na(sessions)))




# 1) Data Description:

# Data Description for `players.csv`

## Dataset Overview
- **Number of Observations**: 196  
- **Number of Variables**: 7  
- **Data Collection Method**: Collected from a Minecraft server tracking player actions, demographics, and subscription status.  

---

## Variables Summary

| Variable Name     | Type       | Description                                                                 | Notes                                                                                   |
|-------------------|------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
| `experience`      | Categorical (chr) | Player's self-reported skill level                                         | `Beginner`, `Amateur`, `Regular`, `Veteran`, `Pro`                                      |
| `subscribe`       | Categorical (lgl)| Whether the player subscribed to the newsletter                            | `TRUE`/`FALSE`                                                                          |
| `hashedEmail`     | String (chr)     | Unique hashed identifier for player emails                                 | SHA-256 hashes (e.g., `f6daba42...`)                                                    |
| `played_hours`    | Numeric (dbl)     | Total hours the player spent on the server                                 | Float values (e.g., `30.3`, `0`, `223.1`)                                               |
| `name`            | String (chr)    | Player's name                                                              | Unique per player (e.g., `Morgan`, `Kyrie`, `Akio`)                                     |
| `gender`          | Categorical (chr) | Player's self-reported gender identity                                     | `Male`, `Female`, `Non-binary`, `Prefer not to say`, `Two-Spirited`, `Agender`, `Other` |
| `Age`             | Numeric (dbl) | Player's age                                                               | Range: `8`–`50`; includes `NA` (missing values)                                         |

---

## Summary Statistics

### Numeric Variables
- **`played_hours`**  
  - **Mean**: 5.846 hours  
  - **Median**: 0.1 hours  
  - **Range**: 0–223.1 hours  
  - **Standard Deviation**: 28.4 hours  

- **`Age`**  
  - **Mean**: 20.52 years  
  - **Median**: 19.00 years  
  - **Range**: `8`–`50` years
  - **Standard Deviation**: 6.17 years
  - **Missing Values**: 2 entries

### Categorical Variables
- **`experience`**
    - `Amateur` 32% (64)
    - `Beginner` 18% (35)
    - `Beginner` 18% (35)
    - `Veteran` 24% (48)
    - `Pro` 7% (14)


- **`subscribe`**  
  - 73% (144) subscribed (`TRUE`), 27% (52) did not (`FALSE`).  

- **`gender`**  
  - Majority: `Male`63% (124), `Female` 19% (37).  
  - Minorities: `Non-binary` 8% (15), `Prefer not to say` 6% (11%), `Agender` 2% (1), 
                `Two-Spirited` 3% (6), `others` <1% (1).
---

## Data Issues
1. **Missing Values**:  
   - `Age` has 2 `NA` entries (`Devin` and `Ahmed` have `NA`).  
2. **Ambiguous Definitions**:  
   - `experience` levels are not clearly defined (self-reported vs. measured).  
   - `played_hours` timeframe is unclear (lifetime total or specific period?).  
3. **Inconsistent naming conventions**:
   - Most variable names are in snake_case  (e.g., played_hours)
   - `hashedEmail` uses camelCase
   - `Age` is capitalized


# Data Description for `sessions.csv`
- **Number of Observations**: 1,535  
- **Number of Variables**: 5  
- **Data Collection Method**: Logged player session data from a Minecraft server.  

---

### Variables Summary

| Variable Name         | Type      | Description                                                       | Notes |
|-----------------------|----------|-------------------------------------------------------------------|-------|
| `hashedEmail`        | String   | Unique hashed identifier for player emails                        | SHA-256 hashes, anonymized. |
| `start_time`         | String   | Session start time (formatted as `DD/MM/YYYY HH:MM`)              | Needs conversion to datetime. |
| `end_time`           | String   | Session end time (formatted as `DD/MM/YYYY HH:MM`)                | Some missing values (`NA`). |
| `original_start_time`| Float    | Unix timestamp of session start                                  | Needs conversion to datetime. |
| `original_end_time`  | Float    | Unix timestamp of session end                                    | Some missing values (`NA`). |

---

### Data Issues
1. **Missing Values**:  
   - `end_time` and `original_end_time` have two missing values.  
   -  If `end_time` is missing, it’s unclear if the session was ongoing or if data was lost. 
2. **Data Format Issues**:  
   - `start_time` and `end_time` should be converted to datetime objects.  

---


In [None]:

cat("\nNumber of observations in players dataset:", nrow(players))
cat("\nNumber of variables in players dataset:", ncol(players))
cat("\nNumber of observations in sessions dataset:", nrow(sessions))
cat("\nNumber of variables in sessions dataset:", ncol(sessions))
gender_counts <- table(players$gender)
gender_counts
gender_proportions <- prop.table(gender_counts)
gender_proportions
experience_counts <- table(players$experience)
experience_counts
experience_proportions <- prop.table(experience_counts)
experience_proportions
subscribe_counts <- table(players$subscribe)
subscribe_counts
subscribe_proportions <- prop.table(subscribe_counts)
subscribe_proportions
sum(duplicated(players$hashedEmail))
sum(duplicated(players$name))
sd(players$played_hours, na.rm = TRUE) 
sd(players$Age, na.rm = TRUE)  

In [None]:
sorted_data <- players %>%
  arrange(desc(played_hours))
head(sorted_data) 

In [None]:
sum(is.na(players$Age))  
players_na_age <- players |>
                  filter(is.na(Age)) 
players_na_age
colSums(is.na(players))

In [None]:
zero_hours_players <- players |>
                      filter(played_hours == 0)
zero_hours_players
zero_hours_count = nrow(zero_hours_players)
total_players <- nrow(players)

zero_hours_percentage <- (zero_hours_count / total_players) * 100
zero_hours_percentage

In [None]:
# Load necessary libraries
library(dplyr)
library(lubridate)

# Load dataset
df <- sessions 

# Convert date-time columns to proper format
df$start_time <- dmy_hm(df$start_time)
df$end_time <- dmy_hm(df$end_time)

# Convert Unix timestamps to datetime
df$original_start_time <- as.POSIXct(df$original_start_time / 1000, origin = "1970-01-01")
df$original_end_time <- as.POSIXct(df$original_end_time / 1000, origin = "1970-01-01")

# Calculate session duration
df$session_duration <- as.numeric(difftime(df$end_time, df$start_time, units = "mins"))

# Summary statistics
summary_report <- df |> summarise(
  total_sessions = n(),
  missing_end_times = sum(is.na(end_time)),
  avg_session_duration = mean(session_duration, na.rm = TRUE),
  median_session_duration = median(session_duration, na.rm = TRUE),
  min_session_duration = min(session_duration, na.rm = TRUE),
  max_session_duration = max(session_duration, na.rm = TRUE)
)

# Print summary report
print(summary_report)


# (2) Questions:

### **Broad Research Question**  
What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types? 

---

### **Specific Predictive Question**  
**Can a player’s *experience level*, *played hours*, *age*, and *gender* predict their likelihood of subscribing to the newsletter (*subscribe*) in the Minecraft server dataset?**

---

### **Response Variable**  
- **`subscribe`**: Logical variable (`TRUE`/`FALSE`) indicating whether the player subscribed to the newsletter.  

---

### **How to Data Wrangling**  
    - Convert `subscribe`, `experience`and 'gender' into factors for modeling.
    - Handle missing values in `Age` (e.g., impute with median or exclude rows with `NA`).  
---
## **How the Data Answers the Question**  
1. **Pattern Identification**:  
   - KNN classifies players based on similarity to neighbors. For example:  
     - Players with high `played_hours` and `Regular` status may cluster together and show higher subscription rates.  
     - Younger players (`Age` < 20) might form a distinct group with unique subscription behaviors.  

2. **Predictive Power**:  
   - The model quantifies how well the combination of `experience`, `played_hours`, `Age`, and `gender` predicts subscription status.   

3. **New Insights**:  
   - If `played_hours` and `experience` are key predictors, stakeholders can:  
     - Target `Amateur`/`Veteran` players for newsletter promotions.  
     - Incentivize people who have played a lot to boost subscriptions.  
   - Gender-based trends might give new strategies to targets certain groups. 



# (3) Exploratory Data Analysis and Visualization

In [None]:
players <- players |>
  mutate(
    experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Veteran", "Pro")),
    subscribe = factor(subscribe),
    gender = factor(gender)
  )
players$Age <- ifelse(is.na(players$Age), median(players,Age, na.rm = TRUE), players$Age)

| Variable      | Mean       |
|---------------|------------|
| played_hours  | 5.846 hours  |
| Age           | 20.52 years |

In [None]:
ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Played Hours vs. Age by Subscription Status",
    x = "Age (Years)",
    y = "Hours Played",
    color = "Subscribed"
  ) +
  theme_minimal()

**Insight**:
Only people who have have few hours (< 10) played are not going to subscribe
People who have played more than a few hours (> 10) are going to subsribe.
Younger players (<20 years) with moderate playtime (10–50 hours) are more likely to subscribe.

In [None]:
players %>%
  group_by(gender, subscribe) |>
  summarise(count = n()) |>
  mutate(proportion = count / sum(count)) |>
  ggplot(aes(x = gender, y = proportion, fill = subscribe)) +
  geom_col(position = "dodge") +
  labs(
    title = "Subscription Rate by Gender",
    x = "Gender",
    y = "Proportion Subscribed",
    fill = "Subscribed"
  ) +
  theme_minimal()

**Insight**:
Most gender groups tend to subscribe more than not subscribe (Except Prefer not say to say)
Gender-specific campaigns might improve engagement in underrepresented groups.

In [None]:
ggplot(players, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("#FF6B6B", "#4ECDC4")) +
  labs(
    title = "Subscription Rate by Experience Level",
    x = "Experience Level",
    y = "Proportion Subscribed",
    fill = "Subscribed"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

**Insight**:
Lower experience tiers (e.g., Beginner, Pro) show higher subscription proportions.
Stakeholders could target lower experienced players for newsletter promotions.

# (4) Methods and Plan
# Method Proposal: K-Nearest Neighbors (KNN) Classification

## **Why KNN Classification is Appropriate**  
1. **Problem Type**: Binary classification (`subscribe` = `TRUE`/`FALSE`) that can be change into categorical.  
2. **Mixed Data Types**: Handles both numeric (`played_hours`, `Age`) and categorical (`experience`, `gender`) predictors after preprocessing.  
3. **Non-Linearity**: Can compure multiple relationships without assuming linearity (e.g., high `played_hours` + `Regular` status).
## **Key Assumptions**  
1. **Feature Scaling**: Numeric variables must be normalized to ensure equal weighting in distance calculations.  
2. **Relevant Predictors**: The selected features (`experience`, `played_hours`, `Age`, `gender`) are meaningful for predicting subscriptions.
   
## **Potential Limitations**  
1. **Outliers**: Extreme `played_hours` values (e.g., 223 hours) and many players with little to no hours played (e.g., < 1) may distort distance metrics.  
2. **Class Imbalance**: Subscribed players (`TRUE`) dominate (73%), potentially biasing predictions.  

### **Metrics for Comparison**
- **Accuracy**: Overall proportion of correct predictions.
- **Precision**: Proportion of true subscribers among predicted subscribers (minimize false positives).
- **Recall**: Proportion of actual subscribers correctly identified (minimize false negatives).

## **How to Process the Data**  
- **Split**: 80% training, 20% testing.
- **Find K** Use **5-fold cross-validation** to find k (K = 8) 
- **Stratification**: Ensure traing and testing set have the same `subscribe` distribution
- **Timing**: Split before preprocessing to avoid data leakage into the training process.

In [None]:
set.seed(123)
players_split <- initial_split(players, prop = 0.8, strata = subscribe)
training_set <- training(players_split)
testing_set <- testing(players_split)

knn_recipe <- recipe(subscribe ~ experience + played_hours + Age + gender, 
                    data = training_set) |>
  step_impute_median(Age) |>                
  step_normalize(all_numeric_predictors())  

knn_spec <- nearest_neighbor(
  weight_func = "rectangular", 
  neighbors = tune()
) |>
  set_mode("classification") |> 
  set_engine("kknn")

cv_folds <- vfold_cv(training_set, v = 5, strata = subscribe)

knn_workflow <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_spec)

knn_tune <- knn_workflow |> 
  tune_grid(
    resamples = cv_folds,
    grid = tibble(neighbors = seq(2, 20, by = 2)),  # Wider range for k
    metrics = metric_set(accuracy, roc_auc)          # Multiple metrics
  )

# Select best k based on accuracy
best_k <- select_best(knn_tune, metric = "accuracy")$neighbors
best_k