# DSCI 100 Final Group Project

# Title

### Group 37- Jaana Rodrigo, Matthew Kyi, Mersara Shi

## Introduction

**<p style="color: green;">J- provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report</p>**

Video games generate and store large amounts of behavioural data whcih can be used to understand how players interact with games and game- realted activities. At UBC, a lab led by Frank Wood runs a customized Minecraft research server that records player activity as they explore and interact with the game. Each unique player is logged in the players.csv dataset, which includes a list of all players and some of their basic information.

**Mer- Clearly state the question you tried to answer with your project**

<hr>

**<p style="color: green;">Mat- identify and fully describe the dataset that was used to answer the question</p>**

<p>We will be making use of the players.csv dataset to answer our question. Below, we load the players.csv dataset, and show the first six observations.</p>

### Players Dataset

In the players dataset, there is a total of 196 observations/rows in. This corresponds to 196 unique player accounts (each account has a unique hashedEmail). The players dataset contains a total of 7 variables, each storing information about a player.

Variables: 7

1. experience (chr)- Player's skill level.

2. hashedEmail (chr)- Player's email, hashed for privacy.

3. name (chr)- Player's name.

4. gender (chr)- Player's self- identified gender.

5. played_hours (dbl)- Number of hours played.

6. Age (dbl)- Age of the player.

7. subscribe (lgl)- Newsletter subscription status

We identified two potential issues with the players dataset. Firstly, many of the values in the played_hours column are 0 due to inactive players. Secondly, the dataset has some extreme values/outliers that may affect our results in unwanted ways.

<hr>

In [None]:
# Load necessary libraries
library(tidyverse)
library(tidymodels)
library(repr)

# Set display options for readability
options(repr.matrix.max.rows = 7)

In [None]:
# Load sessions dataset
download.file(url = "https://raw.githubusercontent.com/MatthewKyi/DSCI100-004-37/refs/heads/main/sessions.csv", destfile = "sessions-local.csv")
sessions <- read_csv("sessions-local.csv")

# Load players dataset
download.file(url = "https://raw.githubusercontent.com/MatthewKyi/DSCI100-004-37/refs/heads/main/players.csv", destfile = "players-local.csv")
players <- read_csv("players-local.csv")

In [None]:
#Minimum, mean, median and maximum values for Age
head(players) 
age_summary <- players |>
  summarise(
    min = round(min(Age, na.rm = TRUE), 2),
    mean = round(mean(Age, na.rm = TRUE), 2),
    median = round(median(Age, na.rm = TRUE), 2),
    max = round(max(Age, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "Age") |>
  as_tibble()

age_summary

In [None]:
#Minimum, mean, median and maximum values for Hours Played
played_hours_summary <- players |>
  summarise(
    min = round(min(played_hours, na.rm = TRUE), 2),
    mean = round(mean(played_hours, na.rm = TRUE), 2),
    median = round(median(played_hours, na.rm = TRUE), 2),
    max = round(max(played_hours, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "played_hours") |>
  as_tibble()

played_hours_summary

In [None]:
#Count and percentage of each experience level
experience_counts <- players |>
  count(experience) |>
  rename(count = n) |>
  mutate(percentage = round((count / sum(count)) * 100, 2))

experience_counts

In [None]:
#Count and percentage of each gender
gender_counts <- players |>
  count(gender) |>
  rename(count = n) |>
  mutate(percentage = round((count / sum(count)) * 100, 2))

gender_counts

In [None]:
#Count and percentageof each subscription status
subscribe_counts <- players |>
  count(subscribe) |>
  rename(subscription_status = subscribe, count = n) |>
  mutate(percentage = round((count / sum(count)) * 100, 2))

subscribe_counts

### Sessions Dataset
Observations: 1535 sessions recorded.

Variables: 5

1. hashedEmail (chr)- Player's email, hashed for privacy. 
2. start_time (chr)- Date, time the session began.
3. end_time (chr)- Date, time the session ended.
4. original_start_time (dbl)- Start time in Unix timestamp.
5. original_end_time (dbl)- End time in Unix timestamp.

Potential issues
- The format of start_time and end_time are not easy to work with.

In [None]:
#Mutating the dataset to include session duration in minutes
sessions <- sessions |>
  mutate(
    start_time_dt = dmy_hm(start_time),
    end_time_dt = dmy_hm(end_time),
    session_duration = as.numeric(difftime(end_time_dt, start_time_dt, units = "mins"))
  )

sessions

In [None]:
#Minimum, mean, median and maximum values for session duration
session_duration_summary <- sessions |>
  summarise(
    min = round(min(session_duration, na.rm = TRUE), 2),
    mean = round(mean(session_duration, na.rm = TRUE), 2),
    median = round(median(session_duration, na.rm = TRUE), 2),
    max = round(max(session_duration, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "session_duration") |>
  as_tibble()

session_duration_summary

In [None]:
#Session count and mean duration for each player
player_session_summary <- sessions |>
  group_by(hashedEmail) |>
  summarise(
    session_count = n(),
    mean_duration = round(mean(session_duration, na.rm = TRUE), 2)
  ) |>
  rename(player = hashedEmail) |>
  arrange(desc(session_count))

player_session_summary

### Broad Question
1. What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
### Specific Question
Can age and the number of hours played be used to predict subscription status in the ‘players’ dataset?

Response Variable: subscribe (TRUE/FALSE)
Explanatory Variables: played_hours (dbl), Age (dbl)

## How the data will help address the question:

Each row represents a single player, linking characteristics (played_hours, experience, age, gender) to subscription status. The dataset is tidy, so analysis will handle missing values, and addressing outliers. Classification models will assess whether higher engagement predicts subscription and how this relationship varies across player characteristics.

# Methods and Results

<p style="color: green;">J-describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:</p>
<p style="margin-left: 80px;">Mer-load data<br><br>
<span style="color: green;">Mat-wrangles and cleans the data to the format necessary for the planned analysis</span><br><br>
<p style="color: green;">J-performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis </p><br>
Mer-creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis<br><br>
Mat-performs the data analysis<br><br>
J-creates a visualization of the analysis <br><br>
note: all figures should have a figure number and a legend</p></p>

<hr>

### Data Cleaning & Wrangling

In [None]:
# wrangle and clean data to the format necessary for the planned analysis
players_tidy <- players |>
  drop_na() |> # remove rows with NA values
  rename(hashed_email = hashedEmail, age = Age) |>
  select(age, played_hours, subscribe) |>
  mutate(subscribe = as_factor(subscribe)) |>
  mutate(subscribe = fct_recode(subscribe, "Subscriber" = "TRUE", "Nonsubscriber" = "FALSE"))

players_tidy

<hr>

### Data Analysis

In [None]:
players_split <- initial_split(players_tidy, prop = 0.75, strata = subscribe)
players_train <- training(players_split)
players_test <- testing(players_split)

In [None]:
players_vfold <- vfold_cv(players_train, v = 5, strata = subscribe)

In [None]:
players_recipe <- recipe(subscribe ~ age + played_hours,
                        data = players_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())


In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

In [None]:
k_vals <- tibble(neighbors = seq(from = 1, to = 25, by = 1))

knn_results <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = players_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies

In [None]:
accuracy_vs_k <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Neighbors", y = "Accuracy Estimate") +
  theme(text = element_text(size = 20))

accuracy_vs_k

In [None]:
best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1)
best_k

<hr>

### Mean Calculation

In [None]:
#Computing the mean value for each quantitative variable in the players.csv data set, and representing them in a tibble. 
mean_tibble <- players_tidy |>
  summarise(
    mean_played_hours = mean(played_hours, na.rm = TRUE),
    mean_age = mean(age, na.rm = TRUE)
  ) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Mean") |>
  mutate(Mean = round(Mean, 2))

mean_tibble

### Scatterplot

In [None]:
options(repr.plot.width = 13, repr.plot.height = 9)
ggplot(players_tidy, aes(x = age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.7, size = 3) +
  labs(
    title = "Playtime vs Age Coloured by Subscription Status",
    x = "Age (years)",
    y = "Playtime (hours)",
    color = "Subscription Status") +
theme(text = element_text(size = 20))

Most non-subscribers have low playtimes- inactive players are less engaged. Few inactive players are subscribed- ongoing interest despite low activity. All players under 17 are subscribed- possible age-related trend.

## Histogram

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)

ggplot(players_tidy, aes(x = age, fill = subscribe)) +
  geom_histogram(bins = 30, position = "stack") +
  labs(
    title = "Distribution of Age coloured by Subscription Status",
    x = "Age (years)",
    y = "Number of Players",
    fill = "Subscribed"
  )  +
theme(text = element_text(size = 20))

Subscription rates are highest among younger players, especially those under 20. Older players show lower subscription rates- age may be a meaningful predictor of newsletter subscription.

### Suitability
KNN classification is appropriate because it handles binary response variables and continuous explanatory variables. 
### Assumptions
- Observations are independent
- Balanced dataset
- Sufficient sample size

### Limitations and Weaknesses
KNN requires continuous variables, we cannot use player experience and gender as exploratory variables. It is also sensitive to outliers and class imbalance. it is also highly dependent on k, which must be carefully selected to avoid overfitting or underfitting.

### Comparison and Model Selection

I would tune k using k-fold cross-validation.

### Data processing
1. Standardization of played_hours and Age
2. Splitting (70 training/ 30 testing)
3. Cross validation

## 5. GitHub Repository

Below is the link to my GitHub repository.

https://github.com/jaanacara/project_planning_individual.git

# Discussion

**Mer - summarize what you found 
Mat - discuss whether this is what you expected to find
J - discuss what impact could such findings have
Mer - discuss what future questions could this lead to**