# DSCI 100 Final Group Project

### Jaana Rodrigo

## 1. Data Description

Firstly, we will install the libraries necessary for us to load, wrangle and visualize our datasets.

In [None]:
# Run this cell! 
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 7)

Next, we will load the two datasets.

In [None]:
players<-read_csv("planning/data/players.csv")
sessions<-read_csv("planning/data/sessions.csv")

### Players Dataset

Observations: 196 unique players.

Variables: 7

1. experience (chr)- Player's skill level.

2. hashedEmail (chr)- Player's email, hashed for privacy.

3. name (chr)- Player's name.

4. gender (chr)- Player's self- identified gender.

5. played_hours (dbl)- Number of hours played.

6. Age (dbl)- Age of the player.

7. subscribe (lgl)- Newsletter subscription status

Potential issues
- Many of the played_hours values are 0 due to inactive players.
- Extreme values/ outliers

In [None]:
#Minimum, mean, median and maximum values for Age
age_summary <- players |>
  summarise(
    min = round(min(Age, na.rm = TRUE), 2),
    mean = round(mean(Age, na.rm = TRUE), 2),
    median = round(median(Age, na.rm = TRUE), 2),
    max = round(max(Age, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "Age") |>
  as_tibble()

age_summary

In [None]:
#Minimum, mean, median and maximum values for Hours Played
played_hours_summary <- players |>
  summarise(
    min = round(min(played_hours, na.rm = TRUE), 2),
    mean = round(mean(played_hours, na.rm = TRUE), 2),
    median = round(median(played_hours, na.rm = TRUE), 2),
    max = round(max(played_hours, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "played_hours") |>
  as_tibble()

played_hours_summary

In [None]:
#Count and percentage of each experience level
experience_counts <- players |>
  count(experience) |>
  rename(count = n) |>
  mutate(percentage = round((count / sum(count)) * 100, 2))

experience_counts

In [None]:
#Count and percentage of each gender
gender_counts <- players |>
  count(gender) |>
  rename(count = n) |>
  mutate(percentage = round((count / sum(count)) * 100, 2))

gender_counts

In [None]:
#Count and percentageof each subscription status
subscribe_counts <- players |>
  count(subscribe) |>
  rename(subscription_status = subscribe, count = n) |>
  mutate(percentage = round((count / sum(count)) * 100, 2))

subscribe_counts

### Sessions Dataset
Observations: 1535 sessions recorded.

Variables: 5

1. hashedEmail (chr)- Player's email, hashed for privacy. 
2. start_time (chr)- Date, time the session began.
3. end_time (chr)- Date, time the session ended.
4. original_start_time (dbl)- Start time in Unix timestamp.
5. original_end_time (dbl)- End time in Unix timestamp.

Potential issues
- The format of start_time and end_time are not easy to work with.

In [None]:
#Mutating the dataset to include session duration in minutes
sessions <- sessions |>
  mutate(
    start_time_dt = dmy_hm(start_time),
    end_time_dt = dmy_hm(end_time),
    session_duration = as.numeric(difftime(end_time_dt, start_time_dt, units = "mins"))
  )

sessions

In [None]:
#Minimum, mean, median and maximum values for session duration
session_duration_summary <- sessions |>
  summarise(
    min = round(min(session_duration, na.rm = TRUE), 2),
    mean = round(mean(session_duration, na.rm = TRUE), 2),
    median = round(median(session_duration, na.rm = TRUE), 2),
    max = round(max(session_duration, na.rm = TRUE), 2)
  ) |>
  pivot_longer(everything(), names_to = "stat", values_to = "session_duration") |>
  as_tibble()

session_duration_summary

In [None]:
#Session count and mean duration for each player
player_session_summary <- sessions |>
  group_by(hashedEmail) |>
  summarise(
    session_count = n(),
    mean_duration = round(mean(session_duration, na.rm = TRUE), 2)
  ) |>
  rename(player = hashedEmail) |>
  arrange(desc(session_count))

player_session_summary

## 2. Questions

### Broad Question
1. What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?
### Specific Question
Can age and the number of hours played be used to predict subscription status in the ‘players’ dataset?

Response Variable: subscribe (TRUE/FALSE)
Explanatory Variables: played_hours (dbl), Age (dbl)

## How the data will help address the question:

Each row represents a single player, linking characteristics (played_hours, experience, age, gender) to subscription status. The dataset is tidy, so analysis will handle missing values, and addressing outliers. Classification models will assess whether higher engagement predicts subscription and how this relationship varies across player characteristics.

## 3. Exploratory Data Analysis and Visualization

To ensure the data is tidy, I will remove any NA values.

In [None]:
players <- players |>
  drop_na()
players # 2 rows were removed.

### Mean Calculation

In [None]:
#Computing the mean value for each quantitative variable in the players.csv data set, and representing them in a tibble. 
mean_tibble <- players |>
  summarise(
    mean_played_hours = mean(played_hours, na.rm = TRUE),
    mean_age = mean(Age, na.rm = TRUE)
  ) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Mean") |>
  mutate(Mean = round(Mean, 2))

mean_tibble

### Scatterplot

In [None]:
options(repr.plot.width = 13, repr.plot.height = 9)
ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(alpha = 0.7, size = 3) +
  labs(
    title = "Playtime vs Age Coloured by Subscription Status",
    x = "Age (years)",
    y = "Playtime (hours)",
    color = "Subscription Status") 

Most non-subscribers have low playtimes- inactive players are less engaged. Few inactive players are subscribed- ongoing interest despite low activity. All players under 17 are subscribed- possible age-related trend.

## Histogram

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)

ggplot(players, aes(x = Age, fill = subscribe)) +
  geom_histogram(bins = 30, position = "stack") +
  labs(
    title = "Distribution of Age coloured by Subscription Status",
    x = "Age (years)",
    y = "Number of Players",
    fill = "Subscribed"
  ) 

Subscription rates are highest among younger players, especially those under 20. Older players show lower subscription rates- age may be a meaningful predictor of newsletter subscription.

## 4. Methods and Plan

### Suitability
KNN classification is appropriate because it handles binary response variables and continuous explanatory variables. 
### Assumptions
- Observations are independent
- Balanced dataset
- Sufficient sample size

### Limitations and Weaknesses
KNN requires continuous variables, we cannot use player experience and gender as exploratory variables. It is also sensitive to outliers and class imbalance. it is also highly dependent on k, which must be carefully selected to avoid overfitting or underfitting.

### Comparison and Model Selection

I would tune k using k-fold cross-validation.

### Data processing
1. Standardization of played_hours and Age
2. Splitting (70 training/ 30 testing)
3. Cross validation

## 5. GitHub Repository

Below is the link to my GitHub repository.

https://github.com/jaanacara/project_planning_individual.git