(1)

The "players" dataset contains information regarding 196 unique players (rows) and seven seperate relevant observations (columns). 

- The first column is self-reported categorical data labeled experience, split into five categories indicating gaming profficiency starting from beginner, to amateur, to regular, to veteran, to pro.
      - Potential issues regarding this data is that it is purely subjective.
- The second column, subscribe, is if they are subscribed to a gaming newsletter in the form of Boolean.
- The third column, hashedEmail, represents the participants email address.
      - Not useful.
- The fourth column, played_hours, is numeric data of total number of hours each participant spent playing in the MineCraft server.
    - Most likely recorded through MineCraft Server logs
- The fifth column are the participants' names,
       - Not useful.
- The sixth column is categorical data of the participant's gender
       - Seems to have a strong imbalance with majority of male participants.
- The final column contains each participants' age.
      -Ranges from 17-26, with outliers of 57, 50, 9, and more.

The "sessions" dataset contains information regarding every single individual playing session on the MineCraft Server. There are a total of 1535 playing sessions (rows) with 5 relevant observations (columns).

- The first column is hashedEmail.
      - Not useful.
- The second column is start_time, represented in human-readable form, day/month/year hour:minute.
    - Along with the end_time, most likely calculated from UNIX timestamps
- The third column is end_time.
- The forth column is original_start_time, recorded as a UNIX timestamp (milliseconds since 1 Jan 1970).
    - Formatted in scientific notation, which may cause precision loss
    - Along with original_end_time, most likely recorded through MineCraft Server logs
- The fifth column is original_end_time

(2)

Broad Question: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question: Can a playerâ€™s experience level, average playtime, and demographic attributes (age, gender) predict whether they are subscribed (subscribe = TRUE)?


The data I plan on using will all be in the players dataset. I plan on using the experience, played_hours, gender, and age as explanatory variables to predict response variable subscribe. I plan to first remove useless columns, such as name and hashedEmail using select. I will then convert catgorical variables such as exeprience and gender using as_factor. 

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)
source("cleanup.R")

In [None]:
players <- read_csv("data/players.csv")
players_tidy <- players |> 
            rename(age = Age) |>
            select(-name, -hashedEmail)
mean_table <- players_tidy |>
  summarise(mean_played_hours = mean(played_hours, na.rm = TRUE), mean_age = mean(age, na.rm = TRUE))
mean_table

players_experience <- ggplot(players_tidy, aes(x = experience, fill = subscribe)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Newsletter Subscription Rate by Experience Level",
    x = "Experience Level",
    y = "Percentage of Players",
    fill = "Subscribed")

players_gender <- ggplot(players, aes(x = gender, fill = subscribe)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Newsletter Subscription Rate by Gender",
    x = "Gender",
    y = "Percentage of Players",
    fill = "Subscribed")

players_gender
players_experience

In the first graph, we are able to see how newsletter subscription rates vary across genders, but are limited by small sample sizes for some categories. Additionally, the data doesn't allow us to make any explicit conclusions yet, but does allow us to see how a majority of these groups are subscribed. The second plot shows us the relationship between experience level and subscription rates and allows us to observe that the proportions are quite similar, suggesting that experience might not be a strong predictor.

(4)

I propose to use Knn-classification since our response variable, subscribe, is categorical, not numerical. Additionally, Knn predicts the class of an observation by observing its nearest neighbors, therefore it is well-suited for datasets with mixed numeric (played_hours, age) and categorical (experience, gender) predictors. 

Knn has three crucial assumptions: Similar observations are closer together and numerical values are scaled appropriately, each row represents an independent observation, and each class has sufficient representation. However, a limitation of Knn is if a class doesn't have sufficient representation, it can offset the model. Additionally, it is very dependent on the K value, which we will tune the number of neighbors (k) using cross-validation on the training set. For each k value, we will evaluate model performance using metrics such as Accuracy or Precision and recall. I also plan on splitting the data into 80% training and 20% testing. 

(5)
https://github.com/Jzhen7/Project-Planning-Stage- 