# Introduction

Players engage with video games in many different ways. Some log in only for short sessions, while others spend much more time exploring and return more frequently. Beyond in-game activities, players also differ in how closely they stay connected to the game’s community. One way to stay engaged is by subscribing to the game’s newsletter for updates and upcoming events. Because subscription rates often reflect higher interest and community involvement, it’s useful to identify which types of players are more likely to subscribe. 

The two data sets: players.csv and sessions.csv center on real data collected from the UBC Computer Science research group studying player behaviour on a Minecraft server. The goal for the research is to better target recruitment and ensure they have adequate server resources. 

To support the research group’s recruitment efforts, our project aims to answer the question: Can the played hours, and age of a player predict whether or not they subscribed to the newsletter in the players dataset? This helps determine which factors are most associated with engagement so the group can better tailor their efforts accordingly.

We will only be working with players.csv, which includes data about all unique players. It contains information such as name, age, played hours, subscription status, and more. A detailed description of the dataset is provided below:



The data set consists of 196 observations and 7 variables
|Variable Name|Data Type|Description|
|------------------|----------------------------|-------------|
| experience | Character|experience level of the players at the game from beginner to regular to amateur to veteran to pro |
| subscribe| Logical|if the players are subscribed to the newsletter |
| hashed_email|Character|  unique encrypted way to identify each player |
| played_hours| Numeric|total hours the player has played the game |
| name| Character|player's name|
| gender| Character| player's gender, either Male, Female, Other, Two Spirited, or Prefer not to say |
| Age | Numeric| players age in years| 




# Questions

### *Broad Question:*

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### *Specific Question:*

 Can the played hours, and age of a player predict whether or not they subscribed to the newsletter in the players dataset?


# Methods

### *Broad Question:*

What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

### *Specific Question:*

 Can the played hours, and age of a player predict whether or not they subscribed to the newsletter in the players dataset?

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
library(purrr)
library(lubridate)
options(repr.matrix.max.rows = 6)


In [None]:
players<-read_csv("https://raw.githubusercontent.com/RadinAlikhani/dsci100-solo-project/refs/heads/main/players.csv")|>
    mutate(gender=as_factor(gender),experience=as_factor(experience),subscribe=as_factor(subscribe))|>
    mutate(subscribe=fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"))|>
    select(-name)

players


In [None]:
played_time_plot<-players|>
    ggplot(aes(x=played_hours,fill=subscribe))+
    geom_histogram()+
    labs(x="Time Played (Hours)",y="Number of Players",fill="Subscribed?")+
    ggtitle("The distribution of playtime of players and subscription status")
played_time_plot

age_plot<-players|> #change this to a boxplot of age by subscription, or proportion of subscribers within age groups
    ggplot(aes(x=Age,fill=subscribe))+
    geom_histogram()+
    labs(x="Age of player (Years)",y="Number of Players",fill="Subscribed?")+
    ggtitle("The distribution of age of players and subscription status")
age_plot

age_and_time_plot<-players|>
    ggplot(aes(x=Age,y=played_hours,color=subscribe))+
    geom_point()+
    labs(x="Age of Players (Years)",y="Time Played (hours)",color="Subscribed?")+
    ggtitle("The relationship of age and time played of a player with subscription status")
age_and_time_plot


## Observations from these plots
#### Plot 1:
- players with more than 25 hours are usually all subscribed.
- helps distinguish between the different hours and subscription status
- can be used as predictor

#### Plot 2
- ages from 0 to 20 (besides 19) had higher proportion of subscribers to non subscribers compared to ages after 20, especially after 30
- can be used as predictor.

#### Plot 3
- Players in the age range of 10-20 have more hours on average compared to higher ages


# (4) Methods and Plans
Will use:
- knn classification with k nearest neighbors to predict the player's subscription status
- Age and played_hours as predictors, subscribe as response variable
- Will not use experience level or gender because they are categorical variables and cannot be used in Knn Classification.

#### Why this method is appropriate
- the response variable (subscribe) is a categorical variable and a factor and the two predictor variables (Age, and played_hours) are numerical variables
- knn classification can predict non linear relationships
- flexible

#### Assumptions
- data is tidy
- All data are contributing the same (data is standardized)

#### Limitations/Weaknesses

- small data set, more sensitive to outliers 

#### How I am going to compare and select the model
- Wrangle the data as stated before to make it tidy
- Standardize the training data using a recipe (step_center and step_scale)
- In the training data, tune the k (1-40)
- use 5 fold cross validation to find the most accurate amount of neighbors
- Will compare the different models via accuracy, precision, and the confusion matrix

#### How I am going to split the data

- Split the data 75/25 training:testing
- Will split before preprocessing


In [None]:
players<-drop_na(players)

players_split<-initial_split(players,prop=0.75,strata=subscribe)
players_train<-training(players_split)
players_test<-testing(players_split)

players_recipe<-recipe(subscribe~played_hours+Age,data=players_train)|>
step_scale(all_predictors())|>
step_center(all_predictors())

players_vfold<-vfold_cv(players_train,v=5,strata=subscribe)

knn_spec<-nearest_neighbor(weight_func="rectangular",neighbors=tune())|>
set_engine("kknn")|>
set_mode("classification")

k_vals<-tibble(neighbors=seq(from=1,to=30,by=1))
               
knn_results <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = players_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

best_k <- accuracies |>
        arrange(desc(mean)) |>
        head(1) |>
        pull(neighbors)
best_k

best_knn<-knn_spec<-nearest_neighbor(weight_func="rectangular",neighbors=best_k)|>
set_engine("kknn")|>
set_mode("classification")

knn_fit<-workflow()|>
add_recipe(players_recipe)|>
add_model(best_knn)|>
fit(data=players_train)
knn_fit

players_test_predictions <- predict(knn_fit, players_test) |>
  bind_cols(players_test)

players_test_predictions |>
  metrics(truth = subscribe, estimate = .pred_class) |>
  filter(.metric == "accuracy")


                                      
                                      
    
                                      