(1) Data Description: This data was collected by the PLAI, by setting up a MineCraft server where the player’s actions were recorded into a data frame as they played the game. 

All necessary libraries are run: 

In [None]:
library(tidyverse)
library(dplyr)
library(repr)

Data sets will be read:

In [None]:
players<-read_csv("https://raw.githubusercontent.com/Laiann/25_final_indiv_project/refs/heads/main/players.csv")
head(players)

Players Dataset: There are 196 observations, along with 7 variables in this dataset
- experience: Character type, that describes the player’s experience level 
- subscribe: Logical data type, that describes whether the player is a subscriber to the newsletter
- hashedEmail: Character type, that includes the players hashed email
- played_hours: Double type, that is the players hours played including decimals
- name: Character type, that includes the players name
- gender: Character type, that includes the gender
- Age: Double type, that describes the players age

Additionally, there are a few players that have very high playing times which are outliers. Some issues related to things we can’t see directly is that players might lie about their age which would negatively impact predictions. Additionally, played_hours might only count for active sessions so if someone is offline it would not account for their hours. 

In [None]:
sessions<-read_csv("https://raw.githubusercontent.com/Laiann/25_final_indiv_project/refs/heads/main/sessions.csv")
head(sessions)

Session Dataset: There are 1535 observations, along with 5 variables in this dataset
- hashedEmail: Character type, that includes the players hashed email
- start_time: Character type, that describes the session start time
- end_time: Character type, that describes the session end time
- original_start_time: Double type, that includes the timestamp for start time	
- original_end_time: Double type, that includes the timestamp for end time

Some potential errors in this data set is that there are some missing end times therefore its unknown for how long the player played for. 

(2) Questions:
Broad question 1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

Specific Question: Can age and the number of hours played predict whether a player is subscribed in the Players Dataset?

The data will help me address the question because I will utilize age and hours played as predictors to evaluate if someone is a subscriber. To wrangle this data I plan to first remove missing values, then I will select only the variables I need which are age, gender, experience, and subscription status. 

(3) Exploratory Data Analysis and Visualization:

Mean value for each quantitative variable (Age and played_hours) in the players.csv data set

In [None]:
players_mean<-select(players, Age, played_hours)|>
    summarize(
        mean_age=round(mean(Age,na.rm=TRUE),2),
        mean_played_hours=round(mean(played_hours,na.rm=TRUE),2))
players_mean

Visualizations:

In [None]:
library(ggplot2)

players_figure1 <- ggplot(players, aes(x = Age, y = played_hours, color = subscribe)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "Relationship Between Age and Played Hours on Subscription Status",
    x = "Age",
    y = "Played Hours",
    color = "Subscribed")

players_scatter_graph

In [None]:
players_figure2<-ggplot(players, aes(x = subscribe, y = Age, fill = subscribe)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Age by Subscription Status",
    x = "Subscribed",
    y = "Age")
players_figure2

In [None]:
players_figure3<- ggplot(players, aes(x = subscribe, y = played_hours, fill = subscribe)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Played Hours by Subscription Status",
    x = "Subscribed",
    y = "Played Hours")
players_figure3

(4) Methods and Plan:

The model method that is most appropriate is KNN classification because this method predicts the class of an observation, which in this case would be whether a player is subscribed to the newsletter based on the majority class among its k nearest neighbours. An assumption that has to be made is that both the predictive variables have the same impact on the prediction (therefore the data must be scaled), so no single variable dominates the distance measure. Some limitations is that KNN is slow for large datasets, and the model's performance depends heavily on the choice of k. The model selection will focus on finding the optimal value of k by cross validating to compare the accuracy across different k values, and the model with the highest average accuracy will be selected. Moreover, accuracy and recall will also be compared. The dataset will be split into training (70%) and testing (30%) which will occur before the model is fit to ensure predictions are not biassed. Lastly, k-fold-cross-validation will be used to tune k.