# Project Final Report
#### By: Mohak Batra, Ellina Hao, Sophia Mokretsova

Title
Introduction:
provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your report
clearly state the question you tried to answer with your project
identify and fully describe the dataset that was used to answer the question


Methods & Results:
describe the methods you used to perform your analysis from beginning to end that narrates the analysis code.
your report should include code which:
loads data 
wrangles and cleans the data to the format necessary for the planned analysis
performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
performs the data analysis
creates a visualization of the analysis 


note: all figures should have a figure number and a legend
Discussion:
summarize what you found
discuss whether this is what you expected to find
discuss what impact could such findings have
discuss what future questions could this lead to
References
You may include references if necessary, as long as they all have a consistent citation style.

# Introduction: Predicting Game Related Newsletter Subscription from Player Behavior
### To what accuracy can a player’s age, average session length, and total play time predict whether the player is subscribed to a game-related newsletter?
The Pacific Laboratory for Artificial Intelligence (PLAI), a computer science research group based at the University of British Columbia, has created a Minecraft server to collect hours of player data in an effort to train and create an AI model that can understand speech and respond to its environment. To recruit players to their study, PLAI must target recruitment towards individuals most likely to contribute hours' worth of data. Players' subscription to receive game-related newsletters is linked to hours of data they may contribute towards the study, as subscription demonstrates interest in engaging with the server. Assessing to what accuracy a player’s age, average session length, and total play time predict a player’s subscription status may allow researchers to target the best sources of data. The players and sessions datasets can be used to classify which characteristics are linked to subscription. A list of players’ in-game experience level, subscription status, hashed email, played hours, name, gender, and age is included in the players dataset, while each player’s hashed email alongside their start and end times are in the latter. By merging the two datasets, it is possible to produce statistical data and visualizations and predict the likelihood of a player subscribing to receive newsletters based on characteristics of existing players. In the sessions dataset, the start and end times of each player’s sessions were wrangled to figure out how long their sessions are in minutes, and players’ average play time per session was used alongside age and total hours played to predict the odds of new players with similar characteristics subscribing through KNN classification.

Our project takes a closer look at if we can predict if a Minecraft player is subscribed to a newsletter based on their in game behavior. We looked at the player's age, average session length, and total playtime in order to explore our prediction. We have been provided two datasets, players.csv with basic player information, and sessions.csv with recorded gameplay sessions. 

# Method Used:
### KNN Classification 

We first started our project by cleaning up and wrangling our data into the format necessary for our planned analysis. 

After preparing the data, we made a K-Nearest Neighbours (KNN) classification model. We trained the model using age, average session length, and total playtime to see if it could predict whether a player was subscribed or not to a game related newsletter. 

In [39]:
# Importing Libraries
library(tidyverse)
library(repr)
library(tidymodels)


In [40]:
#Raw dataset URLs from github imported files
player_data_url <- "https://raw.githubusercontent.com/MohakB3/dsci100-project/refs/heads/main/data/players.csv"
sessions_data_url <- "https://raw.githubusercontent.com/MohakB3/dsci100-project/refs/heads/main/data/sessions.csv"

#Reading Data (with the column specification message disabled) and Omitting NAs
player_data <- na.omit(read_csv(player_data_url, show_col_types = FALSE))
sessions_data <- na.omit(read_csv(sessions_data_url, show_col_types = FALSE))

#Wrangling Sessions Data
sessions_wrangled_data <- sessions_data |>
    #creating new column session_length, converting start_time and end_time into date-time format and subtracting them. This gives us how long each session was.
    mutate(session_length = as.POSIXct(end_time,format = "%d/%m/%Y %H:%M") - as.POSIXct(start_time, format = "%d/%m/%Y %H:%M")) |> 
    #selecting and keeping only two columns
    select(hashedEmail, session_length) |> 
    #grouping all sessions belonging to the same hashedEmail together
    group_by(hashedEmail) |>
    #calculating each player’s average session length
    summarize(player_average_session_length = as.numeric(mean(session_length)))


#Merging Player Data and Wrangled Sessions Data
merged_data <- left_join(sessions_wrangled_data, player_data, by = "hashedEmail") |> na.omit()

merged_data <- merged_data |>
  mutate(subscribe = as.factor(subscribe))
merged_data 
#Calculating Mean Values for all Quantitative Variables
mean_quantitative_data <- merged_data |>
    summarize(mean_age = mean(Age), mean_played_hours = mean(played_hours), mean_session_length = mean(player_average_session_length))

hashedEmail,player_average_session_length,experience,subscribe,played_hours,name,gender,Age
<chr>,<dbl>,<chr>,<fct>,<dbl>,<chr>,<chr>,<dbl>
0088b5e134c3f0498a18c7ea6b8d77b4b0ff1636fc93355ccc95b45423367832,53.000000,Regular,TRUE,1.5,Isaac,Male,20
060aca80f8cfbf1c91553a72f4d5ec8034764b05ab59fe8e1cf0eee9a7b67967,30.000000,Pro,FALSE,0.4,Lyra,Male,21
0ce7bfa910d47fc91f21a7b3acd8f33bde6db57912ce0290fa0437ce0b97f387,11.000000,Beginner,TRUE,0.1,Osiris,Male,17
0d4d71be33e2bc7266ee4983002bd930f69d304288a8663529c875f40f1750f3,32.153846,Regular,TRUE,5.6,Winslow,Male,17
0d70dd9cac34d646c810b1846fe6a85b9e288a76f5dcab9c1ff1a0e7ca200b3a,35.000000,Pro,TRUE,1.0,Knox,Male,17
11006065e9412650e99eea4a4aaaf0399bc338006f85e80cc82d18b49f0e2aa4,10.000000,Veteran,FALSE,0.1,Callum,Male,19
119f01b9877fc5ea0073d05602a353b91c4b48e4cf02f42bb8d661b46a34b760,50.000000,Amateur,TRUE,0.7,Hugo,Female,21
18936844e06b6c7871dce06384e2d142dd86756941641ef39cf40a9967ea14e3,29.682927,Amateur,TRUE,17.2,Kyrie,Male,14
1a2b92f18f36b0b59b41d648d10a9b8b20a2adff550ddbcb8cec2f47d4d881d0,18.000000,Beginner,FALSE,0.2,Aurora,Female,37
1d2371d8a35c8831034b25bda8764539ab7db0f63938696917c447128a2540dd,5.000000,Amateur,FALSE,0.0,Emerson,Male,21


In [41]:
set.seed(69420)  # for reproducibility :3

data_split <- initial_split(merged_data, prop = 0.75) #splitting data into training and testing datasets (split by 0.75 and 0.25)
train_data <- training(data_split)
test_data  <- testing(data_split)


#creating a model specification for K-nearest neighbors classification
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 9) |>
  set_engine("kknn") |> #using k nearest neighbors
  set_mode("classification") #our model type classification

#creating the centering and scaling the recipe (scaled the data so that all variables are on similar ranges, which helps KNN make more accurate distance predictions.)
train_recipe <- recipe(subscribe ~ Age + player_average_session_length + played_hours, data = train_data) |> #predicting subscribe using our predictors age, avg session length and total played hrs from our training data.
    step_scale(all_predictors()) |>
    step_center(all_predictors()) 


knn_fit <- workflow() |>
    add_recipe(train_recipe) |> #applies recipe
    add_model(knn_spec) |> #using knn classification model
    fit(data = train_data) #trains model using the train dataset 

data_test_predictions <- predict(knn_fit, test_data) |>
    bind_cols(test_data)

accuracy <- data_test_predictions |>
    metrics(truth = subscribe, estimate = .pred_class) |>
    filter(.metric == "accuracy")
accuracy

precision <- data_test_predictions |>
    precision(truth = subscribe, estimate = .pred_class, event_level = "first")
precision

recall <- data_test_predictions |>
    recall(truth = subscribe, estimate = .pred_class, event_level = "first")
recall

.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
accuracy,binary,0.7741935


.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
precision,binary,1


.metric,.estimator,.estimate
<chr>,<chr>,<dbl>
recall,binary,0.125


# Discussion:

discussion here O_O