Final Data Sci 100 Project - Sophia Koronczay 87575445

In [None]:
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 6)

Predictive Guiding Question (Question 2): We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

My specific Predictive Question based on above:
Is there a relationship between player experience level and total game time and is age group a predictor for experience level and thus total game time (amount of data).

In [None]:
#read in data
players<- read_csv("players.csv")
sessions<- read_csv("sessions.csv")
sessions
players
#join data frames
all_data<- inner_join(sessions,players, by = "hashedEmail")
all_data

Inital goal: Visually look if there is a trend between experience levle and played hours through a scatterplot. Ie: we want to identify which groups would give the most amount of data


In [None]:
#order experience data
all_data <- all_data|>
mutate(experience = factor(experience, levels = c("Beginner", "Amateur", "Regular", "Pro", "Veteran")))

#check to see if played_hours is per session or per player
# all_data|>
# count(hashedEmail)|>
# arrange(desc(n))

#hashedEmails appear more than once, thus total play_time must be summarized into one number that reflects total play time per player

play_time<- all_data|>
group_by(hashedEmail,experience,gender,Age)|>
summarise(total_hours = sum(played_hours, na.rm = TRUE))|>
ungroup()

play_time_plot<-play_time|> 
group_by(experience)|>
summarise(avg_hours = mean (total_hours, na.rm = TRUE))|>
ggplot(aes(x=experience, y=avg_hours)) + 
    geom_col(alpha = 0.6) +
    xlab("Experience Level") + ylab("Total Time Played (Hrs)")+
    ggtitle("Total Time Played Across All Sessions Versus Players Experience Level") +
    theme(text = element_text(size = 12))
play_time_plot
#The plot lets me identify that "Regulars" and "Amateurs" have the most game time, thus providing the most data.
#The question now becomes: What kinds of players by age group are most likely to fall into experience levels associated with the highest total game time/amount of data?

In [None]:
#Now I will establish if age is related to player experience level
#This will help to see if age can predict if the player will give the most data for researches in terms of play time

play_time_clean <- play_time|>
filter(!is.na(Age), !is.na(experience), !is.na(gender))

#split my data
set.seed(2020)

data_split<- initial_split(play_time_clean, prop = 0.75, strata = experience)
train_data <- training(data_split)
test_data <- testing(data_split)

#Check counts for cross validation:
# count(train_data, experience)
#Since there is n=9 for pro, I will reduce the flolds from 5->3 in the cross validation to ensure that every experience level is in each fold.

#create recipe
experience_recipe <-recipe(experience ~ Age, data = train_data)|>
step_center(all_predictors()) |>
step_scale(all_predictors())
#model
model_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = tune ())|>
set_engine("kknn")|>
set_mode("classification")
#Cross Validation                              
numbervfold <- vfold_cv(train_data, v = 3, strata = experience)
k_vals <- tibble(neighbors = seq(from = 1, to = 10))

tune_data <- workflow()|>
add_recipe (experience_recipe) |>
add_model (model_spec) |>
tune_grid(resamples = numbervfold, grid = k_vals)
    
number_metrics <- tune_data|>
collect_metrics() |>
filter(.metric =="accuracy")|>
mutate(neighbors = as.double(neighbors))|>
filter(!is.na(mean))

cross_val_plot <- ggplot(number_metrics, aes(x = neighbors, y = mean)) +
geom_point()+
geom_line()+
labs(x = "Number of Neighbors (K)", y =
"Accuracy Estimate", title= "KNN Cross-Validation Accuracy")+ scale_x_continuous (breaks = seq(1, 10, by = 1))
cross_val_plot
#Accuracy is about 40% with K=9 or K=10

#best k to use is:

final_spec<- nearest_neighbor(weight_func = "rectangular", neighbors = 9)|>
set_engine("kknn")|>
set_mode("classification")
                              
final_workflow <- workflow()|>
add_recipe(experience_recipe)|>
add_model(final_spec)
                              
final_fit <- fit(final_workflow, data = train_data)
test_predictions<-(predict(final_fit, new_data = test_data))|>
bind_cols(test_data)
                              
test_metrics <-test_predictions |>
metrics(truth = experience, estimate = .pred_class)

final_conf_mat <- test_predictions |>
conf_mat(truth = experience, estimate = .pred_class)
                   
test_metrics
final_conf_mat 

                              