<b>How are you going to process the data to apply it
   to the model? (ex. how many splits, what proportions 
                  will you use for the splits, at which
                  stage will you split, will there be a 
                  validation set? Will you use cross
                  validation? </b>
- 5 splits will be used
- 75% of the data will be used for training, 25% for testing
-5 fold cross-validation will be used with the minumum RMSPE K value
- We first standardize the data in the recipe
- Then we make the spec
- then we use 5 fold cross validation
  

<b>Broad Question: Question 2 (We would like to know which "kinds"
                            of players are most likely to contribute
                            a large amount of data so that we can target
                            those players in our recruiting efforts)</b>

Specific Question: *What gender and age group plays the longest hours?*

How the data will address the question of interest 
(The plan for wrangling to apply a predictive method 
 from class):

The data will find the gender and age groups that play the most hours, 
so that they can be recruited as they contribute a large amount of data. 

The plan is to use multivariable KNN regression in order to predict
the hours a participant of a certain gender and age
group will play. 

The players data set will be used.

Title

Introduction (Provide relevant background information)

Title

Introduction (Provide relevant background information)

## Methods

First, we need to load in the tidyverse package so we have the necessary tools to analyze the data.

In [6]:
library(tidyverse)
library(tidymodels)
library(repr)
library(infer)
library(rvest)
library(themis)

Next, we read in the data from the raws on GitHub.

In [7]:
players_url <- "https://raw.githubusercontent.com/cindylemon/plaicraft-individual-project/refs/heads/main/players.csv"
players <- read_csv(players_url)
head(players)

sessions_url <- "https://raw.githubusercontent.com/cindylemon/plaicraft-individual-project/refs/heads/main/sessions.csv"
sessions <- read_csv(sessions_url)
head(sessions)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


[1mRows: [22m[34m1535[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): hashedEmail, start_time, end_time
[32mdbl[39m (2): original_start_time, original_end_time

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


hashedEmail,start_time,end_time,original_start_time,original_end_time
<chr>,<chr>,<chr>,<dbl>,<dbl>
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,30/06/2024 18:12,30/06/2024 18:24,1719770000000.0,1719770000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,17/06/2024 23:33,17/06/2024 23:46,1718670000000.0,1718670000000.0
f8f5477f5a2e53616ae37421b1c660b971192bd8ff77e3398304c7ae42581fdc,25/07/2024 17:34,25/07/2024 17:57,1721930000000.0,1721930000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,25/07/2024 03:22,25/07/2024 03:58,1721880000000.0,1721880000000.0
36d9cbb4c6bc0c1a6911436d2da0d09ec625e43e6552f575d4acc9cf487c4686,25/05/2024 16:01,25/05/2024 16:12,1716650000000.0,1716650000000.0
bfce39c89d6549f2bb94d8064d3ce69dc3d7e72b38f431d8aa0c4bf95ccee6bf,23/06/2024 15:08,23/06/2024 17:10,1719160000000.0,1719160000000.0


One of the first things we need to do is to select the variables we will need for the models.

In [8]:

player_selected <- players |> select(Age, gender, played_hours) |>
mutate(gender = as_factor(gender)) |> mutate(gender = as.numeric(gender))


head(player_selected)

Age,gender,played_hours
<dbl>,<dbl>,<dbl>
9,1,30.3
17,1,3.8
17,1,0.0
21,2,0.7
21,1,0.1
17,2,0.0


We then have to make the model for the regression.

Male = 1
Female = 2
Non-binary = 3
Other = 7 
Prefer not to say = 4
Two-spirited = 6


In [9]:
drop_na(player_selected)

hours_split<-initial_split(player_selected, prop=0.75, strata=played_hours)
hours_train<-training(hours_split)
horus_test<-testing(hours_split)


player_recipe <- recipe(played_hours~., data = hours_train) |>
                 step_scale(all_predictors()) |>
                 step_center(all_predictors())

player_spec <- nearest_neighbor(weight_func = "rectangular",
                              neighbors = tune()) |>
                              set_engine("kknn") |>
                              set_mode("regression")
player_vfold <- vfold_cv(hours_train, v = 5, strata = played_hours)

player_wkflw <- workflow() |>
  add_recipe(player_recipe) |>
  add_model(player_spec)

gridvals <- tibble(neighbors = seq(from = 1, to = 50, by = 3))

player_results <- player_wkflw |>
  tune_grid(resamples = player_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse")
player_results

player_min <- player_results |>
  filter(mean == min(mean))

player_min

Age,gender,played_hours
<dbl>,<dbl>,<dbl>
9,1,30.3
17,1,3.8
17,1,0.0
21,2,0.7
21,1,0.1
17,2,0.0
19,2,0.0
21,1,0.0
17,1,0.1
22,2,0.0


neighbors,.metric,.estimator,mean,n,std_err,.config
<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
1,rmse,standard,26.26471,5,7.510306,Preprocessor1_Model01
4,rmse,standard,29.5525,5,6.616077,Preprocessor1_Model02
7,rmse,standard,27.66234,5,7.007517,Preprocessor1_Model03
10,rmse,standard,27.31744,5,7.064316,Preprocessor1_Model04
13,rmse,standard,26.55093,5,7.391464,Preprocessor1_Model05
16,rmse,standard,26.28905,5,7.525678,Preprocessor1_Model06
19,rmse,standard,26.14847,5,7.63401,Preprocessor1_Model07
22,rmse,standard,26.29384,5,7.61909,Preprocessor1_Model08
25,rmse,standard,26.22636,5,7.597145,Preprocessor1_Model09
28,rmse,standard,26.07813,5,7.638495,Preprocessor1_Model10


neighbors,.metric,.estimator,mean,n,std_err,.config
<dbl>,<chr>,<chr>,<dbl>,<int>,<dbl>,<chr>
46,rmse,standard,25.57754,5,7.56741,Preprocessor1_Model16
