In [2]:
#load the necessary packages
library(repr)
library(themis)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)

Loading required package: recipes

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘recipes’


The following object is masked from ‘package:stats’:

    step


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.2     [32m✔[39m [34mtidyr    [39m 1.3.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m  masks [34mstats[39m::filter()
[31m✖[39m [34mstringr[39m::[32mfixed()[39m mask

In [3]:
#load the players data set
url <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players <- read_csv(url)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [4]:
#convert the players data set into a tidy format by removing unnecessary columns
players_tidy <- players |> 
    select(experience:age, -hashedEmail, -name) #|> #removed the hashedEmail and name columns
    head(5) #print only the first 5 rows of the data set
players_tidy

experience,subscribe,played_hours,gender,age
<chr>,<lgl>,<dbl>,<chr>,<dbl>
Pro,TRUE,30.3,Male,9
Veteran,TRUE,3.8,Male,17
Veteran,FALSE,0.0,Male,17
Amateur,TRUE,0.7,Female,21
Regular,TRUE,0.1,Male,21
⋮,⋮,⋮,⋮,⋮
Amateur,TRUE,0.0,Female,17
Veteran,FALSE,0.3,Male,22
Amateur,FALSE,0.0,Prefer not to say,17
Amateur,FALSE,2.3,Male,17


In [5]:
#calculate the average number of played hours to determine a boundary separating high and low contributors
average_played_hours <- players_tidy |>
  summarize(avg_hours = mean(played_hours, na.rm = TRUE)) |>
    pull()


print(paste("The average number of played hours:", average_played_hours))

[1] "The average number of played hours: 5.84591836734694"


The above code output reveals that the average number of played hours in the players data set is 5.85 hours. Therefore, we will classify players who contributed 5.85 hours or more as "High Contributors" and players who contributed less than 5.85 hours as "Low Contributors."

In [6]:
#convert the character variables to factor variables so they can be used as categories for KNN classification
players_tidy <- players_tidy |> 
    mutate(experience = as.factor(experience), 
           gender = as.factor(gender))

In [7]:
#assign numerical values to the experience, and gender variables so they can be used to calculate distances between points in KNN classification
players_tidy <- players_tidy |> 
    mutate(experience = as.numeric(experience), 
           gender = as.numeric(gender))

In [8]:
#assign a contributor label to each played hours value
players_tidy <- players_tidy |> 
    mutate(contributor = factor(ifelse(played_hours > average_played_hours, "High Contributor", "Low Contributor"))) 

In [9]:
#test if this works (can delete later) 
head(players_tidy)

experience,subscribe,played_hours,gender,age,contributor
<dbl>,<lgl>,<dbl>,<dbl>,<dbl>,<fct>
3,True,30.3,3,9,High Contributor
5,True,3.8,3,17,Low Contributor
5,False,0.0,3,17,Low Contributor
1,True,0.7,2,21,Low Contributor
4,True,0.1,3,21,Low Contributor
1,True,0.0,2,17,Low Contributor


In [10]:
#set the seed for the project
set.seed(2024) 

#Split the data into a train:test ratio of 1:9
players_split <- initial_split(players_tidy, prop = 0.90, strata = contributor)  
players_train <- training(players_split)   
players_test <- testing(players_split)

#show data sample
players_train 
players_test

experience,subscribe,played_hours,gender,age,contributor
<dbl>,<lgl>,<dbl>,<dbl>,<dbl>,<fct>
3,TRUE,30.3,3,9,High Contributor
5,TRUE,3.8,3,17,Low Contributor
5,FALSE,0.0,3,17,Low Contributor
1,TRUE,0.7,2,21,Low Contributor
4,TRUE,0.1,3,21,Low Contributor
⋮,⋮,⋮,⋮,⋮,⋮
1,TRUE,0.0,2,17,Low Contributor
5,FALSE,0.3,3,22,Low Contributor
1,FALSE,0.0,6,17,Low Contributor
1,FALSE,2.3,3,17,Low Contributor


experience,subscribe,played_hours,gender,age,contributor
<dbl>,<lgl>,<dbl>,<dbl>,<dbl>,<fct>
4,TRUE,0.0,2,19,Low Contributor
2,TRUE,0.0,2,17,Low Contributor
2,TRUE,1.0,3,17,Low Contributor
5,TRUE,2.2,3,24,Low Contributor
4,TRUE,218.1,4,20,High Contributor
⋮,⋮,⋮,⋮,⋮,⋮
5,TRUE,0.0,3,17,Low Contributor
1,FALSE,2.1,3,24,Low Contributor
4,FALSE,0.1,3,18,Low Contributor
5,TRUE,0.1,2,44,Low Contributor


In [None]:
set.seed(2024)

#model for k=3 


#add recipe, use step_upsample to make high contributor data not so rare
players_recipe <- recipe(contributor ~ gender + age + experience, data = players_train) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())|>
    step_upsample(contributor, over_ratio = 1, skip = TRUE) 


players_recipe

#add model, use initial neighbors 3
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 3) |>
       set_engine("kknn") |>
       set_mode("classification")

#get in to workflow
players_fit <- workflow() |>
       add_recipe(players_recipe) |>
       add_model(knn_spec) |>
       fit(data = players_train)


players_fit



[36m──[39m [1mRecipe[22m [36m──────────────────────────────────────────────────────────────────────[39m



── Inputs 

Number of variables by role

outcome:   1
predictor: 3



── Operations 

[36m•[39m Scaling for: [34mall_predictors()[39m

[36m•[39m Centering for: [34mall_predictors()[39m

[36m•[39m Up-sampling based on: [34mcontributor[39m



In [1]:
set.seed(2024)

#test k = 1 to 10
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

#create a 5fold cross-validation
player_vfold <- vfold_cv(players_train, v = 5, strata = contributor)

knn_spec <- nearest_neighbor(weight_func = "rectangular",
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_results <- workflow() |>
  add_recipe(players_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = player_vfold, grid = k_vals) |>
  collect_metrics()

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies

ERROR: Error in tibble(neighbors = seq(from = 1, to = 10, by = 1)): could not find function "tibble"
