In [9]:
#load the necessary packages
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)

In [16]:
#load the players data set
url <- "https://drive.google.com/uc?export=download&id=1Mw9vW0hjTJwRWx0bDXiSpYsO3gKogaPz"
players <- read_csv(url)

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, age
[33mlgl[39m (3): subscribe, individualId, organizationName

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [11]:
#convert the players data set into a tidy format by removing unnecessary columns
players_tidy <- players |> 
    select(experience:age, -hashedEmail, -name) |> #removed the hashedEmail and name columns
    head(5) #print only the first 5 rows of the data set
players_tidy

experience,subscribe,played_hours,gender,age
<chr>,<lgl>,<dbl>,<chr>,<dbl>
Pro,True,30.3,Male,9
Veteran,True,3.8,Male,17
Veteran,False,0.0,Male,17
Amateur,True,0.7,Female,21
Regular,True,0.1,Male,21


In [4]:
#calculate the average number of played hours to determine a boundary separating high and low contributors
average_played_hours <- players_tidy |>
  summarize(avg_hours = mean(played_hours, na.rm = TRUE))
average_played_hours

avg_hours
<dbl>
6.98


The above code output reveals that the average number of played hours in the players data set is 6.98 hours. Therefore, we will classify players who contributed 6.98 hours or more as "High Contributors" and players who contributed less than 6.98 hours as "Low Contributors."

In [5]:
#convert the character variables to factor variables so they can be used as categories for KNN classification
players_tidy <- players_tidy |> 
    mutate(experience = as.factor(experience), 
           gender = as.factor(gender))

In [6]:
#assign numerical values to the experience, and gender variables so they can be used to calculate distances between points in KNN classification
players_tidy <- players_tidy |> 
    mutate(experience = as.numeric(experience), 
           gender = as.numeric(gender))

In [7]:
#assign a contributor label to each played hours value
players_tidy <- players_tidy |> 
    mutate(contributor = ifelse(played_hours >= 6.98, "High Contributor", "Low Contributor")) 

In [8]:
#test if this works (can delete later) 
head(players_tidy)

experience,subscribe,played_hours,gender,age,contributor
<dbl>,<lgl>,<dbl>,<dbl>,<dbl>,<chr>
2,True,30.3,2,9,High Contributor
4,True,3.8,2,17,Low Contributor
4,False,0.0,2,17,Low Contributor
1,True,0.7,1,21,Low Contributor
3,True,0.1,2,21,Low Contributor
