Final Project Report: Predicting a High Contributer 
-

In [None]:
library(tidymodels)
library(tidyverse)

In [None]:
players <- read_csv("https://raw.githubusercontent.com/Modas101/dsci-100-project-final/refs/heads/main/data/players.csv")
sessions <- read_csv("https://raw.githubusercontent.com/Modas101/dsci-100-project-final/refs/heads/main/data/sessions.csv")

head(players)

Introduction:
-
We are exploring the dataset players.csv, which contains data on how much time each player spends on the server and various other factors related to the player. We will utilize these additional factors to create a model which predicts whether a player will be a high contributor or a low contributor to the server based on their information. 

#### Question:
Can player characteristics (specifically `experience`, `subscribe`, and `Age`) predict whether a player is a high data contributor (defined as having `played_hours` in the top 75th percentile) in the `players` dataset?

#### Data Description:
**players.csv:**
**experience (string type):** A self-rated assessment of their own experience.\
**subscribe (boolean type):** Whether they are subscribed to a game-related newsletter or not.\
**hashedEmail (string type):** Their email, hashed.\
**played_hours (double type):** Time played in hours.\
**name (string type):** Their name.\
**gender (string type):** Their gender.\
**Age (double type):** Their real life age in years.

#### Potential Issues:
- The `experience` variable in `players.csv` is very subjective, and is more likely an indicator of how confident the player is, rather than an actual measure of their skill level.
- Most entries are less than `25 played_hours`, but there are several outliers that are well above 150. These outliers should be included in our analysis; however, they may make reading certain plots very difficult.
- `age` could easily be fabricated.


Methods: 
- 
#### Determining Which Variables to Use:
We have many different variables in our dataset which we may use in the knn classification model, but we should thoroughly analyze all the data to ensure it fulfills all assumptions for knn classification and that it is relevant. In **Code Block 1**, we plot histograms showing the proportion of high contribution players to `age`, `experience`, and `subscribe` (subscription status). These graphs help us visualize patterns present within the data, so we can choose which variables to include in our knn classification model. 

In **Code Block 2**, we first selected the variables we wanted to use, then split the data. We used a 75 - 25 split to determine the ideal K value. We plotted the ideal K values on a graph for easy visualization. It is crucial that we determine the ideal K value, because if we do not do so correctly, we risk over- or under-fitting the data. 

Finally, in **Code Block 3**, we test our model using the "unseen" portion of the data and ideal K value. These tests were run as many times as necessary until a sufficiently accurate model was generated. 

#### Code Block 1:

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

clean_players <- players |>
    filter(!is.na(Age))

played_hours_75th_percentile <- clean_players |>
    pull(played_hours) |>
    quantile(0.75)
#played_hours_75th_percentile

clean_players <- clean_players |>
    mutate(high_contributor = played_hours > played_hours_75th_percentile)




# plots
# age amount
clean_players |> ggplot(aes(x = Age, fill = high_contributor)) +
    geom_histogram(binwidth = 2, position = "stack", alpha = 0.8) +
    labs(title = "Amount of High Contributors by Age",
        x = "Age (bins of 2 years)",
        y = "Amount of Players",
        fill = "High Contributor") + 
    theme(element_text(size = 20))
# age proportion
clean_players |> ggplot(aes(x = Age, fill = high_contributor)) +
    geom_histogram(binwidth = 2, position = "fill", alpha = 0.8) +
    labs(title = "Proportion of High Contributors by Age",
        x = "Age (bins of 2 years)",
        y = "Percent of Players",
        fill = "High Contributor") + 
    theme(element_text(size = 20))
# experience amount
clean_players |> ggplot(aes(x = experience, fill = high_contributor)) +
    geom_bar(position = "stack") +
    labs(title = "Amount of High Contributors by Experience Level",
        x = "Experience Level",
        y = "Amount of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))
# experience proportion
clean_players |> ggplot(aes(x = experience, fill = high_contributor)) +
    geom_bar(position = "fill") +
    labs(title = "Proportion of High Contributors by Experience Level",
        x = "Experience Level",
        y = "Percent of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))
# subscribed amount
clean_players |> ggplot(aes(x = subscribe, fill = high_contributor)) +
    geom_bar(position = "stack") +
    labs(title = "Amount of High Contributors by Subscription",
        x = "Subscribed to Newsletter",
        y = "Amount of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))
# subscribed proportion
clean_players |> ggplot(aes(x = subscribe, fill = high_contributor)) +
    geom_bar(position = "fill") +
    labs(title = "Proportion of High Contributors by Subscription",
        x = "Subscribed to Newsletter",
        y = "Percent of Players",
        fill = "High Contributor") +
    theme(element_text(size = 20))

mean_hours <- clean_players |>
  pull(played_hours) |>
  mean()
mean_age <- clean_players |>
  pull(Age) |>
  mean()

#mean_hours
#mean_age

#### Code Block 2:

In [None]:
model_data <- clean_players |>
    mutate(high_contributor = as.factor(high_contributor)) |>
    select(high_contributor, Age, experience, subscribe, gender)

set.seed(123)

data_split <- initial_split(model_data, prop = 0.75, strata = high_contributor)

clean_players_training <- training(data_split)
clean_players_testing <- testing(data_split)

knn_recipe <- recipe(high_contributor~., data = clean_players_training) |>
    step_mutate(subscribe = as.integer(subscribe)) |>
    step_mutate(experience = case_match(experience,
        "Beginner" ~ 1,
        "Regular" ~ 2,
        "Amateur" ~ 3,
        "Veteran" ~ 4,
        "Pro" ~ 5)) |>
    step_novel(all_nominal_predictors()) |>
    step_dummy(all_nominal_predictors()) |>
    step_center(all_predictors()) |>
    step_scale(all_predictors())

knn_spec <- nearest_neighbor(neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_workflow <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec)

cv_folds <- vfold_cv(clean_players_training, v = 5, strata = high_contributor)

k_grid <- tibble(neighbors = seq(from = 1, to = 20, by = 1))

knn_results <- knn_workflow |> tune_grid(resamples = cv_folds, grid = k_grid) |>
    collect_metrics() |>
    filter(.metric == "accuracy")
accuracy_versus_k <- ggplot(knn_results, aes(x = neighbors, y = mean)) +
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") +
      scale_x_continuous(breaks = seq(1, 20, by = 1)) +  # adjusting the x-axis
      scale_y_continuous(limits = c(0.7, 0.8)) # adjusting the y-axis
accuracy_versus_k


use K = 11

#### Code Block 3:

In [None]:
knn_best_spec <- nearest_neighbor(neighbors = 11) |>
    set_engine("kknn") |>
    set_mode("classification")

knn_best_fit <- workflow() |>
    add_recipe(knn_recipe) |>
    add_model(knn_spec) |>
    fit(data = clean_players_training)

knn_summary <- knn_best_fit |>
  predict(clean_players_testing) |>
  bind_cols(clean_players_testing) |>
  metrics(truth = high_contributor, estimate = .pred_class) |>
  filter(.metric == 'accuracy')

knn_summary

# Discussion
In this project, we defined high contributors as players whose played_hours was above the 75th percentile. Exploratory visualizations revealed to us that older players and those with greater expereicne were more likely to be high contributors. 
