# Data Science Project Final Report 

### Sophie, Bao, Lucia and Miles

# Introduction

## Background
The purpose of this project is to apply what we have learned in this course to a real-world problem, and assist Frank Wood's research group in the Computer Science department with their recruitment efforts. Specifically, our groups Research Question is targetted towards answering the First of their Three Broad Questions of Interest posed to us, as follows:
"**Question 1:** *What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?"("Project Planning Stage: Individual")*

## Data Collection Method
The data is collected by a research group in Computer Science at UBC led by Frank Wood using a custom Minecraft server that logged voluntary gameplay.


## Research Question
Can Age and Played Hours make quality predictions of Subscription Status in the `players` dataset? 

## Described Dataset 

Run this cell to load all necessary R packages: 

In [None]:
library(tidyverse)
library(scales)
library(RColorBrewer)
library(tidymodels)

In [None]:
# reading in the "players" dataset" 
players <- read_csv("data/players.csv")
players

<font size="4"> **Descriptions of "players":**

    - The players dataset contains 196 observations and 7 variables
| Variable Name | Data type | Meaning | 
|----------|-----------|---------|
| experience | character | Player's self-identified experience level |
| subscribe | logical | Whether the player subscribes to a game newsletter (TRUE or FALSE) |
| hashedEmail | character | Hashed version of a player's email | 
| played_hours | double | Total hours of playtime of a player| 
| name | character | Player's name | 
| gender | character |  Player's gender | 
| Age | double | Player's age |

# Methods & Results

## Project Rundown/Description

This project is involves K-NN classification using 2 predictor variables from the `players` dataset: `played_hours` and `Age`. The steps of the predictive analysis is listed below: 

1. Load the dataset `players`
2. Select only the `subscribe`, `played_hours`, and `Age` variables/columns
3. Change `subscribe` from `lgl` to `fct`
4. Create a visualization involving setting the predictive variables as the x and y axis, and coloring them by the variable of interest, `subscribe`
5. Split the data into a 70-30 ratio of training and testing data, respectively
6. Perform 5-fold cross validation on the training data, then picking the K with the least RMSE
7. Train another model using the previously selected K to perform classification on the testing set
8. Examine accuracy (recall and precision is not needed since there is no positive variable

The process will be shown below: 

### STEP 1: Loading the data set

In [None]:
# Reading the dataset
players_full <- read_csv("data/players.csv")
players_full

### STEP 2 & 3: Data wrangling 

In [None]:
# Wrangling the data into the preferred/tidy format to perform predictive analysis 
players_clean <- players_full |>
    select(subscribe, played_hours, Age) |> # selecting the required variables 
    mutate(subscribe = as_factor(subscribe)) # changing the variable of interest to <fct> type 
players_clean

### STEP 4: Creating a visualization 

Now that the data has been wrangled into the preferred and tidy format for the predictive analysis, let's create a visualization of the predictors and the variables of interest

In [None]:
# Creating the visualization 
options(repr.plot.width = 8, repr.plot.height = 6)
players_plot <- players_clean |>
    ggplot(aes(x = Age, y = played_hours, color = subscribe)) +
    geom_point() +
    labs(x = "Age of players", y = "Hours played", color = "Subscription status", 
         title = "Age and Playtime's relation to Subscription Status") +
    theme(element_text(size = 14)) +
    scale_y_log10() +
    scale_color_brewer(palette = "Dark2") 
players_plot

Note that the y scale for playtime is scaled logarithmically, which essentially ignores the data with a playtime of 0 as log(0) isn't real. The analysis below disregards data where playtime is 0

There are a few observations that can be made from the visualization above: 

1. There is no visible relationship between player age and playtime.
2. Most players are within the age range between 20 - 30 years old.
3. Most players played between 0 - 10 hours.
4. There is no visible relationship with subscription status from age and playtime. 

## Step 5: Creating the Training and Testing Set 

In [None]:
# creating a 70 - 30 split of players_clean
players_split <- initial_split(players_clean, prop = 0.7, strata = subscribe) 
players_training <- training(players_split) # training data
players_testing <- testing(players_split) # testing data

## Step 6: Performing 5 Fold Cross-validation on `players_training`

In [None]:
# creating the specifications for the model 
players_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

# creating the recipe 
players_recipe <- recipe(subscribe ~ played_hours + Age, data = players_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors()) 

# creating the 5 folds for cross-validation
players_vfold <- vfold_cv(players_training, v = 5, strata = subscribe)

# creating the K values (from 1 to 10) 
k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1)) 

# fitting the data into a workflow and finding the best K
players_results <- workflow() |>
    add_recipe(players_recipe) |>
    add_model(players_spec) |>
    tune_grid(resamples = players_vfold, grid = k_vals) |>
    collect_metrics()

## References