# Predicting Subscription to a Video Game Newsletter Based on Age and Time Played on the Game

### By: Lila, Lauren, and Khush

## Introduction

Researchers have been collecting data on how gamers play video games. They have done this by collecting data through a server created by the data scientists on the game 'Minecraft'. This data can be used to explore a number of different questions.

The question we aim to answer through this report is *'Can age and time played of a player predict subscription to a newsletter in players.csv?'*

The dataset titled "players.csv" will be used to answer this question. It contains Information about each player observed. The information included is: 
 - The players experience (Beginner, Amateur, Veteran, Pro)
 - Whether or not the player has suscribed to a game-related newsletter
 - The hashed email of the player
 - The hours they have spent playing
 - The name of the player
 - The gender of the player
 - The age of the player

For this report we will be focusing on just the age of the player, the time that they have spent playing, and whether or not they have suscribed to a game-related newsletter.

## Methods & Results

This section is for loading, wrangling, performing a summary and creating visualizations of the data:

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)

In [None]:
# loading the data
players_original <- read_csv("players.csv")

# renaming the subscribe column variables
players <- mutate(players_original, subscribe = as_factor(subscribe)) |>
    mutate(subscribe = fct_recode(subscribe, "Yes" = "TRUE", "No" = "FALSE"))
head(players)

In [None]:
# Totalling the number of players based on subscription
subscriber_counts <- players |>
    group_by(subscribe) |>
    summarize(count=n())
subscriber_counts

In [None]:
# Minimum, maximum and mean number of player's age and number of hours played

# Age
age_summary <- players |>
    summarize(age_min = min(Age, na.rm=TRUE),
             age_max = max(Age, na.rm=TRUE),
             age_mean = mean(Age, na.rm=TRUE))
age_summary

# Hours played
played_hours_summary <- players |>
    summarize(played_hours_min = min(played_hours, na.rm=TRUE),
             played_hours_max = max(played_hours, na.rm=TRUE),
             played_hours_mean = mean(played_hours, na.rm=TRUE))
played_hours_summary

In [None]:
# Histogram of ages of subscribed individuals

options(repr.plot.width=8, repr.plot.height=8)
age_subscribers <- players |>
    filter(subscribe == "Yes") |>
    ggplot(aes(x=Age)) +
    geom_histogram(binwidth = 10) +
    labs(x="Age (years)", y="Number of Subscribers", title="Distribution of Subscribers by Age") +
    theme(text=element_text(size=20))
age_subscribers

In [None]:
# Histogram for distributions of ages and hours played among players

options(repr.plot.width = 8, repr.plot.height = 8)

# Age
age_counts <- ggplot(players, aes(x=Age)) +
    geom_histogram(binwidth=10) +
    labs(x="Age", y="Count", title="Ages among Players") +
    theme(text=element_text(size=20))
age_counts

# Hours played
played_hours_counts <- ggplot(players, aes(x=played_hours)) +
    geom_histogram(boundary = 0, binwidth=25) +
    labs(x="Hours", y="Count", title="Number of Hours Played Among Players") +
    theme(text=element_text(size=20))
played_hours_counts

In [None]:
#creating the split
players_split <- initial_split(players, prop = 0.75, strata = subscribe)  
players_train<- training(players_split)   
players_test<- testing(players_split)

head(players_train)
head(players_test)

#creating the recipe
players_recipe <- recipe(subscribe ~ Age + played_hours , data = players_train) |>
   step_scale(all_predictors()) |>
   step_center(all_predictors())

#creating model and fit
knn_tune<- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
      set_engine("kknn") |>
      set_mode("classification")

#do cross validation
players_vfold <- vfold_cv(players_train, v = 10, strata = subscribe)

k_vals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))
knn_results <- workflow() |>
      add_recipe(players_recipe) |>
      add_model(knn_tune) |>
      tune_grid(resamples = players_vfold, grid = k_vals) |>
      collect_metrics()
head(knn_results)

In [None]:
#get accuracies
accuracies <- knn_results |> 
      filter(.metric== "accuracy")

#plot accuracies vs k to find best k
accuracy_versus_k<- ggplot(accuracies, aes(x = neighbors, y = mean))+
      geom_point() +
      geom_line() +
      labs(x = "Neighbors", y = "Accuracy Estimate") 
accuracy_versus_k