**DSCI 100 Final Project: Predicting Usage of a Video Game Research Server**

Question #1: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?

*Research Question: Can playing time and player age predict subscribing to a game-related newsletter?*

Question #2: We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts.

*Research Question: Can we predict the gender of players based on playing time?*


In [None]:
#loading libraries for analyses
library(tidyverse)
library(readr)
library(tidymodels)
library(scales)
library(janitor)

#preset the max rows shown when displaying data
options(repr.matrix.max.rows = 6)

In [None]:
#find working directory
getwd()

#read in the appropriate dataset called players.csv using a relative path and
#cleaning col names to remove uneccessary capitals
players <- read_delim('Data/players.csv', delim = ',', skip = 1 ) |> clean_names()
players

In [None]:
#setting dimensions for the plots
options(repr.plot.length = 10, repr.plot.width = 10)

#Exploratory plots to better understand the dataset 
playr_time_age_plot <- players |> ggplot(aes(x = age, 
                            y = played_hours, color = subscribe)) +
                    geom_point() +  
                    labs(color = "Did the player subscribe?") +
                    ylab("Total hours Played") + xlab("Age (in Years)")

playr_time_age_plot 

#we see a lot of points near the x-axis, causing some overplotting losing detail - I created a 'zoomed-in' graph 
#to better examine these data points
playr_time_age_plot_scaled <- players |> ggplot(aes(x = age, 
                            y = played_hours, color = subscribe)) +
                    geom_point() +  scale_y_log10() +
                    labs(color = "Did the player subscribe?") +
                    ylab("Total hours Played") + xlab("Age (in Years)")

playr_time_age_plot_scaled

#here we can see that most points are below 10 hours played 
#and below 30 years (which makes sense for an undergraduate course)
#We also see that the data does not visibly appear to have any linearity, 
#so we should likely use a KNN regression rather than a simple linear regression to
#try to predict played hours based on age