# The prediction accuracy of subscribing to a game-related newsletter using gender, age, and played hours.

## Introduction

* This project will mainly focus on a dataset about how people play video games. It will use data from this dataset to calculate the prediction accuracy of different characteristics or behaviors. The goal is to identify which characteristics or behaviors are more predictive. In the future, these characteristics can be used to predict whether a player will subscribe to a game-related newsletter.

  
* According to the dataset provided by a research group in Computer Science at UBC, we will primarily analyze three characteristics and behaviors. The main research question of this project is: "Among gender, age, and hours played, which factor has the highest prediction accuracy for subscribing to a game-related newsletter?"

### Data information
  - The data is collected by a research group in Computer Science at UBC, led by Frank Wood, which is about how people played the game.
  - This dataset includes seven variables: experience, subscribe, hashed email, played hours, name, gender, and age. It has 196 observations.
  - Quantitative variable: played hours, age;  Qualitative variable: experience, subscribe, hashed email, name, gender.
  - The meaning of the variables:
    1. experience: The experience level of each player, is classified as "Pro", Veteran", "Amateur", "Beginner", and "Regular" five level.
    2. subscribe: Whether the player subscribes to the game-related newsletter or not. True means they subscribe, otherwise it gonna be false.
    3. hashed email: The email address of these players.
    4. played_hours: The time (in hours) that played totally played.
    5. name: The name of these players.
    6. gender: The gender of these players.
    7. Age: The age of these players.
  - The mean of played hour is 6.24, the mean of age is 20.47
  - There are some missing values in the age and gender variables, likely because these are private questions. When we perform our analysis, we will drop the missing values.
  - Since this project only analyzes how hours played, gender, and age predict subscription, we need to filter out the variables experience, hashedEmail, and name to tidy the data. At the same time, we also need to convert the data types of subscribe and name into factors. 

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
library(cowplot)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.6     [32m✔[39m [34mrsample     [39

In [2]:
player <- read_csv("data/players.csv")
player

[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): experience, hashedEmail, name, gender
[32mdbl[39m (2): played_hours, Age
[33mlgl[39m (1): subscribe

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


experience,subscribe,hashedEmail,played_hours,name,gender,Age
<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
Pro,TRUE,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
Veteran,TRUE,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
Veteran,FALSE,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
⋮,⋮,⋮,⋮,⋮,⋮,⋮
Amateur,FALSE,d572f391d452b76ea2d7e5e53a3d38bfd7499c7399db299bd4fedb06a46ad5bb,0.0,Dylan,Prefer not to say,17
Amateur,FALSE,f19e136ddde68f365afc860c725ccff54307dedd13968e896a9f890c40aea436,2.3,Harlow,Male,17
Pro,TRUE,d9473710057f7d42f36570f0be83817a4eea614029ff90cf50d8889cdd729d11,0.2,Ahmed,Other,


In [3]:
player <- player |>
select(-hashedEmail, -name,-experience)|>
drop_na(subscribe, played_hours,Age)|>
mutate(subscribe = as_factor(subscribe), gender = as_factor(gender))|>
filter(gender != "Prefer not to say")
player

subscribe,played_hours,gender,Age
<fct>,<dbl>,<fct>,<dbl>
TRUE,30.3,Male,9
TRUE,3.8,Male,17
FALSE,0.0,Male,17
⋮,⋮,⋮,⋮
TRUE,0.0,Female,17
FALSE,0.3,Male,22
FALSE,2.3,Male,17


In [4]:
summary <- player |>
summarize(played_hours = mean(played_hours),Age = mean(Age))
summary

played_hours,Age
<dbl>,<dbl>
6.237158,20.46995
