# Individual Planning Report
GitHub repository link: https://github.com/MrSaltyPotatoes/UBC-DSCI-100-Project

## Import Data

In [1]:
library(tidyverse)
url_players <- "https://drive.google.com/uc?export=download&id=1UVsY6J_v6s_gCkQRWUPVRkiD4aBypnRj"
players <- read_csv(url_players)
url_sessions <- "https://drive.google.com/uc?export=download&id=1i4i4CRxh8ouTNllQvUkaKF8iRxQgUqbM"
sessions <- read_csv(url_sessions)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
[1mRows: [22m[34m196[39m [1mColumns: [22m[34m7[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m


## Data Description

The dataset **players.csv** provides information about players' personal identities, gaming experience and playtime. The dataset contains 196 observations and 7 variables.  
The variables are:
+ `experience`: a *character* variable representing the level of gaming experience of a player. It contains 5 categories: `Pro`,`Veteran`,`Amateur`,`Regular`,`Beginner`.
+ `subscribe`: a *logical* variable representing whether the player subscribe to a game-related newsletter. It contains 2 values: `TRUE` (subscribed), `FALSE` (not subscribed).
+ `hashedEmail`: a *character* variable representing a coded version of each player's email address.
+ `played_hours`: a *double* variable representing the number of hours each player spend in playing the game. It ranges from a minimum of 0.00 hours to a maximum of 223.10 hours, with a mean of 5.85 hours and a median of 0.10 hour.
+ `name`: a *character* variable representing the name of each player.
+ `gender`: a *character* variable representing the gender of each player. It contains 7 categories: `Male`,`Female`,`Non-binary`,`Prefer not to say`,`Agender`,`Two-Spirited`,`Other`.
+ `Age`: a *double* variable representing the age of each player. It ranges from a minimum of 9.00 years old to a maximum of 58.00 years old, with a mean of 21.14 years old and a median of 19.00 years old.

**Potential Issue**:
+ The variables `hashedEmail` and `name` do not provide useful information for analysis.
+ If we want to use the variable `experience` as a predictor, we have to process it first, since it is a *character* variable.
+ In `gender` variable, the amounts of `Agender`,`Other` and `Two-Spirited` are relatively small compared to the ones of other categories, which may affect the result of analysis.

The dataset **sessions.csv** provides information about the time when a player start and end a session. The dataset contains 1535 observations and 5 variables.  
The variables are:
+ `hashedEmail`: a *character* variable representing a coded version of player's email address.
+ `start_time`: a *character* variable representing the time when a player starts a session in a human-readable format.
+ `end_time`: a *character* variable representing the time when a player ends a session in a human-readable format.
+ `original_start_time`: a *double* variable representing the time when a player starts a session in a machine-readable format.
+ `original_end_time`: a *double* variable representing the time when a player ends a session in a machine-readable format.

**Potential Issue**:
+ The variable `harshedEmail` do not provide useful information for analysis.

In [2]:
colnames_players <- colnames(players)
colnames_sessions <- colnames(sessions)
observation_players <- nrow(players)
observation_sessions <- nrow(sessions)
columns_players <- ncol(players)
columns_sessions <- ncol(sessions)
colnames_players
colnames_sessions
observation_players
columns_players
observation_sessions
columns_sessions

In [22]:
unique(players$experience)
unique(players$subscribe)
unique(players$gender)
players |>
summarize(max(played_hours), min(played_hours), mean(played_hours), median(played_hours), 
          max(Age, na.rm = TRUE), min(Age, na.rm = TRUE), mean(Age, na.rm = TRUE), median(Age, na.rm = TRUE))
players |>
group_by(gender) |>
summarize(count = n())

max(played_hours),min(played_hours),mean(played_hours),median(played_hours),"max(Age, na.rm = TRUE)","min(Age, na.rm = TRUE)","mean(Age, na.rm = TRUE)","median(Age, na.rm = TRUE)"
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
223.1,0,5.845918,0.1,58,9,21.13918,19


gender,count
<chr>,<int>
Agender,2
Female,37
Male,124
Non-binary,15
Other,1
Prefer not to say,11
Two-Spirited,6


## Questions

**Borad Question**:
**Question 1**: What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?  
**Specific Question**: Can factors such as `experience`,`played_hours`,`gender` and `Age` predict whether a player has subscribe to a game-related newsletter or not in dataset `players.csv`?  

The dataset `players.csv` contains the explanatory variables (`experience`,`played_hours`,`gender`,`Age`) and the response variable (`subscribe`) that helps to answer the question of interest. In order to prepare for data analysis, I will first convert the variables `experience` and `gender` into factor or numerical variables, so they can be used to predict the response variable using KNN classification algorithm. In addition, I will remove `harshedEmail` column and `name` column from the dataset, as they do not provide meaningful information for the question. Last, I will scale and center the data to remain the accuracy of the prediction.