In [1]:
players <- read.csv("data/players.csv")

## (1) Data Description

The `players.csv` file contains a list of **196 unique users** and **7 features** describing each user.  
Those features include:
 
- `name` *(chr)* - Player's name.
  
- `Age` *(int)* - The player’s age in years, ranging from **9 to 58 years old**, with **median** age around 19.

- `gender` *(chr)* - The player’s gender, contains multiple categories to represent diverse gender identities.

- `experience` *(chr)* - Indicates a player's experience level. Possible levels include:  
  `Beginner`, `Amateur`, `Regular`, `Pro`, and `Veteran`.

- `played_hours` *(dbl)* - Total hours the player has played. **Hours range from** 0 to 223.1, with a **mean** of 5.85 and a **median** of 0.1.

- `subscribe` *(int)* - Indicates whether the player subscribed to a game-related newsletter (`TRUE` or `FALSE`). Out of 196 players, **144 (≈73.5%) subscribed**, and **52 (≈26.5%) did not**.

- `hashedEmail` *(chr)* - Player’s email stored as a hashed value for privacy. 

## (2) Questions

The broad question posed is: "We would like to know which "kinds" of players are most likely to contribute a large amount of data so that we can target those players in our recruiting efforts." **To answer this broad question**, the specific question formulated is: Do `gender`, `experience`, and `subscribe` significantly predict whether a player is among the top 20% of players in terms of total hours spent playing the game (`played_hours`) in `players.csv`. 

Since our response variable indicates whether a player is among the **top 20%** of players by total hours played, we first compute the **80th percentile** of `played_hours`, then use `mutate()` on the `players` dataset to create a new binary variable `top_hours` which is our response variable for a potential KNN regression. 


 


## (3) Exploratory Data Analysis and Visualization

In [35]:
library(tidyverse)
options(repr.matrix.max.rows = 10)
options(repr.matrix.max.columns = 10)

# Wrangling: 
# (1) Any rows containing missing values need to be removed to ensure the dataset is clean
players_0 <- read.csv("data/players.csv") |> drop_na()
# (2) Selecting Relevant Variables 
players_0 <- players_0 |> select(experience, Age, gender, subscribe, played_hours)
# (3) Creating new response variable top_hours
top20Hour <- quantile(players$played_hours, 0.8, na.rm = TRUE)
players_0 <- players_0 |> mutate(top_hours = played_hours >= top20Hour)



