# 1. Data Description

## 1.1 Dataset Overview
In this project, I am working with the file **players.csv**, which contains one row per unique player and several demographic and gameplay-related features.

- **Number of observations:** 196
- **Number of variables:** 7

## 1.2 Variable Summary Table

| Variable       | Type      | Description | Missing | Unique Values |
|----------------|-----------|-------------|---------|---------------|
| experience     | categorical | Player’s self-reported gaming experience | 0 | 5 |
| subscribe      | boolean | Whether the player subscribed to the newsletter | 0 | 2 |
| hashedEmail    | identifier | Unique hashed player ID | 0 | 196 |
| played_hours   | numeric | Total hours played on the server | 0 | 43 |
| name           | text | Player name | 0 | 196 |
| gender         | categorical | Gender identity | 0 | 7 |
| Age            | numeric | Player age in years | 2 | 32 |


## 1.3 Numeric Summary Statistics
*(Report values rounded to 2 decimals.)*

| Variable     |Count | Mean | Std Dev | Min | Median | Max |
|--------------|------|----------|------|--------|------|----|
| played_hours |  196   |    5.85      |    28.36  |   0.00     |  0.10    |  223.1  |
| Age          |  194    |   21.14       |   7.39   |   9.00     |   19.00   |  58  |

## 1.4 Notes and Potential Data Issues
- Age contains 2 missing values.  
- `played_hours` is heavily skewed with a few large outliers. Many players have close to 0 or very small play time.  
- There may be volunteer/self-selection bias in the dataset.  
- `name` contains personal information and should be excluded from all analysis.  
- Small category counts in gender may require grouping.  
- Subscription classes are imbalanced (about 73% subscribed).
- Experience appears to be ordinal and we should preserve this
  
   



## 1.5 Overall summary

The **players.csv** file contains 196 unique players and 7 variables. Key numeric variables are **played_hours** (total play time per player) and **Age** (two missing values). **subscribe** is a boolean indicating newsletter signup (144 True, 52 False). Categorical variables include **experience** (5 levels) and **gender** (7 categories). Computer checks found no duplicate **hashedEmail** identifiers. **played_hours** is highly skewed (mean 5.85 hours, median 0.10 hours, max 223.1). Potential limitations include class imbalance for **subscribe**, small counts in some **gender** categories, and volunteer bias inherent to the data collection method.

# 2. Questions

## 2.1 Broad Question
**What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter, and how do these features differ between various player types?**

## 2.2 Specific Question
**Can player demographic attributes (`Age`, `gender`) and gameplay characteristics (`experience`, `played_hours`) predict whether a player subscribes to the game-related newsletter?**

This question is appropriate because `players.csv` contains all required explanatory variables and a clear binary response variable (`subscribe`). This will be a standard classification problem. To apply models seen in class, we will likely need to scale the numeric variables, impute the missing values in the age variable, and encode the categorical variables to numeric form.

# 3. Exploratory Data Analysis (EDA)
## 3.1 Load Data

In [1]:
players <- read.csv("data/players.csv")
head(players)

Unnamed: 0_level_0,experience,subscribe,hashedEmail,played_hours,name,gender,Age
Unnamed: 0_level_1,<chr>,<lgl>,<chr>,<dbl>,<chr>,<chr>,<int>
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
2,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
3,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
4,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
5,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
6,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


## 3.2 Basic Data Wrangling

In [3]:
library(tidyverse)

players_tidy <- players |>
  mutate(
    subscribe = as.logical(subscribe),
    gender = as.factor(gender),
    experience = as.factor(experience)
  )

head(players_tidy)

Unnamed: 0_level_0,experience,subscribe,hashedEmail,played_hours,name,gender,Age
Unnamed: 0_level_1,<fct>,<lgl>,<chr>,<dbl>,<chr>,<fct>,<int>
1,Pro,True,f6daba428a5e19a3d47574858c13550499be23603422e6a0ee9728f8b53e192d,30.3,Morgan,Male,9
2,Veteran,True,f3c813577c458ba0dfef80996f8f32c93b6e8af1fa939732842f2312358a88e9,3.8,Christian,Male,17
3,Veteran,False,b674dd7ee0d24096d1c019615ce4d12b20fcbff12d79d3c5a9d2118eb7ccbb28,0.0,Blake,Male,17
4,Amateur,True,23fe711e0e3b77f1da7aa221ab1192afe21648d47d2b4fa7a5a659ff443a0eb5,0.7,Flora,Female,21
5,Regular,True,7dc01f10bf20671ecfccdac23812b1b415acd42c2147cb0af4d48fcce2420f3e,0.1,Kylie,Male,21
6,Amateur,True,f58aad5996a435f16b0284a3b267f973f9af99e7a89bee0430055a44fa92f977,0.0,Adrian,Female,17


Note: We did not change name to factor as that will likely be dropped

In [None]:
players_tidy |>
  summarize(
    mean_played_hours = mean(played_hours, na.rm = TRUE),
    mean_age = mean(Age, na.rm = TRUE)
  )