-> title: Individual Planning Report — Data Science Project

-> author: Khush Shah

-> student_id: 39772439

Project: Predicting Usage of a Video Game Research Server  
**Question 1:** What player characteristics and behaviours are most predictive of subscribing to a game-related newsletter?


1. Data Description

This project uses two datasets obtained from a UBC research Minecraft server. 

1. **players.csv** - contains information about each unique player, including demographic and behavioral data.
2. **sessions.csv** — contains information about individual play sessions for each player.

These datasets together help understand player behavior and predict engagement outcomes such as newsletter subscription.

**Players Dataset Summary**

This dataset has 7 columns and 196 rows as determined by the code cell below. 
The columns are described and summarized below. 

| Column Name            | Data Type         | Description                                                         |
|-----------------------|------------------|--------------------------------------------------------------------|
| player_id             | Categorical (ID) | Unique identifier for each player                                  |
| age                   | Numeric          | Age of the player in years                                         |
| country               | Categorical      | Country of origin of the player                                    |
| gender                | Categorical      | Gender of the player (if provided)                                 |
| total_playtime        | Numeric          | Total hours spent playing on the server                            |
| num_sessions          | Numeric          | Total number of sessions recorded for that player                  |
| newsletter_subscribed | Binary (0/1)     | 1 if player subscribed to the game-related newsletter, 0 otherwise |

**Sessions Dataset Summary**

This dataset has 5 columns and 1535 rows as determined by the code cell below. 
The columns are described and summarized below. 

| Column Name      | Data Type        | Description                     |
|------------------|------------------|---------------------------------|
| session_id       | Categorical (ID) | Unique ID for each game session |
| player_id        | Categorical      | ID that links to players.csv    |
| session_duration | Numeric          | Length of play session in hours |
| start_time       | DateTime         | Start time of the session       |
| end_time         | DateTime         | End time of the session         |

**Data Summary Calculated**
Below is the data for the summary statistics calculated using the the code in the cell below. The values are rounded to 2 decimal places. 

| played_hours_mean| played_hours_sd	| played_hours_min	| played_hours_max  | Age_mean     | Age_sd   | Age_min | Age_max    | 
|------------------|--------------------|-------------------|------------------|-------------|---------|--------|-----------|
| 5.85             | 28.36               | 0	            | 223.1            | 21.14       | 7.39    | 9      | 58        |

**Data issues observed (directly visible)**

1. **Missing values:** Some numeric variables (like "age") contain "NA"s. These will require imputation or explicit handling.
2. **Outliers:** Max values for "total_playtime" and "num_sessions" are substantially larger than the mean; these extreme values suggest they may be outliers which affect modeling.

**Potential issues you cannot see directly (hidden concerns)**

1. **Sampling bias:** Players present in the dataset are only those who logged into the server — they may not represent the entire intended population. For example, people who tried once then never returned are underrepresented.
2. **Measurement error:** "total_playtime" may be aggregated differently, or may include idle time if the server doesn't strictly record active play.
3. **Time zone and timestamp consistency:** If session timestamps are in different time zones or recorded inconsistently, derived features (like first-week activity) could be miscomputed later.

**Data collection notes**

1. Data were collected from server logs and registration records for the research Minecraft server.
2. Demographic data (age, gender) likely come from optional registration fields or post-registration surveys; the accuracy depends on self-reporting.

**Conclusion:** The dataset is good enough for an initial descriptive analysis and for modeling a binary outcome (newsletter subscription). However, we must be careful with missing values, outliers, and potential bias when interpreting predictive results.

In [7]:
players <- read.csv("project-009-45/players.csv") 
sessions <- read.csv("project-009-45/sessions.csv")

ncol(players)
nrow(players)

ncol(sessions)
nrow(sessions)

library(dplyr)

players_summary <- players |> 
  summarise(
    across(
      where(is.numeric),
      list(
        mean = ~round(mean(.x, na.rm = TRUE), 2),
        sd   = ~round(sd(.x, na.rm = TRUE), 2),
        min  = ~round(min(.x, na.rm = TRUE), 2),
        max  = ~round(max(.x, na.rm = TRUE), 2)
      ),
      .names = "{col}_{fn}"
    )
  )


2. Questions

In this section, I will:
- Identify the **broad question** from the project prompt.
- Formulate a **specific question** using one response and explanatory variables.
- Explain how the data helps address this question.


3. Exploratory Data Analysis and Visualization

In this section, I will:
- Demonstrate that data can be loaded into R.
- Perform minimal wrangling to tidy the data.
- Compute mean values for quantitative variables (players.csv).
- Create a few exploratory plots.
- Discuss key patterns or insights relevant to the question.

4. Methods and Plan

In this section, I will:
- Propose an appropriate predictive modeling method.
- Explain why this method fits the data and question.
- Discuss assumptions, limitations, and comparison strategy.
- Describe data-splitting and validation plans.

5. GitHub Repository

In this section, I will:
- Provide the final GitHub repository link.
- Summarize the commit history and workflow.