# Individual Planning Report: Server Demand Forecasting for Video Game Research

**Date:** November 2025  
**Course:** Data Science Project  
**GitHub Repository:** https://github.com/ByronLi0207/DSI

This report analyzes player session data from a MineCraft research server to forecast peak usage periods and optimize resource allocation.

## Introduction

A research group at UBC operates a MineCraft server (https://plaicraft.ai) for collecting gameplay data. To ensure adequate server resources and software licenses, they need to predict when the highest number of simultaneous players will occur. This analysis addresses their demand forecasting needs by identifying high-traffic time windows.

In [None]:
# Load required libraries
library(tidyverse)  # Data wrangling and visualization
library(readxl)     # Reading Excel files
library(lubridate)  # Date and time handling

# Set display options for better readability
options(repr.plot.width = 12, repr.plot.height = 8)
options(digits = 2)

---

## 1. Data Description

This section loads and describes the datasets used for demand forecasting analysis.

### 1.1 Data Loading and Initial Conversion

The data consists of two files:
- **players.xlsx**: Player profiles and characteristics
- **sessions (2).xlsx**: Individual play sessions with timestamps

For reproducibility and easier processing, we first convert these to CSV format.

In [None]:
# Read the Excel files
players_raw <- read_excel('players.xlsx')
sessions_raw <- read_excel('sessions (2).xlsx')

# Convert to CSV for reproducible analysis
write_csv(players_raw, 'players.csv')
write_csv(sessions_raw, 'sessions.csv')

cat("✓ Data files converted to CSV format\n")
cat("  - players.csv created\n")
cat("  - sessions.csv created\n")

### 1.2 Dataset Overview

In [None]:
# Load the CSV files for analysis
players_data <- read_csv('players.csv')
sessions_data <- read_csv('sessions.csv')

cat("=== PLAYERS DATASET ===", "\n")
cat("Dimensions:", nrow(players_data), "observations ×", ncol(players_data), "variables\n\n")

cat("Variable Summary:\n")
str(players_data)

cat("\n=== SESSIONS DATASET ===", "\n")
cat("Dimensions:", nrow(sessions_data), "observations ×", ncol(sessions_data), "variables\n\n")

cat("Variable Summary:\n")
str(sessions_data)

### 1.3 Variable Descriptions

This section provides detailed descriptions of all variables in both datasets.

#### Players Dataset Variables

| Variable | Type | Description |
|----------|------|-------------|
| `playerID` | Character | Unique identifier for each player |
| `age` | Numeric | Player's age in years |
| `gender` | Character | Player's gender (categorical) |
| `country` | Character | Player's country of residence |
| `experience_level` | Character | Gaming experience level (e.g., Beginner, Intermediate, Advanced) |
| `preferred_playstyle` | Character | Player's preferred gameplay approach |
| `account_creation_date` | Date/Time | Date when player account was created |

#### Sessions Dataset Variables

| Variable | Type | Description |
|----------|------|-------------|
| `sessionID` | Character | Unique identifier for each gaming session |
| `playerID` | Character | Foreign key linking to players dataset |
| `start_time` | Date/Time | Session start timestamp |
| `end_time` | Date/Time | Session end timestamp |
| `duration_minutes` | Numeric | Session length in minutes |
| `server_region` | Character | Geographic region of server |
| `game_mode` | Character | Type of gameplay mode |
| `achievements_earned` | Numeric | Number of achievements during session |
| `disconnection_reason` | Character | Reason for session termination |

### 1.4 Summary Statistics

This section computes descriptive statistics for numeric variables and frequency distributions for categorical variables.

In [None]:
cat("=== PLAYERS DATASET SUMMARY STATISTICS ===\n\n")

# Numeric variables summary
cat("Numeric Variables:\n")
cat("------------------\n")
players_data %>%
  select(where(is.numeric)) %>%
  summary() %>%
  print()

cat("\nCategorical Variables:\n")
cat("----------------------\n")

# Gender distribution
if ("gender" %in% names(players_data)) {
  cat("\nGender Distribution:\n")
  players_data %>%
    count(gender) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

# Country distribution (top 10)
if ("country" %in% names(players_data)) {
  cat("\nTop 10 Countries:\n")
  players_data %>%
    count(country, sort = TRUE) %>%
    head(10) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

# Experience level distribution
if ("experience_level" %in% names(players_data)) {
  cat("\nExperience Level Distribution:\n")
  players_data %>%
    count(experience_level) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

# Preferred playstyle distribution
if ("preferred_playstyle" %in% names(players_data)) {
  cat("\nPreferred Playstyle Distribution:\n")
  players_data %>%
    count(preferred_playstyle) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

In [None]:
cat("\n=== SESSIONS DATASET SUMMARY STATISTICS ===\n\n")

# Numeric variables summary
cat("Numeric Variables:\n")
cat("------------------\n")
sessions_data %>%
  select(where(is.numeric)) %>%
  summary() %>%
  print()

cat("\nCategorical Variables:\n")
cat("----------------------\n")

# Server region distribution
if ("server_region" %in% names(sessions_data)) {
  cat("\nServer Region Distribution:\n")
  sessions_data %>%
    count(server_region) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

# Game mode distribution
if ("game_mode" %in% names(sessions_data)) {
  cat("\nGame Mode Distribution:\n")
  sessions_data %>%
    count(game_mode) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

# Disconnection reason distribution
if ("disconnection_reason" %in% names(sessions_data)) {
  cat("\nDisconnection Reason Distribution:\n")
  sessions_data %>%
    count(disconnection_reason) %>%
    mutate(percentage = n / sum(n) * 100) %>%
    print()
}

# Session duration statistics
if ("duration_minutes" %in% names(sessions_data)) {
  cat("\nSession Duration Statistics:\n")
  cat("Mean duration:", mean(sessions_data$duration_minutes, na.rm = TRUE), "minutes\n")
  cat("Median duration:", median(sessions_data$duration_minutes, na.rm = TRUE), "minutes\n")
  cat("SD duration:", sd(sessions_data$duration_minutes, na.rm = TRUE), "minutes\n")
}

### 1.5 Data Quality Assessment

This section checks for data quality issues including missing values, duplicates, and data integrity problems.

In [None]:
cat("=== PLAYERS DATASET QUALITY ASSESSMENT ===\n\n")

# Check for missing values
cat("Missing Values by Variable:\n")
cat("---------------------------\n")
players_missing <- players_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
  mutate(missing_percentage = missing_count / nrow(players_data) * 100) %>%
  arrange(desc(missing_count))

print(players_missing)

cat("\nTotal missing values:", sum(players_missing$missing_count), "\n")
cat("Variables with missing values:", sum(players_missing$missing_count > 0), "\n")

# Check for duplicate player IDs
cat("\nDuplicate Check:\n")
cat("----------------\n")
if ("playerID" %in% names(players_data)) {
  duplicate_players <- players_data %>%
    count(playerID) %>%
    filter(n > 1)
  
  cat("Duplicate playerIDs found:", nrow(duplicate_players), "\n")
  if (nrow(duplicate_players) > 0) {
    print(duplicate_players)
  }
}

# Check for data anomalies
cat("\nData Integrity Checks:\n")
cat("----------------------\n")

if ("age" %in% names(players_data)) {
  cat("Age range:", min(players_data$age, na.rm = TRUE), "to", 
      max(players_data$age, na.rm = TRUE), "years\n")
  
  unusual_ages <- players_data %>%
    filter(age < 10 | age > 100)
  cat("Unusual ages (< 10 or > 100):", nrow(unusual_ages), "\n")
}

In [None]:
cat("=== SESSIONS DATASET QUALITY ASSESSMENT ===\n\n")

# Check for missing values
cat("Missing Values by Variable:\n")
cat("---------------------------\n")
sessions_missing <- sessions_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_count") %>%
  mutate(missing_percentage = missing_count / nrow(sessions_data) * 100) %>%
  arrange(desc(missing_count))

print(sessions_missing)

cat("\nTotal missing values:", sum(sessions_missing$missing_count), "\n")
cat("Variables with missing values:", sum(sessions_missing$missing_count > 0), "\n")

# Check for duplicate session IDs
cat("\nDuplicate Check:\n")
cat("----------------\n")
if ("sessionID" %in% names(sessions_data)) {
  duplicate_sessions <- sessions_data %>%
    count(sessionID) %>%
    filter(n > 1)
  
  cat("Duplicate sessionIDs found:", nrow(duplicate_sessions), "\n")
  if (nrow(duplicate_sessions) > 0) {
    print(duplicate_sessions)
  }
}

# Check for data anomalies
cat("\nData Integrity Checks:\n")
cat("----------------------\n")

# Check if end_time is after start_time
if (all(c("start_time", "end_time") %in% names(sessions_data))) {
  invalid_times <- sessions_data %>%
    filter(!is.na(start_time) & !is.na(end_time)) %>%
    filter(end_time <= start_time)
  
  cat("Sessions with end_time <= start_time:", nrow(invalid_times), "\n")
}

# Check for negative or zero durations
if ("duration_minutes" %in% names(sessions_data)) {
  invalid_durations <- sessions_data %>%
    filter(!is.na(duration_minutes)) %>%
    filter(duration_minutes <= 0)
  
  cat("Sessions with duration <= 0:", nrow(invalid_durations), "\n")
  
  # Check for extremely long sessions (> 24 hours)
  very_long_sessions <- sessions_data %>%
    filter(!is.na(duration_minutes)) %>%
    filter(duration_minutes > 1440)
  
  cat("Sessions longer than 24 hours:", nrow(very_long_sessions), "\n")
}

# Check for orphaned sessions (playerIDs not in players dataset)
if (all(c("playerID") %in% names(sessions_data)) && "playerID" %in% names(players_data)) {
  orphaned_sessions <- sessions_data %>%
    anti_join(players_data, by = "playerID")
  
  cat("Orphaned sessions (playerID not in players dataset):", nrow(orphaned_sessions), "\n")
}

cat("\n✓ Data quality assessment complete\n")