# Identifying High-Engagement Player Segments Through Demographic and Behavioral Analysis

**Authors:**  
- Kehan Hettiarachchi  
- Kenshin Tanaka  
- Caio Matos  
- Finley Bradshaw  

**Date:** March 15, 2025

## Introduction

Understanding which players contribute the most in-game data is essential for making informed decisions in game design, user retention, and personalized content strategies. This project aims to identify player characteristics that are strongly linked to high engagement, which we measure by total hours played. Engagement is a key performance indicator in gaming; players who spend more time in-game generally provide more interaction data, are more likely to convert to paid services, and can serve as valuable participants in beta testing or early feature releases.

The central question guiding this report is:  
**Can we use player attributes such as age, experience level, gender, and subscription status to predict how many hours a player will play?**

We will explore this by utilizing a dataset called `players.csv`, which contains **196 player records** and **seven variables**: `experience`, `subscribe`, `hashedEmail`, `played_hours`, `name`, `gender`, and `age`. This dataset seems to be collected from an online platform and includes both **numeric variables** (such as `played_hours` and `age`) and **categorical fields** (such as `experience`, `subscribe`, and `gender`) that provide insights into player demographics and behavior.


### Dataset Overview

- **Played hours** is the main target variable, ranging from 0 to 223.1 hours. It has a mean of **5.85** and a median of **0**, which indicates a highly skewed distribution with outliers.

- **Participant ages** range from **8 to 50 years**, with a mean of approximately **20.5** and a median of **19**. Two entries are missing from the dataset.

- **Experience** is divided into five levels:
  - Amateur (32%)
  - Beginner (3%)
  - Pro (7%)
  - Regular (22%)
  - Veteran (36%)

- **Subscribe** is mostly `TRUE` (~73%), indicating a high subscription rate.

- **Gender demographics** include:
  - Male (63%)
  - Female (19%)
  - Non-binary (12%)
  - A few minor categories grouped under `"Other"`

- `hashedEmail` and `name` serve as **unique identifiers** and are not utilized in the analysis.

### Rationale

Identifying high-engagement players offers both analytical and business value. From an analytical perspective, understanding how various factors relate to playtime can help optimize predictive models. For business or design teams, knowing who the most active players are can support targeted marketing campaigns, loyalty programs, and content customization.

However, there are potential modeling challenges to consider:
- **Outliers** in `played_hours`, where many players have recorded 0 hours while a small number play extensively.
- **Low-frequency categories** in `experience` and `gender`, which can compromise the reliability of results for smaller groups.

This project utilizes **Jupyter Notebook** to execute the complete analysis pipeline: loading and cleaning data, performing exploratory data analysis (EDA), and applying regression techniques to model playtime based on demographic attributes. Every aspect of this process is designed to be fully reproducible and understandable for both technical and non-technical audiences.

By the end of this report, we aim to provide actionable insights and a functioning model that highlights the characteristics of the players who are most engaged with the game. These insights can inform recruiting strategies for data collection, enhance retention efforts, and aid in game personalization initiatives.
