# Which Kinds of Players Contribute the Most Data?  
*DSCI-100 Data Science Project*

---

## Introduction

Video game servers create detailed records of player activity, offering researchers valuable insight into how players interact with digital worlds. At UBC, data from a Minecraft server is being used to study player engagement and improve server management. A key challenge is identifying which types of players contribute the most data, so that recruitment and server resources can be optimized.  
In this project, we use clustering analysis to answer: **Which kinds of players are most likely to contribute a large amount of data?**

---

## Data Description

**Dataset overview:**
- **players.csv:** List of all unique players, each with demographic and background data.
- **sessions.csv:** List of every play session, with start/end time and player identifier.

**Variable summary:**

| Variable         | File      | Type      | Description                                 |
|------------------|-----------|-----------|---------------------------------------------|
| hashedEmail      | both      | string    | Unique (anonymized) player ID               |
| name             | players   | string    | Player’s first name                         |
| gender           | players   | string    | Player’s gender                             |
| age              | players   | numeric   | Player’s age                                |
| experience       | players   | string    | Player experience (e.g., Pro, Veteran, etc.)|
| played_hours     | players   | numeric   | Reported hours played                       |
| subscribe        | players   | logical   | Newsletter subscription flag                |
| start_time       | sessions  | datetime  | Session start time                          |
| end_time         | sessions  | datetime  | Session end time                            |
| original_start_time | sessions | numeric | Raw start timestamp                         |
| original_end_time   | sessions | numeric | Raw end timestamp                           |

- **Sample sizes:**  
    - `players.csv`: 196 rows  
    - `sessions.csv`: 1535 rows

- **Summary statistics (example):**  
    | Metric             | Mean     | Median   | Min      | Max         |
    |--------------------|----------|----------|----------|-------------|
    | total_sessions     | 12.28    | 1.00     | 1.00     | 310.00      |
    | total_playtime     | 5154     | 0        | -13106966| 7361189     |
    | avg_session_length | 7114.6   | 0.3      | -485443  | 525578      |

- **Data issues:**  
    - **Missing values:** Some players have missing session information or missing values in reported play hours.
    - **Outliers:** Several players have extremely high or negative playtime/session length (potential data entry or logging errors).
    - **Data collection:** Data was collected automatically by the Minecraft server as players joined sessions, and player demographic information was self-reported at registration.

---

## Methods & Results

**Data Wrangling and Feature Engineering:**  
- Imported data and calculated session duration (`session_length`) for each session.
- Summarized, for each player: total sessions, total playtime, average session length.
- Merged player demographic info with session activity using `hashedEmail` as the unique key.
- Selected numeric features (`total_sessions`, `total_playtime`, `avg_session_length`) and standardized them for clustering.
- Removed missing values before clustering.

**Exploratory Data Analysis:**  
- Most players have low total sessions and playtime, but a small number of players are extremely active.
- Some variables are highly skewed, so log transformation was used for visualization.

**Clustering Analysis:**  
- Used k-means clustering on standardized features to segment players into groups by activity level.
- The elbow plot (Figure 2) shows the best number of clusters is k = 3.

**Cluster Characteristics:**  
- **Cluster 1:** Largest group (117 players), lowest average session activity.
- **Cluster 2:** Small group (4 players), high variability and the highest average playtime and session length.
- **Cluster 3:** Also small (4 players), with high average sessions and playtime.

---

### Visualizations

**Figure 1. Player Activity Clusters (scatterplot):**  
> The scatterplot shows clear separation of clusters by activity level. Cluster 1 is tightly grouped near the origin (casual/infrequent players), Cluster 2 and Cluster 3 contain the highly active users.

**Figure 2. Elbow Plot for Player Activity Clustering:**  
> The "elbow" at k = 3 supports the choice of three clusters.

**Figure 3. Log-Transformed Histograms by Cluster:**  
> The histograms show that most players (Cluster 1) have very low session activity. Cluster 2 contains a small number of highly variable, highly active players. Cluster 3 also contains only a few players but with consistently high total sessions and playtime.

---

## Results & Discussion

After k-means clustering on player activity metrics, three distinct groups emerged:

- **Cluster 1:** Casual/infrequent users, low total sessions, playtime, and short sessions.
- **Cluster 2:** Very small group of "super users" with extremely high total playtime and session length, and highly variable behavior.
- **Cluster 3:** Another small group with consistently high total sessions and playtime, but lower variability than Cluster 2.

Cluster 2 and Cluster 3 are the main contributors of data to the server, even though they are small in number. Cluster 1, the majority, contributes relatively little.  
**This suggests that most data comes from a few highly engaged players.**

**Limitations:**  
- Some values (e.g., negative playtime/session length) indicate data quality issues.
- K-means assumes clusters are spherical and similar size; with skewed data and small clusters, results may be sensitive to scaling and outliers.

**Implications:**  
- Recruitment and engagement strategies should focus on understanding and retaining Cluster 2 and 3 users.
- Future work could include deeper analysis of demographics (e.g., experience level, age) and investigation of data quality.

---

## Conclusion

Clustering analysis revealed that the majority of useful data is generated by a very small group of highly active players. Targeted engagement of these users is crucial for data-driven research and server management. Efforts to increase overall data contribution should consider both retaining top contributors and motivating casual players.

---

## References

- Timbers, T., Campbell, T., & Lee, M. (2021). *Data Science: A First Introduction*.
---