# **STEAM REVIEW SENTIMENT & PLAYER BEHAVIOR ANALYSIS**
**Phase 5: Project Summary & Comprehensive Report**

**Authors:** `Krystal Bacalso` `Javier Raut` `Joseph Desyolong` `Jhon Omblero` `Hayah Apistar`

---

### **Executive Summary**

**Objective:**
The primary goal of this project was to investigate the question:

> “What factors influence how players review games on Steam, and how do those reviews reflect player experience and engagement?”

We aimed to determine if playtime, ownership history, or acquisition method correlates with positive or negative sentiment.


**Methodology:**
We engineered a complete data mining pipeline consisting of:
1.  **Extraction:** Pulling 10,000+ reviews via the Steam Web API.
2.  **Storage:** Architecting a relational database on Supabase (PostgreSQL).
3.  **Processing:** Using SQL logic for efficient data cleaning and feature engineering.
4.  **Analysis:** Performing Exploratory Data Analysis (EDA) and Machine Learning (Logistic Regression) to classify sentiment.

---

### **Phase 1: API Selection & Data Strategy**

**Data Source:** [Steam User Reviews API](https://partner.steamgames.com/doc/store/getreviews)

To ensure a representative dataset, we targeted the **Top 10 Most Reviewed Games** on Steam (including diverse genres like *Terraria, Elden Ring, and Euro Truck Simulator 2*). This ensures we capture high-engagement titles with a wide variety of user feedback.

**Extraction Strategy:**
*   **Endpoint:** `https://store.steampowered.com/appreviews/{app_id}?json=1`

| Field | Description |
|:------|:-------------|
| `review` | The text content of the user’s review |
| `voted_up` | Indicates whether the user recommended the game (positive = True, negative = False) |
| `votes_up` | Number of users who found the review helpful |
| `author.num_games_owned` | Total number of games owned by the reviewer |
| `author.num_reviews` | Number of reviews written by the user |
| `author.playtime_forever` | Total lifetime playtime (in minutes) for the game |
| `author.playtime_last_two_weeks` | Playtime in the past two weeks |
| `author.playtime_at_review` | Playtime at the time the review was written |
| `timestamp_created` | Unix timestamp when the review was created |
| `timestamp_updated` | date the review was last updated (unix timestamp) |
| `steam_purchase` | True if the review came from a verified Steam purchase |
| `received_for_free` | True if the game was received for free |
| `written_during_early_access ` | true if the user posted this review while the game was in Early Access |
*   **Parameters Used:**

| Parameter | Description |
|------------|-------------|
| `filter` | Defines how reviews are sorted — by **recent**, **updated**, or **helpfulness**. |
| `language` | Specifies the language of reviews (or “all” to include every language). |
| `day_range` | The number of days from the current date to include in results (max 365). |
| `cursor` | Handles pagination — used to fetch the next batch of reviews. |
| `review_type` | Filters reviews by sentiment: `all`, `positive`, or `negative`. |
| `purchase_type` | Filters whether the review came from a **Steam purchase** or not. |
| `num_per_page` | Number of reviews returned per request (max = 100). |
| `filter_offtopic_activity` | Optional — include or exclude off-topic “review bombs.” |

**Outcome:**
We successfully retrieved raw JSON data containing:
*   **Review Content:** The text body and "voted up" status.
*   **Voting Data:** "Helpful" and "Funny" votes from other users.
*   **Author Metadata:** Playtime, games owned, and last played timestamps.

---

### **Phase 2: Database Architecture (Supabase)**

We moved beyond simple CSV files by implementing a **Relational Database** using PostgreSQL (via Supabase). This ensures data integrity, facilitates complex queries, and mimics a real-world production environment.

**Schema Design (Normalized):**

#### **1. `authors` Table**
| Column             | Type          | Description                          |
|--------------------|---------------|------------------------------------|
| author_id          | VARCHAR(50)   | Primary key, unique user ID         |
| num_games_owned    | INT           | Total games owned by user           |
| num_reviews       | INT           | Number of reviews user wrote        |
| playtime_forever   | INT           | Total playtime of user (minutes)    |
| playtime_last_2weeks| INT          | Playtime in last 2 weeks (minutes)  |
| playtime_at_review | INT           | Playtime at time of review (minutes)|


#### **2. `reviews` Table**
| Column             | Type          | Description                          |
|--------------------|---------------|------------------------------------|
| review_id          | BIGINT        | Primary key, unique review ID       |
| app_id             | INT           | Steam game ID                      |
| review_text        | TEXT          | User’s review content               |
| voted_up           | BOOLEAN       | Recommended or not                  |
| votes_up           | INT           | Helpful votes                      |
| steam_purchase     | BOOLEAN       | Verified Steam purchase            |
| received_for_free  | BOOLEAN       | Received game for free             |
| early_access       | BOOLEAN       | Review posted during early access  |
| timestamp_created  | BIGINT        | Review creation time (unix)        |
| timestamp_updated  | BIGINT        | Review update time (unix)          |

**ETL Pipeline:**
*   **Extract:** A Python script looped through our target App IDs.
*   **Transform:** We used Pandas `json_normalize` to flatten the nested JSON structure.
*   **Load:** Data was inserted into Supabase using batch `UPSERT` operations to prevent duplicate entries if the script was run multiple times.

---

### **Phase 3: SQL-Based Preprocessing**

Instead of performing all cleaning in Python, we utilized **SQL Stored Procedures and Views**. This "ELT" (Extract, Load, Transform) approach leverages the database engine for efficiency.

**Key Operations Performed:**
1.  **Handling Nulls:** SQL `UPDATE` queries replaced missing `votes_up` values with 0.
2.  **Feature Engineering:**
    *   **`playtime_forever_hours`**: We converted raw minutes to hours for better readability in visualizations.
    *   **`is_active`**: We created a boolean flag to categorize users who had played >1 hour in the last two weeks.
3.  **Data Integration:**
    *   We created a **View** named `review_author_view`. This virtual table automatically joins `reviews` and `authors`, providing a clean, unified dataset for analysis.

**Outcome:** A consistent, pre-cleaned dataset ready for immediate extraction into a Pandas DataFrame.

---

### **Phase 4.1: Exploratory Data Analysis (EDA)**

We visualized the dataset using Matplotlib and Seaborn to uncover behavioral trends.

**A. The Engagement Gap**
We compared playtime between Positive vs. Negative reviewers.
*   *Observation:* Positive reviews have a median lifetime playtime of **~60 hours**, while negative reviews sit at **~41.5 hours**.
*   *Insight:* Higher engagement correlates with satisfaction. Players who invest significant time are more likely to recommend the game. However, the high playtime of negative reviewers suggests many are "Disappointed Veterans" rather than quick refunds.

**B. The "Parting Shot" Theory**
We engineered a metric called `playtime_ratio` (Playtime at Review / Total Lifetime Playtime).
*   *Observation:* Negative reviews have a median ratio of **1.0**.
*   *Insight:* This indicates that users often write negative reviews immediately upon quitting the game forever ("Rage Quitting"), whereas positive reviewers often continue playing after reviewing (~0.99 ratio).

**C. Review Length & Helpfulness**
*   *Observation:* Negative reviews are significantly longer (median **30 words**) than positive reviews (median **14 words**). They also receive more "Helpful" votes (median 1 vs 0).
*   *Insight:* Unhappiness requires explanation. Players write detailed critiques to warn others, and the community values these warnings highly over generic praise.

**D. The "Critical Connoisseur"**
*   *Observation:* Negative reviewers own significantly more games (median **75**) compared to positive reviewers (median **25**).
*   *Insight:* Experienced players with large libraries are harsher critics, likely due to higher standards and broader comparisons.

---

### **Phase 4.2: Machine Learning (Sentiment Prediction)**

We built a Natural Language Processing (NLP) model to classify reviews based solely on text content. We compared **Naive Bayes** and **Logistic Regression**.

**Model Architecture:**
*   **Input:** `review_text` transformed via **TF-IDF Vectorization** (to weigh unique words higher than common ones).
*   **Algorithm:** `LogisticRegression` with `class_weight='balanced'`.
*   **Optimization:** The balanced class weight was crucial to overcome the 96% positivity bias that rendered Naive Bayes ineffective for detecting negatives.

**Results:**
*   **Accuracy:** **92%**
*   **Key Positive Drivers:** `best`, `fun`, `great`, `love`, `peak`.
*   **Key Negative Drivers:** `boring`, `crash`, `sucks`, `waste`, `update`.

**Interpretation:**
The model successfully verified that distinct keywords drive sentiment. Emotional keywords ("fun", "love") drive positive sentiment, while technical issues ("crash") or lack of engagement ("boring") drive negative sentiment. The presence of **"update"** as a negative driver confirms the "Disappointed Veteran" theory—patches can turn loyal fans into critics.

---

### **Synthesis & Findings**

By connecting our quantitative analysis with our machine learning results, we can construct a holistic picture of the Steam review ecosystem. This section synthesizes our disparate findings into cohesive themes.

#### **A. The Dual Nature of Steam Reviews**
Our data reveals a **bimodal user base** with distinct behavioral patterns:
1.  **The "Silent" Majority:** These users make up ~96% of the ecosystem. They play games extensively (median ~60 hours), purchase via standard channels, and leave short, positive reviews like *"10/10 best game"* or *"masterpiece."* Their engagement is high, but their specific feedback is often generic.
2.  **The Vocal Minority:** These are the "Veteran Critics." They own significantly more games (median ~75 vs. 25), write more reviews, and exhibit lower recent playtime. Their negative reviews are notably longer (median ~30 words vs. 14), indicating that dissatisfaction requires justification.

#### **B. The "Point of No Return"**
A critical finding is the predictive power of the **`playtime_ratio`**.
*   When `playtime_at_review` equals `playtime_forever` (ratio = 1.0), the review is far more likely to be **negative**.
*   This suggests a specific user journey: A player encounters a frustration point (crash, bad mechanic, poor optimization), writes a negative review, and immediately uninstalls the game. This "Rage Quit" signal is one of the strongest indicators of negative sentiment we found.

#### **C. Content is King (Acquisition is Irrelevant)**
We tested the hypothesis that users who received the game for free might be more lenient (higher positivity) or that Steam purchasers might be more critical due to financial investment.
*   **Result:** We found **no statistically significant difference** in review sentiment based on acquisition method (`steam_purchase` vs. `received_for_free`).
*   **Implication:** Players evaluate games primarily on their intrinsic quality (fun factor, stability) rather than the transaction method. A bad game is a bad game, regardless of price.

#### **D. The Language of Satisfaction vs. Frustration**
Our machine learning model provided a lexical map of user sentiment:
*   **Positive Sentiment:** Driven by emotional and experiential words like *fun, love, great, best, peak*. This reflects a subjective, enjoyment-based evaluation.
*   **Negative Sentiment:** Driven by functional and technical words like *crash, boring, waste, update*. This reflects an objective failure of the product to perform or engage.
*   **Synthesis:** Positive reviews celebrate the **experience**; negative reviews critique the **product**. This distinction is crucial for developers seeking to improve their games.

---

### **Conclusion & Recommendations**

**Conclusion:**
Our analysis confirms that the Steam review ecosystem is heavily biased towards positivity. However, the minority of negative reviews are high-value signals: they come from experienced "veteran" players (high game ownership), are written in greater detail, and signal the end of that player's lifecycle with the game. "Boring" was identified as the ultimate sin, carrying more weight than technical flaws.

**Recommendations:**
1.  **For Developers:** Pay close attention to long, negative reviews from high-playtime users, especially after updates. These are not trolls; they are disappointed superfans providing detailed technical feedback.
2.  **For Future Analysis:** Implement **Aspect-Based Sentiment Analysis (ABSA)** to automatically bucket reviews into categories like "Graphics," "Gameplay," or "Performance" based on the keywords identified in our ML model.

---


### **9. References**

**Methodology & Machine Learning (TF-IDF, Logistic Regression)**
*   Tan, J. Y., Chow, A. S. K., & Tan, C. W. (2021). Sentiment analysis on game reviews: A comparative study of machine learning approaches. *Proceedings of the International Conference on Digital Transformation and Applications (ICDXA)*, 209-225. https://doi.org/10.2991/ahis.k.210913.001
*   Zuo, Z. (2018). Sentiment analysis of Steam review datasets using Naive Bayes and Decision Tree Classifier. *International Journal of Multidisciplinary Sciences and Engineering, 9*(7), 1-8.

**Player Engagement & Playtime**
*   Lin, D., Bezemer, C. P., & Hassan, A. E. (2019). An empirical study of game reviews on the Steam platform. *Empirical Software Engineering, 24*(1), 170-207. https://doi.org/10.1007/s10664-018-9627-4
*   Vicoriza, V., Aryotejo, G., & Widodo, A. P. (2025). Analysis of the correlation between playtime, design, and game mechanics to positive reviews. *Jurnal Masyarakat Informatika, 15*(2).

**Review Helpfulness & User Behavior**
*   Eberhard, L., Kasper, P., Koncar, P., & Gütl, C. (2018). Investigating helpfulness of video game reviews on the Steam platform. *2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS)*, 43-50.
*   Runge, J., et al. (2022). Net Promoter Score inversion may signal problematic digital use. *Scientific Reports, 12*, 1-9.

**Data Source & Documentation**
*   Valve Corporation. (2025). *Steam Web API Documentation*. Steamworks Partner. Retrieved from https://partner.steamgames.com/doc/webapi