# **STEAM REVIEW SENTIMENT & PLAYER BEHAVIOR ANALYSIS**
Authors: `Krystal Bacalso` `Javier Raut` `Joseph Desyolong` `Jhon Omblero` `Hayah Apistar`

## **Phase 3: Data Pipelining and Preprocessing**
### **Overview**
In this phase, we perform data preprocessing directly within the database using SQL queries, views, and stored procedures. This approach ensures efficient data cleaning and transformation, allowing us to prepare the data for effective analysis and machine learning in the next phase.

Our main goals are to:
1. Handle missing or inconsistent data.
2. Normalize values where necessary.
3. Create calculated columns that enrich the dataset.
4. Join related tables to generate combined views for easier querying.


### **3.1 Data Cleaning and Handling Missing Values**
Real-world data often contains missing values (NULLs). These can cause errors or misleading results during analysis. We replace NULL values with sensible defaults (e.g., zero) to ensure consistency.

```sql
-- Replace NULL values in 'votes_up' with 0 in reviews table
UPDATE reviews
SET votes_up = 0
WHERE votes_up IS NULL;

-- Replace NULL values in 'playtime_last_2weeks' with 0 in authors table
UPDATE authors
SET playtime_last_2weeks = 0
WHERE playtime_last_2weeks IS NULL;
```

### **3.2 Normalization and Transformations**
Some raw values are stored in less intuitive units. For example, playtime is in minutes, which can be large numbers. We convert these to hours to make the data easier to interpret and compare.

```sql
ALTER TABLE authors
ADD COLUMN playtime_forever_hours FLOAT;

UPDATE authors
SET playtime_forever_hours = playtime_forever / 60.0;
```

### **3.3 Creating Calculated Columns**
Derived columns can add insight. Here, we create a flag `is_active` to identify users who have played for more than 1 hour in the last two weeks, indicating recent engagement.

```sql
ALTER TABLE authors
ADD COLUMN is_active BOOLEAN;

UPDATE authors
SET is_active = CASE WHEN playtime_last_2weeks > 60 THEN TRUE ELSE FALSE END;
```

### **3.4 Creating Views for Joined Data**
Joining tables repeatedly can be complex and slow. Views encapsulate this logic and let us query combined data easily, improving readability and performance.

```sql
CREATE OR REPLACE VIEW review_author_view AS
SELECT
    r.app_id,
    r.review_id,
    r.review_text,
    r.voted_up,
    r.votes_up,
    r.timestamp_created,
    r.timestamp_updated,
    r.steam_purchase,
    r.received_for_free,
    r.early_access,

    a.author_id,
    a.num_games_owned,
    a.num_reviews,
    a.playtime_forever,
    a.playtime_last_2weeks,
    a.playtime_at_review,

    -- Compute playtime_forever_hours from authors.playtime_forever
    ROUND(a.playtime_forever / 60.0, 2) AS playtime_forever_hours

FROM reviews r
JOIN authors a ON r.author_id = a.author_id;
```

### **3.5 Creating Views for Joined Data**
Stored procedures automate repetitive tasks. This procedure cleans missing values, adds normalized and calculated columns in one call, improving maintainability and reproducibility.

```sql
CREATE OR REPLACE PROCEDURE preprocess_data()
LANGUAGE plpgsql
AS $$
BEGIN
    -- Set missing votes_up to 0
    UPDATE reviews
    SET votes_up = 0
    WHERE votes_up IS NULL;

    -- Set missing playtime_last_2weeks to 0
    UPDATE authors
    SET playtime_last_2weeks = 0
    WHERE playtime_last_2weeks IS NULL;

    -- Calculate playtime in hours
    ALTER TABLE authors
    ADD COLUMN IF NOT EXISTS playtime_forever_hours FLOAT;

    UPDATE authors
    SET playtime_forever_hours = playtime_forever / 60.0;

    -- Set is_active flag
    ALTER TABLE authors
    ADD COLUMN IF NOT EXISTS is_active BOOLEAN;

    UPDATE authors
    SET is_active = CASE WHEN playtime_last_2weeks > 60 THEN TRUE ELSE FALSE END;
END;
$$;
```

### **3.6 Validation Queries**
After preprocessing, run checks to verify data consistency and correctness.

```sql
-- Check for any remaining NULLs in votes_up
SELECT COUNT(*) FROM reviews WHERE votes_up IS NULL;

-- Count of active users
SELECT COUNT(*) FROM authors WHERE is_active = TRUE;

-- Sample combined data from the view
SELECT * FROM review_author_view LIMIT 5;
```

## **Summary**
In this phase, we successfully performed essential data preprocessing directly within the database. By using SQL queries, views, and stored procedures, we were able to:
1. Clean the data by handling missing values and replacing NULLs with default values.
2. Normalize important fields, such as converting playtime from minutes to hours, to improve interpretability.
3. Create new calculated columns like is_active to flag recently engaged users, adding valuable insights.
4. Build a combined view review_author_view that joins reviews with author metadata, streamlining future queries and analysis.
5. Automate the preprocessing steps using a stored procedure, improving workflow efficiency and ensuring reproducibility.

These preprocessing steps lay a solid foundation for the next phase, where we will conduct detailed data analysis and apply machine learning models on a clean, well-structured dataset.