# Data Engineering Override: ETL Pipeline & Ground Truth Restoration
**Objective:** Restore data integrity by re-engineering the ETL (Extract, Transform, Load) pipeline from the raw source.

**Context & Justification:**
The initial datasets provided contained synthetic labels generated by an untrained model. To satisfy the requirement for **Supervised Learning**, we must train on **Ground Truth** (human-verified) labels. This section implements a custom parser to extract the original sentiment labels from the raw Johns Hopkins University (JHU) XML corpus.

**Technical Implementation:**
1.  **Ingestion:** We programmatically retrieve the `domain_sentiment_data.tar.gz` directly from the academic source to ensure reproducibility.
2.  **XML Parsing:** The raw data is unstructured pseudo-XML. We utilize `BeautifulSoup` to traverse the DOM tree and extract the text content within `<review_text>` tags.
3.  **Label Encoding:**
> Files located in `positive.review` are explicitly mapped to **Label 1**. <br>Files located in `negative.review` are explicitly mapped to **Label 0**. <br>This guarantees a deterministic and correct binary classification target, resolving the label noise issues found in the previous iteration.
4.  **Persistence:** The cleaned data is aggregated into structured DataFrames and serialized as CSVs (e.g., `books_clean.csv`, `dvd_clean.csv`, etc.), serving as the immutable input for the Model Architecture phase.

In [None]:
# -*- coding: utf-8 -*-
# ================================================================ #
# ISY503: Intelligent Systems - Final Project (Data Preprocessing) #
# ================================================================ #

# ------------------- #
# 1. Data Engineering #
# ------------------- #
# This section downloads the raw data and creates the "Ground Truth" files.

import os
import tarfile
import urllib.request
import pandas as pd
from bs4 import BeautifulSoup

# 1. Download the Original JHU Dataset
url = "https://www.cs.jhu.edu/~mdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
filename = "domain_sentiment_data.tar.gz"

if not os.path.exists(filename):
    print("Downloading raw dataset... (approx 50MB)")
    urllib.request.urlretrieve(url, filename)
    print("Download complete.")

# 2. Extract Data
if not os.path.exists("sorted_data_acl"):
    print("Extracting files...")
    with tarfile.open(filename, "r:gz") as tar:
        tar.extractall()
    print("Extraction complete.")

# 3. Parse & Create Clean CSVs (The Logic Member 1 missed)
def parse_review_file(file_path, label_value):
    """Parses pseudo-XML to extract review text."""
    texts = []
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        content = f.read()

    # The files are pseudo-XML. BeautifulSoup handles them well.
    soup = BeautifulSoup(content, "html.parser")
    reviews = soup.find_all("review_text")

    for r in reviews:
        texts.append(r.get_text().strip())
    return texts

categories = ['books', 'dvd', 'electronics', 'kitchen_&_housewares']
dfs = []

print("\nProcessing raw files into Clean CSVs...")
for cat in categories:
    # Path to positive and negative files
    pos_path = os.path.join("sorted_data_acl", cat, "positive.review")
    neg_path = os.path.join("sorted_data_acl", cat, "negative.review")

    # Parse
    pos_reviews = parse_review_file(pos_path, 1)
    neg_reviews = parse_review_file(neg_path, 0)

    # Create DataFrame
    df_pos = pd.DataFrame({'review': pos_reviews, 'label': 1, 'category': cat})
    df_neg = pd.DataFrame({'review': neg_reviews, 'label': 0, 'category': cat})

    # Combine & Save
    df_cat = pd.concat([df_pos, df_neg], ignore_index=True)
    csv_name = f"{cat}_clean.csv"
    df_cat.to_csv(csv_name, index=False)

    dfs.append(df_cat)
    print(f"âœ… Created {csv_name} ({len(df_cat)} rows: {len(df_pos)} pos, {len(df_neg)} neg)")

# Combine all for training
full_data = pd.concat(dfs, ignore_index=True)
print(f"\nTotal Dataset Size: {len(full_data)} rows")