# Lab 9: Final Project Architecture Workshop (Blueprint)
Date generated: 2025-08-21

Use this as your *structured design doc*. Replace all placeholders with your team’s plan.

## Section 1 — Business Problem (1 paragraph)
We will build a real-time social media sentiment analysis system that helps marketing and product teams detect changes in public sentiment about our top product within minutes. The system will surface urgent negative trends for rapid response, automatically tag high-impact posts (such as ones with high follower counts or high engagement), and provide historical context so product managers can prioritize bug fixes, feature improvements, and PR responses.

## Section 2 — Data Sources (batch + streaming) with access details
Batch (historical):
historical_twitter_100k.csv — 100,000 historical tweets collected between 2023-01-01 and 2024-12-31. Columns: post_id, created_at (ISO 8601 UTC), user_id, text, label_sentiment (negative|neutral|positive), platform, language, retweet_count, like_count, reply_count, user_followers, user_account_age_days.
Stored in: Cloud Storage bucket gs://final-project-aurora-data/historical/.
Streaming (live):
Twitter/X API v2 — filtered stream (Bearer token): provides live public tweets matching brand-related rules. Stream fields: id, created_at, text, author_id, lang, public_metrics (retweets, likes, replies), geo (where available).
Optional vendor stream (Brandwatch / Meltwater / Mention): webhook pushing brand mentions to our ingestion endpoint for broader coverage beyond Twitter. For compliance and privacy, we will only store and process public posts and will follow platform TOS. Any PII (personal data in text) will be redacted before persistent storage and flagged for manual review if required.


## Section 3 — Cloud Architecture (diagram + narrative)
ASCII draft:
[API] -> [Cloud Function producer] -> [Pub/Sub topic] -> [Dataflow template] -> [BigQuery] -> [Looker]
GCP Project & Resources (concrete names used in our build):

GCP Project ID: final-sentiment-aurora-2025

Cloud Storage bucket: gs://final-project-aurora-data

BigQuery dataset: final_project_aurora

Pub/Sub topic: projects/final-sentiment-aurora-2025/topics/twitter-stream

Dataflow job name prefix: aurora-dataflow-.

Ingest (Streaming + Batch):

Historical CSV historical_twitter_100k.csv uploaded to gs://final-project-aurora-data/historical/ and loaded once into BigQuery table final_project_aurora.raw_twitter_posts using a one-time bq load job.
Live tweets arrive via Twitter filtered stream. A lightweight ingestion service runs on Cloud Run (twitter-ingest) that establishes the persistent connection (or receives webhook events from vendor streams). The service publishes incoming JSON messages to Pub/Sub topic twitter-stream.

Stream buffer & pre-processing:

Pub/Sub acts as durable buffer and decouples ingestion from processing.
Topics:

- twitter-stream (primary) and twitter-stream-deadletter (DLQ).

- Dataflow (Apache Beam, streaming) job aurora-dataflow-clean subscribes to twitter-stream.

Responsibilities:

Validate incoming JSON schema and route malformed messages to the DLQ.
Normalize timestamps to UTC and parse language tags.
Compute lightweight features: text_length, has_hashtag (bool), has_mention (bool), has_url (bool), hashtags_count, mentions_count, is_retweet.
Enrich with user metadata if available (followers count, account age), and compute engagement_score = retweet_count*1 + reply_count*1.5 + like_count*0.5.
Batch and buffer inserts to BigQuery for cost-effectiveness.

ML enrichment & scoring:
Primary sentiment scoring happens in two parallel paths:
Low-latency path: Dataflow calls a lightweight local BQML microservice (deployed as a prediction endpoint in BigQuery or Cloud Run) for immediate sentiment prediction for every message (used for dashboarding and alerts). This path avoids external API rate limits.
High-fidelity path: A sampled subset (configurable, default 20%) and high-impact posts (e.g., user_followers > 10000 or engagement_score > 50) are sent to the Natural Language AI API for richer sentiment scores and entity extraction. Results are merged back into BigQuery for model training and deep analysis.

Storage & serving:
Processed streaming records written to BigQuery table final_project_aurora.processed_tweets (partitioned by DATE(created_at) and clustered by label_sentiment, platform). Use streaming inserts with fallback to batched load via Cloud Storage for resilience.
Historical CSV data initially loaded into final_project_aurora.historical_tweets and then normalized into the processed table schema.

Modeling & retraining:
Use BigQuery ML to train the baseline classification model final_project_aurora.models.sentiment_bqml_v1 (see SQL example below).
Model retraining cadence: weekly (every Sunday 02:00 UTC) triggered by Cloud Scheduler invoking a Cloud Function which executes a CREATE OR REPLACE MODEL BigQuery job. Retrain earlier if data drift is detected.

Visualization & alerting:
Looker Studio dashboard named Aurora Sentiment Monitor connects directly to BigQuery (final_project_aurora.processed_tweets) for live visualizations.
Cloud Monitoring / Alerting: alert rules defined for operational metrics (Dataflow lag > 5 minutes, Pub/Sub unacked messages > threshold) and business KPIs (negative % in 15-min window > 30%). Alerts sent to Slack channel #aurora-alerts and PagerDuty.

Concrete example BigQuery table names and model names:

final_project_aurora.raw_twitter_posts (raw records)

final_project_aurora.processed_tweets (cleaned+enriched records)

final_project_aurora.labels.manual_labeled (human-labeled samples for evaluation)



## Section 4 — ML Plan (BQML): target, features, metric, scoring mode
Problem formulation:

Primary task: Multiclass classification to predict label_sentiment in {negative, neutral, positive} for each post.

Secondary task (optional): Regression to predict continuous sentiment score in [-1,1] if we want finer-grained scoring.

Training data and split:

Use the 100k historical tweets (historical_twitter_100k.csv) as the primary labeled dataset.

Supplement training data with human-labeled streaming samples kept in labels.manual_labeled (target 5,000 labeled streaming samples gathered over the first 6 weeks).

Train/val/test split: 80/10/10.

Features (final, concrete list):

text_length (int)

hashtags_count (int)

mentions_count (int)

user_followers (int)

engagement_score (float)

platform (string / categorical)

language (string / categorical)

hour_of_day (int)

day_of_week (int)

has_negative_keyword (boolean)

avg_sentiment_last_15m (float) — rolling feature computed by scheduled SQL or within Dataflow windowing (optional)

embedding_vector (REPEATED FLOAT) — optional; precomputed embeddings stored in a separate table and joined if required.

Model selection & evaluation:

Baseline: boosted_tree_classifier in BQML.

Evaluation metrics: accuracy, precision/recall per class, macro F1, and confusion matrix. Store eval results in final_project_aurora.model_evals.
Use ML.EXPLAIN to surface top features driving predictions.

Retraining & deployment:
Retrain weekly or when drift detected (automatic detection via statistical tests comparing recent vs baseline distributions stored in BigQuery).
Deploy predictions by writing to table final_project_aurora.predictions_streaming and exposing a prediction summary table optimized for Looker Studio.



## Section 5 — Dashboard KPIs (3–5) with definitions and SQL sources
Dashboard name: Aurora Sentiment Monitor

Real-time Sentiment Score (15-min rolling average) — rolling mean sentiment score (or % positive minus % negative) with 1–5 minute latency.
Alert threshold: negative % in last 15 minutes > 30% triggers a page-level alert and Slack notification.
Mentions Volume (posts/hour) — total incoming posts mentioning tracked keywords; helps separate volume spikes from sentiment shifts.

Example alert: mentions/hour increases by >300% relative to baseline.

Sentiment Distribution — stacked bar / donut showing % positive / neutral / negative for selected time window.
Top Negative Themes — table of top 10 keywords/entities associated with negative posts (counts & examples).
Model Health & Data Drift — sample model accuracy (from labeled streaming samples), and a drift indicator (e.g., KL divergence for text length or language distribution vs baseline).


## Section 6 — Risks & Mitigations (Devil’s Advocate) + Prompt
Biggest risk: Relying on external streaming APIs (Twitter/X) and the Natural Language API for high-fidelity scoring creates two correlated single points of failure: (1) data coverage and continuity risk (API outages, access revocations, or rule pruning), and (2) cost/latency risk from third-party scoring at scale.


## Section 7 — Milestones & Ownership (30/60/90 or weekly)
Week 1 (Day 1–7): Foundation & Data Setup

Finalize project objective and research question

Confirm dataset (German Credit)

Import data, perform cleaning, create train/test split

Initial EDA: summary statistics, target distribution, correlation checks
Owner: Analyst A

Week 2 (Day 8–14): Modeling & Baselines

Build logistic regression baseline

Fit KNN, Decision Tree, and Random Forest models

Evaluate accuracy, AUC, confusion matrices

Begin tuning hyperparameters
Owner: Analyst B

Week 3 (Day 15–21): Optimization & Interpretability

Hyperparameter tuning across all models

Finalize best-performing model

Create model interpretation outputs (feature importance, SHAP, charts)

Draft visuals for presentation
Owner: Analyst A

Week 4 (Day 22–30): Finalization & Deliverables

Finalize presentation slides

Write executive summary

Produce final report (methods, results, recommendations)

Prepare 8-minute presentation script
Owner: Analyst B