# Homework 2: TechMart - QuickBuy Acquisition Data Integration

**Analyst:** [Your Name Here]  
**Due:** Day 3, Start of Class  
**Total Points:** 100  
**Deadline Context:** Board meeting Wednesday 9 AM - we need this by Tuesday EOD!

---

## üè¢ Executive Summary

TechMart has acquired QuickBuy for $12M. Their product catalog (194 products, 582 reviews) is trapped in nested JSON from their NoSQL database. 

**Your mission:** Transform this data into clean, normalized tables for our SQL-based analytics warehouse before tomorrow's board meeting.

**Business Impact:**
- $2.5M inventory decision (which product lines to keep)
- Marketing budget allocation based on engagement
- Customer satisfaction benchmarking
- Integration roadmap for 50 developers

---

## üìä Communication Framework

**Remember:** You're not just processing data - you're informing $12M worth of business decisions!

For each analysis section, consider:
- **What** does the data show? (facts)
- **So what** does it mean? (interpretation)
- **Now what** should we do? (recommendation)

Different stakeholders need different information:
- **Board/CEO:** Strategic decisions, risks, timeline
- **CMO:** Customer insights, engagement patterns
- **Product Team:** Feature priorities, development roadmap
- **Engineering:** Technical specifications, integration complexity
- **Data Quality:** Risk assessment, monitoring needs

---

## Instructions

1. Complete all TODO sections below
2. **Add stakeholder communications where marked** (critical for grade!)
3. Ensure all assertions pass (data quality is critical!)
4. Before submitting: **Kernel ‚Üí Restart & Run All Cells**
5. Verify all outputs are visible
6. Rename file to `hw2_[your_name].ipynb`

**Read the README.md for full business context, requirements, and grading rubric!**

---

## Setup

Run these cells to set up your analysis environment.

In [None]:
# Install required packages (if needed)
!pip install duckdb pandas -q

In [None]:
# Import libraries
import json
import pandas as pd
import duckdb
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")
print(f"üìÖ Analysis date: {datetime.now().strftime('%Y-%m-%d')}")
print("‚è∞ Remember: Board meeting is Wednesday 9 AM!")

In [None]:
# Connect to DuckDB (our data warehouse)
con = duckdb.connect(':memory:')
print("‚úÖ Connected to TechMart Data Warehouse (DuckDB)!")

---

## Part 1: Data Ingestion & Exploration (15 points)

**Context:** The Head of Analytics just asked: *"What exactly did we buy? I need to understand QuickBuy's data structure before we integrate."*

Let's explore what QuickBuy's JSON export contains.

### Question 1.1: Load the JSON Data (3 points)

**Business Context:** First, we need to load QuickBuy's product catalog export.

**Requirements:**
- Load the JSON file from `data/products.json`
- Store the products array in a variable called `products`
- Print the total number of products
- Show the data structure type

In [None]:
# TODO: Load the JSON file
# Hint: Use json.load() with open()

# with open('data/products.json', 'r') as f:
#     data = ...

# products = ...

# TODO: Print summary for the Head of Analytics
# print(f"üìä QuickBuy Product Catalog Summary:")
# print(f"Total products acquired: ...")
# print(f"Data structure type: ...")


### Question 1.2: Explore the Structure (4 points)

**Business Context:** The CFO wants to know: *"How many customer reviews are we inheriting? This affects our customer insights strategy."*

**Requirements:**
- Display all keys from the first product
- Count the total number of reviews across ALL products
- Find which product has the most reviews (show id and title)

In [None]:
# TODO: Show structure of first product
print("üîç First product structure:")
# print("Keys:", ...)

# TODO: Count total reviews
total_reviews = 0
# for product in products:
#     total_reviews += ...

print(f"\nüí¨ Total customer reviews in QuickBuy data: {total_reviews}")

# TODO: Find product with most reviews
max_reviews = 0
most_reviewed_product = None
# for product in products:
#     if len(product['reviews']) > max_reviews:
#         ...

# print(f"\nüèÜ Most reviewed product:")
# print(f"   ID: {most_reviewed_product['id']}")
# print(f"   Title: {most_reviewed_product['title']}")
# print(f"   Review count: {max_reviews}")

### Question 1.3: Identify Nested Elements (4 points)

**Business Context:** The BI Team Lead says: *"I need to know what's nested so we can plan the normalization. Our Tableau dashboards expect flat tables."*

**Requirements:**
- List all fields that contain nested objects (dict type)
- List all fields that contain arrays (list type)
- Document which fields need normalization

In [None]:
# TODO: Analyze first product to identify nested structures
sample_product = products[0]

nested_objects = []
array_fields = []
simple_fields = []

# TODO: Categorize each field
# for key, value in sample_product.items():
#     if isinstance(value, dict):
#         nested_objects.append(key)
#     elif isinstance(value, list):
#         array_fields.append(key)
#     else:
#         simple_fields.append(key)

print("üìã Data Structure Analysis for BI Team:")
print(f"\nüóÇÔ∏è Nested objects to flatten: {nested_objects}")
print(f"üìö Array fields to normalize: {array_fields}")
print(f"‚úÖ Simple fields (ready to use): {simple_fields[:5]}...")  # Show first 5

### Question 1.4: Data Quality Check (4 points)

**Business Context:** The Head of Data Quality warns: *"QuickBuy's last acquisition failed due to poor data quality. Check for any missing critical fields!"*

**Requirements:**
- Check if any products are missing 'id', 'title', or 'price'
- Count unique product categories
- Verify all products have at least one review

In [None]:
# TODO: Check for missing critical fields
missing_critical = []
# for product in products:
#     if 'id' not in product or 'title' not in product or 'price' not in product:
#         missing_critical.append(product.get('id', 'NO_ID'))

# TODO: Count unique categories
categories = set()
# for product in products:
#     categories.add(...)

# TODO: Verify all products have reviews
products_without_reviews = []
# for product in products:
#     if len(product.get('reviews', [])) == 0:
#         ...

print("‚úÖ Data Quality Report:")
print(f"\nüîç Products missing critical fields: {len(missing_critical)}")
print(f"üìÇ Unique categories: {len(categories)}")
print(f"üí¨ Products without reviews: {len(products_without_reviews)}")

# TODO: List categories for executive review
# print(f"\nüìä Categories for board review: {sorted(categories)}")

### üìù Stakeholder Communication: Initial Assessment

**TODO: Brief the Head of Analytics on QuickBuy's data (3-4 sentences)**

Consider:
- Overall data quality assessment
- Complexity of the integration task
- Any immediate red flags or pleasant surprises
- Estimated effort for normalization

[Write your assessment here for the Head of Analytics]

---

## Part 2: Data Normalization (35 points)

**Context:** The BI Team Lead just called: *"I need this data in three clean tables by end of day. Our dashboards are waiting!"*

Transform QuickBuy's nested JSON into normalized relational tables.

### Question 2.1: Create Products Table (12 points)

**Business Context:** Create the main products table for inventory analysis.

**Requirements:**
- Flatten `dimensions` object to width, height, depth columns
- Flatten `meta` object to created_at, updated_at, barcode, qr_code columns
- Drop nested columns (dimensions, meta, reviews, tags, images)
- Convert price to float, stock to int
- Parse created_at and updated_at as datetime
- Result: DataFrame with 25 columns and 100 rows

In [None]:
# TODO: Create products DataFrame
products_df = pd.DataFrame(products)

# TODO: Flatten dimensions (width, height, depth)
# Hint: products_df['width'] = products_df['dimensions'].apply(lambda x: x.get('width', None))


# TODO: Flatten meta (created_at, updated_at, barcode, qr_code)


# TODO: Drop nested columns that we'll normalize separately
# columns_to_drop = ['dimensions', 'meta', 'reviews', 'tags', 'images']
# products_df = products_df.drop(columns=columns_to_drop)

# TODO: Fix data types
# products_df['price'] = products_df['price'].astype(float)
# products_df['stock'] = ...
# products_df['created_at'] = pd.to_datetime(...)

# TODO: Verify shape and display info
print("üìä Products Table Created:")
print(f"Shape: {products_df.shape}")
print(f"\nFirst 3 products:")
products_df.head(3)

### Question 2.2: Create Reviews Table (12 points)

**Business Context:** The CMO needs customer sentiment analysis: *"Extract all reviews so we can analyze satisfaction by product category."*

**Requirements:**
- Extract reviews from each product
- Maintain product_id as foreign key
- Generate review_id as primary key (1, 2, 3...)
- Parse review dates as datetime
- Include: review_id, product_id, rating, comment, date, reviewer_name, reviewer_email
- Result: DataFrame with ~300 rows and 7 columns

In [None]:
# TODO: Extract all reviews with foreign key relationship
reviews_list = []

# for product in products:
#     product_id = product['id']
#     for review in product.get('reviews', []):
#         review_row = {
#             'product_id': product_id,
#             'rating': ...,
#             'comment': ...,
#             'date': ...,
#             'reviewer_name': ...,
#             'reviewer_email': ...
#         }
#         reviews_list.append(review_row)

# TODO: Create DataFrame and add review_id
# reviews_df = pd.DataFrame(reviews_list)
# reviews_df['review_id'] = range(1, len(reviews_df) + 1)

# TODO: Fix data types
# reviews_df['date'] = pd.to_datetime(reviews_df['date'])
# reviews_df['rating'] = reviews_df['rating'].astype(int)

# TODO: Reorder columns for clarity
# reviews_df = reviews_df[['review_id', 'product_id', 'rating', 'comment', 'date', 'reviewer_name', 'reviewer_email']]

print("üí¨ Reviews Table Created:")
print(f"Shape: {reviews_df.shape}")
print(f"Average rating: {reviews_df['rating'].mean():.2f}")
print(f"\nFirst 3 reviews:")
reviews_df.head(3)

### Question 2.3: Create Product Tags Table (11 points)

**Business Context:** The Marketing team needs this for SEO: *"We need to know which tags are associated with which products for our search optimization."*

**Requirements:**
- Extract product-tag relationships
- Create bridge table with product_id and tag
- One row per product-tag combination
- Result: DataFrame with ~250 rows and 2 columns

In [None]:
# TODO: Extract product-tag relationships
tags_list = []

# for product in products:
#     product_id = product['id']
#     for tag in product.get('tags', []):
#         tags_list.append({
#             'product_id': ...,
#             'tag': ...
#         })

# TODO: Create DataFrame
# tags_df = pd.DataFrame(tags_list)

# TODO: Show tag statistics for marketing
print("üè∑Ô∏è Product Tags Table Created:")
print(f"Shape: {tags_df.shape}")
print(f"Unique tags: {tags_df['tag'].nunique()}")
print(f"\nTop 5 most common tags:")
# tags_df['tag'].value_counts().head()

### üìù Stakeholder Communication: Normalization Results

**TODO: Brief the BI Team on the normalization outcome (3-4 sentences)**

Consider:
- How many tables were created and their relationships
- Any data transformations or cleanups performed
- Readiness for Tableau integration
- Any limitations or caveats they should know

[Write your normalization summary for the BI Team]

---

## Part 3: Data Validation (20 points)

**Context:** The Head of Data Quality insists: *"QuickBuy's last merger failed because of duplicate records and broken relationships. Validate EVERYTHING!"*

Implement critical data quality checks.

### Question 3.1: Primary Key Validation (5 points)

Verify that our primary keys are unique (no duplicates).

In [None]:
# TODO: Check primary key uniqueness
print("üîë Primary Key Validation:")

# Check products
# assert products_df['id'].is_unique, "‚ùå CRITICAL: Duplicate product IDs found!"
print("‚úÖ Product IDs are unique")

# Check reviews
# assert reviews_df['review_id'].is_unique, "‚ùå CRITICAL: Duplicate review IDs found!"
print("‚úÖ Review IDs are unique")

print("\n‚ú® All primary keys valid!")

### Question 3.2: Foreign Key Integrity (5 points)

Verify that all foreign keys point to valid primary keys.

In [None]:
# TODO: Check foreign key relationships
print("üîó Foreign Key Validation:")

# Check reviews -> products
# invalid_product_refs = ~reviews_df['product_id'].isin(products_df['id'])
# assert not invalid_product_refs.any(), f"‚ùå {invalid_product_refs.sum()} reviews reference non-existent products!"
print("‚úÖ All reviews link to valid products")

# Check tags -> products
# invalid_tag_refs = ~tags_df['product_id'].isin(products_df['id'])
# assert not invalid_tag_refs.any(), f"‚ùå {invalid_tag_refs.sum()} tags reference non-existent products!"
print("‚úÖ All tags link to valid products")

print("\n‚ú® All foreign keys valid!")

### Question 3.3: Data Type Validation (5 points)

Verify that critical columns have the correct data types.

In [None]:
# TODO: Check data types
print("üìä Data Type Validation:")

# Check numeric types
# assert products_df['price'].dtype == 'float64', "‚ùå Price should be float"
# assert products_df['stock'].dtype in ['int64', 'int32'], "‚ùå Stock should be integer"
# assert reviews_df['rating'].dtype in ['int64', 'int32'], "‚ùå Rating should be integer"
print("‚úÖ Numeric columns have correct types")

# Check datetime types
# assert pd.api.types.is_datetime64_any_dtype(products_df['created_at']), "‚ùå created_at should be datetime"
# assert pd.api.types.is_datetime64_any_dtype(reviews_df['date']), "‚ùå review date should be datetime"
print("‚úÖ Date columns are properly parsed")

print("\n‚ú® All data types correct!")

### Question 3.4: Completeness Check (5 points)

Verify that no data was lost during transformation.

In [None]:
# TODO: Verify completeness
print("üìà Data Completeness Validation:")

# Count reviews in original JSON
original_review_count = sum(len(p['reviews']) for p in products)
# assert len(reviews_df) == original_review_count, f"‚ùå Review count mismatch! Original: {original_review_count}, Transformed: {len(reviews_df)}"
print(f"‚úÖ All {original_review_count} reviews preserved")

# Count tags in original JSON
original_tag_count = sum(len(p['tags']) for p in products)
# assert len(tags_df) == original_tag_count, f"‚ùå Tag count mismatch!"
print(f"‚úÖ All {original_tag_count} product-tag relationships preserved")

# Check product count
# assert len(products_df) == len(products), f"‚ùå Product count mismatch!"
print(f"‚úÖ All {len(products)} products preserved")

print("\n‚ú® No data lost in transformation!")

### üìù Stakeholder Communication: Data Quality Assessment

**TODO: Write a data quality summary for the Head of Data Quality (3-4 sentences)**

Consider:
- Overall quality score (excellent/good/concerning)
- Any red flags for the integration?
- What should we monitor going forward?
- Comparison to other acquisitions you've seen

[Write your data quality assessment here]

---

## Part 4: Database Persistence (10 points)

**Context:** The Data Engineering Lead says: *"Load this into DuckDB now. The overnight ETL jobs need these tables by midnight!"*

Persist the normalized data to our data warehouse.

### Question 4.1: Create Database Tables (5 points)

Load the normalized DataFrames into DuckDB.

In [None]:
# TODO: Load tables into DuckDB
print("üèóÔ∏è Creating database tables...")

# Register DataFrames with DuckDB
# con.register('products_staging', products_df)
# con.register('reviews_staging', reviews_df)
# con.register('tags_staging', tags_df)

# Create permanent tables
# con.execute("CREATE TABLE products AS SELECT * FROM products_staging")
# con.execute("CREATE TABLE reviews AS SELECT * FROM reviews_staging")
# con.execute("CREATE TABLE product_tags AS SELECT * FROM tags_staging")

print("‚úÖ Tables created in TechMart Data Warehouse")

### Question 4.2: Verify Database Load (5 points)

Confirm that all data loaded correctly.

---

## Part 5: SQL Analysis - Board Questions (15 points)

**Context:** It's Tuesday afternoon. The CEO just called: *"I need answers to these specific questions for tomorrow's board meeting!"*

## üìä Analysis Communication Framework

**Remember:** The board doesn't want SQL - they want decisions!

For each analysis below:
1. **Run the query** to get the data
2. **Interpret the results** - what does it mean?
3. **Make a recommendation** - what should we do?
4. **Consider the audience** - tailor your message

Use SQL to answer critical business questions.

In [None]:
# TODO: Verify table creation and row counts
print("üìä Database Verification:")
print("=" * 40)

# Check products table
# product_count = con.execute("SELECT COUNT(*) FROM products").fetchone()[0]
# print(f"‚úÖ Products table: {product_count} rows")

# Check reviews table
# review_count = con.execute("SELECT COUNT(*) FROM reviews").fetchone()[0]
# print(f"‚úÖ Reviews table: {review_count} rows")

# Check product_tags table
# tag_count = con.execute("SELECT COUNT(*) FROM product_tags").fetchone()[0]
# print(f"‚úÖ Product_tags table: {tag_count} rows")

print("\nüìã Sample data from each table:")

# Show sample from products
# print("\nProducts (first 2):")
# con.execute("SELECT id, title, price, category FROM products LIMIT 2").df()

# Show sample from reviews
# print("\nReviews (first 2):")
# con.execute("SELECT review_id, product_id, rating, date FROM reviews LIMIT 2").df()

# Show sample from tags
# print("\nProduct Tags (first 5):")
# con.execute("SELECT * FROM product_tags LIMIT 5").df()

---

## Part 5: SQL Analysis - Board Questions (15 points)

**Context:** It's Tuesday afternoon. The CEO just called: *"I need answers to these specific questions for tomorrow's board meeting!"*

Use SQL to answer critical business questions.

### üìù CEO Recommendation: Category Strategy

**TODO: What category strategy would you recommend to the board? (2-3 sentences)**

Consider:
- Which categories to prioritize/discontinue
- Resource allocation implications
- Risk vs. opportunity balance

[Write your category recommendation for the CEO/Board]

In [None]:
# TODO: Write SQL query for category analysis
query = """
-- CEO wants to know which categories to keep
SELECT 
    p.category,
    COUNT(DISTINCT p.id) as product_count,
    COUNT(r.review_id) as review_count,
    ROUND(AVG(r.rating), 2) as avg_rating
FROM products p
INNER JOIN reviews r ON p.id = r.product_id
GROUP BY p.category
ORDER BY avg_rating DESC
LIMIT 10
"""

# result = con.execute(query).df()
print("üìä Category Performance for Board Meeting:")
# result

**TODO: What category recommendation would you make to the board?**

[Write 1-2 sentences with your recommendation based on the data]

### üìù CMO Recommendation: Marketing Strategy

**TODO: What marketing strategy would you recommend based on engagement patterns? (2-3 sentences)**

Consider:
- Which products should feature in campaigns?
- What makes these products engaging?
- Cross-sell/upsell opportunities?
- Any surprising findings?

[Write your marketing strategy for the CMO]

### Question 5.2: High-Engagement Products (4 points)

**Board Question:** *"Which products generate the most customer engagement? These might be our marketing champions."*

**Requirements:**
- Find products with more than 3 reviews
- Show product title, review count, and average rating
- Use HAVING clause
- Order by review count DESC

### üìù Product Team Recommendation: Development Insights

**TODO: What product development insights can we extract? (2-3 sentences)**

Consider:
- Which features should we prioritize in new products?
- Any unexpected tag patterns or combinations?
- Cross-category opportunities?
- Features to potentially discontinue?

[Write your product development insights for the Product Team]

In [None]:
# TODO: Write SQL query for high-engagement products
query = """
-- CMO wants to identify marketing champions

"""

# result = con.execute(query).df()
print("üéØ High-Engagement Products (Marketing Champions):")
# result

### üìù CEO Assessment: Integration Timing

**TODO: What's your assessment of QuickBuy's trajectory for the CEO? (2-3 sentences)**

Consider:
- Is customer satisfaction improving or declining?
- Should we accelerate or delay integration?
- Any seasonal patterns to consider?
- Risk assessment for the $12M investment

[Write your timing assessment for the CEO]

### Question 5.3: Popular Features Analysis (4 points)

**Board Question:** *"What product features (tags) resonate most with customers? This drives our product strategy."*

**Requirements:**
- Count how often each tag appears
- Show tag and product count
- Order by frequency DESC
- Show top 10 tags

In [None]:
---

## Executive Summary for Board Meeting

**TODO: Write a comprehensive executive summary for tomorrow's board meeting (5-6 sentences)**

Include:
- Total data scope (products, reviews, categories)
- Key insight about customer satisfaction trends
- Primary recommendation for integration strategy
- Major risks or concerns identified
- Timeline recommendation (accelerate/maintain/delay)
- Expected ROI or value creation opportunity

[Write your executive summary here]

---

## Submission Checklist

Before submitting, verify:

- [ ] All TODO sections completed
- [ ] **All stakeholder communications written** (Part 1, 2, 3, 4, and all Part 5 subsections)
- [ ] All assertions pass (no errors)
- [ ] Three tables created: products (194 rows), reviews (582 rows), product_tags (364 rows)
- [ ] All SQL queries return results
- [ ] Data dictionary has all columns documented
- [ ] Business insights included throughout
- [ ] Executive summary written (5-6 sentences)
- [ ] **CRITICAL:** Kernel ‚Üí Restart & Run All Cells (no errors)
- [ ] File renamed to `hw2_[your_name].ipynb`

---

## Reflection (Optional but Strongly Recommended)

**What was the most challenging part of this integration?**

[Your answer here]

**What insight would be most valuable for the board?**

[Your answer here]

**How would you improve QuickBuy's data quality going forward?**

[Your answer here]

**If you had one more day, what additional analysis would you perform?**

[Your answer here]

---

**üéâ Outstanding work, analyst!** You've successfully transformed QuickBuy's data for tomorrow's board meeting. The $2.5M inventory decision and the future of 50 developers' work rest on your analysis. The executives will be impressed with your thoroughness and business acumen!

In [None]:
# TODO: Write SQL query for timeline analysis
query = """
-- Board wants to know sentiment trend

"""

# result = con.execute(query).df()
print("üìÖ Review Timeline (Sentiment Trend):")
# result

---

## Part 6: Data Dictionary (5 points)

**Context:** The Integration Team Lead says: *"50 developers start migrating QuickBuy's systems tomorrow. They need clear documentation of your schema!"*

Create a comprehensive data dictionary for all tables.

In [None]:
# TODO: Create data dictionary
data_dictionary = pd.DataFrame([
    # Products table
    {'Table': 'products', 'Column': 'id', 'Type': 'INTEGER', 'Description': 'Unique product identifier (PK)', 'Example': '1'},
    {'Table': 'products', 'Column': 'title', 'Type': 'VARCHAR', 'Description': 'Product name', 'Example': 'Essence Mascara'},
    # TODO: Add all other product columns
    
    # Reviews table
    {'Table': 'reviews', 'Column': 'review_id', 'Type': 'INTEGER', 'Description': 'Unique review identifier (PK)', 'Example': '1'},
    # TODO: Add all other review columns
    
    # Product_tags table
    {'Table': 'product_tags', 'Column': 'product_id', 'Type': 'INTEGER', 'Description': 'Product identifier (FK)', 'Example': '1'},
    {'Table': 'product_tags', 'Column': 'tag', 'Type': 'VARCHAR', 'Description': 'Product feature tag', 'Example': 'electronics'},
])

print("üìö Data Dictionary for Integration Team:")
print("=" * 50)
print(f"Total tables: 3")
print(f"Total columns documented: {len(data_dictionary)}")
print("\nSample entries:")
data_dictionary.head(10)

---

## Executive Summary

**TODO: Write a 3-4 sentence summary for the board meeting**

Include:
- Total data processed (products, reviews)
- Key insight about categories or satisfaction
- Your recommendation for the integration
- Any risks or concerns

[Your executive summary here]

---

## Submission Checklist

Before submitting, verify:

- [ ] All TODO sections completed
- [ ] All assertions pass (no errors)
- [ ] Three tables created: products (100 rows), reviews (~300 rows), product_tags (~250 rows)
- [ ] All SQL queries return results
- [ ] Data dictionary has all columns documented
- [ ] Business insights included throughout
- [ ] Executive summary written
- [ ] **CRITICAL:** Kernel ‚Üí Restart & Run All Cells (no errors)
- [ ] File renamed to `hw2_[your_name].ipynb`

---

## Reflection (Optional but Recommended)

**What was the most challenging part of this integration?**

[Your answer here]

**What insight would be most valuable for the board?**

[Your answer here]

**How would you improve QuickBuy's data quality?**

[Your answer here]

---

**üéâ Great work, analyst!** You've successfully transformed QuickBuy's data for tomorrow's board meeting. The $2.5M decision rests on your analysis. The executives will be impressed!