# Databricks 101: Hands-On Practice

## ‚ö†Ô∏è IMPORTANT: Prerequisites

**This notebook should ONLY be run AFTER completing the `main_pipeline.ipynb` notebook.**

Before starting these exercises, ensure that:

1. ‚úÖ You have successfully run `main_pipeline.ipynb` from start to finish
2. ‚úÖ All Gold tables have been created and populated in Unity Catalog:
   - `{catalog}.gold.customer_analytics`
   - `{catalog}.gold.product_performance`
   - `{catalog}.gold.monthly_revenue`
   - `{catalog}.gold.category_performance`
3. ‚úÖ The catalog and database specified in `main_pipeline.ipynb` are accessible

### How to Verify Gold Tables Exist

Before proceeding, run the setup cell below to verify that all required gold tables are available in Unity Catalog.

---

## About This Notebook

This notebook contains hands-on exercises designed to help you practice querying and analyzing data from the Gold layer tables you created in the Medallion Architecture.

### Learning Objectives:

- Query pre-aggregated Gold tables for business insights
- Practice PySpark DataFrame operations
- Perform analytical queries and aggregations
- Answer real-world business questions using data

### Structure:

Each exercise provides:
- A business question to answer
- Hints about which Gold table(s) to use
- A blank code cell for your solution

---

## Setup & Verification

Run this cell to configure your environment and verify that all Gold tables exist.

In [None]:
# Configuration - Update these to match your main_pipeline.ipynb settings
catalog = "demo"  # Change if you used a different catalog
gold_db = "gold"  # Change if you used a different gold database

print("="*70)
print("VERIFYING GOLD TABLES")
print("="*70)
print(f"\nCatalog: {catalog}")
print(f"Gold Database: {gold_db}\n")

# List of required Gold tables
required_tables = [
    "customer_analytics",
    "product_performance",
    "monthly_revenue",
    "category_performance"
]

# Check each table
all_tables_exist = True
for table in required_tables:
    try:
        count = spark.table(f"{catalog}.{gold_db}.{table}").count()
        print(f"‚úÖ {table}: {count:,} records")
    except Exception as e:
        print(f"‚ùå {table}: NOT FOUND")
        all_tables_exist = False

print("\n" + "="*70)
if all_tables_exist:
    print("‚úÖ SUCCESS! All Gold tables found. You can proceed with the exercises.")
else:
    print("‚ùå ERROR! Some Gold tables are missing.")
    print("Please run main_pipeline.ipynb first to create all Gold tables.")
print("="*70)

---

# Exercise 1: Top Customers by Lifetime Value

**Business Question:** Who are our top 5 customers by lifetime value, and what are their purchase patterns?

**Table to use:** `customer_analytics`

**Hints:**
- Select relevant columns: customer name, email, lifetime_value, total_orders, avg_order_value
- Order by lifetime_value descending
- Limit to top 5
- Use `display()` to show results

**Your Solution:**

In [None]:
# Write your code here


---

# Exercise 2: Best-Selling Products by Category

**Business Question:** What are the top 10 products by total revenue, and which categories do they belong to?

**Table to use:** `product_performance`

**Hints:**
- Select: product_name, category, brand, total_revenue, total_quantity_sold
- Order by total_revenue descending
- Limit to 10 products
- Round revenue to 2 decimal places

**Your Solution:**

In [None]:
# Write your code here


---

# Exercise 3: Monthly Revenue Trends

**Business Question:** Show the revenue trends for the last 6 months. Which month had the highest revenue?

**Table to use:** `monthly_revenue`

**Hints:**
- Select: month_start_date, total_orders, gross_revenue, net_revenue, mom_growth_percent
- Order by month_start_date descending
- Limit to 6 months
- Format date for readability
- Round monetary values

**Your Solution:**

In [None]:
# Write your code here


---

# Exercise 4: Category Performance Ranking

**Business Question:** Rank all product categories by total revenue. Which category generates the most revenue?

**Table to use:** `category_performance`

**Hints:**
- Select: category, total_products, total_orders, total_revenue
- Order by total_revenue descending
- Round revenue values
- Show all categories

**Your Solution:**

In [None]:
# Write your code here


---

# Exercise 5: Customer Segmentation Analysis

**Business Question:** How many customers do we have in each customer segment, and what is their average lifetime value per segment?

**Table to use:** `customer_analytics`

**Hints:**
- Group by customer_segment
- Count distinct customers (count customer_id)
- Calculate average lifetime_value
- Round monetary values to 2 decimal places
- Order by segment

**Your Solution:**

In [None]:
# Write your code here


---

# Congratulations!

You've completed the hands-on exercises for Databricks 101!

## Skills You Practiced:

- ‚úÖ Querying Gold layer tables in Unity Catalog
- ‚úÖ Using PySpark DataFrame operations (select, filter, groupBy, agg, orderBy, limit)
- ‚úÖ Performing business analytics and aggregations
- ‚úÖ Working with pre-calculated metrics and KPIs
- ‚úÖ Answering real-world business questions with data

## Next Steps:

1. Try creating your own business questions and queries
2. Explore the Silver and Bronze layers to understand data lineage
3. Experiment with data visualizations using Databricks notebooks
4. Learn about Delta Lake features like time travel and OPTIMIZE
5. Build your own data pipelines using the Medallion Architecture pattern

---

**Happy Learning! üéâ**