# Exercise 6: Apache Hive and the Data Catalog

## Why Not Just Use CSV Files?

In previous exercises, we read data directly from CSV files. While this works, it has significant limitations:

| Approach | CSV Files | Hive Tables |
|----------|-----------|-------------|
| **Schema** | Inferred or defined in code | Stored centrally in catalog |
| **Discovery** | Need to know file paths | Query catalog to find tables |
| **Governance** | No access control | Role-based access possible |
| **Evolution** | Breaking changes everywhere | Schema versioning supported |
| **Optimization** | No statistics | Partition pruning, statistics |
| **Sharing** | Copy files or share paths | Consistent view for all tools |

## Learning Objectives
- Understand the role of the Hive Metastore as a data catalog
- Create managed and external tables with proper schemas
- Use partitioning for efficient data organization
- Compare governed tables vs raw file access

---

## Part 1: Understanding the Hive Metastore

The **Hive Metastore** is a central repository that stores:
- **Database definitions** - logical groupings of tables
- **Table schemas** - column names, types, and descriptions
- **Partition information** - how data is organized
- **Storage details** - file locations, formats, SerDe
- **Statistics** - row counts, data sizes for optimization

```
┌─────────────────────────────────────────────────────────────┐
│                    HIVE METASTORE                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │  Database   │  │   Tables    │  │ Partitions  │        │
│  │ - sales_db  │──│ - orders    │──│ - year=2024 │        │
│  │ - hr_db     │  │ - customers │  │ - month=01  │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│         │                │                │                │
│         ▼                ▼                ▼                │
│  ┌─────────────────────────────────────────────┐          │
│  │           PostgreSQL Backend               │          │
│  │  (Persistent metadata storage)             │          │
│  └─────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────┘
         │              │              │
         ▼              ▼              ▼
    ┌─────────┐   ┌─────────┐   ┌─────────┐
    │  Spark  │   │  Hive   │   │ Presto  │
    │  Query  │   │  Query  │   │ Query   │
    └─────────┘   └─────────┘   └─────────┘
```

In [1]:
# Connect to HiveServer2 using beeline (command-line interface)
# In a real scenario, we'd use PyHive or Spark with Hive support

# For this notebook, we'll demonstrate using Spark with Hive Metastore
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Hive Data Catalog Lab") \
    .master("yarn") \
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
    .config("hive.metastore.uris", "thrift://hive-metastore:9083") \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

print("Spark session with Hive support created!")
print(f"Application ID: {spark.sparkContext.applicationId}")

Spark session with Hive support created!
Application ID: application_1768092706713_0009


## Part 2: Exploring the Catalog

Let's see what's already in our metastore:

In [5]:
# List all databases in the catalog
print("=== Databases in Catalog ===")
spark.sql("SHOW DATABASES").show(truncate=False)

=== Databases in Catalog ===
+---------------+
|namespace      |
+---------------+
|default        |
|sales_analytics|
+---------------+



In [6]:
# List tables in the default database
print("=== Tables in 'default' database ===")
spark.sql("SHOW TABLES IN default").show(truncate=False)

=== Tables in 'default' database ===
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
+---------+---------+-----------+



## Part 3: Creating a Governed Database

Let's create a proper database for our sales data with governance:

In [4]:
# Create a new database with description and location
spark.sql("""
    CREATE DATABASE IF NOT EXISTS sales_analytics
    COMMENT 'Sales data warehouse for analytics team'
    LOCATION '/user/hive/warehouse/sales_analytics.db'
""")

# Verify creation
spark.sql("DESCRIBE DATABASE EXTENDED sales_analytics").show(truncate=False)

+--------------+-----------------------------------------------------------+
|info_name     |info_value                                                 |
+--------------+-----------------------------------------------------------+
|Catalog Name  |spark_catalog                                              |
|Namespace Name|sales_analytics                                            |
|Comment       |Sales data warehouse for analytics team                    |
|Location      |hdfs://namenode:9000/user/hive/warehouse/sales_analytics.db|
|Owner         |jovyan                                                     |
|Properties    |                                                           |
+--------------+-----------------------------------------------------------+



In [7]:
# Switch to our new database
spark.sql("USE sales_analytics")
print("Now using: sales_analytics database")

Now using: sales_analytics database


## Part 4: Managed Tables vs External Tables

### Managed Tables
- Hive owns the data
- DROP TABLE deletes the data
- Best for ETL outputs and internal tables

### External Tables  
- Hive only manages metadata
- DROP TABLE leaves data intact
- Best for raw data landing zones

In [8]:
# Create a MANAGED table with explicit schema
spark.sql("""
    CREATE TABLE IF NOT EXISTS customers (
        customer_id INT COMMENT 'Unique customer identifier',
        first_name STRING COMMENT 'Customer first name',
        last_name STRING COMMENT 'Customer last name',
        email STRING COMMENT 'Contact email address',
        signup_date DATE COMMENT 'Date customer registered',
        loyalty_tier STRING COMMENT 'Bronze, Silver, Gold, Platinum'
    )
    COMMENT 'Master customer dimension table'
    STORED AS PARQUET
""")

print("Managed table 'customers' created!")
spark.sql("DESCRIBE EXTENDED customers").show(100, truncate=False)

Managed table 'customers' created!
+----------------------------+---------------------------------------------------------------------+------------------------------+
|col_name                    |data_type                                                            |comment                       |
+----------------------------+---------------------------------------------------------------------+------------------------------+
|customer_id                 |int                                                                  |Unique customer identifier    |
|first_name                  |string                                                               |Customer first name           |
|last_name                   |string                                                               |Customer last name            |
|email                       |string                                                               |Contact email address         |
|signup_date                 |date       

In [9]:
# Create an EXTERNAL table pointing to existing data
spark.sql("""
    CREATE EXTERNAL TABLE IF NOT EXISTS raw_transactions (
        transaction_id STRING,
        customer_id INT,
        product_id STRING,
        quantity INT,
        unit_price DECIMAL(10,2),
        transaction_date TIMESTAMP
    )
    COMMENT 'Raw transaction data - external source'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION '/user/student/data/transactions'
    TBLPROPERTIES ('skip.header.line.count'='1')
""")

print("External table 'raw_transactions' created!")

External table 'raw_transactions' created!


## Part 5: Partitioned Tables for Performance

Partitioning organizes data into directories, enabling **partition pruning**:

In [10]:
# Create a partitioned fact table
spark.sql("""
    CREATE TABLE IF NOT EXISTS orders (
        order_id STRING,
        customer_id INT,
        order_total DECIMAL(12,2),
        order_status STRING,
        created_at TIMESTAMP
    )
    COMMENT 'Order fact table partitioned by date'
    PARTITIONED BY (order_year INT, order_month INT)
    STORED AS PARQUET
""")

print("Partitioned table 'orders' created!")
spark.sql("DESCRIBE EXTENDED orders").show(50, truncate=False)

Partitioned table 'orders' created!
+----------------------------+------------------------------------------------------------------+-------+
|col_name                    |data_type                                                         |comment|
+----------------------------+------------------------------------------------------------------+-------+
|order_id                    |string                                                            |NULL   |
|customer_id                 |int                                                               |NULL   |
|order_total                 |decimal(12,2)                                                     |NULL   |
|order_status                |string                                                            |NULL   |
|created_at                  |timestamp                                                         |NULL   |
|order_year                  |int                                                               |NULL   |
|order_mon

In [13]:
# Insert sample data into partitions
from pyspark.sql.functions import lit, current_timestamp
from datetime import datetime

# Create sample order data
sample_orders = [
    ("ORD-001", 101, 150.00, "completed", datetime(2024, 1, 15, 10, 30)),
    ("ORD-002", 102, 275.50, "completed", datetime(2024, 1, 20, 14, 45)),
    ("ORD-003", 101, 89.99, "shipped", datetime(2024, 2, 5, 9, 15)),
    ("ORD-004", 103, 420.00, "pending", datetime(2024, 2, 10, 16, 20)),
    ("ORD-004", 103, 420.00, "pending", datetime(2025, 2, 10, 16, 20)),
]

orders_df = spark.createDataFrame(
    sample_orders, 
    ["order_id", "customer_id", "order_total", "order_status", "created_at"]
)

# Add partition columns
from pyspark.sql.functions import year, month
orders_df = orders_df.withColumn("order_year", year("created_at")) \
                     .withColumn("order_month", month("created_at"))

# Write to Hive table with partitioning
orders_df.write.mode("append").insertInto("orders")

print("Sample data inserted into partitioned table!")

Sample data inserted into partitioned table!


In [14]:
orders_df.rdd.getNumPartitions()

2

In [15]:
# View the partitions
print("=== Table Partitions ===")
spark.sql("SHOW PARTITIONS orders").show(truncate=False)

=== Table Partitions ===
+-----------------------------+
|partition                    |
+-----------------------------+
|order_year=2024/order_month=1|
|order_year=2024/order_month=2|
|order_year=2025/order_month=2|
+-----------------------------+



In [20]:
# Query with partition pruning
print("=== Query January 2024 Orders (Partition Pruning) ===")
spark.sql("""
    SELECT order_id, customer_id, order_total, order_status
    FROM orders
    WHERE order_year = 2024 AND order_month = 1
""").show()

# Explain the query to see partition pruning
print("\n=== Query Plan (Notice Partition Filters) ===")
spark.sql("""
    EXPLAIN FORMATTED
    SELECT * FROM orders WHERE order_year = 2024 AND order_month = 1
""").show(200, truncate=False)

=== Query January 2024 Orders (Partition Pruning) ===
+--------+-----------+-----------+------------+
|order_id|customer_id|order_total|order_status|
+--------+-----------+-----------+------------+
| ORD-001|        101|     150.00|   completed|
| ORD-002|        102|     275.50|   completed|
| ORD-001|        101|     150.00|   completed|
| ORD-002|        102|     275.50|   completed|
+--------+-----------+-----------+------------+


=== Query Plan (Notice Partition Filters) ===
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Part 6: Schema Discovery and Documentation

Unlike raw files, catalog tables are self-documenting:

In [21]:
# List all tables in our database
print("=== Tables in sales_analytics ===")
spark.sql("SHOW TABLES").show(truncate=False)

=== Tables in sales_analytics ===
+---------------+----------------+-----------+
|namespace      |tableName       |isTemporary|
+---------------+----------------+-----------+
|sales_analytics|customers       |false      |
|sales_analytics|orders          |false      |
|sales_analytics|raw_transactions|false      |
+---------------+----------------+-----------+



In [28]:
# Get detailed table information
print("=== customers Table Details ===")
spark.sql("DESCRIBE FORMATTED customers").show(100, truncate=False)

=== customers Table Details ===
+----------------------------+---------------------------------------------------------------------+------------------------------+
|col_name                    |data_type                                                            |comment                       |
+----------------------------+---------------------------------------------------------------------+------------------------------+
|customer_id                 |int                                                                  |Unique customer identifier    |
|first_name                  |string                                                               |Customer first name           |
|last_name                   |string                                                               |Customer last name            |
|email                       |string                                                               |Contact email address         |
|signup_date                 |date          

In [31]:
# View table creation DDL
print("=== How to recreate the table ===")
spark.sql("SHOW CREATE TABLE orders").show(truncate=False)

=== How to recreate the table ===
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                                                                                                                                                                                                            |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Part 7: The Problem with Raw Files

Let's contrast governed tables with raw file access:

In [None]:
# The OLD WAY: Reading a CSV file directly
# Problems:
# 1. Need to know exact file path
# 2. Schema must be defined in every script
# 3. No documentation or discovery
# 4. No access control

print("=== THE OLD WAY (Raw File Access) ===")
print("""
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \\  # <- Slow, unreliable
    .csv("/some/path/to/transactions.csv")  # <- Hardcoded path!
    
# Every user needs to:
# - Know the file location
# - Know the schema
# - Handle schema changes manually
# - No data lineage
""")

In [33]:
# The NEW WAY: Query from catalog
print("=== THE NEW WAY (Catalog Access) ===")
print("""
df = spark.table("sales_analytics.orders")

# Benefits:
# ✓ Schema is predefined and documented
# ✓ Location abstracted away
# ✓ Access control via database permissions
# ✓ Statistics enable query optimization
# ✓ All tools see the same data definition
""")

# Actually query the table
df = spark.table("sales_analytics.orders")
print("\nQuerying orders table from catalog:")
df.show()

=== THE NEW WAY (Catalog Access) ===

df = spark.table("sales_analytics.orders")

# Benefits:
# ✓ Schema is predefined and documented
# ✓ Location abstracted away
# ✓ Access control via database permissions
# ✓ Statistics enable query optimization
# ✓ All tools see the same data definition


Querying orders table from catalog:
+--------+-----------+-----------+------------+-------------------+----------+-----------+
|order_id|customer_id|order_total|order_status|         created_at|order_year|order_month|
+--------+-----------+-----------+------------+-------------------+----------+-----------+
| ORD-004|        103|     420.00|     pending|2025-02-10 16:20:00|      2025|          2|
| ORD-001|        101|     150.00|   completed|2024-01-15 10:30:00|      2024|          1|
| ORD-002|        102|     275.50|   completed|2024-01-20 14:45:00|      2024|          1|
| ORD-001|        101|     150.00|   completed|2024-01-15 10:30:00|      2024|          1|
| ORD-002|        102|     275.50|

## Part 8: Table Statistics for Optimization

In [None]:
# Compute table statistics
spark.sql("ANALYZE TABLE orders COMPUTE STATISTICS")
spark.sql("ANALYZE TABLE orders COMPUTE STATISTICS FOR ALL COLUMNS")

print("Statistics computed!")

In [None]:
# View the statistics
print("=== Table Statistics ===")
spark.sql("DESCRIBE EXTENDED orders").show(100, truncate=False)

## Exercises

### Exercise 1: Create a Product Catalog Table
Create a managed table called `products` with columns for product_id, name, category, price, and inventory_count.

### Exercise 2: Create a Partitioned Sales Table
Create an external table for sales data partitioned by region and year.

### Exercise 3: Compare Query Plans
Write a query against the partitioned table and examine the query plan to verify partition pruning.

In [None]:
# Exercise 1: Your code here
# spark.sql("""
#     CREATE TABLE IF NOT EXISTS products (
#         ...
#     )
# """)

In [None]:
# Cleanup (optional)
# spark.sql("DROP TABLE IF EXISTS customers")
# spark.sql("DROP TABLE IF EXISTS raw_transactions")
# spark.sql("DROP TABLE IF EXISTS orders")
# spark.sql("DROP DATABASE IF EXISTS sales_analytics CASCADE")

In [None]:
# Stop the Spark session
spark.stop()
print("Session stopped.")