# Exercise 4: Apache Hive Fundamentals

## Learning Objectives
- Understand Hive architecture and its role in the Hadoop ecosystem
- Create databases, tables (managed and external)
- Load data into Hive tables
- Work with partitions and bucketing
- Write HiveQL queries (SELECT, JOIN, GROUP BY, etc.)
- Use Hive with Spark (SparkSQL + Hive Metastore)

## Prerequisites
- Completed HDFS and Spark exercises
- Hive services running (metastore and hiveserver2)

## Part 1: Connecting to Hive

### Exercise 1.1: Connect via Beeline

Open a terminal and connect to HiveServer2 using Beeline:

```bash
docker exec -it hiveserver2 beeline -u "jdbc:hive2://localhost:10000/default"
```

Once connected, run:
```sql
SHOW DATABASES;
```

**Question:** What databases exist by default?

### Exercise 1.2: Connect via PySpark with Hive Support

Spark can use Hive Metastore for table metadata. Complete the code below:

In [None]:
from pyspark.sql import SparkSession

# TODO: Create a SparkSession with Hive support enabled
# Hint: Use .enableHiveSupport() and set hive.metastore.uris
spark = SparkSession.builder \
    .appName("HiveExercises") \
    .config("hive.metastore.uris", "thrift://hive-metastore:9083") \
    # TODO: Add enableHiveSupport()
    .getOrCreate()

# Verify connection by listing databases
spark.sql("SHOW DATABASES").show()

## Part 2: Creating Databases and Tables

### Exercise 2.1: Create a Database

In [None]:
# TODO: Create a database called 'sales_db'
# Hint: Use spark.sql("CREATE DATABASE ...")


# Switch to the new database
spark.sql("USE sales_db")

# Verify
spark.sql("SELECT current_database()").show()

### Exercise 2.2: Create a Managed Table

In [None]:
# TODO: Create a managed table called 'customers' with the following columns:
# - customer_id (INT)
# - name (STRING)
# - email (STRING)  
# - signup_date (DATE)

spark.sql("""
    -- TODO: Write your CREATE TABLE statement here
""")

# Verify table was created
spark.sql("DESCRIBE customers").show()

### Exercise 2.3: Create an External Table

External tables point to data stored outside Hive's warehouse. The data persists even if the table is dropped.

In [None]:
# First, let's create some sample data in HDFS
sample_data = [
    (1, "Laptop", "Electronics", 999.99),
    (2, "Mouse", "Electronics", 29.99),
    (3, "Desk", "Furniture", 299.99),
    (4, "Chair", "Furniture", 199.99),
    (5, "Monitor", "Electronics", 399.99)
]

df = spark.createDataFrame(sample_data, ["product_id", "name", "category", "price"])

# Save to HDFS as CSV
df.write.mode("overwrite").option("header", "true").csv("/user/hive/external/products")

print("Data written to HDFS!")

In [None]:
# TODO: Create an EXTERNAL table pointing to the CSV data
# Hint: Use LOCATION clause and specify ROW FORMAT

spark.sql("""
    -- TODO: Create external table 'products' 
    -- with columns: product_id INT, name STRING, category STRING, price DOUBLE
    -- pointing to /user/hive/external/products
""")

# Query the external table
spark.sql("SELECT * FROM products").show()

## Part 3: Partitioned Tables

Partitioning divides data into directories based on column values, improving query performance.

### Exercise 3.1: Create a Partitioned Table

In [None]:
# TODO: Create a partitioned table for sales data
# Partition by: year (INT) and month (INT)

spark.sql("""
    CREATE TABLE IF NOT EXISTS sales (
        sale_id INT,
        product_id INT,
        quantity INT,
        amount DOUBLE,
        sale_date DATE
    )
    -- TODO: Add PARTITIONED BY clause for year and month
    STORED AS PARQUET
""")

spark.sql("DESCRIBE sales").show()

In [None]:
# Insert data into partitions
spark.sql("""
    INSERT INTO sales PARTITION (year=2024, month=1)
    VALUES (1, 1, 2, 1999.98, '2024-01-15'),
           (2, 2, 5, 149.95, '2024-01-20')
""")

spark.sql("""
    INSERT INTO sales PARTITION (year=2024, month=2) 
    VALUES (3, 3, 1, 299.99, '2024-02-10'),
           (4, 4, 3, 599.97, '2024-02-25')
""")

# View partitions
spark.sql("SHOW PARTITIONS sales").show()

### Exercise 3.2: Query with Partition Pruning

When you filter on partition columns, Hive only scans relevant partitions.

In [None]:
# TODO: Query only January 2024 sales
# This should use partition pruning (only read year=2024/month=1 directory)

spark.sql("""
    -- TODO: Write a SELECT query filtering by year=2024 AND month=1
""").show()

# Check the execution plan to see partition pruning
spark.sql("SELECT * FROM sales WHERE year=2024 AND month=1").explain()

## Part 4: HiveQL Queries

### Exercise 4.1: Aggregation Queries

In [None]:
# TODO: Calculate total sales amount per month
spark.sql("""
    -- TODO: Write GROUP BY query to sum amount by year, month
""").show()

### Exercise 4.2: JOIN Operations

In [None]:
# TODO: Join sales with products to show product names with sales
spark.sql("""
    -- TODO: Write a JOIN query between sales and products tables
    -- Show: product name, category, quantity sold, amount
""").show()

## Part 5: Advanced Features

### Exercise 5.1: Window Functions

In [None]:
# TODO: Use a window function to calculate running total of sales
spark.sql("""
    -- TODO: Calculate cumulative sum of amount ordered by sale_date
    -- Hint: Use SUM() OVER (ORDER BY ...)
""").show()

### Exercise 5.2: Create a View

In [None]:
# TODO: Create a view that shows monthly sales summary
spark.sql("""
    -- TODO: CREATE VIEW monthly_summary AS ...
""")

spark.sql("SELECT * FROM monthly_summary").show()

## Cleanup

In [None]:
# Optional: Clean up tables (uncomment to run)
# spark.sql("DROP TABLE IF EXISTS sales")
# spark.sql("DROP TABLE IF EXISTS products")
# spark.sql("DROP TABLE IF EXISTS customers")
# spark.sql("DROP DATABASE IF EXISTS sales_db CASCADE")

spark.stop()
print("Session stopped.")