# Exercise 8: Hive Execution Engines

## MapReduce vs Tez vs Spark

Hive was originally built to run on **MapReduce**, but it can now use faster execution engines:

| Engine | Description | Performance | Use Case |
|--------|-------------|-------------|----------|
| **MapReduce** | Original Hive engine | Slow (disk I/O heavy) | Legacy, batch |
| **Tez** | DAG-based execution | 10x faster | Default on Hortonworks |
| **Spark** | In-memory processing | 10-100x faster | Interactive, ML |

```
┌─────────────────────────────────────────────────────────────┐
│                     HIVE QUERY                             │
│         SELECT * FROM sales WHERE region = 'US'            │
└─────────────────────────────────────────────────────────────┘
                         │
         ┌───────────────┼───────────────┐
         │               │               │
         ▼               ▼               ▼
    ┌─────────┐    ┌─────────┐    ┌─────────┐
    │   MR    │    │   Tez   │    │  Spark  │
    │  Slow   │    │  Fast   │    │ Fastest │
    │  Disk   │    │   DAG   │    │ Memory  │
    └─────────┘    └─────────┘    └─────────┘
```

## Learning Objectives
- Understand Hive execution engines
- Configure Hive to use Spark as execution engine
- Compare performance between engines
- Know when to use which approach

---

## Part 1: Current Hive Configuration

Let's first check the current execution engine configuration.

In [None]:
# We'll use subprocess to run beeline commands
import subprocess

def run_hive_query(query, database="default"):
    """Execute a Hive query using beeline and return the result."""
    cmd = f'''docker exec hiveserver2 beeline -u "jdbc:hive2://localhost:10000/{database}" \
              --silent=true -e "{query}"'''
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    return result.stdout

print("Hive query helper function defined.")

In [None]:
# For this notebook, we'll primarily demonstrate concepts
# and use Spark to show the differences

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Hive Execution Engines Lab") \
    .master("yarn") \
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
    .config("hive.metastore.uris", "thrift://hive-metastore:9083") \
    .config("spark.sql.catalogImplementation", "hive") \
    .enableHiveSupport() \
    .getOrCreate()

print("Spark session created for Hive operations.")

## Part 2: Understanding Execution Engines

### MapReduce (Default in older Hive)
```xml
<property>
    <name>hive.execution.engine</name>
    <value>mr</value>
</property>
```

### Spark (Modern, Fast)
```xml
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
</property>
```

In [None]:
# View current hive-site.xml execution engine setting
print("""
╔════════════════════════════════════════════════════════════════╗
║  Current Teaching Lab Configuration (hive-site.xml):          ║
║                                                                ║
║  <property>                                                    ║
║      <name>hive.execution.engine</name>                        ║
║      <value>mr</value>   ← Currently MapReduce                 ║
║  </property>                                                   ║
║                                                                ║
║  To change to Spark, you would set:                            ║
║      <value>spark</value>                                      ║
╚════════════════════════════════════════════════════════════════╝
""")

## Part 3: Hive on Spark Configuration

To configure Hive to use Spark as its execution engine, you need:

1. **Spark Assembly JAR** accessible to Hive
2. **Hive configuration** updated
3. **YARN resources** for Spark executors

In [None]:
# Configuration required for Hive on Spark
print("""
═══════════════════════════════════════════════════════════════════
CONFIGURATION REQUIRED FOR HIVE ON SPARK
═══════════════════════════════════════════════════════════════════

1. hive-site.xml:
   <property>
       <name>hive.execution.engine</name>
       <value>spark</value>
   </property>
   <property>
       <name>spark.master</name>
       <value>yarn</value>
   </property>
   <property>
       <name>spark.submit.deployMode</name>
       <value>client</value>
   </property>

2. Environment Variable:
   export SPARK_HOME=/opt/spark
   
3. Hive needs Spark JARs:
   - Copy spark-assembly JAR to Hive's lib directory
   - Or set spark.home in hive-site.xml
""")

## Part 4: The Better Approach - Use Spark SQL Directly

Instead of configuring Hive to use Spark, the modern approach is:

**Use Spark SQL directly with Hive Metastore!**

This gives you:
- Spark's fast in-memory processing
- Access to all Hive tables via metastore
- No complex Hive-on-Spark configuration
- Better control over resources

In [None]:
# Create test database and table for comparison
spark.sql("CREATE DATABASE IF NOT EXISTS engine_demo")
spark.sql("USE engine_demo")

# Generate sample data
from pyspark.sql.functions import rand, randn, floor, lit
from pyspark.sql.types import *

# Create a larger dataset for meaningful comparison
num_rows = 100000

test_data = spark.range(num_rows) \
    .withColumn("category", (floor(rand() * 10)).cast("int")) \
    .withColumn("value", (rand() * 1000)) \
    .withColumn("region", (floor(rand() * 5)).cast("int"))

# Save as Hive table
test_data.write.mode("overwrite").saveAsTable("benchmark_data")

print(f"Created benchmark table with {num_rows:,} rows")
spark.sql("SELECT COUNT(*) as row_count FROM benchmark_data").show()

In [None]:
# Demonstrate Spark SQL performance
import time

# Complex aggregation query
query = """
    SELECT 
        category,
        region,
        COUNT(*) as count,
        SUM(value) as total_value,
        AVG(value) as avg_value,
        MAX(value) as max_value,
        MIN(value) as min_value
    FROM benchmark_data
    GROUP BY category, region
    ORDER BY total_value DESC
"""

# Time Spark SQL execution
start = time.time()
result = spark.sql(query)
result.collect()  # Force execution
spark_time = time.time() - start

print(f"=== Spark SQL Results ===")
result.show(10)
print(f"\n⏱️  Spark SQL execution time: {spark_time:.2f} seconds")

In [None]:
# Show query execution plan
print("=== Spark Query Execution Plan ===")
spark.sql(query).explain(mode="formatted")

## Part 5: When to Use What

### Use Hive with MapReduce when:
- Large batch ETL jobs that run overnight
- Memory is limited
- Legacy systems that require Hive compatibility

### Use Hive on Spark when:
- You have existing Hive queries
- Need faster performance without rewriting
- Team is familiar with HiveQL

### Use Spark SQL directly when:
- Interactive analysis
- Machine Learning pipelines
- New development
- Need DataFrame API flexibility

In [None]:
# Comparison chart
print("""
╔═══════════════════════════════════════════════════════════════════════╗
║              EXECUTION ENGINE COMPARISON                              ║
╠═══════════════════════════════════════════════════════════════════════╣
║                                                                       ║
║  Approach           │ Speed    │ Memory   │ Complexity │ Best For    ║
║  ──────────────────────────────────────────────────────────────────  ║
║  Hive + MR          │ Slow     │ Low      │ Simple     │ Batch ETL   ║
║  Hive + Tez         │ Medium   │ Medium   │ Medium     │ General     ║
║  Hive + Spark       │ Fast     │ High     │ Complex    │ Migration   ║
║  Spark SQL Direct   │ Fastest  │ High     │ Simple     │ New Dev     ║
║                                                                       ║
╚═══════════════════════════════════════════════════════════════════════╝
""")

## Part 6: Hybrid Architecture (Best Practice)

In production, organizations often use:

```
┌─────────────────────────────────────────────────────────────────┐
│                    HIVE METASTORE                              │
│               (Central Catalog)                                 │
└────────────────────────┬────────────────────────────────────────┘
                         │
         ┌───────────────┼───────────────┐
         │               │               │
         ▼               ▼               ▼
    ┌─────────┐    ┌─────────┐    ┌─────────────┐
    │ HiveQL  │    │ Spark   │    │   Presto/   │
    │ (Batch) │    │  SQL    │    │   Trino     │
    │         │    │ (Fast)  │    │  (BI Tools) │
    └─────────┘    └─────────┘    └─────────────┘
         │               │               │
         └───────────────┴───────────────┘
                         │
                         ▼
                    ┌─────────┐
                    │   HDFS  │
                    │  (ORC/  │
                    │ Parquet)│
                    └─────────┘
```

This allows:
- **Batch ETL**: Hive with Tez or MR for scheduled jobs
- **Interactive Analytics**: Spark SQL for data science
- **BI Dashboards**: Presto/Trino for fast queries
- **All sharing the same tables**: Via Hive Metastore

In [None]:
# Demonstrate the shared metastore concept
print("=== Tables accessible by ALL query engines ===")
spark.sql("SHOW TABLES IN engine_demo").show()

In [None]:
# This table is queryable via:
print("""
The 'benchmark_data' table can be queried by:

1. SPARK SQL (what we're using now):
   spark.sql("SELECT * FROM engine_demo.benchmark_data")

2. HIVE (via beeline):
   SELECT * FROM engine_demo.benchmark_data;

3. PRESTO/TRINO (if configured):
   SELECT * FROM hive.engine_demo.benchmark_data;

4. JDBC/ODBC TOOLS:
   Connect to HiveServer2 at port 10000

All see the SAME data with the SAME schema!
""")

## Summary

Key takeaways:

1. **Hive supports multiple execution engines**: MapReduce, Tez, Spark
2. **Spark is fastest** but requires more memory
3. **Modern approach**: Use Spark SQL directly with Hive Metastore
4. **Hive Metastore is the key**: Central catalog for all tools
5. **Hybrid architecture**: Different engines for different workloads

In [None]:
# Cleanup (optional)
# spark.sql("DROP TABLE IF EXISTS benchmark_data")
# spark.sql("DROP DATABASE IF EXISTS engine_demo CASCADE")

In [None]:
spark.stop()
print("Session stopped.")