# 📌 Managed vs External Tables in Spark

This notebook explains **Managed vs External Tables in Spark** using an example dataset (`customers_1mb.csv`). It also demonstrates how to configure the **default warehouse path** and verify tables using **Hive Metastore**.

## 🔹 Key Concepts
- **Managed Table**: Spark fully controls the **data and metadata**.
- **External Table**: Spark manages only **metadata**, while the data remains **outside** Spark's control.
- **Hive Metastore**: Stores **table definitions** to enable SQL-like querying in Spark.

---

In [None]:
# ✅ Step 1: Import necessary libraries
from pyspark.sql import SparkSession

# ✅ Step 2: Define warehouse directory
warehouse_location = '/tmp/pandu/warehouse'

# ✅ Step 3: Create Spark session with Hive support
spark = SparkSession.builder \
    .appName('ManagedVsExternalTables') \
    .config('spark.sql.warehouse.dir', warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

print(f"Spark session created with warehouse location: {warehouse_location}")

### 🔍 Checking Current Warehouse Directory
You can verify the current **Spark SQL warehouse directory** using the command below.

In [1]:
# Show the configured warehouse directory
spark.conf.get('spark.sql.warehouse.dir')

'file:/spark-warehouse'

---
## 📂 Loading the Dataset
Now, we load the **customers_1mb.csv** dataset.

In [2]:
# Load CSV data into a DataFrame
df = spark.read \
    .format('csv') \
    .option('header', 'True') \
    .option('inferSchema', 'True') \
    .load('/tmp/customers_1mb.csv')

# Display DataFrame schema
df.printSchema()

[Stage 1:>                                                          (0 + 1) / 1]

root
 |-- customer_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- country: string (nullable = true)
 |-- registration_date: timestamp (nullable = true)
 |-- is_active: boolean (nullable = true)



                                                                                

### 🔄 Creating a Temporary View
We'll create a **temporary view** to allow SQL-like queries before creating tables.

In [3]:
# Create a temporary view for querying
df.createOrReplaceTempView("temp_customers")

# Query the view
spark.sql("SELECT * FROM temp_customers LIMIT 5").show()

+-----------+----------+---------+-----------+-------+-------------------+---------+
|customer_id|      name|     city|      state|country|  registration_date|is_active|
+-----------+----------+---------+-----------+-------+-------------------+---------+
|          0|Customer_0|     Pune|Maharashtra|  India|2023-06-29 00:00:00|    false|
|          1|Customer_1|Bangalore| Tamil Nadu|  India|2023-12-07 00:00:00|     true|
|          2|Customer_2|Hyderabad|    Gujarat|  India|2023-10-27 00:00:00|     true|
|          3|Customer_3|Bangalore|  Karnataka|  India|2023-10-17 00:00:00|    false|
|          4|Customer_4|Ahmedabad|  Karnataka|  India|2023-03-14 00:00:00|    false|
+-----------+----------+---------+-----------+-------+-------------------+---------+



---
## 🏗 Creating a **Managed Table**
- Spark **stores** the data inside the **warehouse directory** (`/tmp/pandu/warehouse`).
- If you **drop** this table, the **data is also deleted**.

In [None]:
# Creating a Managed Table
spark.sql("DROP TABLE IF EXISTS managed_customers")
spark.sql("""
    CREATE TABLE managed_customers AS 
    SELECT * FROM temp_customers
""")
print("✅ Managed table 'managed_customers' created.")

In [None]:
spark.sql("ALTER DATABASE default SET LOCATION 'file:/tmp/pandu/warehouse'")


In [6]:
spark.sql('describe extended managed_customers').show(truncate=False)

+----------------------------+---------------------------------------------------------+-------+
|col_name                    |data_type                                                |comment|
+----------------------------+---------------------------------------------------------+-------+
|customer_id                 |int                                                      |null   |
|name                        |string                                                   |null   |
|city                        |string                                                   |null   |
|state                       |string                                                   |null   |
|country                     |string                                                   |null   |
|registration_date           |timestamp                                                |null   |
|is_active                   |boolean                                                  |null   |
|                            |

### 🔍 Verify Managed Table
Run the following command in a **Dataproc terminal** to check where the managed table is stored:
```bash
!hdfs dfs -ls /tmp/pandu/warehouse/managed_customers
```

---
## 📂 Creating an **External Table**
- The data **remains** in `/tmp/customers_1mb.csv`.
- If you **drop** this table, the **data is not deleted**.

In [6]:
# Creating an External Table
spark.sql("DROP TABLE IF EXISTS external_customers")
spark.sql("""
    CREATE EXTERNAL TABLE external_customers 
    USING CSV 
    LOCATION '/tmp/customers_1mb.csv'
""")
print("✅ External table 'external_customers' created.")

### 🔍 Verify External Table
Run the following command to check its location:
```bash
!hdfs dfs -ls /tmp/customers_1mb.csv
```

---
## 📊 Verifying Tables in Hive Metastore
You can check all available tables in Spark using:

In [7]:
# Show tables in Spark
spark.sql("SHOW TABLES").show()

### 🛑 Dropping the Tables
Dropping a **Managed Table** deletes both **metadata and data**, while dropping an **External Table** deletes only **metadata**.

In [8]:
# Drop managed table (Data is deleted!)
spark.sql("DROP TABLE IF EXISTS managed_customers")

# Drop external table (Data is NOT deleted!)
spark.sql("DROP TABLE IF EXISTS external_customers")

print("✅ Tables dropped successfully.")

## 🎯 Summary
| Feature | Managed Table | External Table |
|---------|--------------|----------------|
| Data Location | Inside warehouse | Custom location |
| Dropping Table | Deletes data | Only deletes metadata |
| Performance | Optimized by Spark | Depends on external storage |

**🚀 Now you understand the difference between Managed and External Tables in Spark!**