## OPTIMIZE COMMAND IN DATABRICKS (DELTA LAKE)
### PURPOSE:
- The Optimize command in Databricks compacts small files into larger ones within a Delta table.
- This improves query performance by reducing the number of files that Spark needs to read.

### WHY NEEDED:
- Delta tables can accumulate many small files due to frequent updates, merges and streaming writes.
- Optimize combines these small files into fewer large files, which helps improve read performance.

### HOW IT WORKS:
- It rewrites data files within each partition(if any) into optimized files.
- Uses a Bin-packaging algorithm to combine smaller files into target-sized files (~1 GB each).
- ONly affects physical layout of data — does NOT change data content.

In [0]:
# Creating catalog, schema and volume
spark.sql(f"CREATE CATALOG IF NOT EXISTS sales1_catalog")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS sales1_catalog.inputdb")
spark.sql(f"CREATE VOLUME IF NOT EXISTS sales1_catalog.inputdb.volume1")

In [0]:
%sql
--Step 1: Create a Delta table
CREATE OR REPLACE TABLE sales1_catalog.inputdb.tblsales
(
    sales_id INT,
    product_id INT,
    region STRING,
    sales_amount DOUBLE,
    sales_date DATE
)
USING DELTA;

In [0]:
%sql
-- Step 2: Insert sample data

-- Let’s add multiple small batches to simulate many small files:

INSERT INTO sales1_catalog.inputdb.tblsales VALUES
  (1, 101, 'North', 1000.50, '2025-10-16'),
  (2, 102, 'South', 500.75, '2025-10-16'),
  (3, 103, 'East', 700.20, '2025-10-16'),
  (4, 104, 'West', 1200.00, '2025-10-16');

  INSERT INTO sales1_catalog.inputdb.tblsales VALUES
  (5, 101, 'North', 800.00, '2025-10-17'),
  (6, 102, 'South', 450.00, '2025-10-17'),
  (7, 103, 'East', 600.00, '2025-10-17'),
  (8, 104, 'West', 1100.00, '2025-10-17');

In [0]:
%sql
select * from sales1_catalog.inputdb.tblsales;
--spark.sql("select * from sales1_catalog.inputdb.tblsales");
--spark.Table("sales1_catalog.inputdb.tblsales")

In [0]:
%sql
-- Step 3: Check fragmentation
DESCRIBE DETAIL sales1_catalog.inputdb.tblsales;

In [0]:
%sql
-- Step 4: Optimize the table
-- This performs file compaction:
-- Combines many small Parquet files into fewer large files (around 1 GB default).
-- Improves read performance and reduces metadata overhead.

OPTIMIZE sales1_catalog.inputdb.tblsales;

In [0]:
%sql
-- Step 5: Verify compaction

-- After optimization, run:

DESCRIBE DETAIL sales1_catalog.inputdb.tblsales;