# M01: Platform & Workspace

We explore the Databricks Lakehouse platform — architecture, compute types, Unity Catalog, and the notebook workspace. Understanding these concepts is foundational for the exam (24% of questions). After this module, you'll navigate the workspace confidently, create clusters, and manage Unity Catalog objects.

| Exam Domain | Weight |
|---|---|
| Databricks Lakehouse Platform | **24%** |

---

## Lakehouse Architecture

The Lakehouse architecture combines the benefits of Data Lakes and Data Warehouses. Understanding this evolution is key for the exam (24% of questions cover the platform).

---

### The Evolution of Data Architectures

**Generation 1: Data Warehouse (1990s-2000s)**
- Teradata, Oracle, SQL Server
- Structured data only, expensive storage
- Great for BI, terrible for ML/unstructured data

**Generation 2: Data Lake (2010s)**
- Hadoop, S3, ADLS
- Cheap storage, any format
- Problem: "Data Swamp" - no governance, no ACID, unreliable

**Generation 3: Lakehouse (2020s)**
- Delta Lake, Iceberg, Hudi
- Best of both: cheap storage + ACID + governance
- Single platform for BI, ML, streaming

<img src="../../../assets/images/4c9090bd82f2475c810bafde13f978e0.png" width="800">




### Cost Comparison (Rough Estimates)

| Component | Traditional (DW + Lake) | Lakehouse |
|-----------|------------------------|-----------|
| Storage | $$$$ (2x for Lake + DW) | $$ (single copy) |
| ETL Compute | $$$ (sync jobs) | $$ (no sync needed) |
| Governance Tools | $$$ (separate tools) | $ (built-in) |
| Latency | Hours (ETL sync) | Minutes (direct access) |
| **Total TCO** | **Higher** | **30-50% lower** |

*Note: Actual costs depend on workload. Run POC with your data to validate.*

### Alternatives to Databricks Lakehouse

| Alternative | Pros | Cons | When to Choose |
|-------------|------|------|----------------|
| **Snowflake** | Mature, great SQL | Separate from ML, vendor lock-in | Pure SQL/BI workloads |
| **BigQuery** | Serverless, cheap storage | GCP-only, less flexible | GCP shop, ad-hoc analytics |
| **Spark + Iceberg on K8s** | Open source, no vendor | Complex ops, no unified governance | Strong DevOps team, cost-sensitive |
| **Databricks** | Unified platform, strong ML/AI | Premium pricing | ML + Analytics + Streaming |

---

### How Apache Spark Works — Distributed Execution & Lazy Evaluation

Understanding Spark's execution model is essential before working with Databricks. Every query you run — whether PySpark or SQL — follows the same principles.

---

#### Driver & Executors Architecture

<img src="../../../assets/images/d82b6da777ca4f3a9eee333717287c15.png" width="800">

- **Driver** — single process that coordinates the entire job. Runs on the master node. Holds the `SparkSession`, builds the query plan, schedules tasks.
- **Executors** — worker JVMs running on cluster nodes. Each executor processes multiple **partitions** in parallel.
- **Partition** — a chunk of data (default ~128 MB). Spark processes all partitions in parallel across executors.

> **Exam Tip:** Spark does NOT move data to the computation. It moves small task code to where the data resides (data locality). This is a fundamental design principle.

---

#### Lazy Evaluation — Nothing Happens Until You Ask for Results

Spark uses **lazy evaluation**: transformations (e.g., `filter`, `select`, `groupBy`) are NOT executed immediately. They are recorded as a logical plan. Execution only starts when an **action** is called.

| Type | Examples | What Happens |
|------|----------|-------------|
| **Transformation** (lazy) | `filter()`, `select()`, `groupBy()`, `join()`, `withColumn()` | Added to the logical plan, nothing computed |
| **Action** (triggers execution) | `count()`, `show()`, `collect()`, `write()`, `display()`, `save()` | Triggers the entire pipeline execution |

**Why lazy evaluation?** It allows Spark to **optimize the entire pipeline** before executing it. The Catalyst Optimizer can:
- Reorder operations for efficiency (e.g., push filters before joins)
- Combine multiple transformations into a single pass over the data
- Prune unnecessary columns early (column pruning)
- Choose optimal join strategies (broadcast vs. shuffle)

---

#### Execution Flow: From Code to Results

<img src="../../../assets/images/32aef39a4c8c47b0bd849732025ab43a.png" width="800">

---

#### Stages and Shuffles

Spark divides a job into **stages**. A stage boundary occurs when data needs to be **shuffled** (redistributed across the cluster):

| Operation | Shuffle? | Explanation |
|-----------|----------|-------------|
| `filter()`, `select()` | No (narrow) | Each partition processed independently |
| `groupBy().agg()` | Yes (wide) | Data must be grouped by key across partitions |
| `join()` | Yes (wide)* | Data with same key must be on same executor |
| `repartition()` | Yes | Explicitly redistributes data |
| `coalesce()` | No | Reduces partitions without full shuffle |

*Exception: broadcast join avoids shuffle by sending the small table to all executors.

> **Exam Tip:** Shuffles are the most expensive operation in Spark. Minimizing shuffles (e.g., using broadcast joins for small tables, pre-partitioning data) is key to performance. In Databricks, Adaptive Query Execution (AQE) automatically optimizes many shuffle scenarios.

---

#### Key Exam Concepts Recap

| Concept | Definition |
|---------|-----------|
| **Lazy evaluation** | Transformations are recorded but not executed until an action is called |
| **DAG (Directed Acyclic Graph)** | Execution plan showing stages and dependencies |
| **Partition** | Unit of parallelism — one partition = one task on one core |
| **Shuffle** | Data redistribution across cluster (expensive, causes stage boundary) |
| **Catalyst Optimizer** | Rule-based + cost-based optimizer that rewrites logical plans |
| **Adaptive Query Execution (AQE)** | Runtime optimization: adjusts shuffle partitions, converts joins, handles skew |
| **Narrow transformation** | No data exchange between partitions (map, filter) |
| **Wide transformation** | Requires shuffle (groupBy, join, distinct) |

## Databricks Platform Elements

Kluczowe elementy platformy Databricks: Workspace, Catalog Explorer, Git Folders, Volumes. Zrozumienie struktury platformy to fundament pracy Data Engineera.

---

### Per-user Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ../../setup/00_setup

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
import re

# Display user context (variables from 00_setup)
print(f"Catalog: {CATALOG}")
print(f"Schema Bronze: {BRONZE_SCHEMA}")
print(f"Schema Silver: {SILVER_SCHEMA}")
print(f"Schema Gold: {GOLD_SCHEMA}")
print(f"User: {raw_user}")

### Comparison of Traditional Architecture vs Lakehouse

**Objective:** Visualize differences between traditional approach (Data Lake + Data Warehouse) and Lakehouse.

**Traditional Architecture:**
<img src="../../../assets/images/49f3830d3784442ea5582bc82e6fb89c.png" width="800">

**Lakehouse Benefits:**
- Single copy of data (single source of truth)
- Lower storage costs
- Elimination of synchronization latency
- Common governance for all use cases

### Databricks Platform Elements

**Theoretical Introduction:**

The Databricks platform consists of several key components that together create a complete environment for working with data in the Lakehouse architecture.

**Key Components:**
- **Workspace**: Working environment containing notebooks, experiments, folders, and resources
- **Catalog Explorer**: Interface for managing catalogs, schemas, tables, and views
- **Git Folders (formerly Repos)**: Git integration for versioning notebooks and code
- **Volumes**: Management of unstructured files (images, models, artifacts)
- **DBFS (Databricks File System)**: Virtual file system over cloud storage

**Practical Application:**
- Workspace organizes projects and team collaboration
- Catalog Explorer enables data exploration and governance
- Git Folders integrates development workflow with Git

### Workspace Exploration

#### Example: Workspace Exploration

**Objective:** Familiarize with Databricks Workspace interface

**Workspace Elements:**
1. **Sidebar** (left side):
   - Workspace: Folders and notebooks
   - Git Folders: Git Integration
   - Compute: Cluster management
   - Workflows: Lakeflow Jobs
   - Catalog: Unity Catalog explorer

2. **Main Panel**: Notebook editor or details view

3. **Top Bar**: Quick access to compute, account, help

**Navigation Instructions:**
- Use the left menu to switch between sections
- In the Catalog section, you can browse catalogs, schemas, and tables
- In the Compute section, you manage Spark clusters

#### Databricks Workspace UI

<img src="../../../assets/images/848bc3658ab44bb09f586bd2b1f4231e.png" width="800">
> ```

### Catalog Explorer - Unity Catalog Structure

#### Example: Catalog Explorer

**Objective:** Understand object hierarchy in Unity Catalog

#### Catalog Explorer Screenshot

<img src="../../../assets/images/32356ba877a74bfe87feeb6d6ee93a46.png" width="800">


In [0]:
# Display current catalog and schema
current_catalog = spark.sql("SELECT current_catalog()").collect()[0][0]
current_schema = spark.sql("SELECT current_schema()").collect()[0][0]

print(f"Current catalog: {current_catalog}")
print(f"Current schema: {current_schema}")

**Unity Catalog Hierarchy:**

<img src="../../../assets/images/cddc09f5ffc5482aa3063a13a7c4f927.png" width="800">

> **3-level namespace:** `catalog.schema.object` — e.g., `prod.gold.sales_summary`

### Browsing Catalogs and Schemas

#### Example: Browsing Catalogs and Schemas

**Objective:** Programmatic listing of objects in Unity Catalog

In [0]:
# List of all catalogs available to the user
catalogs_df = spark.sql("SHOW CATALOGS")
display(catalogs_df)

In [0]:
# List of schemas in the current catalog
schemas_df = spark.sql(f"SHOW SCHEMAS IN {CATALOG}")
display(schemas_df)

### Git Folders and Git Integration

In practice, working with code in Databricks should be based on **Git Folders** (formerly Repos), not single, orphaned notebooks in Workspace.

Typical workflow:

1. **Create Git Folder** in Databricks: `Workspace → Git Folders → Add Repo`.
2. **Connect to Git** (GitHub / Azure DevOps / other).
3. Work on **feature branches** (e.g., `feature/cleaning-module`).
4. Regularly:
   - commit and push changes from Databricks to remote repo,
   - create PR and merge to main/dev.

Best Practices:

- One repo per project/domain (e.g., `databricks-dea-training`).
- Do not work in **Workspace root** – always in **Git Folders**.
- Training notebooks, test data, and README can be in one repo.

### Volumes vs DBFS

Where should you store files? In new Unity Catalog-based workspaces, **Volumes** are the preferred location.

- `dbfs:/` is treated as a **legacy** layer or auxiliary area.
- `/Volumes/catalog.schema.volume_name` is a fully managed, UC-controlled data area (permissions, audit, lineage).

Volume Definition Example (SQL):

```sql
CREATE VOLUME IF NOT EXISTS ${catalog}.${schema}.training_volume
COMMENT 'Workspace for training purposes';
```

Usage Example in PySpark:

```python
catalog = dbutils.widgets.get("catalog")
schema = dbutils.widgets.get("schema")

volume_path = f"/Volumes/{catalog}/{schema}/training_volume"
display(dbutils.fs.ls(volume_path))
```

### SQL Warehouse

A SQL engine optimized for BI and ad-hoc analytics, an alternative to notebook clusters.

When to use:
- Reporting in Power BI / other BI tools.
- Business analysts / power users working mainly in SQL.
- Interactive dashboards and ad-hoc queries to **Gold** layer.

Differences from all-purpose cluster:
- Billing based on **DBU SQL** (different rates).
- Automatic provisioning / scaling.
- Isolation of BI workload from engineering clusters.

---

## Compute Resources

Types of compute resources in Databricks: All-Purpose Clusters, Job Clusters, SQL Warehouses. Choosing the right compute directly impacts costs.

---

### The Real Question: How Much Will This Cost?

As a Data Engineer, you'll be asked: *"Why is our Databricks bill so high?"*

Understanding compute options is essential for cost control.

### Compute Options Comparison

| Type | Startup Time | Cost Model | Best For |
|------|--------------|------------|----------|
| **All-Purpose Cluster** | 3-5 min | Per-minute (running) | Interactive development, exploration |
| **Job Cluster** | 3-5 min | Per-minute (only during job) | Scheduled production jobs |
| **Serverless** | <10 sec | Per-query DBUs | Ad-hoc queries, variable workloads |
| **SQL Warehouse** | 0 (Serverless) or 3-5 min | Per-query DBUs | BI tools, SQL analysts |

### Cost Optimization Strategies

**1. Right-size clusters:**
- Development: 2-4 workers, smallest instance type
- Production: Autoscaling 2-10 workers based on workload

**2. Use Spot/Preemptible instances:**
- 60-80% cost savings for workers
- Driver on on-demand (stability)
- Trade-off: Job may be interrupted

**3. Photon Engine:**
- 2-3x faster for aggregations/joins
- ~2x DBU cost, but finishes faster = often cheaper
- Enable for: large scans, aggregations, joins
- Skip for: simple transformations, ML training

**4. Cluster policies:**
- Enforce maximum worker count
- Require autoscaling
- Set auto-termination (e.g., 30 min idle)

### Decision Tree: Which Compute to Use?

<img src="../../../assets/images/f98e20f71ec541eb9b206877f1da98b5.png" width="800">

---

### Cluster Info

Check the runtime version and Photon status on the current cluster.

In [0]:
# Cluster runtime and Photon status
dbr_version = spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion", "unknown")
photon = spark.conf.get("spark.databricks.photon.enabled", "false")
print(f"Runtime: {dbr_version}  |  Photon: {photon}")

## Magic Commands

Magic commands allow you to switch between languages and perform system operations directly from notebook cells.

---

| Command | Purpose |
|---------|---------|
| `%sql` | SQL cell |
| `%python` | Python cell (default) |
| `%md` | Markdown documentation |
| `%fs` | DBFS file operations |
| `%sh` | Shell commands |
| `%run` | Execute another notebook |
| `%pip` | Install notebook-scoped libraries |

### Monitoring

Where to look for problems: **Cluster → Event log** | **Spark UI** (Jobs, SQL tabs) | **Driver/Executor logs**

> **Best Practice:** For production pipelines, log to Delta tables — not just cluster logs.

### Demo: %sql

In [0]:
%sql
-- SQL magic command allows writing pure SQL without Python wrapper

SELECT 
  current_catalog() as catalog,
  current_schema() as schema,
  current_user() as user,
  current_timestamp() as timestamp

### Demo: Mixing Python + SQL

Create data in Python → query with SQL via temp view:

In [0]:
# Python: Raw data definition
data = [
    (1, "Alice", "Engineering", 95000),
    (2, "Bob", "Sales", 75000),
    (3, "Charlie", "Engineering", 105000),
    (4, "Diana", "Marketing", 68000),
    (5, "Eve", "Engineering", 98000)
]

# Schema definition
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("department", StringType(), False),
    StructField("salary", IntegerType(), False)
])

In [0]:
# Create DataFrame
df = spark.createDataFrame(data, schema)
display(df)

In [0]:
# Register as temp view for SQL access
df.createOrReplaceTempView("employees_temp")

In [0]:
%sql
-- SQL: Aggregation on Python data

SELECT 
  department,
  COUNT(*) as employee_count,
  AVG(salary) as avg_salary,
  MAX(salary) as max_salary
FROM employees_temp
GROUP BY department
ORDER BY avg_salary DESC

### Demo: %pip (notebook-scoped libraries)

In [0]:
# Install emoji library
%pip install emoji

In [0]:
import emoji
print(emoji.emojize('Databricks is :fire:'))

### Databricks Assistant (AI)

In 2025, coding work is assisted by AI. Databricks has a built-in assistant (**Databricks Assistant**) that is context-aware of your data (knows table schemas in Unity Catalog!).

**How to use?**
1. Shortcut **Cmd+I** (Mac) or **Ctrl+I** (Windows) inside a cell.
2. "Assistant" side panel.

**What is it for?**
- **Code Generation**: "Write a SQL query that calculates average sales by region from the sales table".
- **Code Explanation**: Select a complex snippet and ask "Explain this code".
- **Fixing errors**: When a cell returns an error, click "Diagnose Error" – the assistant will explain the cause and propose a fix.
- **Transformation**: "Rewrite this code from PySpark to SQL".

---

## Unity Catalog

A modern metadata management system replacing Hive Metastore. Provides centralized access control, lineage, and governance across the entire organization.

---

### Theoretical Introduction

Databricks supports two metadata systems: legacy Hive Metastore and modern Unity Catalog. Unity Catalog is recommended for all new projects due to advanced governance and security features.

**Key Differences:**

| Aspect | Hive Metastore | Unity Catalog |
|--------|----------------|---------------|
| **Governance** | Limited | Full: RBAC, masking, audit |
| **Namespace** | 2-level (db.table) | 3-level (catalog.schema.table) |
| **Cross-workspace** | No | Yes (shared metastore) |
| **Lineage** | None | End-to-end lineage |
| **Data Sharing** | Limited | Delta Sharing protocol |
| **Isolation** | Workspace-level | Catalog-level |

**Why Unity Catalog?**
- Central access management for all workspaces
- Automatic lineage for audit and compliance
- Fine-grained permissions (column-level, row-level)
- Integration with external systems (Delta Sharing)

### Namespace - Hive vs Unity Catalog

#### Example: Namespace Comparison

**Objective:** Compare table access syntax

### Creating a Table in Unity Catalog

#### Example: Creating a Table

**Objective:** Demonstrate full syntax with 3-level namespace

In [0]:
# Create sample table in Unity Catalog
table_name = f"{CATALOG}.{BRONZE_SCHEMA}.lakehouse_demo"

# Demo data
demo_data = [
    (1, "Unity Catalog", "Enabled", "2024-01-15"),
    (2, "Delta Lake", "Enabled", "2024-01-15"),
    (3, "Photon Engine", "Enabled", "2024-01-15"),
    (4, "Hive Metastore", "Legacy", "2024-01-15")
]

In [0]:
# Save as Delta Table in Unity Catalog
demo_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(table_name)

In [0]:
# Verification
display(spark.table(table_name))

### Task: Check in UI

1. Click **Catalog** in the left sidebar.
2. Find your catalog (name in `CATALOG` variable, e.g., `retailhub_...`).
3. Expand the `bronze` schema (or other defined in `BRONZE_SCHEMA`).
4. Click on the `lakehouse_demo` table.
5. See tabs: **Sample Data** (preview) and **Lineage** (data origin).

**Explanation:**

The table was created with a full 3-level namespace. In Unity Catalog, every table automatically:
- Is managed by the governance system
- Has tracked lineage
- Has permissions assigned based on catalog and schema
- Is available in Catalog Explorer for exploration

**Managed vs External Tables:**
The table above is a **Managed Table**. Databricks manages both metadata and data files (in default catalog/schema storage). Dropping the table (`DROP TABLE`) also deletes the data.

**External Table** is created when we provide `LOCATION 'path'`. Then `DROP TABLE` removes only metadata, and files remain in storage.

### Comparison PySpark vs SQL

**DataFrame API (PySpark):**

In [0]:
# PySpark Approach - programmatic DataFrame API

df_pyspark = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.lakehouse_demo")

In [0]:
result_pyspark = df_pyspark \
    .filter(F.col("status") == "Enabled") \
    .select("feature", "status", "date") \
    .orderBy("feature")
display(result_pyspark)

**SQL Equivalent:**

In [0]:
df = spark.sql(f"select * from {CATALOG}.{BRONZE_SCHEMA}.lakehouse_demo")
display(df)

### Parameterization with Databricks Widgets

Below we use the **Widgets** mechanism, which allows creating interactive controls in the notebook. This allows passing parameters (e.g., table names, dates) to SQL and Python code, facilitating the building of universal reports.

In [0]:
# Parameterization with Databricks Widgets
# Set default values based on variables from 00_setup (if available)
# This ensures SQL cells will use the same catalog as Python cells

default_catalog = CATALOG if 'CATALOG' in locals() else "retailhub_trainer"
default_schema = BRONZE_SCHEMA if 'BRONZE_SCHEMA' in locals() else "bronze"

dbutils.widgets.text("CATALOG", default_catalog)
dbutils.widgets.text("BRONZE_SCHEMA", default_schema)
dbutils.widgets.text("BRONZE_SCHEMA_2", default_schema)

In [0]:
%sql

SELECT 
  feature,
  status,
  date
FROM IDENTIFIER(:CATALOG || '.' || :BRONZE_SCHEMA || '.lakehouse_demo')
WHERE status = 'Enabled'
ORDER BY feature

**Comparison:**
- **Performance**: Identical - both approaches compile to the same Catalyst query plan
- **When to use PySpark**: 
  - Complex business logic with UDFs
  - Dynamic pipelines (parameterization, loops)
  - Integration with Python libraries (pandas, scikit-learn)
- **When to use SQL**: 
  - Simple transformations and aggregations
  - Team with strong SQL skills
  - Migration from traditional Data Warehouse
  - Better support for business analysts

---

## Summary

### What was achieved:
- Learned Lakehouse concept as evolution of Data Lake + Data Warehouse
- Explored Databricks platform elements: Workspace, Compute, Catalog
- Understood Unity Catalog hierarchy: Metastore → Catalog → Schema → Objects
- Practiced magic commands: %sql, %python, %fs, %pip
- Compared Hive Metastore vs Unity Catalog
- Created first Delta table in Unity Catalog with 3-level namespace

### Key Takeaways:
1. **Lakehouse eliminates data duplication**: Single copy serves BI, ML, and real-time analytics
2. **Unity Catalog is governance foundation**: 3-level namespace, fine-grained permissions, automatic lineage
3. **Clusters are flexible**: Autoscaling and spot instances reduce costs, Photon accelerates queries
4. **Notebooks are powerful**: Mixing SQL/Python, magic commands, Git integration via Git Folders
5. **Delta Lake is default format**: ACID transactions, time travel, schema evolution

### Quick Reference - Key Commands:

| Operation | PySpark | SQL |
|-----------|---------|-----|
| Set catalog | `spark.sql(f"USE CATALOG {CATALOG}")` | `USE CATALOG my_catalog` |
| List catalogs | `spark.sql("SHOW CATALOGS")` | `SHOW CATALOGS` |
| List schemas | `spark.sql("SHOW SCHEMAS")` | `SHOW SCHEMAS` |
| Create table | `df.write.saveAsTable("cat.schema.table")` | `CREATE TABLE cat.schema.table AS SELECT ...` |
| Read table | `spark.table("cat.schema.table")` | `SELECT * FROM cat.schema.table` |
| Metadata | - | `SELECT * FROM system.information_schema.tables` |
| Install lib | `%pip install package` | - |