# Databricks Certified Data Engineer Associate -- Exam Prep

**Duration:** ~25 min | **Day:** 3 | **Final session**

---

Congratulations on completing the 3-day training! This notebook prepares you for the certification exam.

## 1. Exam Overview

| Detail | Value |
|--------|-------|
| **Format** | Multiple choice (single and multi-select) |
| **Questions** | 45 |
| **Time** | 90 minutes |
| **Passing score** | 70% (32/45 correct) |
| **Cost** | $200 USD |
| **Validity** | 2 years |
| **Retake policy** | 14-day wait after 1st attempt, 30 days after 2nd |
| **Proctoring** | Online (webcam required) or test center |
| **Registration** | [academy.databricks.com](https://academy.databricks.com) |

## 2. Exam Domains & Weight

| Domain | Weight | Our modules |
|--------|--------|-------------|
| **Databricks Lakehouse Platform** | ~24% | M01 |
| **ELT with Spark SQL and Python** | ~29% | M02, M06 |
| **Incremental Data Processing** | ~17% | M05, M04 (CDF) |
| **Production Pipelines** | ~13% | M07, M08 |
| **Data Governance** | ~17% | M09 |

> The heaviest domain is **ELT** (29%) -- make sure you're comfortable with SELECT, JOIN, GROUP BY, MERGE, window functions, and array/JSON operations.

## 3. What You Learned (3-Day Map)

```
DAY 1                         DAY 2                         DAY 3
----                          ----                          ----
M01: Lakehouse Platform       M04: Delta Optimization       M07: Medallion & Lakeflow
  - Architecture              - OPTIMIZE / ZORDER           - Bronze/Silver/Gold
  - Unity Catalog             - VACUUM                      - STREAMING TABLE
  - Compute (clusters/SQL)    - Liquid Clustering           - MATERIALIZED VIEW
  - Volumes / Git Folders     - Change Data Feed            - Expectations
                                                            
M02: ELT Ingestion            M05: Incremental Processing   M08: Orchestration
  - Read CSV/JSON/Parquet     - COPY INTO                   - Jobs & Tasks
  - DataFrames & SQL          - Auto Loader                 - Triggers / CRON
  - Transforms / Joins        - Structured Streaming        - taskValues
  - Temp Views                - Checkpointing               - Repair Run
                                                            
M03: Delta Fundamentals       M06: Advanced Transforms      M09: Governance
  - CRUD (INSERT/UPDATE)      - Window functions             - GRANT / REVOKE
  - MERGE INTO                - CTEs / Subqueries           - Row Filters
  - Time Travel               - explode() / from_json()    - Column Masks
  - Schema Evolution          - CTAS / Higher-order fn     - System Tables
```

## 4. Must-Know Syntax (Exam Favorites)

### 4.1. MERGE INTO (appears on almost every exam)
```sql
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
```

### 4.2. Auto Loader
```python
spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint)
    .load(path)
```

### 4.3. Lakeflow Expectations
```sql
CONSTRAINT valid_id EXPECT (id IS NOT NULL) ON VIOLATION DROP ROW
```

### 4.4. Change Data Feed
```sql
ALTER TABLE t SET TBLPROPERTIES ('delta.enableChangeDataFeed' = true);
SELECT * FROM table_changes('t', 2);  -- from version 2
```

### 4.5. GRANT privileges
```sql
GRANT USE CATALOG ON CATALOG my_catalog TO `analysts`;
GRANT SELECT ON SCHEMA my_catalog.silver TO `analysts`;
```

## 5. Common Exam Traps

| Trap | Correct answer |
|------|---------------|
| VACUUM default retention | **7 days** (168 hours). Cannot vacuum < 7 days unless you disable the safety check |
| `LIVE.table` vs `STREAM(LIVE.table)` | `LIVE.` = batch reference. `STREAM(LIVE.)` = streaming reference |
| Expectation without `ON VIOLATION` | **Warn only** -- keeps all rows, logs metrics |
| `DROP ROW` vs `FAIL UPDATE` | DROP ROW silently removes bad rows. FAIL UPDATE aborts the pipeline |
| Temp View scope | Current SparkSession only. Dies when notebook detaches |
| Global Temp View namespace | Must query as `global_temp.view_name` |
| Auto Loader format | `cloudFiles` (not `autoLoader`!) |
| Unity Catalog default | **Deny by default** -- must explicitly GRANT |
| OPTIMIZE vs VACUUM | OPTIMIZE compacts files. VACUUM removes old files. Different operations! |
| Schema evolution option | `.option("mergeSchema", "true")` on write, not read |
| `DESCRIBE HISTORY` | Shows table versions, operations, timestamps. Used for Time Travel |
| Change Data Feed | `table_changes()` function, NOT `table_change_feed()` |
| Structured Streaming trigger | `availableNow=True` = process all, then stop. `processingTime` = micro-batch |
| Row Filter function | Must return **BOOLEAN**. Takes column params matching filtered columns |
| Column Mask function | Must return **same type** as the masked column |

## 6. Study Strategy

### Before the Exam
1. **Review cheatsheets** (provided in `materials/` folder) -- read them 2-3 times
2. **Re-do the quizzes** -- aim for 100% on all three
3. **Practice in Databricks** -- create a table, MERGE, Time Travel, VACUUM, Auto Loader
4. **Take the Databricks practice exam** -- [academy.databricks.com](https://academy.databricks.com)
5. **Read the official exam guide** -- [Databricks Certification](https://www.databricks.com/learn/certification/data-engineer-associate)

### During the Exam
- You have **2 minutes per question** -- don't overthink
- **Flag and skip** difficult questions, come back later
- **Eliminate wrong answers** first -- usually 2 out of 4 are obviously wrong
- Read questions carefully -- "which is TRUE" vs "which is FALSE"
- If two answers seem correct, pick the **more specific** one

## 7. Post-Training Resources

| Resource | Link |
|----------|------|
| Databricks Academy | [academy.databricks.com](https://academy.databricks.com) |
| Practice Exam | [Databricks Certification](https://www.databricks.com/learn/certification/data-engineer-associate) |
| Official Documentation | [docs.databricks.com](https://docs.databricks.com) |
| Databricks Community | [community.databricks.com](https://community.databricks.com) |
| Delta Lake Docs | [docs.delta.io](https://docs.delta.io) |
| Unity Catalog Guide | [UC Documentation](https://docs.databricks.com/en/data-governance/unity-catalog/index.html) |

### Your Training Materials
- **Cheatsheets**: Day 1, 2, 3 -- keep them as quick reference
- **Quizzes**: 60+ questions covering all exam domains
- **Lab notebooks**: Re-run them for hands-on practice
- **Demo notebooks**: M01-M09 -- complete reference with code examples

## 8. Final Notes

You've covered **100% of the exam topics** during this 3-day training:

- Lakehouse architecture & Unity Catalog
- ELT with Spark SQL and PySpark
- Delta Lake (CRUD, Time Travel, Optimization, CDF)
- Incremental processing (Auto Loader, Structured Streaming)
- Lakeflow pipelines (Medallion, expectations)
- Orchestration (Jobs, triggers, monitoring)
- Governance (GRANT/REVOKE, masking, filtering, audit)

**Recommended timeline:** Schedule your exam within **2-4 weeks** while the material is fresh.

---

Good luck on your certification!

---
*Training delivered by Altcom*