# Practice Questions - Set 1: Databricks Lakehouse Fundamentals

## Overview
This notebook contains 25 practice questions covering Databricks Lakehouse fundamentals, Delta Lake basics, and core concepts.

## Instructions
- Read each question carefully
- Try to answer before revealing the solution
- Review explanations for all questions, even correct ones
- Track your score to identify weak areas

---

## Section 1: Databricks Architecture (Questions 1-5)

### Question 1

**Which of the following best describes the Databricks Lakehouse platform?**

A) A data warehouse built on cloud storage  
B) A data lake with ACID transactions and data management features  
C) A replacement for Apache Spark  
D) A NoSQL database optimized for analytics  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
The Databricks Lakehouse combines the best of data lakes and data warehouses. It provides:
- Cost-effective storage of data lakes (on cloud object storage)
- ACID transactions and data management features of data warehouses
- Support for both structured and unstructured data
- Unified platform for BI, ML, and streaming

Delta Lake is the technology that enables these features on top of data lakes.
</details>

### Question 2

**What are the three levels in Unity Catalog's namespace hierarchy?**

A) Database.Schema.Table  
B) Catalog.Database.Table  
C) Catalog.Schema.Table  
D) Workspace.Database.Table  

<details>
<summary>Click to reveal answer</summary>

**Answer: C**

**Explanation:**
Unity Catalog uses a three-level namespace:
1. **Catalog** - Top level, represents a collection of schemas
2. **Schema** (or Database) - Contains tables, views, and functions
3. **Table** (or View) - The actual data objects

Example: `my_catalog.sales_schema.customer_table`

This structure provides better organization and governance compared to traditional two-level namespaces.
</details>

### Question 3

**Which cluster type is recommended for production jobs that run on a schedule?**

A) All-purpose cluster  
B) Job cluster  
C) SQL Warehouse  
D) High-concurrency cluster  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
**Job clusters** are recommended for production scheduled jobs because:
- Created when job starts, terminated when complete
- More cost-effective (no idle time charges)
- Isolated from other workloads
- Optimized for specific job requirements

**All-purpose clusters** are for:
- Interactive analysis and development
- Shared across users
- More expensive for automated jobs
</details>

### Question 4

**What is the primary benefit of using Photon in Databricks?**

A) Reduces storage costs  
B) Provides better data governance  
C) Accelerates query performance  
D) Enables real-time streaming  

<details>
<summary>Click to reveal answer</summary>

**Answer: C**

**Explanation:**
**Photon** is a vectorized query engine that:
- Accelerates SQL and DataFrame queries
- Written in C++ (vs standard Spark in JVM)
- Provides 2-10x performance improvement
- Especially effective for:
  - Aggregate queries
  - Joins
  - File format operations

Enabled automatically on SQL Warehouses and can be enabled on clusters.
</details>

### Question 5

**In the medallion architecture, which layer contains raw, unprocessed data?**

A) Bronze  
B) Silver  
C) Gold  
D) Platinum  

<details>
<summary>Click to reveal answer</summary>

**Answer: A**

**Explanation:**
The medallion architecture has three layers:

1. **Bronze** (Raw):
   - Raw data as ingested from source
   - No transformations applied
   - Append-only, immutable history

2. **Silver** (Cleaned):
   - Cleaned and validated data
   - Deduplication, quality checks
   - Enriched with additional context

3. **Gold** (Curated):
   - Business-level aggregates
   - Optimized for analytics and reporting
   - Feature tables for ML
</details>

## Section 2: Delta Lake Basics (Questions 6-10)

### Question 6

**What does ACID stand for in the context of Delta Lake?**

A) Automated, Consistent, Isolated, Durable  
B) Atomic, Consistent, Isolated, Durable  
C) Atomic, Continuous, Isolated, Distributed  
D) Automated, Continuous, Independent, Durable  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
ACID stands for:
- **Atomicity**: Operations either complete fully or not at all
- **Consistency**: Data remains in valid state before/after transaction
- **Isolation**: Concurrent operations don't interfere
- **Durability**: Committed changes are permanent

Delta Lake provides ACID guarantees through its transaction log.
</details>

### Question 7

**Which command is used to query a previous version of a Delta table?**

A) `SELECT * FROM table@version`  
B) `SELECT * FROM table VERSION AS OF 5`  
C) `SELECT * FROM table HISTORY 5`  
D) `SELECT * FROM table ROLLBACK TO 5`  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
Time travel in Delta Lake uses:

```sql
-- By version number
SELECT * FROM table VERSION AS OF 5

-- By timestamp
SELECT * FROM table TIMESTAMP AS OF '2024-01-01'
```

Or in Python:
```python
df = spark.read.format("delta").option("versionAsOf", 5).load("/path")
df = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/path")
```
</details>

### Question 8

**What is the purpose of the OPTIMIZE command in Delta Lake?**

A) To delete old data  
B) To compact small files into larger ones  
C) To add new partitions  
D) To update table statistics only  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
`OPTIMIZE` compacts small files into larger files:

```sql
OPTIMIZE table_name;

-- With Z-ORDER for specific columns
OPTIMIZE table_name ZORDER BY (column1, column2);
```

Benefits:
- Improves query performance (fewer files to read)
- Reduces metadata overhead
- Optimal file size is 32MB-1GB

Note: VACUUM is used to delete old files, not OPTIMIZE.
</details>

### Question 9

**Which statement about Delta Lake schema enforcement is TRUE?**

A) Schema changes are always allowed automatically  
B) Schema enforcement prevents writes with incompatible schemas  
C) Delta Lake does not support schema enforcement  
D) Schema enforcement only applies to Parquet files  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
Delta Lake enforces schemas by default:
- Writes with incompatible schemas are rejected
- Prevents data quality issues
- Ensures consistency

To evolve schema (add columns):
```python
df.write.format("delta") \
  .option("mergeSchema", "true") \
  .mode("append") \
  .save("/path")
```

Or allow schema evolution:
```sql
SET spark.databricks.delta.schema.autoMerge.enabled = true;
```
</details>

### Question 10

**What is the minimum retention period for VACUUM command by default?**

A) 0 days  
B) 7 days  
C) 30 days  
D) 90 days  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**
The default retention period is **7 days** (168 hours):

```sql
-- Default: keeps files from last 7 days
VACUUM table_name;

-- Custom retention (30 days)
VACUUM table_name RETAIN 720 HOURS;
```

This protects:
- Time travel queries
- Long-running queries
- Concurrent operations

‚ö†Ô∏è **Warning**: Setting retention < 7 days requires:
```sql
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
```
</details>

## Section 3: PySpark and SQL (Questions 11-15)

### Question 11

**What is the difference between transformations and actions in Spark?**

A) Transformations execute immediately, actions are lazy  
B) Transformations are lazy, actions trigger execution  
C) There is no difference  
D) Transformations modify data, actions read data  

<details>
<summary>Click to reveal answer</summary>

**Answer: B**

**Explanation:**

**Transformations (Lazy)**:
- Create new DataFrames
- Not executed immediately
- Examples: `select()`, `filter()`, `groupBy()`, `join()`

**Actions (Eager)**:
- Trigger computation
- Return results
- Examples: `show()`, `count()`, `collect()`, `write()`

```python
# Lazy - not executed
df2 = df.filter(col("age") > 21)

# Action - triggers execution
df2.show()
```
</details>

### Question 12

**Which SQL function is used to extract year from a date column?**

A) `EXTRACT(YEAR FROM date_column)`  
B) `YEAR(date_column)`  
C) `DATE_PART('year', date_column)`  
D) All of the above  

<details>
<summary>Click to reveal answer</summary>

**Answer: B** (most common in Databricks SQL)

**Explanation:**
In Databricks/Spark SQL:

```sql
SELECT 
  YEAR(order_date) as year,
  MONTH(order_date) as month,
  DAY(order_date) as day
FROM orders;
```

In PySpark:
```python
from pyspark.sql.functions import year, month, day

df.select(
  year(col("order_date")).alias("year"),
  month(col("order_date")).alias("month")
)
```
</details>

## Practice Score Tracker

Track your score as you go through the questions:

- Questions 1-5 (Architecture): __ / 5
- Questions 6-10 (Delta Lake): __ / 5
- Questions 11-15 (PySpark/SQL): __ / 5

**Total: __ / 15**

### Score Interpretation
- 13-15: Excellent! You're well-prepared
- 10-12: Good! Review missed topics
- 7-9: Fair. Focus study on weak areas
- < 7: Need more study. Review all materials

---

## Summary

This practice set covered:
- ‚úÖ Databricks architecture and components
- ‚úÖ Unity Catalog fundamentals
- ‚úÖ Delta Lake ACID properties and operations
- ‚úÖ Medallion architecture
- ‚úÖ PySpark transformations and actions

## Next Steps

1. Review any questions you missed
2. Study the explanations thoroughly
3. Practice with hands-on exercises
4. Move to Practice Set 2 for more questions

## Study Resources

- [Databricks Documentation](https://docs.databricks.com/)
- [Delta Lake Guide](https://docs.delta.io/)
- [Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/)

**Good luck with your certification! üéì**