# Databricks Compute & Clusters: A Comprehensive Guide

This guide provides a comprehensive overview of Databricks compute resources, cluster types, configuration, and best practices for optimizing your data processing workloads.

---

## Table of Contents

1. [Fundamentals: Compute vs. Clusters](#1-fundamentals-compute-vs-clusters)
2. [Storage and Data Handling](#2-storage-and-data-handling)
3. [Cluster Types](#3-cluster-types)
4. [Compute Types: Serverless vs. Classic](#4-compute-types-serverless-vs-classic)
5. [Cluster Access Modes](#5-cluster-access-modes)
6. [High-Performance Engines: Photon & AQE](#6-high-performance-engines-photon--aqe)
7. [Cluster Sizing and Configuration](#7-cluster-sizing-and-configuration)
8. [Cluster Pools](#8-cluster-pools)
9. [Autoscaling](#9-autoscaling)
10. [Instance Types and Selection](#10-instance-types-and-selection)
11. [Cluster Lifecycle Management](#11-cluster-lifecycle-management)
12. [Cost Optimization Strategies](#12-cost-optimization-strategies)
13. [Monitoring and Performance](#13-monitoring-and-performance)

---

## 1. Fundamentals: Compute vs. Clusters

### What is Compute?

**Compute** in Databricks refers to the underlying hardware resources required to process data and run code. It encompasses:
- **CPU**: Processing power for executing tasks
- **Memory (RAM)**: Temporary storage for data processing
- **Storage**: Local disk space for temporary files
- **Networking**: Bandwidth for data transfer between nodes and external storage

### How is a Cluster Different from Compute?

Think of **Compute** as the raw infrastructure (the Virtual Machines) and a **Cluster** as the organized, configured team that uses that infrastructure.

**Key Differences:**
- **Compute**: The abstract concept of processing resources
- **Cluster**: A specific, configured set of compute resources working together as a single unit using Apache Spark

**In Practice:**
- You **provision compute** by **creating a cluster**
- A cluster is a managed Spark environment that coordinates multiple compute nodes
- Each cluster has a driver node (coordinates work) and worker nodes (execute tasks)

---

## 2. Storage and Data Handling

### Do Clusters Have Storage?

Yes, but cluster storage is primarily **ephemeral (temporary)**. Understanding storage architecture is crucial for data engineering workflows.

### Local Storage (Ephemeral)

Each cluster node has local storage used for:

1. **Operating System**: Base OS files and system libraries
2. **Application Libraries**: Python packages, JAR files, and other dependencies
3. **Spark Shuffling**: Temporary data exchange during Spark operations (joins, aggregations, repartitions)
4. **Spill Files**: When memory is full, Spark writes intermediate data to disk
5. **Logs**: Application and system logs

**Important Notes:**
- Local storage is **not persistent** - data is lost when the cluster terminates
- Size depends on the instance type (typically 50-200 GB per node)
- Fast SSD storage is used for better performance

### Persistent Storage (External)

For actual data storage, Databricks connects to external cloud object storage:

- **AWS**: Amazon S3
- **Azure**: Azure Data Lake Storage (ADLS Gen1/Gen2) or Azure Blob Storage
- **GCP**: Google Cloud Storage (GCS)

**How It Works:**
- Clusters **read from** and **write to** cloud storage
- Data remains in cloud storage even after cluster termination
- Databricks uses optimized connectors (e.g., DBIO) for efficient data access
- Delta Lake tables are stored in cloud storage, not on cluster nodes

**Best Practice:** Always store your data in cloud storage, not on cluster local storage.

---

## 3. Cluster Types

Databricks offers different cluster types optimized for different use cases:

### 3.1 All-Purpose Clusters

**Purpose:** Interactive development and ad-hoc analysis

**Characteristics:**
- Used for notebooks, interactive queries, and exploratory data analysis
- Can be shared among multiple users
- Supports multiple languages (Python, R, Scala, SQL)
- Customizable Spark configuration
- Typically runs for extended periods

**Use Cases:**
- Data exploration and analysis
- Interactive development
- Collaborative work on notebooks
- Ad-hoc queries

### 3.2 Job Clusters

**Purpose:** Automated, scheduled, or one-time workloads

**Characteristics:**
- Created automatically when a job runs
- Terminated automatically after job completion
- More cost-effective for scheduled workloads
- Can be configured with different settings than all-purpose clusters
- Supports retry policies and notifications

**Use Cases:**
- Scheduled ETL pipelines
- Automated data processing jobs
- Batch transformations
- Data quality checks

**Key Difference:** Job clusters are ephemeral and created on-demand, while all-purpose clusters are long-lived.

### 3.3 SQL Warehouses (formerly SQL Endpoints)

**Purpose:** Optimized for SQL queries and BI tool connectivity

**Characteristics:**
- Specialized architecture for SQL workloads
- Low-latency query execution
- High concurrency support
- Can be "Always On" or serverless
- Optimized for BI tools (Tableau, Power BI, Looker, etc.)
- Does not support Python/Scala/R - SQL only

**Use Cases:**
- Business intelligence dashboards
- Ad-hoc SQL queries
- Reporting and analytics
- BI tool connectivity

**When to Use SQL Warehouse vs. All-Purpose:**
- **SQL Warehouse**: Pure SQL workloads, BI tools, high concurrency SQL queries
- **All-Purpose**: Multi-language development, data engineering, custom Spark configurations

---

## 4. Compute Types: Serverless vs. Classic

### 4.1 Classic Compute (All-Purpose)

**Management:** User-managed infrastructure

**Characteristics:**
- You select VM instance types
- You choose Spark version
- You configure scaling limits
- You manage cluster lifecycle
- Full control over configuration

**Startup Time:** ~3-5 minutes (VM provisioning time)

**Configuration:**
- Instance types (e.g., `Standard_DS3_v2`, `Standard_L8s_v2`)
- Spark version selection
- Custom Spark configurations
- Environment variables
- Init scripts

**Use Cases:**
- When you need specific VM types
- Custom Spark configurations
- Long-running interactive sessions
- Full control over infrastructure

### 4.2 Serverless Compute

**Management:** Databricks-managed infrastructure

**Characteristics:**
- Databricks automatically selects and manages VMs
- No VM type selection needed
- Automatic optimization
- Instant startup
- Automatic scaling
- Pay only for compute time used

**Startup Time:** ~10-30 seconds (much faster than classic)

**Configuration:**
- Simplified configuration
- Databricks optimizes resource allocation
- Automatic instance type selection
- Less control, more convenience

**Use Cases:**
- Unpredictable workloads
- Short-running tasks
- Quick iterations
- When you want to minimize management overhead
- Cost optimization for sporadic workloads

**Availability:** Serverless is available for both SQL Warehouses and All-Purpose clusters (depending on your Databricks plan).

---

## 5. Cluster Access Modes

Access modes determine how users and applications interact with clusters and what level of isolation is provided.

### 5.1 Single User Access Mode

**Isolation:** Full isolation per user

**Characteristics:**
- One user per cluster
- Complete isolation of data and libraries
- Best for security-sensitive workloads
- User-specific environment variables and secrets
- No resource sharing conflicts

**Use Cases:**
- Production workloads with strict security requirements
- When users need different library versions
- Compliance requirements
- Sensitive data processing

**Limitations:**
- Cannot share clusters among users
- Higher cost (one cluster per user)

### 5.2 Shared Access Mode

**Isolation:** Shared resources with fair scheduling

**Characteristics:**
- Multiple users can attach to the same cluster
- Fair scheduling ensures resource distribution
- Shared libraries and environment
- Cost-effective for teams
- Automatic resource allocation

**Use Cases:**
- Collaborative development
- Team environments
- Cost optimization
- General-purpose analytics

**Best Practices:**
- Use when users have similar library requirements
- Enable autoscaling for better resource utilization
- Monitor for resource contention

### 5.3 No Isolation Shared Access Mode

**Isolation:** Minimal isolation (legacy mode)

**Characteristics:**
- Multiple users share the same Spark session
- Less isolation than Shared mode
- Legacy option (not recommended for new clusters)
- Potential security and stability concerns

**Recommendation:** Avoid this mode for new clusters. Use Shared Access Mode instead.

---

## 6. High-Performance Engines: Photon & AQE

### 6.1 Photon Engine

**What is Photon?**

Photon is Databricks' high-performance, vectorized query engine written in **C++**. It's a drop-in replacement for traditional Spark JVM execution that provides significant performance improvements.

**Key Features:**
- **Vectorized Execution**: Processes data in batches (vectors) rather than row-by-row
- **Native Code**: Written in C++ for better CPU utilization
- **Optimized Operators**: Faster joins, aggregations, and filters
- **Cost Reduction**: More efficient resource usage = lower costs
- **Automatic Fallback**: Falls back to JVM Spark for unsupported operations

**Performance Benefits:**
- 2-10x faster for SQL and DataFrame operations
- Better CPU cache utilization
- Reduced memory overhead
- Lower cloud costs

**When Photon is Used:**
- SQL queries
- DataFrame operations
- Most Spark SQL operations
- Automatically enabled on supported clusters

**Enabling Photon:**
- Available on Databricks Runtime 9.1 LTS and above
- Enable in cluster configuration: `spark.databricks.photon.enabled = true`
- Or use Photon-optimized runtime versions

**Limitations:**
- Some Spark operations may fall back to JVM
- Custom UDFs (User Defined Functions) may not benefit
- Some advanced Spark features may not be supported

### 6.2 Adaptive Query Execution (AQE)

**What is AQE?**

Adaptive Query Execution (AQE) is a Spark optimization feature that **dynamically re-optimizes** query execution plans based on real-time statistics collected during query execution.

**How AQE Works:**

1. **Initial Plan**: Spark creates an initial query execution plan based on table statistics
2. **Runtime Statistics**: During execution, AQE collects real-time data about:
   - Actual data sizes
   - Skew in data distribution
   - Join output sizes
3. **Dynamic Re-optimization**: AQE adjusts the plan mid-execution:
   - Switches join strategies (e.g., Sort-Merge Join → Broadcast Join)
   - Adjusts partition counts
   - Handles data skew automatically
   - Optimizes shuffle partitions

**Key Optimizations:**

1. **Dynamic Coalescing of Shuffle Partitions**
   - Automatically reduces the number of shuffle partitions if data is smaller than expected
   - Reduces overhead and improves performance

2. **Dynamic Join Selection**
   - Switches from expensive Sort-Merge Join to fast Broadcast Join when appropriate
   - Example: If a table is smaller than expected, broadcast it instead of doing a sort-merge

3. **Dynamic Skew Join Handling**
   - Automatically detects and handles skewed data in joins
   - Splits large partitions to balance workload

**Is AQE in Open Source Spark?**

Yes! AQE was introduced in **Apache Spark 3.0** and is available in the open-source version. However:
- Databricks includes proprietary enhancements
- Better defaults and tuning in Databricks Runtime
- More aggressive optimizations
- Better integration with Photon

**Enabling AQE:**
- Enabled by default in Spark 3.0+
- Configuration: `spark.sql.adaptive.enabled = true`
- Additional settings available for fine-tuning

**Benefits:**
- Automatic optimization without manual tuning
- Handles data skew automatically
- Adapts to actual data characteristics
- Reduces need for manual partition tuning

---

## 7. Cluster Sizing and Configuration

### 7.1 Understanding Cluster Size

Cluster size depends on:
- **Data volume**: How much data you're processing
- **Query complexity**: Simple aggregations vs. complex joins
- **Concurrency**: Number of simultaneous users/queries
- **Performance requirements**: Latency vs. throughput needs
- **Budget constraints**: Cost considerations

### 7.2 Sizing Scenarios

#### Scenario A: Small Exploratory Data (1-10 GB)

**Characteristics:**
- Small datasets
- Ad-hoc analysis
- Single user or small team
- Interactive queries

**Recommendation:**
- **Single Node Cluster** or small 2-node cluster
- Instance type: `Standard_DS3_v2` (4 vCPUs, 14 GB RAM)
- No autoscaling needed
- Cost-effective for exploration

**Configuration Example:**
```
Cluster Mode: Single Node or Standard
Min Workers: 0 (single node) or 1
Max Workers: 2
Instance Type: Standard_DS3_v2
```

#### Scenario B: Large ETL/Batch (500 GB - 1 TB)

**Characteristics:**
- Large data volumes
- Scheduled batch processing
- Complex transformations
- Heavy joins and aggregations

**Recommendation:**
- **Multi-node cluster with Autoscaling**
- Min: 4 nodes, Max: 20 nodes (adjust based on workload)
- **Memory-Optimized instances** (r-series) for heavy joins
- Instance type: `Standard_L8s_v2` or `Standard_L16s_v2`
- Consider job clusters for scheduled workloads

**Configuration Example:**
```
Cluster Mode: Standard
Min Workers: 4
Max Workers: 20
Instance Type: Standard_L8s_v2 (memory-optimized)
Autoscaling: Enabled
```

#### Scenario C: High-Concurrency (Many Users/BI)

**Characteristics:**
- Multiple simultaneous users
- BI tool connections
- Mixed workload types
- Need for fair resource distribution

**Recommendation:**
- **Shared Access Mode cluster** with autoscaling
- Larger cluster size to handle concurrent queries
- Consider SQL Warehouse for pure SQL workloads
- Instance type: `Standard_DS4_v2` or larger
- Enable fair scheduling

**Configuration Example:**
```
Cluster Mode: Standard
Access Mode: Shared
Min Workers: 8
Max Workers: 50
Instance Type: Standard_DS4_v2
Autoscaling: Enabled
```

#### Scenario D: Real-Time Streaming

**Characteristics:**
- Continuous data processing
- Low latency requirements
- Always-on workloads

**Recommendation:**
- **Always-on cluster** (don't auto-terminate)
- Stable cluster size (limited autoscaling)
- Consider Structured Streaming optimizations
- Instance type: Balanced (compute-optimized)

### 7.3 Cluster Configuration Best Practices

**Driver Node:**
- Usually same instance type as workers
- For large clusters, consider larger driver node
- Driver needs memory for:
  - Spark application state
  - Broadcast variables
  - Result collection

**Worker Nodes:**
- Choose based on workload:
  - **Compute-optimized**: CPU-intensive workloads
  - **Memory-optimized**: Large joins, caching
  - **Storage-optimized**: I/O-intensive workloads
  - **General-purpose**: Balanced workloads

**Spark Configuration:**
- `spark.sql.shuffle.partitions`: Default 200, adjust based on data size
- `spark.executor.memory`: Allocate based on instance type
- `spark.executor.cores`: Match to instance vCPUs
- Enable Photon and AQE for better performance

---

## 8. Cluster Pools

### What are Cluster Pools?

**Cluster Pools** (also called Instance Pools) are a collection of idle, ready-to-use cloud instances that Databricks maintains to reduce cluster start times.

### How Cluster Pools Work

1. **Pre-provisioned Instances**: Databricks keeps a pool of instances running
2. **Fast Cluster Start**: When you create a cluster from a pool, it starts in **~1-2 minutes** instead of 3-5 minutes
3. **Cost Efficiency**: Idle instances in pools are cheaper than running full clusters
4. **Automatic Management**: Databricks manages the pool size automatically

### Benefits

- **Faster Startup**: Reduced cluster start time
- **Cost Savings**: Idle pool instances cost less than idle clusters
- **Better Resource Utilization**: Instances are reused across clusters
- **Reduced Cold Starts**: Instances are "warm" and ready

### When to Use Cluster Pools

- **Frequent cluster creation/termination**: Jobs that start/stop often
- **Interactive development**: When you need quick cluster starts
- **Cost optimization**: When you have predictable workload patterns
- **Team environments**: Shared pools for multiple users

### Configuration

- **Min Idle Instances**: Minimum instances to keep warm
- **Max Capacity**: Maximum instances in the pool
- **Instance Type**: Choose based on your typical workload
- **Preloaded Spark Version**: Pre-install common Spark versions

---

## 9. Autoscaling

### What is Autoscaling?

**Autoscaling** automatically adds or removes worker nodes from a cluster based on workload demand.

### How Autoscaling Works

1. **Monitor Workload**: Databricks monitors cluster utilization
2. **Scale Up**: Adds workers when:
   - Tasks are queued
   - High CPU/memory utilization
   - Long-running tasks
3. **Scale Down**: Removes workers when:
   - Low utilization
   - Idle workers
   - Tasks complete

### Autoscaling Configuration

**Min Workers:** Minimum number of worker nodes (always running)

**Max Workers:** Maximum number of worker nodes (scaling limit)

**Scaling Behavior:**
- **Conservative**: Scales slowly, good for stable workloads
- **Standard**: Balanced scaling (default)
- **Aggressive**: Scales quickly, good for variable workloads

### Benefits

- **Cost Optimization**: Pay only for resources you use
- **Performance**: Automatically handles workload spikes
- **Flexibility**: Adapts to changing data volumes
- **Efficiency**: Removes idle workers automatically

### Best Practices

- Set **min workers** based on baseline workload
- Set **max workers** based on peak workload and budget
- Use autoscaling for variable workloads
- Monitor scaling behavior and adjust if needed
- Consider cluster pools for faster scaling

### When Not to Use Autoscaling

- **Fixed workloads**: If workload is always the same size
- **Very short jobs**: Scaling overhead may not be worth it
- **Strict SLAs**: Fixed size provides predictable performance

---

## 10. Instance Types and Selection

### Understanding Instance Types

Instance types determine the CPU, memory, and storage characteristics of your cluster nodes.

### Instance Type Categories

#### General Purpose (D-series)
- **Use Case**: Balanced workloads, general data processing
- **Examples**: `Standard_DS3_v2`, `Standard_DS4_v2`
- **Characteristics**: Balanced CPU, memory, and storage

#### Compute Optimized (F-series)
- **Use Case**: CPU-intensive workloads, high-performance computing
- **Examples**: `Standard_F4s_v2`, `Standard_F8s_v2`
- **Characteristics**: High CPU-to-memory ratio

#### Memory Optimized (L-series, r-series)
- **Use Case**: Large joins, caching, in-memory processing
- **Examples**: `Standard_L8s_v2`, `Standard_L16s_v2`
- **Characteristics**: High memory-to-CPU ratio
- **Best For**: Spark operations that require large memory (joins, aggregations, caching)

#### Storage Optimized (Ls-series)
- **Use Case**: I/O-intensive workloads, large local storage needs
- **Examples**: `Standard_L8s_v2`, `Standard_L16s_v2`
- **Characteristics**: High local SSD storage, high I/O performance

### Selecting the Right Instance Type

**Consider:**
1. **Workload Type**:
   - CPU-bound → Compute optimized
   - Memory-bound → Memory optimized
   - I/O-bound → Storage optimized
   - Mixed → General purpose

2. **Data Size**:
   - Small data → Smaller instances
   - Large data → Larger instances or more nodes

3. **Cost**:
   - Balance performance needs with budget
   - Consider spot instances for cost savings

4. **Spark Operations**:
   - Joins and aggregations → Memory optimized
   - Transformations → General purpose
   - Shuffling → Storage optimized

### Instance Type Examples

| Instance Type | vCPUs | Memory | Local Storage | Best For |
|--------------|-------|--------|---------------|----------|
| Standard_DS3_v2 | 4 | 14 GB | 28 GB | Small workloads, development |
| Standard_DS4_v2 | 8 | 28 GB | 56 GB | Medium workloads, general purpose |
| Standard_L8s_v2 | 8 | 64 GB | 128 GB | Memory-intensive, large joins |
| Standard_L16s_v2 | 16 | 128 GB | 256 GB | Very memory-intensive workloads |
| Standard_F4s_v2 | 4 | 8 GB | 16 GB | CPU-intensive computations |

---

## 11. Cluster Lifecycle Management

### Cluster States

1. **Pending**: Cluster is being created
2. **Running**: Cluster is active and ready
3. **Restarting**: Cluster is restarting (after failure or manual restart)
4. **Terminated**: Cluster has been stopped
5. **Error**: Cluster failed to start or encountered an error

### Termination Policies

**Auto-Termination:**
- Automatically terminates cluster after specified idle time
- Default: 120 minutes of inactivity
- Configurable per cluster
- **Best Practice**: Enable for all-purpose clusters to save costs

**Manual Termination:**
- User manually stops the cluster
- Immediate termination
- Data in local storage is lost

**Job Completion:**
- Job clusters terminate automatically after job completes
- No manual intervention needed

### Cluster Restart

**When to Restart:**
- After configuration changes
- After library installations
- To clear cached data
- To resolve performance issues

**Restart Types:**
- **Full Restart**: Restarts all nodes (clears all state)
- **Driver Restart**: Restarts only driver node (faster)

### Best Practices

- **Enable auto-termination** for all-purpose clusters
- **Use job clusters** for scheduled workloads
- **Monitor cluster health** regularly
- **Restart clusters** after significant changes
- **Terminate unused clusters** to save costs

---

## 12. Cost Optimization Strategies

### Understanding Databricks Costs

Costs come from:
1. **Compute (DBUs)**: Databricks Unit consumption
2. **Cloud Infrastructure**: VM instance costs
3. **Storage**: Cloud storage costs (separate from Databricks)

### Cost Optimization Techniques

#### 1. Right-Size Clusters
- **Don't over-provision**: Use appropriately sized instances
- **Monitor utilization**: Adjust based on actual usage
- **Use autoscaling**: Scale down when not needed

#### 2. Use Job Clusters
- **For scheduled workloads**: Job clusters are more cost-effective
- **Automatic termination**: No idle time costs
- **Optimized for batch**: Better resource utilization

#### 3. Enable Auto-Termination
- **All-purpose clusters**: Set reasonable idle timeout
- **Default 120 minutes**: Adjust based on usage patterns
- **Saves costs**: No charges for idle clusters

#### 4. Use Cluster Pools
- **Faster starts**: Reduces wasted time
- **Lower idle costs**: Pool instances cost less than idle clusters
- **Better utilization**: Shared resources

#### 5. Choose Appropriate Instance Types
- **Match workload**: Don't use memory-optimized for CPU-bound tasks
- **Consider spot instances**: For fault-tolerant workloads (if available)
- **Right size**: Avoid unnecessarily large instances

#### 6. Optimize Spark Configuration
- **Enable Photon**: Faster execution = lower costs
- **Enable AQE**: Better resource utilization
- **Tune partitions**: Reduce shuffle overhead
- **Cache strategically**: Only cache frequently used data

#### 7. Use Serverless (When Available)
- **Instant start**: No waiting time costs
- **Automatic optimization**: Better resource utilization
- **Pay per use**: Only charged for actual compute time

#### 8. Monitor and Analyze
- **Review cluster usage**: Identify underutilized clusters
- **Analyze costs**: Use Databricks cost analysis tools
- **Set budgets**: Configure spending limits
- **Optimize regularly**: Review and adjust monthly

### Cost Monitoring

- **Databricks Cost Analysis**: Built-in cost tracking
- **Cloud Provider Billing**: Monitor VM costs
- **Usage Reports**: Track DBU consumption
- **Budget Alerts**: Set up spending notifications

---

## 13. Monitoring and Performance

### Cluster Monitoring

**Key Metrics to Monitor:**

1. **CPU Utilization**
   - High CPU → May need more workers or larger instances
   - Low CPU → May be over-provisioned

2. **Memory Utilization**
   - High memory → Consider memory-optimized instances
   - Memory pressure → May cause spills to disk

3. **Network I/O**
   - Monitor data transfer rates
   - High network usage → Consider data locality optimizations

4. **Storage I/O**
   - Disk read/write rates
   - High I/O → May need storage-optimized instances

5. **Task Execution**
   - Task duration
   - Failed tasks
   - Queued tasks (indicates need for scaling)

### Performance Optimization

**Spark UI:**
- Access via cluster details page
- View job execution plans
- Identify bottlenecks
- Analyze task distribution

**Query Optimization:**
- Use EXPLAIN to view query plans
- Identify expensive operations
- Optimize joins and aggregations
- Use appropriate partitioning

**Data Locality:**
- Co-locate data and compute when possible
- Use Delta Lake optimizations (Z-ordering, compaction)
- Minimize data shuffling

### Best Practices

- **Regular Monitoring**: Check cluster metrics weekly
- **Performance Baselines**: Establish expected performance
- **Optimize Incrementally**: Make small, measured changes
- **Document Changes**: Track configuration changes and their impact
- **Use Photon and AQE**: Enable for automatic optimizations

---

## Summary

### Key Takeaways

1. **Compute vs. Cluster**: Compute is the infrastructure, clusters are the configured Spark environments
2. **Storage**: Clusters use ephemeral local storage; persistent data lives in cloud storage
3. **Cluster Types**: Choose all-purpose for development, job clusters for automation, SQL warehouses for BI
4. **Access Modes**: Single user for isolation, shared for collaboration
5. **Performance**: Enable Photon and AQE for automatic optimizations
6. **Sizing**: Right-size based on workload characteristics
7. **Cost**: Use autoscaling, auto-termination, and job clusters to optimize costs
8. **Monitoring**: Regularly monitor metrics and optimize based on actual usage

### Quick Reference

| Use Case | Cluster Type | Access Mode | Instance Type | Autoscaling |
|----------|--------------|-------------|---------------|-------------|
| Development | All-Purpose | Shared | General Purpose | Optional |
| Scheduled ETL | Job Cluster | N/A | Memory-Optimized | Recommended |
| BI/Reporting | SQL Warehouse | N/A | Auto-selected | Auto |
| Large Joins | All-Purpose | Single User | Memory-Optimized | Recommended |
| High Concurrency | All-Purpose | Shared | General Purpose | Required |

---

## Additional Resources

- [Databricks Documentation](https://docs.databricks.com/)
- [Cluster Configuration Guide](https://docs.databricks.com/clusters/)
- [Photon Engine Documentation](https://docs.databricks.com/runtime/photon.html)
- [Cost Optimization Best Practices](https://docs.databricks.com/administration-guide/account-settings/billing.html)

