### Databricks Jobs vs notebooks

**Notebooks: The "Drafting Board"**
- Notebooks are interactive. You use them to experiment, debug, and visualize data in real-time.
- **Best for:** Development, ad-hoc analysis, and "exploring" the data.
- **How they run:** They usually run on All-Purpose Clusters, which are always "on" and more expensive because they are optimized for quick user feedback.
- **The "Human" Element:** You manually click "Run Cell" to see what happens.

- **Databricks Jobs: The "Production Line"**
- A Job is an automated way to run your code on a schedule or in response to an event (like a new file arriving).
- **Best for:** Production pipelines, daily reporting, and the Medallion architecture.
- **How they run:** They typically run on Job Clusters. These clusters start up only when the job begins and shut down immediately after it finishes.
- **The "System" Element:** They run automatically at 2:00 AM while you sleep.

**Key Comparison**
| Feature | Notebook (Interactive) | Databricks Job (Automated)
| ----- | ----- | ----- |
| Primary Use | Development / Debugging | Production Pipelines
| Cluster Type | All-Purpose (Expensive) | Job Cluster (Cheaper/Efficient)
| Execution | Manual (Cell by Cell) | Scheduled or Triggered 
| Reliability | Good for testing | High (Built-in retries & alerts) 
| Cost | High (Clusters stay active) | Low (Clusters only exist during run)

### Multi-task workflows

- In Databricks, a Multi-task Workflow is the orchestration layer that connects your individual notebooks into a single, automated pipeline.
- Instead of manually running your Bronze notebook, then your Silver notebook, and finally your Gold notebook, you create a Directed Acyclic Graph (DAG). 
- This ensures that Task B only starts if Task A finishes successfully.
- **How a Multi-task Workflow looks:** 
- For the Medallion architecture, your workflow would look like a chain of dependencies:
- Task 1 (Bronze): Ingest raw files from S3/ADLS.
- Task 2 (Silver): Clean data and split categories (Depends on Task 1).
- Task 3 (Gold): Aggregate values (Depends on Task 2).

**Key Features of Multi-task Workflows**
- **Task Dependencies:** You can set "upstream" tasks. If the Silver cleaning fails due to a data quality issue, the Gold aggregation will never start, preventing "bad data" from reaching the business.
- **Parallel Execution:** If you have two different Gold tables, you can run them both at the same time after the Silver task finishes to save time.
- **Parameters:** You can pass values between tasks. For example, you can pass a "processing_date" from the first task to all subsequent ones.
- **Repair Run:** This is a lifesaver. If your workflow has 10 tasks and fails at Task 9, you can "Repair" it, and Databricks will only re-run the failed tasks, saving you money and time.

**Why use Multi-task Jobs over one giant Notebook?**
| Feature | Single Giant Notebook | Multi-task Workflow | 
| ----- | ----- | ----- |
| Debugging | Hard to find where it failed in 1000 lines. | You see exactly which "Task" failed in the UI.
| Compute | Uses one cluster for everything. | Can use different cluster sizes for different tasks.
| Retry Logic | Fails entirely. | You can set specific retries for "flaky" source systems. 
| Collaboration | Hard for multiple people to edit. | Different team members can own different notebooks.

### Parameters & scheduling

- Parameters make your code flexible, and Scheduling makes it autonomous.

**1. Parameters (Widgets)**
- Instead of hard-coding values like a specific date or a file path, you use Widgets. 
- This allows you to run the same notebook for different scenarios without changing the code.
- In Databricks, you define a parameter at the top of your notebook:
```
# Create a text widget for the processing date
dbutils.widgets.text("processing_date", "2026-01-14")

# Pull the value into a variable
run_date = dbutils.widgets.get("processing_date")

# Use it in your Silver filter
silver_df = bronze_df.filter(col("event_time").cast("date") == run_date)
```

**Why use this?**
- **Backfilling:*** If you need to re-run data for last month, you just change the parameter value in the Job UI rather than editing your Python code.
- **Dynamic Paths:** You can pass the input folder path as a parameter.

**2. Scheduling (Triggering)**
- Once your Multi-task Workflow is built, you need to tell Databricks when to run it. You have three main options:

**A. Scheduled (Cron)**: The most common. You set a specific time (e.g., "Every day at 6:00 AM").
- Use Case: Daily business reports for the Gold layer.

**B. File Arrival (Trigger)**: Databricks monitors a folder in your cloud storage (S3/ADLS). As soon as a new file lands, the Job starts.
- Use Case: Real-time or near-real-time ingestion.

**C. Continuous**: The job starts again as soon as the previous run finishes.
Use Case: High-frequency streaming data.

**3. Putting it together in a Workflow**
- When you configure the Job Task, you map these together:
- Select Notebook: Silver_Layer_Notebook
- Parameters: Key: processing_date, Value: {{now() | date('yyyy-MM-dd')}} (This is a dynamic value that always passes "today").
- Schedule: Set to Daily, 02:00 AM.

**The "Production" Benefit**
- By combining Multi-tasking, Parameters, and Scheduling:
- Error Handling: You can set "Retries" (e.g., "If Task 1 fails, try again 3 times every 5 minutes").
- Notifications: You can set an alert to email you only if the Job fails.
- Cost Tracking: You can tag the Job so you know exactly how much money the "Medallion Pipeline" is costing each month.

### Error handling

**1. Job-Level Error Handling (The Safety Net)**
- Databricks Workflows provide built-in settings so you don't have to write complex "retry" logic yourself.
- Retries: You can configure a task to "Retry on failure." For example, if a cloud connection drops, Spark can wait 5 minutes and try again.
- Timeouts: Prevent a "stuck" job from running forever and costing money by setting a maximum duration (e.g., 2 hours).
- On-Failure Tasks: You can create a "Conditional Task." If the Silver layer fails, it triggers a "Cleanup" task or sends a high-priority alert to Slack/PagerDuty.

**2. Code-Level Error Handling (The Surgeon)**
- In your Notebooks, you use Python's try-except blocks to handle specific data issues without crashing the whole pipeline.
- Example: Handling a Missing File
```
try:
    df = spark.read.format("parquet").load(path)
except Exception as e:
    if "Path does not exist" in str(e):
        print("No new data found today. Skipping Task.")
        dbutils.jobs.taskValues.set(key="status", value="no_data")
        # Gracefully exit without failing the whole job
        dbutils.notebook.exit("Success: No data to process")
    else:
        raise e # Re-throw if it's a real error
```

**3. Expectations & Data Quality (The Filter)**
- For the Medallion architecture, you want to catch "bad data" before it ruins your Gold aggregates.
- Filter Logic: Instead of letting the code crash on a null price, you can redirect bad rows to a Quarantine Table.
- Delta Expectations: If you are using Delta Live Tables (DLT), you can set rules like CONSTRAINT valid_price EXPECT (price > 0) ON VIOLATION DROP ROW.

**4. Notifications (The Alarm)**
- Don't wait to check the dashboard. In the Workflows UI, you can set Notifications:
- On Start: Useful for long-running batch jobs.
- On Success: Good for peace of mind.
- On Failure: Critical. This sends the error log directly to your email or a webhook.

### Task 1: Add parameter widgets to notebooks