### Spark architecture (driver, executors, DAG)

Databricks is essentially a managed, high-performance "wrapper" around Apache Spark. While Spark provides the distributed engine, Databricks adds a layer of automation, security (Unity Catalog), and a specialized engine called Photon to make it faster and easier to use.

**1. The Driver (The Brain)**
* The Driver is the central coordinator. When you run a cell in a Databricks notebook, you are talking directly to the Driver.
* **Responsibilities:** It runs your main() function, creates the SparkSession, and maintains all the information about the Spark application.
* **The Planner:** It translates your high-level code (Python, SQL, Scala) into a logical plan and then a physical execution plan.
* **Task Master:** It breaks the work into small "tasks" and schedules them to be executed by the workers.
* **State Keeper:** It tracks where data is located across the cluster and manages the overall lifecycle of the job.

In modern Databricks, the Spark Connect architecture decouples the client from the Driver, allowing you to connect from any IDE or application with much better stability and "versionless" upgrades.

**2. The Executors (The Muscles)**
* Executors are the processes that live on the Worker Nodes. If the Driver is the architect, the Executors are the construction crew.
* **Task Execution:** They receive tasks from the Driver, run the actual computation (filtering, joining, aggregating), and report the results back.
* **Data Storage:** They store data in-memory or on disk for fast access. When you "cache" a DataFrame, it lives in the memory of these Executors.
* **Isolation:** Each Spark application gets its own dedicated Executors. If one Executor crashes, the Driver simply restarts the task on a different one.

**3. The DAG (The Blueprint)**
* The Directed Acyclic Graph (DAG) is how Spark maps out the "recipe" for your data transformation.
* **Directed:** The process flows in one direction (from input to output).
* **Acyclic:** There are no loops. You can't go "backward" in the middle of an execution.
* **Lazy Evaluation:** When you write df.filter(...), Spark doesn't actually do anything yet. It just adds a node to the DAG. It only starts working when you call an Action (like .show(), .count(), or .save()).

**Stages & Tasks:**
* **Stages:** The DAG is broken into stages based on "Shuffle" boundaries (whenever data needs to move between executors, like during a join or groupBy).
* **Tasks:** Each stage is further broken into tasks—one task per data partition.

### DataFrames vs RDDs

**1. RDDs: The Foundational Layer**
* An RDD is a distributed collection of elements. Because RDDs are "opaque" to Spark, the engine doesn't know what's inside your data—it just sees a collection of Java or Python objects.
* Manual Control: You have to tell Spark exactly how to process data (e.g., map, flatMap, reduceByKey).
* The Python Tax: In PySpark, RDDs are significantly slower because data must be serialized and moved between the Python process and the JVM.
* When to use: Use RDDs only if you are working with unstructured data (like media files or raw text) or if you need very low-level control over partitioning that the DataFrame API doesn't provide.

**2. DataFrames: The Optimized Evolution** 
* A DataFrame is essentially an RDD with a schema. Because Spark knows the data types and column names, it can use two secret weapons to make your code run faster:

**The Catalyst Optimizer** \
When you write a DataFrame query, Spark doesn't just run it. It builds a Logical Plan, optimizes it (like pushing filters earlier to read less data), and creates an Optimized Physical Plan. \
**Example:**  If you filter a 1TB table for "Country = 'USA'", Catalyst ensures the filter happens at the source, so you don't actually pull 1TB into memory. \
**Project Tungsten** 
* This is Spark’s specialized memory management. Instead of using standard Java objects (which are heavy), Tungsten stores data in a compact binary format. This dramatically reduces the "Garbage Collection" overhead that often slows down big data jobs.

**3. Which one should you use?**
* Use DataFrames for: SQL queries, standard ETL, Machine Learning (MLlib), and virtually all structured data (Parquet, Delta, JSON, CSV).
* Use RDDs for: Legacy code maintenance or complex "black box" algorithms where you need to manipulate raw Java/Python objects directly.

### Lazy evaluation

In Apache Spark, Lazy Evaluation means that Spark does not execute your code immediately as you write it. Instead, it waits until the very last moment—when you actually need a result—to run the computation.

**Transformations (The Planning)**
* Transformations are instructions that tell Spark how to change the data. When you call a transformation, Spark simply records the instruction in the DAG (Directed Acyclic Graph).
* **Examples:** .filter(), .select(), .join(), .groupBy(), .map().
* **Result:** These return a new DataFrame but do zero actual work on the data.

**Actions (The Execution)**
* Actions are the "triggers." When an action is called, Spark looks at the list of transformations it has collected, optimizes them, and sends the tasks to the executors to get the work done.
* **Examples:** .show(), .count(), .collect(), .save(), .write().
* **Result:** These trigger the actual computation and return a value to the driver or write data to storage.

**Benefits**
* **A. Query Optimization**: Because Spark sees the "whole picture" (the entire DAG) before it starts, the Catalyst Optimizer can rearrange your steps. 
* **Predicate Pushdown:** If you have a 100GB table and you apply a filter at the very end of your script, Spark is smart enough to apply that filter while reading the data. This means it only pulls the data it needs into memory, rather than loading the full 100GB and filtering it later.

* **B. Fault Tolerance**
* If a worker node fails in the middle of a job, Spark doesn't panic. Because it has the "lineage" (the DAG recipe), it knows exactly how to re-create the lost data on a new worker from the original source.

* **C. Reducing Pass-Throughs**
* Instead of writing temporary results to disk after every single line of code, Spark combines multiple transformations into a single "stage," keeping data in memory as much as possible.


### Notebook magic commands (%sql, %python, %fs)

In Databricks, Magic Commands allow you to switch languages, interact with the file system, or run shell scripts—all within the same notebook.

#### 1. Language Magics
* %python: Executes Python code.
* %sql: Executes Spark SQL queries.
* %scala: Executes Scala code.
* %r: Executes R code.
* %sh: Executes shell commands.
* %fs: Executes file system commands.
* %md: Renders Markdown text.

#### 2. File System Magics: It allows you to interact with the Databricks File System (DBFS) without writing complex code.

- %fs ls - Lists files in a directory - %fs ls /databricks-datasets
- %fs cp - Copies a file or directory. - %fs cp /src/file.txt /dest/file.txt
- %fs mv - Moves or renames a file. - %fs mv /old/path /new/path
- %fs rm - Removes a file or directory. - %fs rm -r /path/to/directory
- %fs mkdirs - Creates a new directory. - %fs mkdirs /mnt/new_folder
- %fs head - Displays the first few bytes of a file. - %fs head /mnt/data/logs.txt

#### 3. Shell & OS Magics (%sh): Databricks nodes run on Linux. If you need to perform OS-level tasks, use %sh.

- %sh: Runs standard Bash commands.
- Example: %sh pip list or %sh wget https://example.com/data.csv
- %sh top: Can be used to check memory/CPU usage on the driver node.

#### 4. Workflow & Utility Magics: These commands help you manage your notebook environment and dependencies.

- %run <path>: Executes another notebook and imports its variables, functions, and widgets into your current session.
- Common use: Running a Config or Functions notebook at the start of a project.
- %pip: Used to install Python libraries specific to the current notebook session.
- Example: %pip install seaborn
- %md: Renders Markdown for documentation (titles, lists, images, and LaTeX).
- %load_ext: Used to load IPython extensions (like the SQL profiler).

#### 5. Data Visualization & Widgets
- dbutils.widgets: Not a magic command, but allows you to create interactive dropdowns and text boxes at the top of your notebook.
- display(): A function (not a magic) that renders DataFrames as rich, interactive tables and charts.

### Tasks:

In [0]:
# Define the path to your downloaded CSV
file_path = "/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv"

# Read the file with correct options
df = (spark.read
      .format("csv")
      .option("header", "true")        # Uses the first row as column names
      .option("inferSchema", "true")   # Automatically detects data types (e.g., price as double)
      .load(file_path))

In [0]:
# Verify the result
df.printSchema()
display(df.limit(5))

In [0]:
## first 10 rows with selected columns

df.select("event_time","product_id","brand").show(10)

In [0]:
## filter the data

df.filter("price>100").count()

In [0]:
## groupby

df.groupBy("event_type").count().show()

In [0]:
## group by and order by

top_brands = df.groupBy("brand").count().orderBy("count", ascending=False).limit(5)
top_brands.show()

In [0]:
top_brands.write.mode("append").saveAsTable("workspace.ecommerce.top_brands_output")
print("Appended top brands to managed table: workspace.ecommerce.top_brands_output")