#### Why Databricks vs Pandas/Hadoop?

1. Databricks vs. Pandas
The main difference is Scalability. 
Pandas: Processes data in-memory on a single machine (your laptop or the "driver" node in a cluster). If your dataset exceeds your RAM, Pandas will crash. It is ideal for small datasets (typically < 1–2 GB) and quick exploration.
Databricks (Spark): Processes data in a distributed manner across many machines (nodes). It can handle terabytes or petabytes by splitting data into partitions and processing them in parallel.
Key Mindset Shift: Spark uses lazy evaluation, meaning it builds an execution plan and only runs it when you call an action (like .show()), allowing it to optimize performance behind the scenes. 

2. Databricks vs. Hadoop
The main difference is Speed and Modernity. 
Hadoop: A legacy ecosystem that stores data on disk (HDFS) and uses MapReduce for batch processing. It is often slower because it reads/writes to disk frequently.
Databricks: Built on Apache Spark, which performs computations in-memory, making it up to 100x faster than Hadoop for many tasks.
Architecture: Databricks uses a "Lakehouse" architecture, combining the cheap storage of a data lake with the high-performance features (like ACID transactions) of a data warehouse. Hadoop is typically a "data swamp" without these managed

#### Lakehouse architecture basics

The Lakehouse architecture is a modern data management approach that combines the best features of two legacy systems: **data lakes and data warehouses**. It provides a single platform that offers the scale and flexibility of a data lake with the structure and ACID compliance (Atomicity, Consistency, Isolation, Durability) of a data warehouse.
Here are the four fundamental basics of the Lakehouse architecture:

**1. Single Source of Truth (The Data Lake Foundation)**
* The foundation of the Lakehouse is a vast, cheap, and open data lake (usually in cloud storage like AWS S3, Azure Data Lake Storage, or GCS).
* Open Formats: Data is stored in open, standardized formats like Parquet or ORC.
* Raw Data Storage: It stores all data—structured, semi-structured (JSON), and unstructured (images, audio)—regardless of its initial use case, creating a single, centralized data repository.

**2. Introduction of the "Delta" Layer (ACID Transactions)**
* This is the key innovation that differentiates a Lakehouse from a raw data lake. A transactional storage layer (most popularly implemented using Delta Lake on Databricks) is added on top of the open data format files.
* This layer provides:
* **ACID Compliance:** Ensures reliable data writes and reads, preventing data corruption that can happen in raw data lakes during concurrent operations.
* **Schema Enforcement:** Guarantees data quality by automatically preventing "bad" records from entering a table, which is crucial for reliable analytics.
* **Time Travel:** Allows you to query historical versions of your data, making audits easier and simplifying the rollback of bad data changes.

**3. Support for Diverse Workloads (Unified Platform)**
* A Lakehouse breaks down data silos by supporting all data functions in one place. You no longer need separate systems for different teams:
* **Data Engineering (ETL):** Processing and cleaning raw data reliably.
* **Data Warehousing (BI/SQL):** Running traditional business intelligence (BI) and reporting queries directly on the lakehouse tables.
* **Data Science & Machine Learning:** Accessing rich, feature-engineered data using Python/R/Scala libraries directly from the same platform.

**4. Decoupled Storage and Compute**
* In a Lakehouse, your storage (the data files in the cloud) is separate from your computing power (the Databricks cluster that runs queries).
* You can scale your storage capacity infinitely and cheaply without affecting your processing speed.
* You can spin up large clusters for heavy lifting and shut them down when idle, saving significant costs while always having access to all your data.

#### Databricks workspace structure

The Databricks Workspace is organized around **Unity Catalog**, which has shifted the focus from local folders to a **governed, three-tier architecture.**

Here is the breakdown of how your workspace is structured:

**1. The Three-Tier Data Hierarchy (Unity Catalog)** \
This is how you organize your actual data tables and files: 
* **Catalog:** The top level (e.g., prod_catalog). It represents a business unit or environment.
* **Schema (Database):** The middle level (e.g., ecommerce_data). It groups related tables together.
* **Volume / Table:** The bottom level. **Tables:** Structured data you query with SQL. **Volumes:** Folders for non-tabular files (like your Kaggle CSVs or images).

**2. Workspace Assets (The Sidebar)** \
On the left-hand menu, you’ll find the functional areas: 
* **Workspace (Folders):** This is where your Notebooks, Libraries, and Git Folders live. \
**Shared:** For team collaboration.
**Users:** Your private sandbox.
* **Catalog Explorer:** The UI to manage permissions, schemas, and data lineage.
* **Compute:** Where you create and manage your clusters (SQL Warehouses for BI, or All-Purpose clusters for Data Science).
* **Workflows:** The scheduling engine where you turn notebooks into automated "Jobs."

**3. Medallion Architecture (The Logical Flow)** \
While not a physical "folder," most Databricks workspaces are logically structured using the Medallion pattern to move data through stages of quality: \
**Bronze (Raw):** Your initial Kaggle download (the raw CSVs in a Volume). \
**Silver (Cleaned):** Data after you fix the schema, handle nulls, and format timestamps. \
**Gold (Aggregated):** Business-ready tables (e.g., "Daily Sales Summary") used for PowerBI or dashboards.

**4. Git Folders (Repos)** \
these are special folders in your workspace synced with GitHub. They allow you to treat your notebooks as code, providing version control that standard workspace folders lack.

#### Industry use cases (Netflix, Shell, Comcast)

**1. Netflix: Personalized Real-Time Recommendations**

* Netflix uses Databricks to manage its massive telemetry data (every click, pause, and play).
* **The Use Case:** They moved from a messy data lake to a Lakehouse to enable Real-time Personalization.
* **How it Works:** By using Spark and Delta Lake, they process billions of events per second to update your "Top Picks" instantly.
* **The Result:** Reduced "time-to-insight" from hours to seconds, ensuring you stay on the app by seeing the most relevant content immediately. 

**2. Shell: Predictive Maintenance & Energy Transition**

Shell operates over 40,000 gas stations and thousands of industrial assets. \
* **The Use Case:** Predictive Maintenance on a global scale. They monitor over 2 million sensors across their infrastructure.
* **How it Works:** Databricks processes sensor data to predict when a valve or pump might fail before it happens. This prevents oil spills and expensive downtime.
* **The Result:** Saved millions in operational costs and accelerated their move toward renewable energy by optimizing wind farm performance using the same data models. 

**3. Comcast: Real-Time Voice Search & Customer Experience**

If you speak into an Xfinity remote, you are likely interacting with a Databricks-powered pipeline. \
* **The Use Case:** Natural Language Processing (NLP) and unified customer views.
* **How it Works:** Comcast uses Databricks to unify data from millions of set-top boxes. They run machine learning models that interpret voice commands in real-time and predict when a customer might need technical support before they even call.
* **The Result:** A significantly higher success rate for voice commands and a proactive customer service model that reduces "churn" (customers leaving the service). 


### Basic Pyspark commands

In [0]:
# Define the path to your downloaded CSV
file_path = "/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv"

# Read the file with correct options
df = (spark.read
      .format("csv")
      .option("header", "true")        # Uses the first row as column names
      .option("inferSchema", "true")   # Automatically detects data types (e.g., price as double)
      .load(file_path))

# Verify the result
df.printSchema()