# Bronze Layer: Raw Data Ingestion
🎯 **Goal**: Collect and store the raw data exactly as it comes in

🔧 **Tasks a data engineer performs:**
- Write Python or Spark code to pull data from a source (API, file, database)
- Validate basic connectivity (check API responses, handle errors)
- Convert the raw response (e.g. JSON) to a DataFrame
- Store the data in a raw Delta table (or Parquet/CSV in simpler setups)
- Include basic metadata (e.g. ingestion timestamp)

### 📥 This notebook: Bronze Layer – Raw Pokémon Data Ingestion

In this notebook, we:

- Collect data from the [PokéAPI](https://pokeapi.co/) for the first 150 Pokémon
- Extract selected fields from the JSON response:
  - `id`, `name`, `height`, `weight`, `base_experience`, and `types`
- Convert the API responses into a Spark DataFrame
- Save the raw data to a Delta table in the **bronze layer** at `./data/bronze/pokemon`

🟫 No transformations are applied — we store the data as-is to preserve its original structure.

---


## 🔧 What is Spark? What is Delta Lake?

### 💥 Spark
Apache Spark is a powerful engine that helps process large amounts of data quickly — like a **supercharged calculator** for data.

- It can read, transform, and analyze data stored in many formats (CSV, JSON, Parquet, Delta, etc.)
- Spark works **in memory**, meaning it's much faster than traditional tools (like Excel or plain Python) when dealing with big data
- You write code in Python (using **PySpark**) and Spark takes care of the heavy lifting behind the scenes

Even though we’re working with small data here (like 150 Pokémon), Spark is used by **big companies** for tasks involving **millions or billions of rows**.

---

### 🧊 Delta Lake
Delta Lake is a special file format built on top of **Parquet**, but with superpowers:

- ✅ Allows **versioning** — like a time machine for your data
- ✅ Guarantees **data reliability** — no broken writes or half-finished tables
- ✅ Supports updates, deletes, merges — just like a database
- ✅ Works great with Spark

In this project, we use **Delta tables** to store each stage of our data pipeline: `bronze`, `silver`, and `gold`. They're just folders on your computer that Spark reads/writes like structured tables.

> 💡 Think of Delta Lake as a smart format for storing big data — reliable, efficient, and made for analytics or machine learning.

---
# 1. Initializing Spark with Delta support
In this step, we set up a Spark session and enable Delta Lake functionality.

- `SparkSession` is the main entry point to PySpark.
- We configure Spark to understand and work with the Delta format by adding two `.config(...)` lines.
- Delta Lake allows us to write data to disk in a way that's **reliable, versioned, and scalable**.

This session powers everything we do in the pipeline from this point on.

In [None]:
from pyspark.sql import SparkSession

# Define the Spark session with Delta lake support
spark = SparkSession.builder \
    .appName("DemoPipe - Bronze Layer") \
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.3.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Confirm Spark is running
spark.version

# Step 2: Set the path to your bronze Delta table
We define a local path to store our **raw Pokémon data**.

- The output path will be used by Spark to save data in **Delta format**
- Spark will automatically create the folder and files if they don’t exist
- This keeps our code flexible and easy to migrate (e.g., to cloud paths later)

In [None]:
# Define path to bronze table
bronze_path = "../data/bronze/pokemon"

# For Databricks (uncomment when running in DBFS)
# bronze_path = "dbfs:/tmp/bronze/pokemon"

# For Microsoft Fabric (Lakehouse managed table)
# bronze_path = "Tables/bronze_pokemon"

# Step 3: Scrape Pokémon data from the PokéAPI

We fetch raw data for the first three generations of Pokémon using the public [PokéAPI](https://pokeapi.co).

- We use `requests` to make HTTP calls and collect JSON data
- From each Pokémon, we COULD just extract key fields, but to simulate how messy data can be, let's just scrape the raw JSON file in it's entirety.
- We store this data in a list of dictionaries to convert it easily into a DataFrame

This step represents the **real-world “ingest” phase** in a data pipeline.

In [None]:
import requests
import pandas as pd
import json

# Scrape all available fields for the first three generations of Pokémon
pokemon_data = []

for i in range(1, 387):
    response = requests.get(f"https://pokeapi.co/api/v2/pokemon/{i}")
    if response.status_code == 200:
        raw_json = response.json()
        pokemon_data.append({
            "id": raw_json["id"],
            "name": raw_json["name"],
            "raw_json": json.dumps(raw_json)  # 💡 Store full response here
        })

# Convert to a Pandas DataFrame
pdf = pd.DataFrame(pokemon_data)

# Preview the structure
pdf.head()

# Step 4: Convert to Spark DataFrame
After scraping, we:

- Convert the list of dictionaries into a Pandas DataFrame
- Then convert the Pandas DataFrame into a Spark DataFrame

This allows us to use all the power of Spark to process and save the data.

In [None]:
# Convert to Spark DataFrame
df = spark.createDataFrame(pdf)

# Show the schema and first few rows
df.printSchema()
df.show(5, truncate=False)

# Step 5: Save to Delta table in the bronze layer
Now we save the raw data to a **Delta table** on local disk.

- This data is stored in `../data/bronze/pokemon` using the Delta format
- Delta adds features like versioning, schema enforcement, and safe overwrites

At this stage, we do **not** clean or modify the data — we store it exactly as it was received.

In [None]:
# Write to local Delta table (bronze)
df.write.format("delta").mode("overwrite").save(bronze_path)

# Optional: Confirm the Delta table saved correctly
We read the data back from disk to confirm that:

- The Delta format was written properly
- The data matches our expectations


In [None]:
# Read back from the bronze path to verify
bronze_df = spark.read.format("delta").load(bronze_path)
bronze_df.show(5)