# Ingestion of NASA FIRMS Fire Data

**Data Retrieval and Storage in the Bronze Layer**
The Bronze Layer is where raw data is ingested and stored. For FIRMS data, this involves:
- Pulling data from the FIRMS API for the last 2 weeks.
- Appending new data to the Delta Table while ensuring no duplicates are created.

**Key Steps for Bronze Layer Implementation**
1. Define the Schema for the Delta Table:
- The schema should match the FIRMS data fields (e.g., latitude, longitude, acq_date, acq_time, etc.).
- Add a unique identifier column to ensure idempotency (e.g., a combination of latitude, longitude, acq_date, and acq_time).
2. Retrieve Data from FIRMS API:
- Pull data for the last 2 weeks during the initial load.
- For subsequent loads, pull only the latest data (e.g., for the last 1 day or 12 hours).
3. Ensure Idempotency:
- Use a primary key or unique identifier to prevent duplicate rows when appending new data.
- A good candidate for the unique identifier is a combination of latitude, longitude, acq_date, and acq_time.
4. Save Data to Delta Table:
Append new data to the Delta Table while ensuring no duplicates are created.

Write the Backfill data to delta table, all_regions_df. Delta Table will ensure Idempotency with future merges

## Backfill data from 2025-02-01

In [0]:
pip install python-dotenv

In [0]:
from pyspark.sql import SparkSession
import pandas as pd

# Initialize Spark session
spark = SparkSession.builder.appName("FIRMS Delta Table").getOrCreate()

import os
from dotenv import load_dotenv
api_key = os.getenv('FIRMS_API_KEY')

# API details
api_key = api_key
dataset = "VIIRS_NOAA20_NRT"  # VIIRS NOAA-20 Near Real-Time data
country_code = "USA"          # Country code for the United States
days = 10                     # Number of days of data
start_date = '2025-02-01'     # Start date for the data

# Construct the API URL for the backfill
api_url = f'https://firms.modaps.eosdis.nasa.gov/api/country/csv/{api_key}/{dataset}/{country_code}/{days}/{start_date}'

# Fetch the data and load it into a Pandas DataFrame
df_us = pd.read_csv(api_url)

# Convert the Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(df_us)

# Set the table name to be used in the metastore
table_name = "firms_data"

# Create the managed Delta table by overwriting any existing instance.
spark_df.write.format("delta") \
    .partitionBy("acq_date") \
    .mode("overwrite") \
    .saveAsTable(table_name)

# Optimize the table with ZORDER (if supported in your environment)
spark.sql(f"OPTIMIZE {table_name} ZORDER BY (longitude)")

print("Static backfill completed: Table created and optimized.")

In [0]:
%sql
SELECT MAX(acq_date)
FROM firms_data

The next code adds the merge function; this was run twice on the 10th and 20th, producing two duplicate dates for validation

In [0]:
from delta.tables import DeltaTable
import pandas as pd
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("FIRMS Delta Table").getOrCreate()

import os
from dotenv import load_dotenv
api_key = os.getenv('FIRMS_API_KEY')

# API details
api_key = api_key
dataset = "VIIRS_NOAA20_NRT"  # VIIRS NOAA-20 Near Real-Time data
country_code = "USA"          # Country code for the United States
days = 10                     # Number of days of data
start_date = '2025-02-20'     # Start date for the data

# Construct the API URL for the backfill
api_url = f'https://firms.modaps.eosdis.nasa.gov/api/country/csv/{api_key}/{dataset}/{country_code}/{days}/{start_date}'

# Fetch the data and load it into a Pandas DataFrame
df_us = pd.read_csv(api_url)

# Convert the Pandas DataFrame to a Spark DataFrame
spark_df = spark.createDataFrame(df_us)

# Set the table name to be used in the metastore
table_name = "firms_data"

# Since the table already exists from the initial static load, perform a merge.
# The matching condition ensures that if a row with the same:
#   latitude, longitude, acq_date, and acq_time already exists, it will be updated.
# This takes care of the overlapping day data and prevents duplicates.
if spark.catalog.tableExists(table_name):
    delta_table = DeltaTable.forName(spark, table_name)
    
    delta_table.alias("target").merge(
        spark_df.alias("source"),
        """
        target.latitude = source.latitude AND
        target.longitude = source.longitude AND
        target.acq_date = source.acq_date AND
        target.acq_time = source.acq_time
        """
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()

    # Optimize the table if desired
    spark.sql(f"OPTIMIZE {table_name} ZORDER BY (longitude)")
else:
    # This branch would only run if the table doesn't exist.
    # It creates a managed Delta table (with partitioning by acq_date) and optimizes it.
    spark_df.write.format("delta") \
        .partitionBy("acq_date") \
        .mode("overwrite") \
        .saveAsTable(table_name)
    spark.sql(f"OPTIMIZE {table_name} ZORDER BY (longitude)")

print("Data merged and Delta table optimized.")

In [0]:
%sql
SELECT MAX(acq_date)
FROM firms_data

The next code will be the code input into the Bronze layer that collects the most recent data; it will include overlapping days in case there is a late update up to 3 days. 

In [0]:
from delta.tables import DeltaTable
import pandas as pd

import os
from dotenv import load_dotenv
api_key = os.getenv('FIRMS_API_KEY')

# API details
api_key = api_key
dataset = "VIIRS_NOAA20_NRT"  # VIIRS NOAA-20 Near Real-Time data
country_code = "USA"          # Country code for the United States
days = 4                     # Number of days of data

# Construct the API URL
api_url = f'https://firms.modaps.eosdis.nasa.gov/api/country/csv/{api_key}/{dataset}/{country_code}/{days}'

# Fetch the new data from the FIRMS API
new_data_df = pd.read_csv(api_url)

# Convert the Pandas DataFrame to a Spark DataFrame
new_spark_df = spark.createDataFrame(new_data_df)

# Set the table name to be used in the metastore
table_name = "firms_data"

# Check if the table already exists using the Spark catalog
if spark.catalog.tableExists(table_name):
    # Load the existing Delta table using its table name
    delta_table = DeltaTable.forName(spark, table_name)

    # Merge new data with the existing table
    delta_table.alias("target").merge(
        new_spark_df.alias("source"),
        """
        target.latitude = source.latitude AND
        target.longitude = source.longitude AND
        target.acq_date = source.acq_date AND
        target.acq_time = source.acq_time
        """
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()

    # Optimize the table (if your environment supports it)
    spark.sql(f"OPTIMIZE {table_name} ZORDER BY (longitude)")
else:
    # If the table doesn't exist, create it as a managed Delta table with partitioning
    new_spark_df.write.format("delta") \
        .partitionBy("acq_date") \
        .mode("overwrite") \
        .saveAsTable(table_name)

    # Optimize after initial write
    spark.sql(f"OPTIMIZE {table_name} ZORDER BY (longitude)")

print("Data merged and Delta table optimized.")

In [0]:
%sql
SELECT max(acq_date)
FROM firms_data

Check for Duplicates 

In [0]:
%sql
SELECT 
  latitude, 
  longitude, 
  acq_date, 
  acq_time, 
  COUNT(*) AS duplicate_count
FROM firms_data
GROUP BY 
  latitude, 
  longitude, 
  acq_date, 
  acq_time
HAVING COUNT(*) > 1;

Verify all datas are filled between min and max

In [0]:
%sql
WITH min_max AS (
  SELECT
    MIN(CAST(acq_date AS DATE)) AS min_date,
    MAX(CAST(acq_date AS DATE)) AS max_date
  FROM firms_data
),
all_dates AS (
  SELECT EXPLODE(sequence(min_date, max_date, interval 1 day)) AS dt
  FROM min_max
),
existing_dates AS (
  SELECT DISTINCT CAST(acq_date AS DATE) AS dt
  FROM firms_data
)
SELECT a.dt AS missing_date
FROM all_dates a
LEFT JOIN existing_dates e
  ON a.dt = e.dt
WHERE e.dt IS NULL;

We have an idempotent data ingestion for the FIRMS data with all dates backfilled to 2025-02-01