# 📘 Incremental JSON File Loader with Metadata Tracking in Microsoft Fabric

## 📥 1. Silver Layer Data Injection, increamental Loading and Cleaning

This is how we import json file and load incrementally using shortcuts

### ✅ Step 1: Import Required Libraries and Initialise Spark

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import input_file_name, regexp_extract, col,monotonically_increasing_id, col, lit, udf
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
import requests
import json

# Initialize SparkSession
spark = SparkSession.builder.appName("IncrementalJSONLoader").getOrCreate()

StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 5, Finished, Available, Finished)

### ✅ Step 2: Define File Paths and Metadata Store

In [4]:
# Folder where JSON files are stored (OneLake path)
data_folder = "Files/daily_transactions"

# Location to track processed files (as Delta table)
metadata_table_path = "Files/metadata/processed_files"


StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 6, Finished, Available, Finished)

### ✅ Step 3: Load Metadata of Already Processed Files via Spark

In [5]:
# Try loading existing metadata
try:
    processed_df = spark.read.format("delta").load(metadata_table_path)
    processed_files = set([row["file_name"] for row in processed_df.select("file_name").collect()])
except Exception as e:
    print("No metadata found yet. Starting fresh.")
    processed_files = set()


StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 7, Finished, Available, Finished)

No metadata found yet. Starting fresh.


### ✅ Step 4: List All JSON Files and Detect New Ones

In [6]:
# Use Spark to list files (works in Fabric)
file_list_df = spark.read.format("binaryFile").load(f"{data_folder}/*.json")

# Extract file names from full paths
file_list_df = file_list_df.withColumn("file_name", regexp_extract(input_file_name(), r"([^/]+\.json)$", 1))

# Get distinct filenames
all_files = [row["file_name"] for row in file_list_df.select("file_name").distinct().collect()]

# Identify new/unprocessed files
new_files = sorted([f for f in all_files if f not in processed_files])

print("New files to load:", new_files)


StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 8, Finished, Available, Finished)

New files to load: ['2025-06-09.json', '2025-06-10.json', '2025-06-11.json', '2025-06-12.json', '2025-06-13.json']


### ✅ Step 5: Load and Append New Files to DataFrame

In [7]:
df_combined = None

for filename in new_files:
    filepath = f"{data_folder}/{filename}"
    df = spark.read.option("multiline", "true").json(filepath)

    if df_combined is None:
        df_combined = df
    else:
        df_combined = df_combined.unionByName(df)

    # Mark this file as processed
    processed_files.add(filename)

# Display combined new data
if df_combined:
    print("All new files to loaded")
else:
    print("No new files to load.")


StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 9, Finished, Available, Finished)

All new files to loaded


### ✅ Step 6: Update Metadata Table with Processed File Names

In [8]:
# Convert processed file names to DataFrame
processed_df_new = spark.createDataFrame([Row(file_name=f) for f in sorted(processed_files)])

# Overwrite metadata table
processed_df_new.write.mode("overwrite").format("delta").save(metadata_table_path)


StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 10, Finished, Available, Finished)

### ✅ Step 7: Write Combined Data to Delta Table

In [9]:
if df_combined:
  output_path = "Files/processed_output/delta"
  df_combined.write.mode("append").format("delta").save(output_path)


StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 11, Finished, Available, Finished)

### ✅ Step 8: (Optional) Preview Metadata

In [10]:
#display(spark.read.format("delta").load(metadata_table_path))

StatementMeta(, b1e763fc-840a-49d7-a2c8-eceffadb56f7, 12, Finished, Available, Finished)