# 01 - Bronze Layer: Raw Data Ingestion

**Purpose:** Ingest raw data from multiple sources into Bronze Delta tables

**Data Sources:**
1. Credit Risk Assessment (32,576 records)
2. Loan Default Dataset (148,670 records)
3. Loan Prediction Dataset (12,367 records)
4. GST Collections - Statewise (6 years × 38 states)
5. GST Collections - Gross vs Net Tax

**Output:** Bronze Delta tables with metadata tracking

## Setup: Create Schema

In [0]:
# Create database/schema
spark.sql("CREATE SCHEMA IF NOT EXISTS msme_risk_analytics")
spark.sql("USE msme_risk_analytics")

print("✓ Schema created: msme_risk_analytics")

## Imports

In [0]:
from pyspark.sql.functions import current_timestamp, lit
from pyspark.sql.types import *

## 1. Credit Risk Dataset

In [0]:
# Read Credit Risk CSV
credit_risk = spark.read.csv(
    '/FileStore/tables/credit_risk.csv',
    header=True,
    inferSchema=True
)

# Add metadata
credit_risk_bronze = credit_risk \
    .withColumn('source_file', lit('credit_risk.csv')) \
    .withColumn('load_timestamp', current_timestamp())

# Write to Bronze Delta table
credit_risk_bronze.write \
    .format('delta') \
    .mode('overwrite') \
    .saveAsTable('bronze_credit_risk')

print(f"✓ bronze_credit_risk created: {credit_risk_bronze.count()} records")
credit_risk_bronze.printSchema()

## 2. Loan Default Dataset

In [0]:
# Read Loan Default CSV
loan_default = spark.read.csv(
    '/FileStore/tables/Loan_Default.csv',
    header=True,
    inferSchema=True
)

# Add metadata
loan_default_bronze = loan_default \
    .withColumn('source_file', lit('Loan_Default.csv')) \
    .withColumn('load_timestamp', current_timestamp())

# Write to Bronze Delta table
loan_default_bronze.write \
    .format('delta') \
    .mode('overwrite') \
    .saveAsTable('bronze_loan_default')

print(f"✓ bronze_loan_default created: {loan_default_bronze.count()} records")
loan_default_bronze.printSchema()

## 3. Loan Prediction Dataset

In [0]:
# Read Loan Prediction CSV
loan_prediction = spark.read.csv(
    '/FileStore/tables/loan_data.csv',
    header=True,
    inferSchema=True
)

# Add metadata
loan_prediction_bronze = loan_prediction \
    .withColumn('source_file', lit('loan_data.csv')) \
    .withColumn('load_timestamp', current_timestamp())

# Write to Bronze Delta table
loan_prediction_bronze.write \
    .format('delta') \
    .mode('overwrite') \
    .saveAsTable('bronze_loan_prediction')

print(f"✓ bronze_loan_prediction created: {loan_prediction_bronze.count()} records")
loan_prediction_bronze.printSchema()

## 4. GST Statewise Collections (FY 2020-21)

In [0]:
# Read GST 2020-21
gst_2020 = spark.read.csv(
    '/FileStore/tables/Statewise_2020_21.csv',
    header=True,
    inferSchema=True
).withColumn('fiscal_year', lit('2020-21')) \
 .withColumn('load_timestamp', current_timestamp())

gst_2020.write.format('delta').mode('overwrite').saveAsTable('bronze_gst_statewise_2020_21')
print(f"✓ bronze_gst_statewise_2020_21: {gst_2020.count()} records")

## 5. GST Statewise Collections (FY 2021-22)

In [0]:
# Read GST 2021-22
gst_2021 = spark.read.csv(
    '/FileStore/tables/Statewise_2021_22.csv',
    header=True,
    inferSchema=True
).withColumn('fiscal_year', lit('2021-22')) \
 .withColumn('load_timestamp', current_timestamp())

gst_2021.write.format('delta').mode('overwrite').saveAsTable('bronze_gst_statewise_2021_22')
print(f"✓ bronze_gst_statewise_2021_22: {gst_2021.count()} records")

## 6. GST Statewise Collections (FY 2022-23)

In [0]:
# Read GST 2022-23
gst_2022 = spark.read.csv(
    '/FileStore/tables/Statewise_2022_23.csv',
    header=True,
    inferSchema=True
).withColumn('fiscal_year', lit('2022-23')) \
 .withColumn('load_timestamp', current_timestamp())

gst_2022.write.format('delta').mode('overwrite').saveAsTable('bronze_gst_statewise_2022_23')
print(f"✓ bronze_gst_statewise_2022_23: {gst_2022.count()} records")

## 7. GST Statewise Collections (FY 2023-24)

In [0]:
# Read GST 2023-24
gst_2023 = spark.read.csv(
    '/FileStore/tables/Statewise_2023_24.csv',
    header=True,
    inferSchema=True
).withColumn('fiscal_year', lit('2023-24')) \
 .withColumn('load_timestamp', current_timestamp())

gst_2023.write.format('delta').mode('overwrite').saveAsTable('bronze_gst_statewise_2023_24')
print(f"✓ bronze_gst_statewise_2023_24: {gst_2023.count()} records")

## 8. GST Statewise Collections (FY 2024-25)

In [0]:
# Read GST 2024-25
gst_2024 = spark.read.csv(
    '/FileStore/tables/Statewise_2024_25.csv',
    header=True,
    inferSchema=True
).withColumn('fiscal_year', lit('2024-25')) \
 .withColumn('load_timestamp', current_timestamp())

gst_2024.write.format('delta').mode('overwrite').saveAsTable('bronze_gst_statewise_2024_25')
print(f"✓ bronze_gst_statewise_2024_25: {gst_2024.count()} records")

## 9. GST Statewise Collections (FY 2025-26)

In [0]:
# Read GST 2025-26
gst_2025 = spark.read.csv(
    '/FileStore/tables/Statewise_2025_26.csv',
    header=True,
    inferSchema=True
).withColumn('fiscal_year', lit('2025-26')) \
 .withColumn('load_timestamp', current_timestamp())

gst_2025.write.format('delta').mode('overwrite').saveAsTable('bronze_gst_statewise_2025_26')
print(f"✓ bronze_gst_statewise_2025_26: {gst_2025.count()} records")

## 10. GST Gross vs Net Tax Collection

In [0]:
# Read Gross vs Net Tax Excel file
gross_net_tax = spark.read.format('com.crealytics.spark.excel') \
    .option('header', 'true') \
    .option('inferSchema', 'true') \
    .load('/FileStore/tables/bronze_gross_net_tax.xlsx')

# Add metadata
gross_net_tax_bronze = gross_net_tax \
    .withColumn('source_file', lit('bronze_gross_net_tax.xlsx')) \
    .withColumn('load_timestamp', current_timestamp())

# Write to Bronze Delta table
gross_net_tax_bronze.write \
    .format('delta') \
    .mode('overwrite') \
    .saveAsTable('bronze_gross_net_tax')

print(f"✓ bronze_gross_net_tax created: {gross_net_tax_bronze.count()} records")
gross_net_tax_bronze.printSchema()

## Summary: Bronze Layer Tables Created

In [0]:
# List all Bronze tables
bronze_tables = spark.sql("""
    SHOW TABLES IN msme_risk_analytics LIKE 'bronze*'
""")

print("\n" + "="*60)
print("BRONZE LAYER INGESTION COMPLETE")
print("="*60)
bronze_tables.show(truncate=False)

print("\n✅ DAY 1 COMPLETE - All raw data ingested to Bronze layer")
print("Next: Run 02_Silver_Loan_Data.ipynb")