# Emissions 01: Bronze Data & Filters Testing

**Purpose**: Test reading One BI premium data and applying business filters

**Tests**:
1. Read rf_fr1_prm_dtl_midcorp_m from bronze
2. Apply exclusions (intermediaries, guarantees, categories)
3. Verify filter impact

---

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))
print(f"Project root: {project_root}")

Project root: /workspace/new_python


In [2]:
from pyspark.sql import SparkSession
# from azfr_fsspec_utils import fspath
# import azfr_fsspec_abfs

# azfr_fsspec_abfs.use()

spark = SparkSession.builder \
    .appName("Emissions_Testing") \
    .getOrCreate()

print(f"✓ Spark {spark.version}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/17 22:15:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


✓ Spark 3.4.4


## 1. Load Configuration

In [3]:
from utils.loaders.config_loader import ConfigLoader
from src.reader import BronzeReader
import json

config = ConfigLoader(str(project_root / "config" / "config.yml"))
bronze_reader = BronzeReader(
    spark, config,
    str(project_root / "config" / "reading_config.json")
)

# Load emissions exclusions
with open(project_root / "config" / "transformations" / "emissions_config.json") as f:
    emissions_config = json.load(f)

print("Exclusions loaded:")
print(f"  Intermediaries: {len(emissions_config['excluded_intermediaries'])}")
print(f"  Guarantees: {emissions_config['excluded_guarantees']}")
print(f"  Categories: {emissions_config['excluded_categories']}")

Exclusions loaded:
  Intermediaries: 22
  Guarantees: ['180', '183', '184', '185']
  Categories: ['792', '793']


## 2. Read One BI Premium Data

In [5]:
VISION = "202509"

try:
    # CORRECTED: Use correct file_group name
    df = bronze_reader.read_file_group('rf_fr1_prm_dtl_midcorp_m', VISION)
    print(f"✓ Read {df.count():,} rows")
    print(f"  Columns: {len(df.columns)}")
    
    # CORRECTED: Use BRONZE column names (raw, before transformation)
    df.select('nu_cnt_prm', 'cd_prd_prm', 'cd_int_stc', 'cd_gar_prospctiv').show(5)
except Exception as e:
    print(f"⚠ Error reading data: {e}")
    df = None

✓ Read 20,000 rows
  Columns: 16
+-----------+----------+----------+----------------+
| nu_cnt_prm|cd_prd_prm|cd_int_stc|cd_gar_prospctiv|
+-----------+----------+----------+----------------+
|POL00010331|     01006|  INT00296|         XX240YY|
|POL00001683|     01102|  INT00019|         XX300YY|
|POL00006112|     01102|  INT00162|         XX220YY|
|POL00011553|     01087|  INT00115|         XX310YY|
|POL00013684|     01142|  INT00039|         XX240YY|
+-----------+----------+----------+----------------+
only showing top 5 rows



## 3. Lowercase Columns (REQUIRED)

In [6]:
# ADDED: Lowercase before applying filters
from utils.transformations import lowercase_all_columns

if df is not None:
    df = lowercase_all_columns(df)
    print("✓ Columns lowercased")
    print(f"  Sample columns: {df.columns[:5]}")

✓ Columns lowercased
  Sample columns: ['cd_niv_2_stc', 'cd_int_stc', 'nu_cnt_prm', 'cd_prd_prm', 'cd_statu_cts']


## 4. Apply Exclusion Filters

In [7]:
from pyspark.sql.functions import col

if df is not None:
    count_before = df.count()
    
    # Filter 1: Market filter (cd_marche = '6')
    df_f1 = df.filter(col('cd_marche') == '6')
    print(f"After market filter: {df_f1.count():,}")
    
    # Filter 2: Date filter (dt_cpta_cts <= vision)
    df_f2 = df_f1.filter(col('dt_cpta_cts') <= VISION)
    print(f"After date filter: {df_f2.count():,}")
    
    # Filter 3: Excluded intermediaries (CORRECTED column name)
    df_f3 = df_f2.filter(~col('cd_int_stc').isin(emissions_config['excluded_intermediaries']))
    print(f"After intermediary filter: {df_f3.count():,}")
    
    # Filter 4: Excluded guarantees (CORRECTED column name)
    df_f4 = df_f3.filter(~col('cd_gar_prospctiv').isin(emissions_config['excluded_guarantees']))
    print(f"After guarantee filter: {df_f4.count():,}")
    
    # Filter 5: Excluded categories (CORRECTED column name)
    df_f5 = df_f4.filter(~col('cd_cat_min').isin(emissions_config['excluded_categories']))
    print(f"After category filter: {df_f5.count():,}")
    
    count_after = df_f5.count()
    print(f"\nTotal: {count_before:,} → {count_after:,} ({(count_before-count_after):,} filtered)")
    
    df_filtered = df_f5
else:
    print("⚠ No data to filter")

After market filter: 20,000
After date filter: 20,000
After intermediary filter: 20,000
After guarantee filter: 20,000
After category filter: 20,000

Total: 20,000 → 20,000 (0 filtered)


## 5. Verify Bronze Column Names

In [8]:
if df_filtered is not None:
    print("Bronze column mapping:")
    print("  nu_cnt_prm → nopol (after transformation)")
    print("  cd_prd_prm → cdprod (after transformation)")
    print("  cd_int_stc → noint (after transformation)")
    print("  cd_gar_prospctiv → used to extract cgarp (chars 3-5)")
    print("\n✓ All filters applied successfully")

Bronze column mapping:
  nu_cnt_prm → nopol (after transformation)
  cd_prd_prm → cdprod (after transformation)
  cd_int_stc → noint (after transformation)
  cd_gar_prospctiv → used to extract cgarp (chars 3-5)

✓ All filters applied successfully


## Summary

In [9]:
print("="*60)
print("EMISSIONS BRONZE TESTING COMPLETE")
print("="*60)
print("\n→ Next: Notebook 02 - Full Pipeline")

EMISSIONS BRONZE TESTING COMPLETE

→ Next: Notebook 02 - Full Pipeline
