## Process Data

This notebook takes all the raw files from the bronze volume and processes them into a single table. It could be optimized to use less ram, but does the trick.

In [11]:
from hydrate.utils import get_config_path, DotConfig
config_path = get_config_path()
config = DotConfig(config_path)

## Bronze to Silver Cleaning
There are a couple problems with how the files were saved for use with spark. We need to loop through every parquet and fix the timestamp before reading with spark.

In [None]:
from pathlib import Path
import pandas as pd
cols_to_keep = ['P-PDG', 'P-MON-CKP', 'P-MON-CKGL', 'P-JUS-CKGL', 'P-JUS-CKP', 'P-TPT', 'T-TPT', 'QGL', 'T-JUS-CKP', 'state', 'timestamp']

for path in Path(config.download.output_dir).rglob('*.parquet'):
  print(path)
  df_in = pd.read_parquet(path)
  df_out = df_in.reset_index()[cols_to_keep].dropna(subset=['state'])
  # save cleaner parquets to silver with proper timestamp
  out_path = str(path).replace('bronze','silv')
  Path(out_path).parent.mkdir(exist_ok=True)
  df_out.to_parquet(
    Path(out_path),
    coerce_timestamps="ms", 
    allow_truncated_timestamps=True,
    index=False
    )

## Table Generation

Now that we have proper files, we can use spark to make an enormous table.

In [None]:
from pyspark.sql.types import (
    StructType, StructField, DoubleType, IntegerType, TimestampType, LongType, StringType
)

SCHEMA_3W = StructType([
    StructField("timestamp", TimestampType(), True),
    StructField("P-PDG", DoubleType(), True),
    StructField("P-MON-CKP", DoubleType(), True),
    StructField("P-MON-CKGL", DoubleType(), True),
    StructField("P-JUS-CKGL", DoubleType(), True),
    StructField("P-JUS-CKP", DoubleType(), True),
    StructField("P-TPT", DoubleType(), True),
    StructField("T-TPT", DoubleType(), True),
    StructField("QGL", DoubleType(), True),
    StructField("T-JUS-CKP", DoubleType(), True),
    StructField("state", IntegerType(), True),
])

In [None]:
from hydrate.process import read_nested_parquet_files, add_state_name, clean_data
df = read_nested_parquet_files(config.process.silver_dir)
df = df.pipe(clean_data).pipe(add_state_name).reset_index(drop=True)

In [15]:
from databricks.connect import DatabricksSession as SparkSession
spark = SparkSession.builder.serverless(True).getOrCreate()

In [None]:
(
    spark.createDataFrame(df)
    .write
    .mode("overwrite")
    .option("mergeSchema", "true")
    .saveAsTable(f"{config.catalog}.{config.schema}.{config.process.table}")
)