# Data Standardization and Merging Pipeline
This notebook is designed to handle the data standardization, validation, and merging processes for the Triton Flow Plans dataset. It covers the following steps:
- Initializing the configuration and environment setup.
- Loading and validating the JSON schema and source data.
- Flattening and transforming nested data structures.
- Performing data quality checks and ensuring data integrity.
- Managing Delta table creation and merging updated records.
- Generating feedback timestamps to track data processing intervals.


## Widget List
The following widgets are used in this notebook to allow dynamic configuration:
- **`SourceStorageAccount`**: The source storage account for the dataset.
- **`DestinationStorageAccount`**: The destination storage account where processed data is stored.
- **`SourceContainer`**: The container holding the source data files.
- **`SourceDatasetidentifier`**: The identifier used for the dataset.
- **`DepthLevel`**: The depth level for flattening nested JSON structures.

These widgets enable flexibility and can be adjusted to fit different environments or datasets.

In [None]:
%pip install git+https://github.com/Open-Dataplatform/utils-databricks.git@v0.6.0

In [None]:
from pyspark.sql import SparkSession

# Importing functions from the custom utility package
from custom_utils import dataframe, helper
from custom_utils.dp_storage import reader, writer, initialize_config, table_management, merge_management, feedback_management, quality
from custom_utils.validation import verify_paths_and_files
from pyspark.sql.utils import AnalysisException

## Setup
### Configuration Handling with `initialize_config`
The `initialize_config` function manages all configuration parameters dynamically. It pulls values from widgets and ADF, ensuring the code remains adaptable across different environments. This setup ensures that paths, environment settings, and data identifiers are correctly configured.

In [None]:
# Initialize configuration and helper objects
config = initialize_config(dbutils, helper, '<source_environment>', '<destination_environment>', '<source_container>', '<source_datasetidentifier>')
spark = config.spark_session
config.unpack(globals())
config.print_params()

## Read Data
This section validates the paths, reads the JSON schema, and loads the source data for further processing. The functions here ensure that the schema is correctly converted to a PySpark `StructType`, and the data is flattened as needed.

In [None]:
# Verify paths and files
schema_file_path, data_file_path, file_type = verify_paths_and_files(dbutils, config, helper)

# Read and parse the JSON content using schema
schema_json, spark_schema = reader.json_schema_to_spark_struct(schema_file_path)
df_raw = reader.read_json_from_binary(spark, spark_schema, data_file_path)
display(df_raw)

### Flattening and Schema Handling
The depth level is critical in controlling how deeply nested structures are flattened. The schema handling logic ensures that the JSON schema is accurately represented in PySpark, allowing for validation and transformation. In cases where the schema has deep nesting, this function can limit the flattening to a specific depth, making the data easier to manage.

In [None]:
# Flatten and standardize the DataFrame
df, df_flattened, columns_of_interest, view_name = dataframe.process_and_flatten_json(
    spark=spark,
    config=config,
    schema_file_path=schema_file_path,
    data_file_path=data_file_path,
    helper=helper
)
display(df_flattened)

## Data Quality Checks
This step ensures that the data meets expected quality standards. The duplicate check, for example, verifies that key columns do not have repeated values, ensuring data integrity before further processing.

In [None]:
# Perform data quality checks (e.g., duplicate checks)
quality.perform_quality_check(
    spark=spark,
    key_columns=key_columns,
    source_datasetidentifier=source_datasetidentifier,
    helper=helper
)

## Temporary View Creation and Data Transformation
Creating a temporary view for the most recent records based on specific key columns allows you to work with the latest data version. This is particularly useful when handling time series or datasets with versioning.

In [None]:
# Define the columns used for ordering if multiple files are read
order_by_columns = ["input_file_name DESC", "EventTimestamp DESC"]

# Create a temporary view with the most recent records
dataframe.create_temp_view_with_most_recent_records(
    spark=spark,
    view_name=view_name,
    key_columns=key_columns,
    columns_of_interest=columns_of_interest,
    order_by_columns=order_by_columns,
    helper=helper
)

## Delta Table Management and Merging
The following code handles the creation and merging of Delta tables. The logic ensures that if a table doesn't exist, it is created. During a merge, Delta table versions are tracked to monitor changes, providing insight into updates and modifications.

In [None]:
# Manage table creation if it does not exist
table_management.manage_table_creation(
    spark=spark,
    destination_environment=destination_environment,
    source_datasetidentifier=source_datasetidentifier,
    helper=helper
)

# Manage data merge
merge_management.manage_data_merge(
    spark=spark,
    destination_environment=destination_environment,
    source_datasetidentifier=source_datasetidentifier,
    view_name=view_name,
    key_columns=key_columns,
    helper=helper
)

## Feedback Timestamps
The final step is generating feedback timestamps for tracking processing intervals. This ensures that the data pipeline is operating within the expected time windows, making it easier to audit and debug.

In [None]:
# Generate feedback timestamps
feedback_management.generate_feedback_timestamps(
    spark=spark,
    view_name=view_name,
    feedback_column=feedback_column,
    dbutils=dbutils,
    helper=helper
)

## Notebook Completion
The notebook exits with the final feedback output, signaling successful processing.

In [None]:
# Exit the notebook with success message
dbutils.notebook.exit("Notebook completed successfully.")