In [None]:
%pip install git+https://github.com/Open-Dataplatform/utils-databricks.git@v0.5.1

In [None]:
from pyspark.sql import SparkSession

# Importing functions from the custom utility package
from custom_utils import dataframe, helper
from custom_utils.dp_storage import reader, writer, initialize_config
from custom_utils.dp_storage.validation import verify_paths_and_files
from custom_utils.dp_storage.connector import mount

# Standardization

## Setup

### Configuration Handling with `initialize_config`

The `initialize_config` function is now used to set up the configuration parameters for the notebook. This function centralizes the management of configurations and parameters, making it easier to reuse and maintain across different workflows.

The function is defined in the `custom_utils.dp_storage` module and is a key part of the initialization process. It allows you to pass important settings such as source and destination environments, containers, dataset identifiers, and other related options.

**Default Parameters and Usage**:
```python
config = initialize_config(dbutils, helper, '<source_environment>', '<destination_environment>', '<source_container>', '<source_datasetidentifier>')
```
- `dbutils`: Databricks utility object.
- `helper`: Helper object for logging and parameter fetching.
- `source_environment`, `destination_environment`, `source_container`, `source_datasetidentifier`: Core configuration parameters for handling data paths.

By using this function, you can manage configurations more dynamically, pulling parameters from either widgets or ADF depending on where the code is executed. This replaces the older method of manually defining configurations in the notebook itself.

In [None]:
# Initialize configuration and helper objects
config = initialize_config(dbutils, helper, '<source_environment>', '<destination_environment>', '<source_container>', '<source_datasetidentifier>')

## Read
Reads data from storage

### Enhanced Data Reading Workflow

The `verify_paths_and_files` function now handles much of the path validation and file retrieval logic that was previously done manually. It verifies the presence of required files (e.g., schema and source files) and determines the correct paths for further processing.

The function calls `reader.get_path_to_triggering_file` internally to determine the correct path to the source files. This centralizes logic for fetching paths, improving reliability and reducing code duplication.

In addition, the `reader` module provides several other key functions for handling data:
- `json_schema_to_spark_struct`: Converts a JSON schema to a PySpark `StructType`.
- `read_json_from_binary`: Reads and parses files as binary content, allowing for the extraction of JSON data with a specified schema.
- `get_dataset_path`: Retrieves the full path to a dataset based on a provided configuration.

These functions allow for more flexible and dynamic data ingestion, adapting to different file formats and data structures.

In [None]:
# Verify paths and files
schema_file_path, data_file_path, file_type = verify_paths_and_files(dbutils, config, helper)

# Read and parse the JSON content using schema
schema, spark_schema = reader.json_schema_to_spark_struct(schema_file_path)
df_raw = reader.read_json_from_binary(spark, spark_schema, data_file_path)

# Rewrite the line above if your file is not JSON.
# Examples:
# df_raw = spark.read.option("delimiter", ",").csv(source_file_path, header=True)
# df_raw = spark.read.parquet(source_file_path)

## Standardize the Data

### Flattening and Renaming Data

The `flatten_df` function is now used to standardize the data structure. It recursively flattens complex nested structures (like arrays and structs) and applies type mappings for consistent data handling. The function also includes options for customizing the depth level for flattening.

`flatten_df` internally handles column renaming by replacing characters like `.` with `_`, streamlining the renaming process without the need for separate calls to `rename_columns`.

This approach simplifies the overall standardization process while maintaining flexibility for different data structures.

In [None]:
# Standardize the DataFrame
df = dataframe.flatten_df(df_raw, depth_level=config.depth_level, type_mapping=dataframe.get_type_mapping())
df = dataframe.rename_columns(df, replacements={'.': '_'})  # Optional if further renaming is needed

## Merge and Upload

### Improved Delta Table Handling and Version Tracking

In this step, the processed data is merged into a Delta table. The current Delta table version is retrieved before the merge, allowing for version comparison and tracking changes between updates.

You can easily check the changes between table versions by comparing the old and new versions. This helps with monitoring data updates and understanding the impact of each merge operation.

In [None]:
# Determine destination paths
destination_path = writer.get_destination_path(config.destination_environment)
database_name_databricks, table_name_databricks = writer.get_databricks_table_info(config.destination_environment)

# Get the current Delta table version before the merge
try:
    current_version = spark.sql(f"DESCRIBE HISTORY {database_name_databricks}.{table_name_databricks}").select("version").orderBy(F.desc("version")).first()[0]
except:
    current_version = None  # If the table doesn't exist yet

# Insert the processed data
df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .option("path", destination_path) \
    .saveAsTable(f'{database_name_databricks}.{table_name_databricks}')

# Get the new Delta table version after the merge
new_version = spark.sql(f"DESCRIBE HISTORY {database_name_databricks}.{table_name_databricks}").select("version").orderBy(F.desc("version")).first()[0]

# Print version changes
if current_version is not None:
    print(f"Table updated from version {current_version} to {new_version}.")
else:
    print(f"Table created at version {new_version}.")