# Transform data for Parking Sensors

TO DO: 
- Update the code logic using [03_transform.py](https://github.com/Azure-Samples/modern-data-warehouse-dataops/blob/feat/e2e-fabric-dataops-sample/e2e_samples/parking_sensors/databricks/notebooks/03_transform.py) as the reference.
- Need to update the landing page references.
- Add unit test cases in a separate notebook.

About:

- This notebook ingests data from the source needed by parking sensors sample. It then performs cleanup and standardization step. See [parking sensors page](https://github.com/Azure-Samples/modern-data-warehouse-dataops/tree/feat/e2e-fabric-dataops-sample/e2e_samples/fabric_dataops_sample/README.md) for more details about Parking Sensor sample using Microsoft Fabric.

Assumptions/Pre-requisites:

- Currently there is a known issue running cross workspace queries when workspace name has special characters. See [schema limitation](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-schemas#public-preview-limitations) for more details. Avoid special characters if planning to query across workspaces with schema support. 
- All the assets needed are created by IaC step during migration.
    - Config file needed: Files/sc-adls-main/config/application.cfg (derived using application.cfg.template during ci/cd process). Ensure "transform" section is updated with the required parameters for this notebook.
- All the required lakehouse schemas and tables are created by the `nb-setup` notebook.
- Input data standardization is completed by `nb-standardize` notebook.
- Environment with common library otel_monitor_invoker.py and its associated python dependencies
- Parking Sensor Lakehouse
- Datasource: ADLS made available as a shortcut in Parking Sensor Lakehouse or Direct access to REST APIs.
- Monitoring sink: AppInsights
- Secrets repo: Key vault to store AppInsights connection information

- All Lakehouses have schema support enabled (in Public preview as of Nov, 2024).
- Execution
  - A default lakehouse is associated during runtime where the required files and data are already staged. Multiple ways of invoking:
    - [Api call](https://learn.microsoft.com/fabric/data-engineering/notebook-public-api#run-a-notebook-on-demand)
    - [Part of a data pipeline](https://learn.microsoft.com/fabric/data-engineering/author-execute-notebook#parameterized-session-configuration-from-a-pipeline)
    - [Using `%run` from another notebook](https://learn.microsoft.com/fabric/data-engineering/author-execute-notebook#reference-run-a-notebook)


## Parameters and Library imports

### Reading parameters (external from Fabric pipeline or default values)

In [None]:
%%configure
{
    "defaultLakehouse": {
        "name": {
            "parameterName": "lakehouse_name",
            "defaultValue": "{{ .lakehouse_name }}"
               } ,
        "id": { 
            "parameterName": "lakehouse_id",
            "defaultValue": "{{ .lakehouse_id }}"
        } ,
        "workspaceId": {
            "parameterName": "workspace_id",
            "defaultValue": "{{ .workspace_id }}"
        }
    },
    "mountPoints": [
        {
            "mountPoint": "/local_data",
            "source": {
                "parameterName": "local_mount",
                "defaultValue": "abfss://{{ .workspace_id }}@onelake.dfs.fabric.microsoft.com/{{ .lakehouse_id }}/Files"
            }
        }
    ]
}

In [None]:
# Unless `%%configure` is used to read external parameters - this cell should be the first one

# This cell is tagged as Parameters cell. Parameters mentioned here are usually \
#    passed by the user at the time of notebook execution.
# Ref: https://learn.microsoft.com/fabric/data-engineering/notebook-public-api#run-a-notebook-on-demand

# Control how to run the notebook - "all" for entire notebook or "module" mode to use \
#      this notebook like a module (main execution will be skipped). Useful when performing
#      testing using notebooks or funcitons from this notebook need to be called from another notebook.
import configparser
from datetime import datetime

import otel_monitor_invoker as otel  # custom module part of env
from opentelemetry.trace import SpanKind
from opentelemetry.trace.status import StatusCode

# execution_mode = "module" will skip the execution of the main fucntion. Use it for module like treatment
#   "all" perform execution as well.
execution_mode = "all"
# Helpful if user wants to set a child process name etc. will be derived if not set by user
job_exec_instance = ""
# Helpful to derive any stage based globals
env_stage = "dev"
# Common config file path hosted on attached lakehouse - path relative to Files/
config_file_path = "sc-adls-main/config/application.cfg"

In [None]:
# # only need to run when developing the notebook to format the code
# import jupyter_black

# jupyter_black.load()

In [None]:
# Validate input parameters
in_errors = []
if execution_mode not in ["all", "module"]:
    in_errors.append(f"Invalid value: {execution_mode = }. It must be either 'all' or 'module'.")
if not notebookutils.fs.exists(f"Files/{config_file_path}"):
    in_errors.append(f"Specified config - `Files/{config_file_path}` doesn't exist.")

if in_errors:
    raise ValueError(f"Input parameter valiadtion failed. Erros are:\n{in_errors}")
else:
    print("Input parameter verification completed successfully.")

### Network mounts

- Scope is set to Job/session - so these need to be run once per session


In [None]:
# -- Helps to read config files from onelake location
runtime_context = notebookutils.runtime.context
local_data_mount_path = f'{notebookutils.fs.getMountPath("/local_data")}'

### Read user provided config values


In [None]:
config = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())
config.read(f"{local_data_mount_path}/{config_file_path}")

In [None]:
# When we config parser if the value is not present in the specified section, it will be
#   read from "DEFAULT" section.
config_section_name = "transform"
process_name = config.get(config_section_name, "process_name")
parking_ws = config.get(config_section_name, "workspace_name")
parking_ws_id = config.get(config_section_name, "workspace_id")
parking_lakehouse = config.get(config_section_name, "lakehouse_name")

# Add any other parameters that need to be read

### Internal (derived) parameters

In [None]:
# default is micro-seconds, changing to milli-seconds
current_ts = datetime.utcnow().strftime("%Y%m%d%H%M%S%f")[:-3]
job_exec_instance = job_exec_instance if job_exec_instance else f"{process_name}#{current_ts}"
execution_user_name = runtime_context["userName"]

# Add any other parameters needed by the process

## Monitoring and observability

### AppInsights connection

In [None]:
connection_string = notebookutils.credentials.getSecret(
    config.get("keyvault", "uri"), config.get("otel", "appinsights_connection_name")
)
otlp_exporter = otel.OpenTelemetryAppInsightsExporter(conn_string=connection_string)

### Populate resource information

In [None]:
# Resource references
# - Naming conventions: https://opentelemetry.io/docs/specs/semconv/general/attribute-naming/
# - For a complete list of reserved ones: https://opentelemetry.io/docs/concepts/semantic-conventions/
#  NOTE: service.namespace,service.name,service.instance.id triplet MUST be globally unique.
#     The ID helps to distinguish instances of the same service that exist at the same time
#     (e.g. instances of a horizontally scaled service)
resource_attributes = {
    # ---------- Reserved attribute names
    "service.name": config.get(config_section_name, "service_name"),
    "service.version": config.get(config_section_name, "service_version"),
    "service.namespace": "parking-sensor",
    "service.instance.id": notebookutils.runtime.context["activityId"],
    "process.executable.name": process_name,
    "deployment.environment": env_stage,
    # ---------- custom attributes - we can also add common attributes like appid, domain id etc
    #     here or get them from process reference data using processname as the key.
    # runtime context has a lot if useful info - adding it as is.
    "jobexec.context": f"{notebookutils.runtime.context}",  # convert to string otherwise it will fail
    "jobexec.cluster.region": spark.sparkContext.getConf().get("spark.cluster.region"),
    "jobexec.app.name": spark.sparkContext.getConf().get("spark.app.name"),
    "jobexec.instance.name": job_exec_instance,
}

# Typically, logging is performed within the context of a span.
#   This allows log messages to be associated with trace information through the use of trace IDs and span IDs.
#   As a result, it's generally not necessary to include resource information in log messages.
# Note that trace IDs and span IDs will be null when logging is performed outside of a span context.
log_attributes = {"jobexec.instance.name": job_exec_instance}
trace_attributes = resource_attributes

tracer = otlp_exporter.get_otel_tracer(trace_resource_attributes=trace_attributes, tracer_name=f"tracer-{process_name}")
logger = otlp_exporter.get_otel_logger(
    log_resource_attributes=log_attributes,
    logger_name=f"logger-{process_name}",
    add_console_handler=False,
)
logger.setLevel("INFO")  # deafult is WARN

## Code

### Code functions

- When using %run we can expose these functions to the calling notebook.

In [None]:
def get_lakehouse_details(lakehouse_name: str) -> dict:
    logger.info("Performing lakehouse existence check.")
    try:
        details = notebookutils.lakehouse.get(name=lakehouse_name)
    except Exception:
        logger.exception(f"Specified lakehouse - {lakehouse_name} doesn't exist. Aborting..")
        raise
    return details


# create functions as needed for standardization process
# (https://github.com/Azure-Samples/modern-data-warehouse-dataops/blob/feat/e2e-fabric-dataops-sample/e2e_samples/parking_sensors/databricks/notebooks/03_transform.py)
def dummy_function() -> None:

    # your code goes here
    pass

    return None

In [None]:
# Template for main function which calls all other functions
# Note that, there is a root span using OTEL library


def main() -> None:
    root_span_name = f"root#{process_name}#{current_ts}"

    with tracer.start_as_current_span(root_span_name, kind=SpanKind.INTERNAL) as root_span:
        try:
            root_span.add_event(
                name="010-verify-lakehouse",
                attributes={"lakehouse_name": parking_lakehouse},
            )
            lh_details = get_lakehouse_details(parking_lakehouse)
            # Always use absolute paths when referring to Onelake locations
            lh_table_path = f'{lh_details["properties"]["abfsPath"]}/Tables'
            lh_file_path = f'{lh_details["properties"]["abfsPath"]}/Files'
            print(lh_table_path, lh_file_path)

            # root_span.add_event(
            #     name="<<your second event>>", attributes={attributes in dictonary format for second event}
            # )
            # code function for second event

            # root_span.add_event(
            #     name="<<your nth event>>", attributes={attributes in dictonary format for nth event}
            # )
            # code for nth event

        except Exception as e:
            error_message = f"{process_name} process failed with error {e}"
            logger.exception(error_message)
            root_span.set_status(StatusCode.ERROR, error_message)
            root_span.record_exception(e)
            raise
        else:
            root_span.set_status(StatusCode.OK)
            logger.info(f"{process_name} process is successful.")
        finally:
            logger.info(f"\n** {process_name} process is complete. Check the logs for execution status. **\n\n")

    return None

### Code execution

In [None]:
# Apply logic here incase this notebook need to be used as a library

# Log into AppInsights and verify the following for telemetry events produced by OpenTelemetry.
# dependencies
# | where name hasprefix "root#nb-030"
# //
# exceptions
# //
# traces


if execution_mode == "all":
    print(f"{execution_mode = }. Proceeding with the code execution.")
    main()
else:
    print(f"Skipping the main funciton execution as {execution_mode = } and running it like a code module.")