# Performing data focused unit testing

This notebook shows steps involved for performing data focused unit testing on ETL which is part of [nb-city-safety.ipynb](../notebooks/nb-city-safety.ipynb). A sample code centic unit test case is already of that notebook.

See [DataTesting.md](../docs/DataTesting.md) for different ways to organize and run unit testcases.

## Formatting the Notebook - Run only when developing the Notebook

- Reference: https://learn.microsoft.com/en-us/fabric/data-engineering/author-notebook-format-code#extend-fabric-notebooks]
- **WARNING**: When using formatting using `jupyter_black` it will remove any *cell magic commands* present. You should add them back.

```python
import jupyter_black
jupyter_black.load()
```


## Preparing data for testing - Test environement setup

Test environment is often a one time activity for setup and needs regular updates to test data to account for new business scenarios. Other things to make a note are:

- It is very important to account for all business scenarios. Data should cover all business scenarios.
- Volume should be low so that run times are low.
- Maintain data relations as much as possible, so that the same data set can be used for e2e runs (controlled data set).
- Data env cleanup steps should also be identified incase of reruns.
- These can also be part of the code deployments using repo.

In this example, Using the same source as the Production code (OpenData sets for city safety data), we are creating a controlled volume datasets and storing them in a storage account created for this purpose. During testing, we leverage the configurable settings of the application to use these controlled data sets as the source. This way we are testing the entire notebook (our componennt/unit under testing) with minimal/no code pathcing. This also tests the integrations with other components and configuration settings. More importanntly, we can test the outputs to make sure they are inline with business requirements.


```python
import random
import json

target_adls = "abfss://citydatacontainer@azureopendatastoragedev.dfs.core.windows.net/Safety/Release/"
wasbs_path = "wasbs://citydatacontainer@azureopendatastorage.blob.core.windows.net/Safety/Release/"
source_counts = {}

for city in ("Boston", "Chicago", "NewYorkCity", "Seattle", "SanFrancisco"):
    data_df = spark.read.parquet(f"{wasbs_path}/city={city}")
    num_records = random.randint(50, 500)
    sample_df = data_df.sample(False, num_records / data_df.count())
    sample_df.repartition(1).write.format("parquet").mode("overwrite").save(
        f"{target_adls}/city={city}"
    )
    source_counts[city] = sample_df.count()

count_df = spark.createDataFrame(list(source_counts.items()), ["city", "count"])
count_df.repartition(1).write.format("json").mode("overwrite").save(
    f"{target_adls}/source_counts"
)  # need - `.option("multiLine", true)` during read
```


## Unit test - Data focus

In [None]:
from unittest.mock import MagicMock, patch, call
import pytest
import ipytest

# this makes the ipytest magic available and raise_on_error causes notebook failure incase of errors
ipytest.autoconfig(raise_on_error=True)

In [None]:
# ------------------- Keep Only External parameters in this cell. --------------------------------------------
execution_mode = "normal"
onelake_name = "onelake"
env_stage = "dev"
log_level = "WARN"
config_file_path = f"{notebookutils.nbResPath}/builtin/city_safety.cfg" # TO DO: Where are we reading this file from?


### Load function definitions that need to be tested into current context

Things to note:

- `%run ` currently doesn't take variable values. 
- For unit testcases which are part of the nb-city-safety when called from outside (`%run nb-city-safety { "execution_mode": "testing",....}`) test fixtures like `caplog` will differ. So, either move the testcases into to this notebook or run the nb-city-safety notepbook all by itself when testing for these.
- The called notebook built in references are used by the execution (even if the notebook that was callled has its own builtin resources). See [run a notebook](https://learn.microsoft.com/fabric/data-engineering/author-execute-notebook#spark-session-configuration-magic-command) for details about `%run`.
- Only function definitions are loaded by using `%run` as we have code in place to skip the execution portion. The actual execution will be done in this notebook as of part of our testing.

In [None]:
%run nb-city-safety { "execution_mode": "module", "job_exec_instance": "110_city_safety#20240808124161", "common_execution_mode": "normal", "env_stage": "dev", "config_file_path": "/synfs/nb_resource/builtin/city_safety.cfg", "param_override": "True"}

### Create unit tests 

- Code-based unit tests are added in [tests/test_nb-city-safety-common.ipynb](../../tests/test_nb-city-safety-common.ipynb). This example demonstrates unit tests in another notebook scenario.
- Data-based unit tests are shown below. This example demonstrates unit tests in the same notebook scnario. 

In [None]:
from pyspark.sql.functions import from_utc_timestamp


@pytest.mark.parametrize("cleanup_mode", [(True), (False)])
def test_city_safety_main(cleanup_mode):
    # Validate data counts in target table based on the execution ids
    # start with no table first - use custom target table
    # append in second run

    # These are outside of main function in the script - so we are resetting them so that they accept our values
    #   Good practice is to make these as function parameters (unlike what we are doing here).
    global cleanup_flag, current_ts, job_exec_instance

    cleanup_flag = cleanup_mode  # Start with Overwrite mode and then test append mode
    current_ts = datetime.utcnow().strftime("%Y%m%d%H%M%S%f")
    job_exec_instance = f"110_city_safety#{current_ts}"  # make runs unique

    source_table_path = "abfss://citydatacontainer@azureopendatastoragedev.dfs.core.windows.net/Safety/Release/city=Boston"
    exp_df = spark.read.format("parquet").load(source_table_path)
    exp_job_exec_instance = job_exec_instance
    exp_user = mssparkutils.env.getUserName()
    target_table_path = "abfss://a046e3e0-d007-4d72-9c9f-53da44ba8c58@onelake.dfs.fabric.microsoft.com/a8df3bac-4b1c-4c29-a76c-93840de9831a/Tables/tbl_city_safety_data_test"
    columns = [
        "address",
        "category",
        "dataSubtype",
        "dataType",
        "dateTime",
        "latitude",
        "longitude",
        "source",
        "status",
        "subcategory",
        "extendedProperties",
    ]

    main()  # <--- code execution with no code patching

    act_df = spark.read.format("delta").load(target_table_path)
    act_df = act_df.filter(act_df.jobExecId == exp_job_exec_instance)

    # ----------- Validate unchanged columns ------------------------------
    exp_df = exp_df.select(columns)
    act_subset_df = act_df.select(columns)

    # if spark_major_version >= 3.5:
    #   assertDataFrameEqual(act_df, exp_df)
    assert exp_df.count() == act_df.count()
    assert exp_df.schema == act_subset_df.schema
    assert exp_df.exceptAll(act_subset_df).count() == 0

    # -------- Additional columns validation ------------------------------
    # 'lastUpdateUTC'  is hard to validate as this evaluated during runtime.
    #    Code based unit test is the best for validating this one.
    #    Another option is store the runtime value from the code somewhere and
    #    then reading it from there.
    exp_additional_col_values = {
        "City": "Boston",
        "jobExecId": exp_job_exec_instance,
        "lastUpdateUser": exp_user,
    }
    act_additional_col_values = (
        act_df.select("City", "jobExecId", "lastUpdateUser")
        .distinct()
        .collect()[0]
        .asDict()
    )

    assert exp_additional_col_values == act_additional_col_values

    # ----------- Validate transformed column ------------------------------
    # - Can be done here or in the code based unit tests
    #     as there are no run time values involved similar to 'lastUpdateUTC'
    exp_date_df = exp_df.select("address", "dateTime")
    # using a different transformation instead of using the same logic in the main code
    act_date_df = (
        act_df.select("address", "dateTimeUTC")
        .withColumn(
            "dateTime", from_utc_timestamp(act_df.dateTimeUTC, "America/New_York")
        )
        .select("address", "dateTime")
    )
    assert exp_date_df.exceptAll(act_date_df).count() == 0

    # ------------- OTEL SDK for monitor- validate App Insights as the target -----------
    # This is part of the OTEL SDK sample as well as OTEL Collector implementation
    # Ref: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/monitor/azure-monitor-query/samples/sample_single_log_query_without_pandas.py

    # ****NOTE: Queries use Log Anlaytics table names (not appInsights)

    from azure.identity import ClientSecretCredential
    from azure.core.exceptions import HttpResponseError
    from azure.identity import DefaultAzureCredential
    from azure.monitor.query import LogsQueryClient, LogsQueryStatus
    from datetime import timedelta

    credential = ClientSecretCredential(
        tenant_id="ab6f7679-cdcd-4084-b327-54012e772f32",
        client_id="c6f27d64-fdb4-4fac-b269-13dfd003e609",
        client_secret="TODO:<<read_from_key_vault>>",  # app-fabric-sguda-clientsecret
        resource="api.loganalytics.azure.com",
    )

    client = LogsQueryClient(credential)
    log_workspace_id = "80b89bbf-25de-4af9-99ea-07d424b1a21a"

    logs_query = f"AppTraces | where TimeGenerated > todatetime('{current_ts}') | where Message == 'City safety processing is complete.'|count"
    traces_query = f"AppDependencies|where TimeGenerated > todatetime('{current_ts}') |where Target == 'root#nb-safety#{current_ts}'|count"
    exceptions_query = f"AppExceptions | where TimeGenerated > todatetime('{current_ts}') |where OuterType == 'Exception' | where OuterMessage == 'ETL step failed with error Dummy failure on metrics gathering.'|count"
    metrics_query = (
        f"AppMetrics| where TimeGenerated > todatetime('{current_ts}')|count"
    )

    for query in (logs_query, traces_query, exceptions_query, metrics_query):
        assert query_app_insights(log_workspace_id, query, 3000) >= 1

    # ------------- OTEL using - Collectro validate Opentelemetry data in Fabric -------------------------------
    if otel_setup_type == "collector":

        # TO DO - Expand the example to include detailed validation of all the attributes (resource, trace etc.,)
        traces_query = f"OTELTraces | where SpanName == 'root#nb-safety#{current_ts}'"
        kusto_uri = "https://trd-25w48grmsgkdja2nn4.z7.kusto.fabric.microsoft.com"
        kusto_token = mssparkutils.credentials.getToken(kusto_uri)
        database = "oteldb"
        traces_df = (
            spark.read.format("com.microsoft.kusto.spark.synapse.datasource")
            .option("accessToken", kusto_token)
            .option("kustoCluster", kusto_uri)
            .option("kustoDatabase", database)
            .option("kustoQuery", traces_query)
            .load()
        )

        assert traces_df.count() == 1
        assert traces_df.select("SpanStatus").collect()[0][0] == "STATUS_CODE_OK"

        logs_query = f"""OTELLogs
            | where TraceID in (
                OTELTraces
                | where SpanName == "root#nb-safety#{current_ts}"
                | project TraceID
            )"""
        logs_df = (
            spark.read.format("com.microsoft.kusto.spark.synapse.datasource")
            .option("accessToken", kusto_token)
            .option("kustoCluster", kusto_uri)
            .option("kustoDatabase", database)
            .option("kustoQuery", logs_query)
            .load()
        )

        assert logs_df.count() >= 9
        assert logs_df.select("ResourceAttributes").distinct().collect()[
            0
        ].asDict() == {"ResourceAttributes": '{"service.name":"otel-poc-vm-based"}'}

        metrics_query = """OTELMetrics
            | where MetricType == "Sum"
            | where MetricName == "city-level-metrics"
            | where MetricAttributes["record_count_total"] == 267
            """
        metrics_df = (
            spark.read.format("com.microsoft.kusto.spark.synapse.datasource")
            .option("accessToken", kusto_token)
            .option("kustoCluster", kusto_uri)
            .option("kustoDatabase", database)
            .option("kustoQuery", metrics_query)
            .load()
        )

        assert metrics_df.count() >= 1

### Run the unit tests and capture the results

- As `ipytest.autoconfig(raise_on_error=True)` was used in the begining of this notebook, any errors from the testcases will not result in notebook failure.



In [None]:
%%capture data_unit_tests_results
ipytest.run()

### Process the test rsults

- These can be stored somewhere or sent for further processing.

In [None]:
store_unit_test_results(data_unit_tests_results)