# Delta lakes: creating a sample store

This notebook is intended to create a simple dummy delta lake store compliant with the schema outlined in the ECR validation directory. It leverages `pyspark` and `delta-spark` working within a virtual environment to create an empty table and insert a few dummy rows. It can also be run completely outside of a virtual environment, if desired, in which case only steps 3 and beyond need be followed. To configure an environment in order to run this script, perform the following steps:

1. Create a python virtual environment in which to install dependencies and execute the notebook: `python -m venv .venv/`
2. Activate the virtual environment: `source .venv/bin/activate`
3. Install `jupyter` directly in the virtual environment (while it is possible to install kernelspec paths and manipulate environment variables, in practice, simply installing `jupyter` directly without `site-packages` makes importing modules the most seamless): `pip install jupyter`
4. Install the `delta-spark` package in the virtual environment: `pip install delta-spark`. This will automatically install a compatible version of `pyspark` already integrated with the `delta` configuration (it will be version `3.3.2` for most installations).
5. From within the virtual environment, run `jupyter notebook` to launch the navigation page, then select this notebook. The default `Python 3 (ipykernel)` kernel should be displayed in the top right corner of the window, under the `Logout` button.

You can now successfully run this notebook from within your virtual environment!

## Creating the delta lake

First, we'll handle the imports and configuration of the spark builder and delta session (the packages typically assume an interactive running session, but we can use a pre-configured `builder` object to programmatically handle configuration for us).

In [None]:
import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("dibbs") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

With the packages imported and configured, we can create the schema we'll use to instantiate our delta lake. Schemas in `pyspark` (and therefore `delta-spark`) are defined as `struct` type objects, in which one or more `struct fields` are specified with a name and data type. These can be used to create complex logic around acceptable input formats for the deta, but for now, we'll create some simple string columns designed to hold the data specified in the DIBBS eCR validation schema (found in the message parsing service). After defining the schema, we'll use it to create a blank but formatted spark data frame that we can add to and eventually save as a delta lake.

In [None]:
from pyspark.sql.types import StructType,StructField, StringType

schema_fields = ["patient_id",
    "person_id",
    "last_name",
    "first_name",
    "rr_id",
    "status",
    "conditions",
    "eicr_id",
    "eicr_version_number",
    "authoring_datetime",
    "provider_id",
    "facility_id_number",
    "facility_name",
    "facility_type",
    "encounter_type",
    "encounter_start_date",
    "encounter_end_date",
    "active_problem_1",
    "active_problem_date_1",
    "active_problem_2",
    "active_problem_date_2",
    "active_problem_3",
    "active_problem_date_3",
    "active_problem_4",
    "active_problem_date_4",
    "active_problem_5",
    "active_problem_date_5",
    "reason_for_visit",
    "test_type_1",
    "test_result_1",
    "test_result_interp_1",
    "specimen_type_1",
    "performing_lab_1",
    "specimen_collection_date_1",
    "result_date_1",
    "test_type_2",
    "test_result_2",
    "test_result_interp_2",
    "specimen_type_2",
    "performing_lab_2",
    "specimen_collection_date_2",
    "result_date_2",
    "test_type_3",
    "test_result_3",
    "test_result_interp_3",
    "specimen_type_3",
    "performing_lab_3",
    "specimen_collection_date_3",
    "result_date_3",
    "test_type_4",
    "test_result_4",
    "test_result_interp_4",
    "specimen_type_4",
    "performing_lab_4",
    "specimen_collection_date_4",
    "result_date_4",
    "test_type_5",
    "test_result_5",
    "test_result_interp_5",
    "specimen_type_5",
    "performing_lab_5",
    "specimen_collection_date_5",
    "result_date_5",
    "test_type_6",
    "test_result_6",
    "test_result_interp_6",
    "specimen_type_6",
    "performing_lab_6",
    "specimen_collection_date_6",
    "result_date_6",
    "test_type_7",
    "test_result_7",
    "test_result_interp_7",
    "specimen_type_7",
    "performing_lab_7",
    "specimen_collection_date_7",
    "result_date_7",
    "test_type_8",
    "test_result_8",
    "test_result_interp_8",
    "specimen_type_8",
    "performing_lab_8",
    "specimen_collection_date_8",
    "result_date_8",
    "test_type_9",
    "test_result_9",
    "test_result_interp_9",
    "specimen_type_9",
    "performing_lab_9",
    "specimen_collection_date_9",
    "result_date_9",
    "test_type_10",
    "test_result_10",
    "test_result_interp_10",
    "specimen_type_10",
    "performing_lab_10",
    "specimen_collection_date_10",
    "result_date_10",
    "test_type_11",
    "test_result_11",
    "test_result_interp_11",
    "specimen_type_11",
    "performing_lab_11",
    "specimen_collection_date_11",
    "result_date_11",
    "test_type_12",
    "test_result_12",
    "test_result_interp_12",
    "specimen_type_12",
    "performing_lab_12",
    "specimen_collection_date_12",
    "result_date_12",
    "test_type_13",
    "test_result_13",
    "test_result_interp_13",
    "specimen_type_13",
    "performing_lab_13",
    "specimen_collection_date_13",
    "result_date_13",
    "test_type_14",
    "test_result_14",
    "test_result_interp_14",
    "specimen_type_14",
    "performing_lab_14",
    "specimen_collection_date_14",
    "result_date_14",
    "test_type_15",
    "test_result_15",
    "test_result_interp_15",
    "specimen_type_15",
    "performing_lab_15",
    "specimen_collection_date_15",
    "result_date_15",
    "test_type_16",
    "test_result_16",
    "test_result_interp_16",
    "specimen_type_16",
    "performing_lab_16",
    "specimen_collection_date_16",
    "result_date_16",
    "test_type_17",
    "test_result_17",
    "test_result_interp_17",
    "specimen_type_17",
    "performing_lab_17",
    "specimen_collection_date_17",
    "result_date_17",
    "test_type_18",
    "test_result_18",
    "test_result_interp_18",
    "specimen_type_18",
    "performing_lab_18",
    "specimen_collection_date_18",
    "result_date_18",
    "test_type_19",
    "test_result_19",
    "test_result_interp_19",
    "specimen_type_19",
    "performing_lab_19",
    "specimen_collection_date_19",
    "result_date_19",
    "test_type_20",
    "test_result_20",
    "test_result_interp_20",
    "specimen_type_20",
    "performing_lab_20",
    "specimen_collection_date_20",
    "result_date_20",
]

schema_cols = [StructField(f, StringType(), True) for f in schema_fields]

# incident_id can't be null for subsequent joins
schema_cols.append(StructField("incident_id", StringType(), False))
schema_cols.append(StructField("iris_id", StringType(), False))
schema = StructType(schema_cols)

df = spark.createDataFrame([], schema)
df.write.format("delta").mode("overwrite").saveAsTable("test_delta_lake")

With the table created and properly formatted with our desired columns, we can create two dummy rows and insert them into the table using spark's appends:

In [None]:
# Note the tuple around the inserted information--spark additions need to be
# of the form [(,)] to infer struct types

row_1 = [("2c6d5fd1-4a70-11eb-99fd-ad786a821574", "a81bc81b-dead-4e5d-abff-90865d1e13b1", "Shepard", 
         "John", "12-34-56-78", "12", "", "11111111", "2", "", "999", "1", "Huerta Memorial Hospital", 
         "", "encounter", "2021-08-14", "2021-08-16", "", "", "", "", "", "", "", "", "", "", "physical", 
         "", "", "", "", "", "", "", 
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "", "", "", "" ,"", "", "",
         "123456789","iris_id1")]

row_2 = [("2fdd0b8b-4a70-11eb-99fd-ad786a821574", "a81bc81b-dead-4e5d-abff-90865d1e13b1", "Anderson", 
         "David", "97-56-4862", "24", "", "99999", "168", "2022-12-12", "6d8e9s-98w7szz", "84yfd3556d", 
          "Sunset Strip", "outpatient", "OKI", "2022-12-02", "2022-12-11", "arthritis", "2020-10-10", 
          "", "", "", "", "", "", "", "", "concern", 
          "degenerative disk test", "positive", "patient has a bad back", "vertebrae fluid", "Easy Pete's Discount Disk Checks", "2022-10-10", "",
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "",
          "", "", "", "", "", "", "",
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "",
          "", "", "", "", "", "", "",
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "",
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "", 
          "", "", "", "", "", "", "",
          "", "", "", "", "", "", "",
          "987654321","iris_id2")]

row_1.extend(row_2)

# Accepted practice for row appending is union
row_1 = spark.createDataFrame(row_1, schema)
df = df.union(row_1)
df.show()

df.write.format("delta").mode("overwrite").save("test_delta_lake")