##### **Authorship**

The inspiration / reference / source of this code was the [Refine_EdFi.ipynb](https://github.com/EdWire/OpenEduAnalytics/blob/feature/saas_deploy/modules/module_catalog/Ed-Fi/notebook/Refine_EdFi.ipynb) notebook  (hereby called "**Original Referenced Code**") originally authored by [Abhinav](https://github.com/Abhinavgundapaneni). 

[Viraj Jayant](https://github.com/virajjayant-neenopal) has made edits and modifications to the Original Referenced Code from the forked repository to tailor it to the add more customizations.

##### **Major Changes**

Here are the major changes Viraj Jayant made to the Original Referenced Code:

1. Leveraging **partitioning** within functions upsert, overwrite, append: these base functions are modified in OEA class for the same
2. Extending changes to Original Referenced Code to include ed-fi extensions when _ext column present (**extensions like TPDM, TX, etc.**)
3. Changes to `add_to_lake_db` oea base function to include optional overwrite mode and also facilitate references to the ext tables (suffixed as _tx, _tpdm, etc.)
4. ETL of specific entities if parameterized set to True (also implemented in Ingestion and Landing Notebooks)
5. Making the codebase generally more compliant with OEA than before
6. Edit to OEAUtils function `create_spark_schemas_from_definitions` to check if `x-Ed-Fi-explode` is present or not

##### **Additional Notes**
As an additional point please note that both the Original Referenced Code and this notebook implements partitioning after entity in the directory. Please consider the following example:
1. **Original OEA**: stage2/Refined/Ed-Fi/5.2/`DistrictId=All/SchoolYear=All/ed-fi/general/weaponDescriptors`
2. **New Directory**: stage2/Refined/Ed-Fi/5.2/`ed-fi/general/weaponDescriptors/DistrictId=All/SchoolYear=All`  

##### **Specific Code Reference**

Original Referenced Code is referenced from the following location:
[Original Code Link](https://github.com/EdWire/OpenEduAnalytics/blob/feature/saas_deploy/modules/module_catalog/Ed-Fi/notebook/Refine_EdFi.ipynb)

**Viraj Jayant** has made edits and customizations to this code to suit the project's needs.

In [28]:
%run /edfi_fetch_urls

StatementMeta(, 32, -1, Finished, Available)

In [29]:
import copy
import pyspark.sql.functions as f

StatementMeta(spark3p3sm, 32, 49, Finished, Available)

In [30]:
districtPath = districtId if districtId != None else "All"
schoolYearPath = schoolYear if schoolYear != None else "All"
swagger_url = swaggerUrl

parameterized = False

StatementMeta(spark3p3sm, 32, 50, Finished, Available)

In [31]:
%run edfi_py

StatementMeta(, 32, -1, Finished, Available)

2023-10-20 09:19:51,599 - OEA - INFO - Now using workspace: dev
2023-10-20 09:19:51,600 - OEA - INFO - OEA initialized.
2023-10-20 09:19:52,279 - OEA - INFO - Now using workspace: dev
2023-10-20 09:19:53,446 - OEA - INFO - Now using workspace: dev
2023-10-20 09:19:53,447 - OEA - INFO - OEA initialized.
2023-10-20 09:19:53,447 - OEA - INFO - minChangeVersion=None and maxChangeVersion=None
2023-10-20 09:19:53,607 - OEA - INFO - failed to retrieve clientId and clientSecret from keyvault with exception: An error occurred while calling z:com.microsoft.azure.synapse.tokenlibrary.TokenLibrary.getSecret.
: com.microsoft.azure.synapse.tokenlibrary.utils.AkvUtils$KeyVaultException: Azure Key Vault returned error code 'SecretNotFound' with message 'A secret with (name/id) edfi-clientid was not found in this key vault. If you recently deleted this secret you may be able to recover it using the correct recovery command. For help resolving this issue, please see https://go.microsoft.com/fwlink/?link

In [32]:
if parameterized == True:
    edfiEntitiesPath = f'stage1/Transactional/{moduleName}/{apiVersion}/DistrictId={districtPath}/SchoolYear={schoolYearPath}/etl_entities/current_run_data'

    _, edfiEntities = edfi.listSpecifiedEntities(edfiEntitiesPath)
else:
    edfiEntities = "All"  

StatementMeta(spark3p3sm, 32, 58, Finished, Available)

In [33]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
workspace = 'dev'
oea.set_workspace(workspace)

StatementMeta(spark3p3sm, 32, 59, Finished, Available)

2023-10-20 09:19:54,816 - OEA - INFO - Now using workspace: dev


In [34]:
schema_gen = OpenAPIUtil(swagger_url)
schemas = schema_gen.create_spark_schemas()
primitive_datatypes = ['timestamp', 'date', 'decimal', 'boolean', 'integer', 'string', 'long']

StatementMeta(spark3p3sm, 32, 60, Finished, Available)

localEducationAgencyReference


In [35]:
def get_descriptor_schema(descriptor):
    fields = []
    fields.append(StructField('_etag',LongType(), True))
    fields.append(StructField(f"{descriptor[:-1]}Id", IntegerType(), True))
    fields.append(StructField('codeValue',StringType(), True))
    fields.append(StructField('description',StringType(), True))
    fields.append(StructField('id',StringType(), True))
    fields.append(StructField('namespace',StringType(), True))
    fields.append(StructField('shortDescription',StringType(), True))
    return StructType(fields)

def get_descriptor_metadata(descriptor):
    return [['_etag', 'long', 'no-op'],
            [f"{descriptor[:-1]}Id", 'integer', 'hash'],
            ['codeValue','string', 'no-op'],
            ['description','string', 'no-op'],
            ['id','string', 'no-op'],
            ['namespace','string', 'no-op'],
            ['shortDescription','string', 'no-op']]

StatementMeta(spark3p3sm, 32, 61, Finished, Available)

In [36]:
def has_column(df, col):
    try:
        df[col]
        return True
    except AnalysisException:
        return False

def modify_descriptor_value(df, col_name):
    if col_name in df.columns:
        # TODO: @Abhinav, I do not see where you made the changes to use the descriptorId instead of Namespace/CodeValue
        df = df.withColumn(f"{col_name}LakeId", f.concat_ws('_', f.col('DistrictId'), f.col('SchoolYear'), f.regexp_replace(col_name, '#', '_')))
        df = df.drop(col_name)
    else:
        df = df.withColumn(f"{col_name}LakeId", f.lit(None).cast("String"))

    return df

def flatten_reference_col(df, target_col):
    col_prefix = target_col.name.replace('Reference', '')
    df = df.withColumn(f"{col_prefix}LakeId", f.when(f.col(target_col.name).isNotNull(), f.concat_ws('_', f.col('DistrictId'), f.col('SchoolYear'), f.split(f.col(f'{target_col.name}.link.href'), '/').getItem(3))))
    df = df.drop(target_col.name)
    return df

def modify_references_and_descriptors(df, target_col):
    for ref_col in [x for x in df.columns if re.search('Reference$', x) is not None]:
        df = flatten_reference_col(df, target_col.dataType.elementType[ref_col])
    for desc_col in [x for x in df.columns if re.search('Descriptor$', x) is not None]:
        df = modify_descriptor_value(df, desc_col)
    return df

def explode_arrays(df, sink_general_path,target_col, schema_name, table_name, extension = None):
    # TODO: Assess if LastModifiedDate inclusion breaks ETL or not
    try:
        cols = ['lakeId', 'DistrictId', 'SchoolYear', 'LastModifiedDate']
        child_df = df.select(cols + [target_col.name])
    except:
        cols = ['lakeId', 'DistrictId', 'SchoolYear']
        child_df = df.select(cols + [target_col.name])
    child_df = child_df.withColumn("exploded", f.explode(target_col.name)).drop(target_col.name).select(cols + ['exploded.*'])

    # TODO: It looks like te {target_col.name}LakeId column is not addedd to the child entities
    #       We should use LakeId suffix when using the "id" column from the parent and HKey suffix when creating a Hash Key based on composite key columns
    identity_cols = [x.name for x in target_col.dataType.elementType.fields if 'x-Ed-Fi-isIdentity' in x.metadata].sort()
    if(identity_cols is not None and len(identity_cols) > 0):
        child_df = child_df.withColumn(f"{target_col.name}LakeId", f.concat(f.col('DistrictId'), f.lit('_'), f.col('SchoolYear'), f.lit('_'), *[f.concat(f.col(x), f.lit('_')) for x in identity_cols]))
    
    # IMPORTANT: We must modify Reference and Descriptor columns for child columns "first". 
    # This must be done "after" the composite key from identity_cols has been created otherwise the columns are renamed and will not be found by identity_cols.
    # This must be done "before" the grand_child is exploded below
    child_df = modify_references_and_descriptors(child_df, target_col)

    for array_sub_col in [x for x in target_col.dataType.elementType.fields if x.dataType.typeName() == 'array' ]:
        grand_child_df = child_df.withColumn('exploded', f.explode(array_sub_col.name)).select(child_df.columns + ['exploded.*']).drop(array_sub_col.name)
        
        # Modifying Reference and Descriptor columns for the grand_child array
        grand_child_df = modify_references_and_descriptors(grand_child_df, array_sub_col)

        logger.info(f"Writing Grand Child Table - {table_name}_{target_col.name}_{array_sub_col.name}")
        oea.upsert(df = grand_child_df, 
                   destination_path = f"{sink_general_path}_{target_col.name}_{array_sub_col.name}", 
                   primary_key = 'lakeId',
                   partitioning = True,
                   partitioning_cols = ['DistrictId', 'SchoolYear']) 
        oea.add_to_lake_db(source_entity_path = f"{sink_general_path}_{target_col.name}_{array_sub_col.name}", 
                           overwrite = True,
                           extension = extension)
        #grand_child_df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(oea.to_url(f"{sink_general_path}_{target_col.name}_{array_sub_col.name}"))

    logger.info(f"Writing Child Table - {table_name}_{target_col.name}")
    oea.upsert(df = child_df, 
               destination_path = f"{sink_general_path}_{target_col.name}", 
               primary_key = 'lakeId',
               partitioning = True,
               partitioning_cols = ['DistrictId', 'SchoolYear']) 
    oea.add_to_lake_db(source_entity_path = f"{sink_general_path}_{target_col.name}",
                       overwrite = True,
                       extension = extension)
    #child_df.write.format('delta').mode('overwrite').option('overwriteSchema', 'true').save(oea.to_url(f"{sink_general_path}_{target_col.name}"))

    # Drop array column from parent entity
    df = df.drop(target_col.name)
    return df

def transform(df, 
              schema_name, 
              table_name, 
              primary_key,
              ext_entity,
              sink_general_path,
              parent_schema_name, 
              parent_table_name):
    if re.search('Descriptors$', table_name) is None:
        # Use Deep Copy otherwise the schemas object also gets modified every time target_schema is modified
        target_schema = copy.deepcopy(schemas[table_name])
        # Add primary key
        if has_column(df, primary_key):
            df = df.withColumn('lakeId', f.concat_ws('_', f.col('DistrictId'), f.col('SchoolYear'), f.col(primary_key)).cast("String"))
        else:
            df = df.withColumn('lakeId', f.lit(None).cast("String"))
    else:
        target_schema = get_descriptor_schema(table_name)
        # Add primary key
        if has_column(df, 'codeValue') and has_column(df, 'namespace'):
            # TODO: @Abhinav, I do not see where you made the changes to use the descriptorId instead of Namespace/CodeValue
            df = df.withColumn('lakeId', f.concat_ws('_', f.col('DistrictId'), f.col('SchoolYear'), f.col('namespace'), f.col('codeValue')).cast("String"))
        else:
            df = df.withColumn('lakeId', f.lit(None).cast("String"))

    target_schema = target_schema.add(StructField('DistrictId', StringType()))\
                                 .add(StructField('SchoolYear', StringType()))\
                                 .add(StructField('LastModifiedDate', TimestampType()))

    df = transform_sub_module(df, target_schema, sink_general_path, schema_name, table_name)
    logger.info(f"Writing Main Table - {table_name}")
    oea.upsert(df = df, 
               destination_path = f"{sink_general_path}", 
               primary_key = 'lakeId',
               partitioning = True,
               partitioning_cols = ['DistrictId', 'SchoolYear']) 
    oea.add_to_lake_db(source_entity_path = sink_general_path, 
                       overwrite = True,
                       extension = None)

    if '_ext' in df.columns:
        target_schema = get_ext_entities_schemas(table_name = table_name,
                                                 ext_column_name = '_ext',
                                                 default_value = ext_entity)
        df = flatten_ext_column(df = df, 
                                table_name = table_name, 
                                ext_col = '_ext', 
                                inner_key = ext_entity,
                                ext_inner_cols = target_schema.fieldNames())
        sink_general_path = sink_general_path.replace('/ed-fi/', f'/{ext_entity.lower()}/')
        df = transform_sub_module(df, 
                                  target_schema, 
                                  sink_general_path, 
                                  schema_name,
                                  table_name,
                                  extension = f"_{ext_entity.lower()}")

        logger.info(f"Writing EXT Table - {table_name}")
        oea.upsert(df = df, 
                   destination_path = f"{sink_general_path}", 
                   primary_key = 'lakeId',
                   partitioning = True,
                   partitioning_cols = ['DistrictId', 'SchoolYear']) 
        oea.add_to_lake_db(sink_general_path, 
                           overwrite = True,
                           extension = f"_{ext_entity.lower()}")
        
def transform_sub_module(df, target_schema, sink_general_path, schema_name, table_name, extension = None):
    for col_name in target_schema.fieldNames():
        target_col = target_schema[col_name]
        # If Primitive datatype, i.e String, Bool, Integer, etc.abs
        # Note: Descriptor is a String therefore is a Primitive datatype
        if target_col.dataType.typeName() in primitive_datatypes:
            # If it is a Descriptor
            if re.search('Descriptor$', col_name) is not None:
                df = modify_descriptor_value(df, col_name)
            else:
                if col_name in df.columns:
                    # Casting columns to primitive data types
                    df = df.withColumn(col_name, f.col(col_name).cast(target_col.dataType))
                else:
                    # If Column not present in dataframe, add column with None values.
                    df = df.withColumn(col_name, f.lit(None).cast(target_col.dataType))
        # If Complex datatype, i.e. Object, Array
        else:
            if col_name not in df.columns:
                df = df.withColumn(col_name, f.lit(None).cast(target_col.dataType))
            else:
                # Generate JSON column as a Complex Type
                df = df.withColumn(f"{col_name}_json", f.to_json(f.col(col_name))) \
                    .withColumn(col_name, f.from_json(f.col(f"{col_name}_json"), target_col.dataType)) \
                    .drop(f"{col_name}_json")
            
            # Modify the links with surrogate keys
            if re.search('Reference$', col_name) is not None:
                df = flatten_reference_col(df, target_col)
    
            if target_col.dataType.typeName() == 'array':
                df = explode_arrays(df, sink_general_path,target_col, schema_name, table_name, extension = extension)
    return df

StatementMeta(spark3p3sm, 32, 62, Finished, Available)

In [37]:
def get_ext_entities_schemas(table_name = 'staffs',
                             ext_column_name = '_ext',
                             default_value = 'TPDM'):
    target_schema = copy.deepcopy(schemas[table_name])
    for col_name in target_schema.fieldNames():
        target_col = target_schema[col_name]
        if target_col.name == ext_column_name:
            if target_col.dataType[0].name == default_value:
                return target_col.dataType[0].dataType         
                
def flatten_ext_column(df, 
                       table_name, 
                       ext_col, 
                       inner_key,
                       ext_inner_cols
                       ):
    # TODO: Assess if LastModifiedDate inclusion breaks ETL or not
    cols = ['lakeId', 'DistrictId', 'SchoolYear', 'id_pseudonym', 'LastModifiedDate']
    flattened_cols = ext_inner_cols#["educatorPreparationPrograms"] #_ext_TX_cols[table_name]
    dict_col = F.col(ext_col)[inner_key]
    complex_dtype_text = str(df.select('_ext').dtypes[0][1])

    exprs = [dict_col.getItem(key).alias(key) for key in flattened_cols if str(key) in complex_dtype_text]
    try:
        flattened_df = df.select(exprs + cols)
    except:
        cols = ['lakeId', 'DistrictId', 'SchoolYear', 'id_pseudonym']
        flattened_df = df.select(exprs + cols)
    return flattened_df

StatementMeta(spark3p3sm, 32, 63, Finished, Available)

In [38]:
def sink_path_cleanup(destination_path):
    pattern = re.compile(r'DistrictId=.*?/|SchoolYear=.*?/')
    destination_path = re.sub(pattern, '', destination_path)

    return destination_path

StatementMeta(spark3p3sm, 32, 64, Finished, Available)

In [39]:
def refine_and_explode_data(schema_name, 
                            tables_source,
                            ext_entity,
                            metadata, 
                            transform_mode, 
                            test_mode,
                            items = []):
    global districtPath,schoolYearPath
    if items == 'All':
        items = oea.get_folders(f"stage2/Ingested/{tables_source}")
    for item in items:
            table_name = item #sap_to_edfi_complex[item]
            table_path = f"{tables_source}/{item}"
            if not(oea.path_exists(f"stage2/Ingested/{table_path}")):
                print(table_path)
                continue            
            logger.info(f"Processing schema/table: {schema_name}/{table_name}")
            if item == 'metadata.csv':
                logger.info('ignore metadata processing, since this is not a table to be ingested')
            else:
                try:
                    if not(transform_mode):
                        df = oea.refine(table_path, 
                                        metadata = metadata[item], 
                                        primary_key = 'id')
                    if transform_mode:
                        logger.info('Ed-Fi to Ed-Fi Relationship Model: ' + table_name)               
                        source_path = f'stage2/Ingested/{table_path}'
                        sink_general_path, sink_sensitive_path = oea.get_sink_general_sensitive_paths(source_path)
                        
                        sink_general_path = sink_path_cleanup(sink_general_path)
                        sink_sensitive_path = sink_path_cleanup(sink_sensitive_path)

                        df_changes = oea.get_latest_changes(source_path, sink_general_path)
                        df_changes = df_changes.withColumn('DistrictId', F.lit(districtPath))
                        df_changes = df_changes.withColumn('SchoolYear', F.lit(schoolYearPath))
                        
                        current_timestamp = datetime.now()
                        df_changes = df_changes.withColumn('LastModifiedDate', F.lit(current_timestamp))
                        
                        if df_changes.count() > 0:
                            df_pseudo, df_lookup = oea.pseudonymize(df_changes, 
                                                                    metadata,
                                                                    transform_mode,
                                                                    True)
                            
                            transform(df_pseudo, 
                                      schema_name, 
                                      table_name, 
                                      'id_pseudonym', 
                                      ext_entity, 
                                      sink_general_path,
                                      None, 
                                      None)
                            
                            oea.upsert(df = df_lookup, 
                                       destination_path = sink_sensitive_path, 
                                       primary_key = 'id',
                                       partitioning = True,
                                       partitioning_cols = ['DistrictId', 'SchoolYear'])    
                            oea.add_to_lake_db(source_entity_path = sink_sensitive_path,
                                               overwrite = True,
                                               extension = None)
                        else:
                            logger.info(f'No updated rows in {source_path} to process.')

                except AnalysisException as e:
                    # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
                    logger.info(e)

StatementMeta(spark3p3sm, 32, 65, Finished, Available)

In [41]:
from datetime import datetime
schema_name = 'ed-fi'
ext_entity = 'TPDM'
test_mode = False
transform_mode = True
tables_source = f'{moduleName}/{apiVersion}/DistrictId={districtPath}/SchoolYear={schoolYearPath}/{schema_name}'
transform_items = edfiEntities 
metadata = oea.get_metadata_from_url(metadataUrl)

refine_and_explode_data(schema_name, 
                        tables_source,
                        ext_entity,
                        metadata,
                        transform_mode, 
                        test_mode,
                        items = transform_items)

StatementMeta(spark3p3sm, 32, 67, Submitted, Running)

2023-10-20 09:20:05,458 - OEA - INFO - Processing schema/table: ed-fi/absenceEventCategoryDescriptors
2023-10-20 09:20:05,459 - OEA - INFO - Ed-Fi to Ed-Fi Relationship Model: absenceEventCategoryDescriptors
2023-10-20 09:20:07,373 - OEA - INFO - No updated rows in stage2/Ingested/Ed-Fi/5.2/DistrictId=All/SchoolYear=2023/ed-fi/absenceEventCategoryDescriptors to process.
2023-10-20 09:20:07,421 - OEA - INFO - Processing schema/table: ed-fi/academicHonorCategoryDescriptors
2023-10-20 09:20:07,421 - OEA - INFO - Ed-Fi to Ed-Fi Relationship Model: academicHonorCategoryDescriptors
2023-10-20 09:20:09,576 - OEA - INFO - No updated rows in stage2/Ingested/Ed-Fi/5.2/DistrictId=All/SchoolYear=2023/ed-fi/academicHonorCategoryDescriptors to process.
2023-10-20 09:20:09,599 - OEA - INFO - Processing schema/table: ed-fi/academicSubjectDescriptors
2023-10-20 09:20:09,599 - OEA - INFO - Ed-Fi to Ed-Fi Relationship Model: academicSubjectDescriptors
2023-10-20 09:20:12,276 - OEA - INFO - No updated row