# Test for processing Microsoft Education Insights: Reading Progress data

This notebook demonstrates possible data processing and exploration of the Microsoft Education Insights data, using the OEA_py class notebook. Specifically, this notebook and other Reading_Progress module assets are responsible for data manipulation pertaining to Insights reading progress. 

Most of the data processing done in this notebook are also achieved by executing the Reading Progress module main pipeline. This notebook is designed as an alternate approach to the same processing, as well as module data exploration and visualization. 

Directory landing is outlined in each step

The steps are clearly outlined below:
1. Set the workspace,
2. Land Insights/Reading Progress Module K-12 Test Data,
3. Ingest the Insights/Reading Progress Module Test Data,
4. Reading Progress Schema Correction,
5. Refine the Reading Progress Module Test Data, 
6. Demonstrate Lake Database Queries/Final Remarks, and
7. Appendix

In [2]:
%run OEA_py

StatementMeta(, 146, -1, Finished, Available)

2023-02-21 21:55:11,595 - OEA - INFO - Now using workspace: dev
2023-02-21 21:55:11,596 - OEA - INFO - OEA initialized.


In [None]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace('sam')

StatementMeta(, , , Cancelled, )

## 2.) Land Insights/Reading Progress Module K-12 Test Data

Directory: ```GitHub.com (raw data) -> stage1/Transactional/M365```

Run the code block for ingesting Insights K-12 test data (**NOTE**: This is the same test data used for this Reading Progress module).


In [3]:
# 2.1) Land batch data files into stage1 of the data lake.
# In this example we pull Insights/Reading Progress K-12 test csv data files from github and land it in oea/sandboxes/sam/stage1/Transactional/M365/v1.14
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/activity/2022-01-28/ApplicationUsage.csv').text
oea.land(data, 'M365/v1.14/activity', 'activity_k12_test_data.csv', oea.ADDITIVE_BATCH_DATA)

data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadGroup/aadgroup.csv').text
oea.land(data, 'M365/v1.14/AadGroup', 'aadgroup_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadGroupMembership/aadgroupmembership.csv').text
oea.land(data, 'M365/v1.14/AadGroupMembership', 'aadgroupmembership_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadUser/aaduser.csv').text
oea.land(data, 'M365/v1.14/AadUser', 'aaduser_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/AadUserPersonMapping/aaduserpersonmapping.csv').text
oea.land(data, 'M365/v1.14/AadUserPersonMapping', 'aaduserpersonmapping_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Course/course.csv').text
oea.land(data, 'M365/v1.14/Course', 'course_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/CourseGradeLevel/coursegradelevel.csv').text
oea.land(data, 'M365/v1.14/CourseGradeLevel', 'coursegradelevel_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/CourseSubject/coursesubject.csv').text
oea.land(data, 'M365/v1.14/CourseSubject', 'coursesubject_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Enrollment/enrollment.csv').text
oea.land(data, 'M365/v1.14/Enrollment', 'enrollment_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Organization/organization.csv').text
oea.land(data, 'M365/v1.14/Organization', 'organization_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Person/person.csv').text
oea.land(data, 'M365/v1.14/Person', 'person_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographic/persondemographic.csv').text
oea.land(data, 'M365/v1.14/PersonDemographic', 'persondemographic_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographicEthnicity/persondemographicethnicity.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicEthnicity', 'persondemographicethnicity_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographicPersonFlag/persondemographicpersonflag.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicPersonFlag', 'persondemographicpersonflag_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonDemographicRace/persondemographicrace.csv').text
oea.land(data, 'M365/v1.14/PersonDemographicRace', 'persondemographicrace_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonEmailAddress/personemailaddress.csv').text
oea.land(data, 'M365/v1.14/PersonEmailAddress', 'personemailaddress_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonIdentifier/personidentifier.csv').text
oea.land(data, 'M365/v1.14/PersonIdentifier', 'personidentifier_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonOrganizationRole/personorganizationrole.csv').text
oea.land(data, 'M365/v1.14/PersonOrganizationRole', 'personorganizationrole_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonPhoneNumber/personphonenumber.csv').text
oea.land(data, 'M365/v1.14/PersonPhoneNumber', 'personphonenumber_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/PersonRelationship/personrelationship.csv').text
oea.land(data, 'M365/v1.14/PersonRelationship', 'personrelationship_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/RefDefinition/refdefinition.csv').text
oea.land(data, 'M365/v1.14/RefDefinition', 'refdefinition_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/RefTranslation/reftranslation.csv').text
oea.land(data, 'M365/v1.14/RefTranslation', 'reftranslation_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Section/section.csv').text
oea.land(data, 'M365/v1.14/Section', 'section_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SectionGradeLevel/sectiongradelevel.csv').text
oea.land(data, 'M365/v1.14/SectionGradeLevel', 'sectiongradelevel_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SectionSession/sectionsession.csv').text
oea.land(data, 'M365/v1.14/SectionSession', 'sectionsession_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SectionSubject/sectionsubject.csv').text
oea.land(data, 'M365/v1.14/SectionSubject', 'sectionsubject_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/Session/session.csv').text
oea.land(data, 'M365/v1.14/Session', 'session_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/k12_test_data/roster/2022-01-28T06-16-22/SourceSystem/sourcesystem.csv').text
oea.land(data, 'M365/v1.14/SourceSystem', 'sourcesystem_k12_test_data.csv', oea.SNAPSHOT_BATCH_DATA)

StatementMeta(spark3p2sm, 146, 2, Finished, Available)

'stage1/Transactional/M365/v1.14/SourceSystem/snapshot_batch_data/rundate=2023-02-21 21:55:23/sourcesystem_k12_test_data.csv'

## 3.) Ingest the Insights/Reading Progress Module Test Data

Directory: ```stage1/Transactional/M365 -> stage2/Ingested/reading_progress```.

Both test datasets are formatted exactly as the Insights data - thus, there will be no column names or correct dtypes, initially. Ingest the data using the ```ingest_reading_prog()``` function, and next step will correct the table schemas.

**NOTE:**
 - AADGroupMembership table is not ingested, since it is not used for this Reading Progress module. If desired, refer to the Insights module and add the same processing as needed.

In [4]:
# this method is almost identical to the ingest function in the OEA framework, except with the additional function to change the ingested directory 
def ingest_reading_prog(entity_path, write_entity_path, primary_key='id', options={}):
    """ Ingests the data for the entity in the given path.
        CSV files are expected to have a header row by default, and JSON files are expected to have complete JSON docs on each row in the file.
        To specify options that are different from these defaults, use the options param.
        eg, ingest('contoso_sis/v0.1/students') # ingests all entities found in that path
        eg, ingest('contoso_sis/v0.1/students', options={'header':False}) # for CSV files that don't have a header
    """
    primary_key = oea.fix_column_name(primary_key) # fix the column name, in case it has a space in it or some other invalid character
    ingested_path = f'stage2/Ingested/{write_entity_path}'
    raw_path = f'stage1/Transactional/{entity_path}'
    batch_type, source_data_format = oea.get_batch_info(raw_path)
    logger.info(f'Ingesting from: {raw_path}, batch type of: {batch_type}, source data format of: {source_data_format}')
    source_url = oea.to_url(f'{raw_path}/{batch_type}_batch_data')

    if batch_type == 'snapshot': source_url = f'{source_url}/{oea.get_latest_folder(source_url)}' 
            
    logger.debug(f'Processing {batch_type} data from: {source_url} and writing out to: {ingested_path}')
    if batch_type == 'snapshot':
        def batch_func(df): oea.overwrite(df, ingested_path, primary_key)
    elif batch_type == 'additive':
        def batch_func(df): oea.append(df, ingested_path, primary_key)
    elif batch_type == 'delta':
        def batch_func(df): oea.upsert(df, ingested_path, primary_key)
    else:
        raise ValueError("No valid batch folder was found at that path (expected to find a single folder with one of the following names: snapshot_batch_data, additive_batch_data, or delta_batch_data). Are you sure you have the right path?")                      

    if options == None: options = {}
    options['format'] = source_data_format # eg, 'csv', 'json'
    if source_data_format == 'csv' and (not 'header' in options or options['header'] == None): options['header'] = True  # default to expecting a header in csv files

    number_of_new_inbound_rows = oea.process(source_url, batch_func, options)
    if number_of_new_inbound_rows > 0:    
        oea.add_to_lake_db(ingested_path)
    return number_of_new_inbound_rows

StatementMeta(spark3p2sm, 146, 3, Finished, Available)

In [5]:
# 3) The next step is to ingest the batch data into stage2
# Note that when you run this the first time, you'll see an info message like "Number of new inbound rows processed: 2".
# If you run this a second time, the number of inbound rows processed will be 0 because the ingestion uses spark structured streaming to keep track of what data has already been processed.
options = {'header':False}
ingest_reading_prog(f'M365/v1.14/activity', f'reading_progress/v0.1/activity', '_c3', options)
ingest_reading_prog(f'M365/v1.14/AadGroup', f'reading_progress/v0.1/AadGroup', '_c0', options)
ingest_reading_prog(f'M365/v1.14/AadUser', f'reading_progress/v0.1/AadUser', '_c0', options)
ingest_reading_prog(f'M365/v1.14/AadUserPersonMapping', f'reading_progress/v0.1/AadUserPersonMapping', '_c0', options)
ingest_reading_prog(f'M365/v1.14/Course', f'reading_progress/v0.1/Course', '_c0', options)
ingest_reading_prog(f'M365/v1.14/CourseGradeLevel', f'reading_progress/v0.1/CourseGradeLevel', '_c0', options)
ingest_reading_prog(f'M365/v1.14/CourseSubject', f'reading_progress/v0.1/CourseSubject', '_c0', options)
ingest_reading_prog(f'M365/v1.14/Enrollment', f'reading_progress/v0.1/Enrollment', '_c0', options)
ingest_reading_prog(f'M365/v1.14/Organization', f'reading_progress/v0.1/Organization', '_c0', options)
ingest_reading_prog(f'M365/v1.14/Person', f'reading_progress/v0.1/Person', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonDemographic', f'reading_progress/v0.1/PersonDemographic', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonDemographicEthnicity', f'reading_progress/v0.1/PersonDemographicEthnicity', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonDemographicPersonFlag', f'reading_progress/v0.1/PersonDemographicPersonFlag', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonDemographicRace', f'reading_progress/v0.1/PersonDemographicRace', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonEmailAddress', f'reading_progress/v0.1/PersonEmailAddress', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonIdentifier', f'reading_progress/v0.1/PersonIdentifier', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonOrganizationRole', f'reading_progress/v0.1/PersonOrganizationRole', '_c0', options)
ingest_reading_prog(f'M365/v1.14/PersonPhoneNumber', f'reading_progress/v0.1/PersonPhoneNumber', '_c0', options)
#ingest_reading_prog(f'M365/v1.14/PersonRelationship', f'reading_progress/v0.1/PersonRelationship', '_c0', options) # <- no test data currently
ingest_reading_prog(f'M365/v1.14/RefDefinition', f'reading_progress/v0.1/RefDefinition', '_c0', options)
#ingest_reading_prog(f'M365/v1.14/RefTranslation', f'reading_progress/v0.1/RefTranslation', '_c0', options) # <- no test data currently
ingest_reading_prog(f'M365/v1.14/Section', f'reading_progress/v0.1/Section', '_c0', options)
ingest_reading_prog(f'M365/v1.14/SectionGradeLevel', f'reading_progress/v0.1/SectionGradeLevel', '_c0', options)
ingest_reading_prog(f'M365/v1.14/SectionSession', f'reading_progress/v0.1/SectionSession', '_c0', options)
ingest_reading_prog(f'M365/v1.14/SectionSubject', f'reading_progress/v0.1/SectionSubject', '_c0', options)
ingest_reading_prog(f'M365/v1.14/Session', f'reading_progress/v0.1/Session', '_c0', options)
ingest_reading_prog(f'M365/v1.14/SourceSystem', f'reading_progress/v0.1/SourceSystem', '_c0', options)

StatementMeta(spark3p2sm, 146, 4, Finished, Available)

2023-02-21 21:55:26,455 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/activity, batch type of: additive, source data format of: csv
2023-02-21 21:55:42,385 - py4j.java_gateway - INFO - Callback Server Starting
2023-02-21 21:55:42,395 - py4j.java_gateway - INFO - Socket listening on ('127.0.0.1', 40563)
2023-02-21 21:55:44,219 - py4j.java_gateway - INFO - Callback Connection ready to receive messages
2023-02-21 21:55:44,240 - py4j.java_gateway - INFO - Received command c on object id p0
2023-02-21 21:56:21,209 - OEA - INFO - Number of new inbound rows processed: 52658
2023-02-21 21:56:38,711 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadGroup, batch type of: snapshot, source data format of: csv
2023-02-21 21:56:40,511 - py4j.java_gateway - INFO - Received command c on object id p1
2023-02-21 21:56:47,649 - OEA - INFO - Number of new inbound rows processed: 87
2023-02-21 21:56:48,944 - OEA - INFO - Ingesting from: stage1/Transactional/M365/v1.14/AadUs

1

In [11]:
# 3.5) Now you can run queries against the auto-generated "lake database" with the ingested Insights/Reading Progress data.
df = spark.sql("select * from ldb_sam_s2i_reading_progress_v0p1.activity")
display(df.limit(10))

StatementMeta(spark3p2sm, 146, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 22db5602-7194-47ae-ace5-3bf7b8d8515e)

## 4.) Reading Progress Schema Correction

Directory: ```stage2/Ingested/reading_progress -> stage2/Ingested_Corrected/reading_progress```

This step uses the same four functions from the "ReadingProgress_schema_correction" notebook, where the metadata.csv is used to correct each table's schema. Each table's schema is updated with the corrected column names and dtypes.

After the schema is corrected, each table is written to stage2/Ingested_Corrected.

In [12]:
# 4) schema correction, since Insights test data initially landed doesn't have column headers or correct dtypes.

def _extract_element(lst, element_num=0):
    return [item[element_num] for item in lst]

def _dtype_config(dtype_lst):
    return [item.capitalize() + 'Type()' for item in dtype_lst]

def correct_insights_table_schema(df, table_name):
    list_of_column_names = _extract_element(metadata[table_name])
    list_of_column_dtypes = _extract_element(metadata[table_name], 1)
    list_of_column_dtypes = _dtype_config(list_of_column_dtypes)

    n = 0
    df_updatedColumns = df
    for c in df.columns:
        if c != 'rundate':
            new_col_name = list_of_column_names[n]
            df_updatedColumns = df_updatedColumns.withColumnRenamed(c, new_col_name)
            if list_of_column_dtypes[n] != 'StringType()':
                if list_of_column_dtypes[n] == 'IntegerType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(IntegerType()))
                elif list_of_column_dtypes[n] == 'TimestampType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(TimestampType()))
                elif list_of_column_dtypes == 'ShortType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(ShortType()))
                elif list_of_column_dtypes[n] == 'LongType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(LongType()))
                elif list_of_column_dtypes[n] == 'DoubleType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(DoubleType()))
                elif list_of_column_dtypes[n] == 'DateType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(DateType()))
                elif list_of_column_dtypes[n] == 'BooleanType()':
                    df_updatedColumns = df_updatedColumns.withColumn(new_col_name, df_updatedColumns[new_col_name].cast(BooleanType()))
        else:
            df_updatedColumns = df_updatedColumns
        n = n + 1
    return df_updatedColumns

def correct_reading_progress_dataset(tables_source, write_destination):
    items = oea.get_folders(tables_source)
    for item in items: 
        if item == 'metadata.csv':
            logger.info('ignore metadata processing, since this is not a table to be ingested')
        else:
            table_path = tables_source +'/'+ item
            spark.sql("set spark.sql.streaming.schemaInference=true")
            streaming_df = spark.readStream.format('delta').load(oea.to_url(table_path))
            df_corrected = correct_insights_table_schema(streaming_df, table_name=item)
            query = df_corrected.writeStream.format('delta').outputMode('append').trigger(once=True).option('checkpointLocation', oea.to_url(table_path) + '/_checkpoints')
            query = query.start(oea.to_url(write_destination + '/' +item))
            query.awaitTermination() 
            logger.info('Successfully corrected the schema for table: ' + item + ' from: ' + table_path)

StatementMeta(spark3p2sm, 146, 11, Finished, Available)

In [13]:
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Microsoft_Education_Insights/test_data/metadata.csv')
correct_reading_progress_dataset('stage2/Ingested/reading_progress/v0.1', 'stage2/Ingested_Corrected/reading_progress/v0.1')

StatementMeta(spark3p2sm, 146, 12, Finished, Available)

2023-02-21 22:00:32,747 - OEA - INFO - Successfully corrected the schema for table: AadGroup from: stage2/Ingested/reading_progress/v0.1/AadGroup
2023-02-21 22:00:37,084 - OEA - INFO - Successfully corrected the schema for table: AadGroupMembership from: stage2/Ingested/reading_progress/v0.1/AadGroupMembership
2023-02-21 22:00:41,209 - OEA - INFO - Successfully corrected the schema for table: AadUser from: stage2/Ingested/reading_progress/v0.1/AadUser
2023-02-21 22:00:45,246 - OEA - INFO - Successfully corrected the schema for table: AadUserPersonMapping from: stage2/Ingested/reading_progress/v0.1/AadUserPersonMapping
2023-02-21 22:00:48,871 - OEA - INFO - Successfully corrected the schema for table: Course from: stage2/Ingested/reading_progress/v0.1/Course
2023-02-21 22:00:52,684 - OEA - INFO - Successfully corrected the schema for table: CourseGradeLevel from: stage2/Ingested/reading_progress/v0.1/CourseGradeLevel
2023-02-21 22:00:56,436 - OEA - INFO - Successfully corrected the sche

In [14]:
df = spark.read.format('delta').load(oea.to_url('stage2/Ingested_Corrected/reading_progress/v0.1/activity'), header='true')
display(df.limit(10))

StatementMeta(spark3p2sm, 146, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, 366cd25e-ab7a-4842-9e5e-9ae21e587814)

In [15]:
df.printSchema()

StatementMeta(spark3p2sm, 146, 14, Finished, Available)

root
 |-- SignalType: string (nullable = true)
 |-- StartTime: timestamp (nullable = true)
 |-- UserAgent: string (nullable = true)
 |-- SignalId: string (nullable = true)
 |-- SisClassId: string (nullable = true)
 |-- ClassId: string (nullable = true)
 |-- ChannelId: string (nullable = true)
 |-- AppName: string (nullable = true)
 |-- ActorId: string (nullable = true)
 |-- ActorRole: string (nullable = true)
 |-- SchemaVersion: string (nullable = true)
 |-- AssignmentId: string (nullable = true)
 |-- SubmissionId: string (nullable = true)
 |-- SubmissionCreatedTime: timestamp (nullable = true)
 |-- Action: string (nullable = true)
 |-- DueDate: timestamp (nullable = true)
 |-- ClassCreationDate: timestamp (nullable = true)
 |-- Grade: double (nullable = true)
 |-- SourceFileExtension: string (nullable = true)
 |-- MeetingDuration: string (nullable = true)
 |-- MeetingSessionId: string (nullable = true)
 |-- MeetingType: string (nullable = true)
 |-- ReadingSubmissionWordsPerMinute: in

## 5.) Refine the Reading Progress Module Test Data

Directory: ```stage2/Ingested_Corrected/reading_progress -> stage2/Refined/reading_progress```

This step then refines the Insights test data from stage2/Ingested_Corrected to stage2/Refined, using the metadata.csv. This step is responsible for :
 - pseudonymization (which preserves sensitive student information by either hashing or masking the sensitive columns), and 
 - data transformation (to fit the Reading Progress module schema). 

Tables are separated into either ```stage2/Refined/reading_progress/v0.1/general``` or ```stage2/Refined/reading_progress/v0.1/sensitive```, depending on whether each table is pseudonymized or has a sensitive column-hashing/masking mapping, respectively.


**To-Do's:**
 - Find workaround for creating lookup tables, when the primary key is un-hashed after pseudonymization 
    * (i.e. *affected tables*: PersonDemographicEthnicity, PersonDemographicPersonFlag, PersonDemographicRace, PersonEmailAddress, PersonIdentifier, PersonOrganizationRole, and PersonPhoneNumber). 
 - Resolve ingesting and refining AadGroupMembership table.

In [34]:
# 2) this step refines the data through the use of metadata (this is where the pseudonymization of the data occurs).
def refine_reading_prog_corrected(df, table_name, metadata=None, primary_key='id'):
    source_path = f'stage2/Ingested_Corrected/reading_progress/v0.1/activity'
    sink_general_path = f'stage2/Refined/reading_progress/v0.1/general/{table_name}'
    sink_sensitive_path = f'stage2/Refined/reading_progress/v0.1/sensitive/{table_name}_lookup'

    # NOTE: Currently does not accomodate change data; this is expected to be updated for production purposes
    #df_changes = oea.get_latest_changes(source_path, sink_general_path)
    spark_schema = oea.to_spark_schema(metadata)
    df = oea.modify_schema(df, spark_schema)        

    if df.count() > 0:
        df_pseudo, df_lookup = oea.pseudonymize(df, metadata)
        oea.upsert(df_pseudo, sink_general_path, primary_key)
        oea.upsert(df_lookup, sink_sensitive_path, primary_key)
        oea.add_to_lake_db(sink_general_path)
        oea.add_to_lake_db(sink_sensitive_path)
        logger.info(f'Processed {df.count()} rows from {source_path} into stage2/Refined')
    else:
        logger.info(f'No updated rows in {source_path} to process.')
    return df.count()

def refine_reading_progress_dataset(tables_source):
    # read in relevant tables for data transformation
    base_path = tables_source
    df_activity = oea.load(base_path + '/activity')
    df_aaduserpersonmapping = oea.load(base_path + '/AadUserPersonMapping')
    df_person = oea.load(base_path + '/Person/')
    df_personOrgRole = oea.load(base_path + '/PersonOrganizationRole')
    df_organization = oea.load(base_path + '/Organization')
    df_refDefinition = oea.load(base_path + '/RefDefinition')
    # separate student frame, subset and refine data
    dfStudent = df_personOrgRole.join(df_person, df_personOrgRole.PersonId == df_person.Id, how='inner')
    dfStudent = dfStudent.select('PersonId', 'Surname', 'GivenName', 'MiddleName', 'RefRoleId', 'RefGradeLevelId', 'OrganizationId')
    dfStudent = dfStudent.join(df_organization, dfStudent.OrganizationId == df_organization.Id, how='inner').withColumnRenamed('Name', 'OrganizationName')
    dfStudent = dfStudent.select('PersonId', 'Surname', 'GivenName', 'MiddleName', 'RefRoleId', 'RefGradeLevelId', 'OrganizationId', 'OrganizationName')
    dfStudent = dfStudent.join(df_refDefinition, dfStudent.RefRoleId == df_refDefinition.Id, how='inner').withColumnRenamed('Code', 'PersonRole')
    dfStudent = dfStudent.select('PersonId', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'RefGradeLevelId', 'OrganizationId', 'OrganizationName')
    dfStudent = dfStudent.filter(dfStudent['PersonRole'] == 'Student')
    dfStudent = dfStudent.join(df_refDefinition, dfStudent.RefGradeLevelId == df_refDefinition.Id, how='left').withColumnRenamed('Code', 'StudentGrade')
    dfStudent = dfStudent.select('PersonId', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'StudentGrade', 'OrganizationId', 'OrganizationName')
    df_aaduserpersonmapping = df_aaduserpersonmapping.withColumnRenamed('PersonId', 'id')
    dfStudent = dfStudent.join(df_aaduserpersonmapping, dfStudent.PersonId == df_aaduserpersonmapping.id, how='inner').withColumnRenamed('ObjectId', 'AadUserId')
    dfStudent = dfStudent.select('PersonId', 'AadUserId', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'StudentGrade', 'OrganizationId', 'OrganizationName')

    refine_reading_prog_corrected(dfStudent, 'Student', metadata['Student'], 'PersonId_pseudonym')
    # refine reading progress data from Insights activity table
    dfReadingProgress = df_activity.where("AppName == 'ReadingProgress'")
    dfReadingProgress = dfReadingProgress.select('ActorId', 'SignalId', 'SignalType', 'StartTime', 'AppName', 'Action', 'ClassId', 'ReadingSubmissionWordsPerMinute', 'ReadingSubmissionAccuracyScore', \
                                        'ReadingSubmissionRepetitionsCount', 'ReadingSubmissionInsertionsCount', 'ReadingSubmissionMispronunciationCount', 'ReadingSubmissionObmissionCount', 'ReadingSubmissionAttemptNumber', \
                                        'ReadingAssignmentWordCount', 'ReadingAssignmentFleschKincaidGradeLevel', 'ReadingAssignmentLanguage')
    dfReadingProgress = dfReadingProgress.withColumnRenamed('ActorId', 'AadUserId').withColumnRenamed('ClassId', 'AadGroupId')
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionAccuracyScore', dfReadingProgress['ReadingSubmissionAccuracyScore'].cast(DoubleType()))
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionRepetitionsRate', F.col('ReadingSubmissionRepetitionsCount')/F.col('ReadingAssignmentWordCount') * 100) 
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionRepetitionsRate', F.round(F.col('ReadingSubmissionRepetitionsRate'), 3))
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionMispronunciationRate', F.col('ReadingSubmissionMispronunciationCount')/F.col('ReadingAssignmentWordCount') * 100) 
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionMispronunciationRate', F.round(F.col('ReadingSubmissionMispronunciationRate'), 3))
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionInsertionsRate', F.col('ReadingSubmissionInsertionsCount')/F.col('ReadingAssignmentWordCount') * 100) 
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionInsertionsRate', F.round(F.col('ReadingSubmissionInsertionsRate'), 3))
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionObmissionRate', F.col('ReadingSubmissionObmissionCount')/F.col('ReadingAssignmentWordCount') * 100) 
    dfReadingProgress = dfReadingProgress.withColumn('ReadingSubmissionObmissionRate', F.round(F.col('ReadingSubmissionObmissionRate'), 3))

    try:
        refine_reading_prog_corrected(dfReadingProgress, 'ReadingProgress_activity', metadata['ReadingProgress_activity'], 'SignalId')
    except AnalysisException as e:
        # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
        pass
    
    logger.info('Finished refining Reading Progress tables.')

StatementMeta(spark3p2sm, 146, 33, Finished, Available)

In [35]:
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Reading_Progress/data/metadata.csv')
refine_reading_progress_dataset('stage2/Ingested_Corrected/reading_progress/v0.1')

StatementMeta(spark3p2sm, 146, 34, Finished, Available)

2023-02-21 22:31:59,162 - OEA - INFO - Processed 600 rows from stage2/Ingested_Corrected/reading_progress/v0.1/activity into stage2/Refined
2023-02-21 22:32:05,634 - OEA - INFO - Finished refining Reading Progress tables.


In [36]:
oea.add_to_lake_db('stage2/Refined/reading_progress/v0.1/general/ReadingProgress_activity')

StatementMeta(spark3p2sm, 146, 35, Finished, Available)

In [38]:
df = spark.read.format('delta').load(oea.to_url('stage2/Refined/reading_progress/v0.1/general/Student'), header='true')
display(df.limit(10))
df = spark.read.format('delta').load(oea.to_url('stage2/Refined/reading_progress/v0.1/general/ReadingProgress_activity'), header='true')
display(df.limit(10))

StatementMeta(spark3p2sm, 146, 37, Finished, Available)

SynapseWidget(Synapse.DataFrame, ed42724c-1f53-4ecc-8683-fcbecb213225)

SynapseWidget(Synapse.DataFrame, 05061724-8464-45f9-ae30-10151fcfa969)

## 6.) Demonstrate Lake Database Queries/Final Remarks

In [19]:
# Run this cell to reset this example (deleting all the example Insights data in your workspace)
oea.rm_if_exists('stage1/Transactional/reading_progress')
oea.rm_if_exists('stage2/Ingested/reading_progress')
oea.rm_if_exists('stage2/Ingested_Corrected/reading_progress')
oea.rm_if_exists('stage2/Refined/reading_progress')
oea.drop_lake_db('ldb_sam_s2i_reading_progress_v0p1')
oea.drop_lake_db('ldb_sam_s2r_reading_progress_v0p1')

## Appendix

In [None]:
# generate an initial metadata file for manual modification
metadata = oea.create_metadata_from_lake_db('ldb_sam_s2i_reading_progress_v0p1')
dlw = DataLakeWriter(oea.to_url('stage1/Transactional/reading_progress'))
dlw.write('metadata.csv', metadata)

In [None]:
# Create a sql db for the ingested Reading Progress data
oea.create_sql_db('stage2/Ingested/reading_progress')