# Test for processing Canvas data

This notebook demonstrates possible data processing and exploration of the Canvas data, using the OEA_py class notebook. 

Most of the data processing done in this notebook are also achieved by executing the Canvas module main pipeline. This notebook is designed as an alternate approach to the same processing, as well as module data exploration and visualization. 

The steps are clearly outlined below:
1. Set the workspace,
2. Land Canvas Module Higher Ed. Test Data,
3. Pre-Process Canvas Module Test Data,
4. Ingest the Canvas Module Test Data,
5. Refine the Canvas Module Test Data, 
6. Demonstrate Lake Database Queries/Final Remarks, and
7. Appendix

In [None]:
%run OEA_py

In [None]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace('dev')

## 2.) Land Canvas Module Higher Ed. Test Data

Directory: ```GitHub.com (raw data) -> stage1/Transactional/canvas_raw```

The code block below lands 13 OEA Canvas module test data tables, formatted as Canvas Higher Ed. data in your data lake. 

Canvas test data JSON tables landed in stage 1:
 1. **accounts**
 2. **assignments**
 3. **content_tags**
 4. **context_modules**
 5. **courses**
 6. **course_sections**
 7. **enrollments**
 8. **enrollment_terms**
 9. **quiz_submissions**
 10. **quizzes**
 11. **roles**
 12. **submissions**
 13. **users** 

**To-Do's:**
 - Correct the test dataset as needed

In [None]:
# 2.1) Land batch data files into stage1 of the data lake.
# In this example we pull Canvas HEd test json data files from github and land it in oea/dev/stage1/Transactional/canvas/v2.0
import datetime
currentDate = datetime.datetime.now()
currentDateTime = currentDate.strftime("%Y-%m-%d %H-%M-%S")
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/accounts.json').text
oea.land(data, 'canvas_raw/v2.0/accounts', 'accounts_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/courses.json').text
oea.land(data, 'canvas_raw/v2.0/courses', 'courses_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/course_sections.json').text
oea.land(data, 'canvas_raw/v2.0/course_sections', 'coursesections_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/roles.json').text
oea.land(data, 'canvas_raw/v2.0/roles', 'roles_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/users.json').text
oea.land(data, 'canvas_raw/v2.0/users', 'users_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
# normally, these three tables should be landed as delta_batch_data but since functionality is limited for processing delta data, we assume they're snapshot for now.
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/enrollments.json').text
oea.land(data, 'canvas_raw/v2.0/enrollments', 'enrollments_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/enrollment_terms.json').text
oea.land(data, 'canvas_raw/v2.0/enrollment_terms', 'enrollmentterms_hed_test_data.csv', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/content_tags.json').text
oea.land(data, 'canvas_raw/v2.0/content_tags', 'contenttags_hed_test_data.json', oea.SNAPSHOT_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/context_modules.json').text
oea.land(data, 'canvas_raw/v2.0/context_modules', 'contextmodules_hed_test_data.json', oea.SNAPSHOT_BATCH_DATA, currentDateTime)

data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/assignments.json').text
oea.land(data, 'canvas_raw/v2.0/assignments', 'assignments_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/quizzes.json').text
oea.land(data, 'canvas_raw/v2.0/quizzes', 'quizzes_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/quiz_submissions.json').text
oea.land(data, 'canvas_raw/v2.0/quiz_submissions', 'quizsubmissions_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA, currentDateTime)
data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/hed_test_data/submissions.json').text
oea.land(data, 'canvas_raw/v2.0/submissions', 'submissions_hed_test_data.csv', oea.ADDITIVE_BATCH_DATA, currentDateTime)

## 3.) Pre-Process Canvas Module Test Data

Directory: ```stage1/Transactional/canvas_raw -> stage1/Transactional/canvas```

This step is responsible for pre-processing the Canvas module test data from stage1 back to stage1.

The code blocks in this step read in the original JSON tables using the ```pd.read_json(..., lines=True)``` function, performs any ad hoc data conversions, and writes the table to stage1 as a CSV.

**To-Do's:**
 - Check if this if test data matches production data, with raw JSONs oriented as records (one JSON row represents a row in the df).

In [None]:
# 3) this step pre-processing the canvas data through reading in the JSONs as records, corrects any schema discepancies and then writes out the df as a CSV in stage1.
# there is no data transformation happening in this step besides properly reading in the column dtypes properly.
def preprocess_canvas_dataset(tables_source):
    items = oea.get_folders(tables_source)
    for item in items: 
        if item == '_preprocessed_tables':
            logger.info('Ignoring existing _preprocessed_tables folder.')
        else:
            table_path = tables_source +'/'+ item
            # find the batch data type of the table
            batch_type_folder = oea.get_folders(table_path)
            batch_type = batch_type_folder[0]
            # grab only the latest folder in stage1, used to write the JSON -> CSV to the same rundate folder timestamp
            # idea is to mimic the same directory structure of tables landed in stage1
            latest_dt = oea.get_latest_runtime(f'{table_path}/{batch_type}', "rundate=%Y-%m-%d %H-%M-%S")
            latest_dt = latest_dt.strftime("%Y-%m-%d %H-%M-%S")
            pdf = pd.read_json(oea.to_url(f'{table_path}/{batch_type}/rundate={latest_dt}/*.json'),lines=True)
            if item == 'submissions':
                pdf[['graded_anonymously', 'excused']] = pdf[['graded_anonymously', 'excused']].astype(str) # NOTE: df doesn't load properly if array columns aren't cast to strings
            df = spark.createDataFrame(pdf)
            # ad hoc step(s) 
            if item == 'accounts':
                df = df.withColumn('parent_account_id', df['parent_account_id'].cast(LongType()))
            elif item == 'assignments':
                df = df.withColumn('submission_types', df['submission_types'].cast(StringType()))
            elif item == 'quiz_submissions':
                df = df.withColumn('score', F.round(df['score'], 2)).withColumn('kept_score', F.round(df['kept_score'], 2)).withColumn('score_before_regrade', F.round(df['score_before_regrade'], 2))
            elif item == 'submissions':
                df = df.withColumn('quiz_submission_id', df['quiz_submission_id'].cast(LongType())).withColumn('score', F.round(df['score'], 2)).withColumn('published_score', F.round(df['published_score'], 2))
            else:
                logger.info(f'no ad hoc processing needed for the Canvas {item} table.')
            # create the new location for the converted CSVs, and write back to stage1
            new_table_path = f'stage1/Transactional/canvas/v{version}/{item}/{batch_type}/rundate={latest_dt}'
            df.coalesce(1).write.save(oea.to_url(f'{new_table_path}'), format='csv', mode='overwrite', header='true', mergeSchema='true')
            # remove the _SUCCESS file
            oea.rm_if_exists(new_table_path + '/_SUCCESS', False)
            logger.info('Pre-processed table: ' + item + ' from: ' + table_path)
    logger.info('Finished pre-processing Canvas tables')

In [None]:
# set the version number and pre-process the dataset
version = '2.0'
preprocess_canvas_dataset(f'stage1/Transactional/canvas_raw/v{version}')

## 4.) Ingest the Canvas Module Test Data

Directory: ```stage1/Transactional/canvas -> stage2/Ingested/canvas```

This step ingests the Canvas module test data from stage1 to stage2/Ingested.

The code blocks in this step ingest the data using a the ```oea.ingest()``` function as normally.

In [None]:
# this function ingests each canvas table from stage1/../canvas_preprocessed/...
def ingest_canvas_dataset(tables_source):
    items = oea.get_folders(f'stage1/Transactional/{tables_source}')
    for item in items: 
        table_path = f'canvas/v{version}/{item}'
        try:
            # 3 paths: check_path is for checking whether the table should be ingested, read_path is for reading the stage1 CSV location, write path for stage2 ingested location
            if item == 'metadata.csv':
                logger.info('ignore metadata csv - not a table to be ingested')
            elif item == 'content_tags':
                oea.ingest(table_path, 'content_id')
            else:
                oea.ingest(table_path, 'id')
        except AnalysisException as e:
            # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
            pass
    logger.info('Finished ingesting the most recent Canvas data')

In [None]:
# ingest the canvas dataset
version = '2.0'
ingest_canvas_dataset(f'canvas/v{version}')

In [None]:
# 3.5) Now you can run queries against the auto-generated "lake database" with the ingested Canvas data.
df = spark.sql("select * from ldb_dev_s2i_canvas_v2p0.course_sections")
display(df.limit(10))

## 5.) Refine the Canvas Module Test Data

Directory: ```stage2/Ingested/canvas -> stage2/Refined/canvas```

This step then refines the Canvas test data from stage2/Ingested to stage2/Refined, using the metadata.csv. This step is responsible for pseudonymization, which preserves sensitive student information by either hashing or masking the sensitive columns. 

Tables are separated into either ```stage2/Refined/canvas/v2.0/general``` or ```stage2/Refined/canvas/v2.0/sensitive```, depending on whether each table is pseudonymized or has a sensitive column-hashing/masking mapping, respectively.


In [None]:
def refine_canvas(entity_path, metadata=None, primary_key='id'):
    source_path = f'stage2/Ingested/{entity_path}'
    primary_key = oea.fix_column_name(primary_key) # fix the column name, in case it has a space in it or some other invalid character
    path_dict = oea.parse_path(source_path)
    sink_general_path = path_dict['entity_parent_path'].replace('Ingested', 'Refined') + '/general/' + path_dict['entity']
    sink_sensitive_path = path_dict['entity_parent_path'].replace('Ingested', 'Refined') + '/sensitive/' + path_dict['entity'] + '_lookup'
    if not metadata:
        all_metadata = oea.get_metadata_from_path(path_dict['entity_parent_path'])
        metadata = all_metadata[path_dict['entity']]

    df_changes = oea.get_latest_changes(source_path, sink_general_path)
    spark_schema = oea.to_spark_schema(metadata)
    df_changes = oea.modify_schema(df_changes, spark_schema)        

    if df_changes.count() > 0:
        df_pseudo, df_lookup = oea.pseudonymize(df_changes, metadata)
        oea.upsert(df_pseudo, sink_general_path, primary_key) # todo: remove this assumption that the primary key will always be hashed during pseduonymization
        oea.upsert(df_lookup, sink_sensitive_path, primary_key)    
        oea.add_to_lake_db(sink_general_path)
        oea.add_to_lake_db(sink_sensitive_path)
        logger.info(f'Processed {df_changes.count()} updated rows from {source_path} into stage2/Refined')
    else:
        logger.info(f'No updated rows in {source_path} to process.')
    return df_changes.count()

In [None]:
# 4) this step refines the data through the use of metadata (this is where the pseudonymization of the data occurs).
def refine_canvas_dataset(tables_source):
    items = oea.get_folders(tables_source)
    for item in items: 
        table_path = tables_source +'/'+ item
        if item == 'metadata.csv':
            logger.info('ignore metadata processing, since this is not a table to be ingested')
        else:
            try:
                if item == 'accounts':
                    refine_canvas('canvas/v2.0/accounts', metadata[item], 'id_pseudonym')
                if item == 'content_tags':
                    refine_canvas('canvas/v2.0/content_tags', metadata[item], 'content_id')
                elif item == 'users':
                    refine_canvas('canvas/v2.0/users', metadata[item], 'id_pseudonym')
                else:
                    refine_canvas('canvas/v2.0/' + item, metadata[item], 'id')
            except AnalysisException as e:
                # This means the table may have not been properly refined due to errors with the primary key not aligning with columns expected in the lookup table.
                pass
            
            logger.info('Refined table: ' + item + ' from: ' + table_path)
    logger.info('Finished refining Canvas tables')

In [None]:
metadata = oea.get_metadata_from_url('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_catalog/Canvas/test_data/metadata_v2.csv')
refine_canvas_dataset('stage2/Ingested/canvas/v2.0')

## 6.) Demonstrate Lake Database Queries/Final Remarks

In [None]:
# non-hashed primary keys are not automatically added to the lake db - add these tables
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/assignments')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/content_tags')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/context_modules')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/courses')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/course_sections')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/enrollments')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/enrollment_terms')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/quizzes')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/quiz_submissions')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/roles')
oea.add_to_lake_db('stage2/Refined/canvas/v2.0/general/submissions')

In [None]:
# 5) Now you can query the refined data tables in the lake db
df = spark.sql("select * from ldb_dev_s2r_canvas_v2p0.enrollments")
display(df)
df.printSchema()
df = spark.sql("select * from ldb_dev_s2r_canvas_v2p0.users")
display(df)
df.printSchema()
# You can use the "lookup" table for joins (people with restricted access won't be able to perform this query because they won't have access to data in the "sensitive" folder in the data lake)
df = spark.sql("select e.course_section_id, e.type, e.workflow_state, u.id_pseudonym, u.name \
                from ldb_dev_s2r_canvas_v2p0.enrollments e, ldb_dev_s2r_canvas_v2p0.users u where e.user_id_pseudonym = u.id_pseudonym")
display(df.limit(10))

In [None]:
# Run this cell to reset this example (deleting all the example Canvas data in your workspace)
oea.rm_if_exists('stage1/Transactional/canvas_raw')
oea.rm_if_exists('stage1/Transactional/canvas')
oea.rm_if_exists('stage2/Ingested/canvas')
oea.rm_if_exists('stage2/Refined/canvas')
oea.drop_lake_db('ldb_dev_s2i_canvas_v2p0')
oea.drop_lake_db('ldb_dev_s2r_canvas_v2p0')

## Appendix

In [None]:
# generate an initial metadata file for manual modification
metadata = oea.create_metadata_from_lake_db('ldb_dev_s2i_canvas_v2p0')
dlw = DataLakeWriter(oea.to_url('stage1/Transactional/canvas'))
dlw.write('metadata.csv', metadata)

In [None]:
# Create a sql db for the ingested Canvas data
oea.create_sql_db('stage2/Ingested/canvas')

In [None]:
oea.create_sql_db('stage2/Refined/canvas')