# Canvas Module Ingestion - Pre-Processing

This notebook demonstrates the utility of the OEA_py class notebook, by converting the Canvas tables from record-oriented JSONs to CSVs pre-ingestion. Once any ad hoc column-dtype conversion is complete, the table is overwritten in stage1.

The steps outlined below describe how this notebook is used to convert the Canvas module JSON tables:
- Set the workspace for where the Canvas tables are to be converted. 
- Read in the original JSONs landed in ```stage1/Transactional/canvas_raw/...```, perform any ad hoc data conversions (e.g. the accounts table needs the parent_account_id column cast to LongType rather than DoubleType) and write the table to stage1 as a CSV: ```stage1/Transactional/canvas/...```
- 3 function is defined and used:
   1. **preprocess_canvas_dataset**: main method that reads in the pandas df JSON using the function ```pd.read_json(..., lines=True)```, converts to a spark df and corrects the column-dtypes as needed.

**This notebook may either need updating or removal from pipeline, when processing production data.**

In [None]:
workspace = 'dev'
version = '2.0'

In [None]:
%run OEA_py

In [None]:
# 1) set the workspace (this determines where in the data lake you'll be writing to and reading from).
# You can work in 'dev', 'prod', or a sandbox with any name you choose.
# For example, Sam the developer can create a 'sam' workspace and expect to find his datasets in the data lake under oea/sandboxes/sam
oea.set_workspace(workspace)

In [None]:
# 2) this step pre-processing the canvas data through reading in the JSONs as records, corrects any schema discepancies and then writes out the df as a CSV in stage1.
# there is no data transformation happening in this step besides properly reading in the column dtypes properly.
def preprocess_canvas_dataset(tables_source):
    items = oea.get_folders(tables_source)
    for item in items: 
        if item == '_preprocessed_tables':
            logger.info('Ignoring existing _preprocessed_tables folder.')
        else:
            table_path = tables_source +'/'+ item
            # find the batch data type of the table
            batch_type_folder = oea.get_folders(table_path)
            batch_type = batch_type_folder[0]
            # grab only the latest folder in stage1, used to write the JSON -> CSV to the same rundate folder timestamp
            # idea is to mimic the same directory structure of tables landed in stage1
            latest_dt = oea.get_latest_runtime(f'{table_path}/{batch_type}', "rundate=%Y-%m-%d %H-%M-%S")
            latest_dt = latest_dt.strftime("%Y-%m-%d %H-%M-%S")
            pdf = pd.read_json(oea.to_url(f'{table_path}/{batch_type}/rundate={latest_dt}/*.json'),lines=True)
            if item == 'submissions':
                pdf[['graded_anonymously', 'excused']] = pdf[['graded_anonymously', 'excused']].astype(str) # NOTE: df doesn't load properly if array columns aren't cast to strings
            df = spark.createDataFrame(pdf)
            # ad hoc step(s) 
            if item == 'accounts':
                df = df.withColumn('parent_account_id', df['parent_account_id'].cast(LongType()))
            elif item == 'assignments':
                df = df.withColumn('submission_types', df['submission_types'].cast(StringType()))
            elif item == 'quiz_submissions':
                df = df.withColumn('score', F.round(df['score'], 2)).withColumn('kept_score', F.round(df['kept_score'], 2)).withColumn('score_before_regrade', F.round(df['score_before_regrade'], 2))
            elif item == 'submissions':
                df = df.withColumn('quiz_submission_id', df['quiz_submission_id'].cast(LongType())).withColumn('score', F.round(df['score'], 2)).withColumn('published_score', F.round(df['published_score'], 2))
            else:
                logger.info(f'no ad hoc processing needed for the Canvas {item} table.')
            # create the new location for the converted CSVs, and write back to stage1
            new_table_path = f'stage1/Transactional/canvas/v{version}/{item}/{batch_type}/rundate={latest_dt}'
            df.coalesce(1).write.save(oea.to_url(f'{new_table_path}'), format='csv', mode='overwrite', header='true', mergeSchema='true')
            # remove the _SUCCESS file
            oea.rm_if_exists(new_table_path + '/_SUCCESS', False)
            logger.info('Pre-processed table: ' + item + ' from: ' + table_path)
    logger.info('Finished pre-processing Canvas tables')

In [None]:
# set the version number and pre-process the dataset
preprocess_canvas_dataset(f'stage1/Transactional/canvas_raw/v{version}')