# Chronic Absenteeism Package: Build StudentModel Table
This notebook is intended to explore the capabilities of the OEA Chronic Absenteeism package by creating a stage 3 StudentModel table. 

**It is recommended that you review and execute all relevant module pipelines, before testing this Chronic Absenteeism notebook.**

This StudentModel table is created out of the curation of the following data sources, and landed in stage 3 under the "chronic_absenteeism/StudentModel" directory.

This notebook curates these data sources using the following steps:
 1. **Build the SIS fact table:** Read in all tables required; clean and extract a Student table (where each row is a student in the education system).
 2. **In-Person Attendance Curation:** Clean the studentattendance_pseudo table, by calculating the total number of days each student was present. Create a new column in the StudentModel table that holds this data.
    * Then, calculate the percentage of days a student was present in person. Create another column in the StudentModel table that holds this data.
    * Finally, create a new column where a student is flagged if they are chronically absent (i.e. 1 if chronically absent and 0 if they are not). Chronic absence is defined, in this notebook, as being absent 10% of the time, or more.
 3. **Digital Activity Curation:** Clean the digital activity data for the Microsoft Insights and the Clever modules.
    * *Insights* - create 3 new columns in the StudentModel table: one for the total number of days a student was digitally active, one for the average number of Insights-recorded activities per student, and the average time each student spent in a Teams meeting. 
    * *Clever* - create 4+ new columns in the StudentModel table: one for the total number of days a student was digitally active, one for the daily average number of resources accessed per student, one for the distinct number of resources used over the entire Clever-recorded time, and one column for each resource recorded - calculating the average number of times a student accessed each resource per day. 
 4. **Normalize all Digital Activity Data by Student's School and Grade:** Create 3 new columns holding normalized averages, by each student's school and grade:
    * *Insights_avgNumActivitiesPerDay*,
    * *Insights_avgSecInTeamsMeetings*, and
    * *Clever_normAvgNumAppsUsedPerDay*.
 5. **Write the StudentModel_pseudo table to Stage 3p:** Write the final StudentModel_pseudo table to stage 3p. 
 6. **Build StudentModel_lookup table:** Create the lookup table for mapping hashed and masked PII columns, to unhashed and unmasked data.
 7. **Write the StudentModel_lookup table to Stage 3np:** Write the final StudentModel_lookup table to stage 3np.

In [1]:
%run /OEA_py

In [2]:
# 0) Initialize the OEA framework.
oea = OEA()

## Read in relevant data sources

In [28]:
dfContosoSIS_studentattendance = oea.load('contoso_sis', 'studentattendance_pseudo')
df_digActivity = oea.load_delta('stage2p/digital_activity')
dfInsights_aaduserpersonmapping = oea.load('M365', 'AadUserPersonMapping_pseudo')
dfInsights_person = oea.load('M365', 'Person_pseudo')
dfInsights_personOrgRole = oea.load('M365', 'PersonOrganizationRole_pseudo')
dfInsights_organization = oea.load('M365', 'Organization_pseudo')
dfInsights_refDefinition = oea.load('M365', 'RefDefinition_pseudo')

## 1.) Build the SIS fact table 
Currently, built using the Microsoft Education Insights SIS data. 

In [29]:
dfInsights = dfInsights_personOrgRole.join(dfInsights_person, dfInsights_personOrgRole.PersonId_pseudonym == dfInsights_person.Id_pseudonym, how='inner')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'RefRoleId', 'RefGradeLevelId', 'OrganizationId')

In [30]:
dfInsights = dfInsights.join(dfInsights_organization, dfInsights.OrganizationId == dfInsights_organization.Id, how='inner')
dfInsights = dfInsights.withColumnRenamed('Name', 'OrganizationName')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'RefRoleId', 'RefGradeLevelId', 'OrganizationId', 'OrganizationName')

In [31]:
dfInsights = dfInsights.join(dfInsights_refDefinition, dfInsights.RefRoleId == dfInsights_refDefinition.Id, how='inner')
dfInsights = dfInsights.withColumnRenamed('Code', 'PersonRole')
dfInsights = dfInsights.filter(dfInsights['PersonRole'] == 'Student')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'RefGradeLevelId', 'OrganizationId', 'OrganizationName')

In [32]:
dfInsights = dfInsights.join(dfInsights_refDefinition, dfInsights.RefGradeLevelId == dfInsights_refDefinition.Id, how='inner')
dfInsights = dfInsights.withColumnRenamed('Code', 'StudentGrade')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'StudentGrade', 'OrganizationId', 'OrganizationName')

In [33]:
dfInsights_aaduserpersonmapping = dfInsights_aaduserpersonmapping.withColumnRenamed('PersonId_pseudonym', 'StudentId_internal_pseudonym')
dfInsights = dfInsights.join(dfInsights_aaduserpersonmapping, dfInsights.PersonId_pseudonym == dfInsights_aaduserpersonmapping.StudentId_internal_pseudonym, how='inner')
dfInsights = dfInsights.withColumnRenamed('ObjectId_pseudonym', 'StudentId_external_pseudonym').withColumnRenamed('OrganizationName', 'SchoolName')
dfInsights = dfInsights.select('StudentId_internal_pseudonym', 'StudentId_external_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'StudentGrade', 'SchoolName')
display(dfInsights.limit(10))

## 2.) In-Person Attendance Curation

### From Contoso SIS Module: studentattendance

Creates 3 new columns for the Student-model table, based on the Contoso SIS attendance data:
 - Count of the number of days a student has attended school in-person.
 - Calculation of the in-person attendance percentage.
 - Chronic Absence Flag: if the percentage of days missed is greater than or equal to 10%, the student is flagged with a 1.

In [34]:
dfInPersonAttendance = dfContosoSIS_studentattendance.select('student_id_pseudonym', 'attendance_date', 'AttendanceCode')
# first, find the number of days recorded for in-person attendance
dfInPersonAttendance_numDays = dfInPersonAttendance.groupBy('attendance_date').count()
dfInPersonAttendance_numDays = dfInPersonAttendance_numDays.drop('count')
num_days_recorded = dfInPersonAttendance_numDays.count()
# build final attendance table
dfInPersonAttendance = dfInPersonAttendance.withColumn('AttendanceCode_value', F.when(F.col('AttendanceCode') == "P", 1).otherwise(0))
dfInPersonAttendance = dfInPersonAttendance.groupBy("student_id_pseudonym").sum("AttendanceCode_value")
dfInPersonAttendance = dfInPersonAttendance.withColumnRenamed('sum(AttendanceCode_value)', 'InPerson_numDaysAttended')
dfInPersonAttendance = dfInPersonAttendance.withColumn('InPerson_percentDaysAttended', F.col('InPerson_numDaysAttended')/num_days_recorded)
dfInPersonAttendance = dfInPersonAttendance.withColumn('InPerson_chronicAbsFlag', F.when(F.col('InPerson_percentDaysAttended') >= 0.9, 0).otherwise(1))
display(dfInPersonAttendance.limit(10))

In [35]:
# join enriched in-person attendance data back to Student-model table
dfInsights = dfInsights.join(dfInPersonAttendance, dfInsights.StudentId_external_pseudonym == dfInPersonAttendance.student_id_pseudonym, how='inner')
dfInsights = dfInsights.drop('student_id_pseudonym')
display(dfInsights.limit(10))

## 3.) Digital Activity Curation

### Insights: Digital Activity
Creates 3 new columns for the Student-model table, based on the Insights digital activity data:
 - Count of the number of days a student is active.
 - Calculate the average number of digital activities a student uses, on the days they are active.
 - Calculate the average amount of time (in seconds) a student is in a Teams Meeting.

In [36]:
# isolate Insights data
df_digActivity_insights = df_digActivity.filter(df_digActivity['event_object'] == 'MS_Insights')
# get the total number of days from Insights-recorded digital activity per student
metadata_table1 = df_digActivity_insights.select('event_actor', 'event_eventTime')
metadata_table1 = metadata_table1.withColumn('event_eventTime', F.to_date(F.col('event_eventTime')))
metadata_table1 = metadata_table1.distinct()
metadata_table1 = metadata_table1.withColumn('Insights_numDaysActive', F.lit(1))
metadata_table1 = metadata_table1.groupBy('event_actor').sum('Insights_numDaysActive')
metadata_table = metadata_table1.withColumnRenamed('sum(Insights_numDaysActive)', 'Insights_numDaysActive')
display(metadata_table.limit(10))

In [37]:
# get the daily average number of Insights-recorded digital activities per student, for when they are digitally active
metadata_table2 = df_digActivity_insights.select('event_actor', 'event_eventTime')
metadata_table2 = metadata_table2.withColumn('event_eventTime', F.to_date(F.col('event_eventTime')))
metadata_table2 = metadata_table2.groupBy('event_actor', 'event_eventTime').count()
metadata_table2 = metadata_table2.groupBy('event_actor').sum('count')
metadata_table2 = metadata_table2.withColumnRenamed('event_actor', 'studentId').withColumnRenamed('sum(count)', 'Insights_avgNumActivitiesPerDay')
metadata_table = metadata_table.join(metadata_table2, metadata_table.event_actor == metadata_table2.studentId, how='inner')
metadata_table = metadata_table.select('event_actor', 'Insights_numDaysActive', 'Insights_avgNumActivitiesPerDay')
metadata_table = metadata_table.withColumn('Insights_avgNumActivitiesPerDay', F.col('Insights_avgNumActivitiesPerDay')/F.col('Insights_numDaysActive'))
metadata_table = metadata_table.withColumn('Insights_avgNumActivitiesPerDay', F.round(F.col('Insights_avgNumActivitiesPerDay'), 3))
display(metadata_table.limit(10))

In [38]:
# get the average time (in seconds) of Insights-recorded Teams meetings per student
metadata_table3 = df_digActivity_insights.select('event_actor', 'generated_aggregateMeasure_metric_timeOnTaskSec')
metadata_table3 = metadata_table3.withColumn('generated_aggregateMeasure_metric_timeOnTaskSec', F.col('generated_aggregateMeasure_metric_timeOnTaskSec').cast('int'))
metadata_table3 = metadata_table3.groupBy('event_actor').avg('generated_aggregateMeasure_metric_timeOnTaskSec')
metadata_table3 = metadata_table3.withColumnRenamed('avg(generated_aggregateMeasure_metric_timeOnTaskSec)', 'Insights_avgSecInTeamsMeetings')
metadata_table3 = metadata_table3.withColumn('Insights_avgSecInTeamsMeetings', F.round(F.col('Insights_avgSecInTeamsMeetings'), 3))
metadata_table3 = metadata_table3.withColumnRenamed('event_actor', 'studentId')
metadata_table = metadata_table.join(metadata_table3, metadata_table.event_actor == metadata_table3.studentId, how='inner')
dfFinal_metadata = metadata_table.select('event_actor', 'Insights_numDaysActive', 'Insights_avgNumActivitiesPerDay', 'Insights_avgSecInTeamsMeetings')
dfFinal_metadata = dfFinal_metadata.withColumnRenamed('event_actor', 'StudentId_pseudo')
display(dfFinal_metadata.limit(10))

In [39]:
# join enriched Insights digital activity data back to Student-model table
### NOTE: May need to change the join to be with the dfInsights.Student_external_pseudonym given the latest updates with the Digital Engagement Schema Standard
dfInsights = dfInsights.join(dfFinal_metadata, dfInsights.StudentId_internal_pseudonym == dfFinal_metadata.StudentId_pseudo, how='inner')
dfInsights = dfInsights.drop('StudentId_pseudo', 'StudentId_internal_pseudonym')
display(dfInsights.limit(10))

### Clever: Digital Activity 
Creates 2+ new columns for the Student-model table, based on the Insights digital activity data:
 - Count of the number of days a student is active.
 - Calculate the average number of logins per resource a student uses, on the days they are active.
     * Each resource, recorded by Clever, is pivoted to become a column.

**Notes:**
 - Currently, this enrichment does not factor in if a student does not have any data for a resource, which may require some minor edits to factor in these scenarios.
 - The creation of multiple columns has some enrichment processes specific to the test data; these will require updates when using production data.

In [40]:
# isolate Clever Daily Participation data
df_digActivity_clever_dp = df_digActivity.filter(df_digActivity['event_object'] == 'Clever_Daily_Participation')
# get the total number of days from Clever-recorded digital activity per student
metadata_table1 = df_digActivity_clever_dp.select('event_actor', 'generated_aggregateMeasure_metric_used')
metadata_table1 = metadata_table1.withColumn('Clever_numDaysActive', F.when(F.col('generated_aggregateMeasure_metric_used') == "true", 1).otherwise(0))
metadata_table1 = metadata_table1.groupBy('event_actor').sum('Clever_numDaysActive')
metadata_table1 = metadata_table1.withColumnRenamed('sum(Clever_numDaysActive)', 'Clever_numDaysActive')
display(metadata_table1.limit(10))

In [41]:
# isolate Clever Resource Usage data
df_digActivity_clever_ru = df_digActivity.filter(df_digActivity['event_object'] == 'Clever_Resource_Usage')
# get the daily average number of Clever-recorded digital resources accessed per student
metadata_table2 = df_digActivity_clever_ru.select('event_actor', 'event_eventTime', 'entity_type', 'generated_aggregateMeasure_metric_numAccess')
metadata_table2 = metadata_table2.groupBy('event_actor', 'event_eventTime', 'entity_type').count()
metadata_table2 = metadata_table2.groupBy('event_actor', 'event_eventTime').count()
metadata_table2 = metadata_table2.groupBy('event_actor').avg('count')
metadata_table2 = metadata_table2.withColumnRenamed('avg(count)', 'Clever_avgNumAppsUsedPerDay').withColumnRenamed('event_actor', 'StudentId')
metadata_table2 = metadata_table2.withColumn('Clever_avgNumAppsUsedPerDay', F.round(F.col('Clever_avgNumAppsUsedPerDay'), 3))
# join back to general Clever metadata table
dfFinal_metadata = metadata_table1.join(metadata_table2, metadata_table1.event_actor == metadata_table2.StudentId, how='inner')
dfFinal_metadata = dfFinal_metadata.drop('StudentId')
display(dfFinal_metadata.limit(10))

In [42]:
# get the distinct number of Clever-recorded digital resources accessed per student over the entire time-series
metadata_table3 = df_digActivity_clever_ru.select('event_actor', 'entity_type')
metadata_table3 = metadata_table3.distinct()
metadata_table3 = metadata_table3.groupBy('event_actor').count()
metadata_table3 = metadata_table3.withColumnRenamed('count', 'Clever_distinctNumAppsUsedAllTime').withColumnRenamed('event_actor', 'StudentId')
# join back to general Clever metadata table
dfFinal_metadata = dfFinal_metadata.join(metadata_table3, dfFinal_metadata.event_actor == metadata_table3.StudentId, how='inner')
dfFinal_metadata = dfFinal_metadata.drop('StudentId')
display(dfFinal_metadata.limit(10))

In [43]:
# get the daily average number of times Clever recorded a resource was accessed per resource, per student
# each distinct resource is pivoted to a column
metadata_table4 = df_digActivity_clever_ru.select('event_actor', 'entity_type', 'generated_aggregateMeasure_metric_numAccess')
metadata_table4 = metadata_table4.withColumn('generated_aggregateMeasure_metric_numAccess', metadata_table4.generated_aggregateMeasure_metric_numAccess.cast('int'))
metadata_table4 = metadata_table4.groupBy('event_actor').pivot('entity_type').avg('generated_aggregateMeasure_metric_numAccess')
# reduce the name of some columns and replace illegal characters
    # NOTE: these will need to be updated dpending on the resources collected from production data
metadata_table4 = metadata_table4.withColumnRenamed('Tutor.com', 'TutorDotCom').withColumnRenamed('Edgenuity Courseware/MyPath/UpSmart', 'EdgenuityCourseware') \
.withColumnRenamed('Explore Learning/Rostering', 'ExploreLearning').withColumnRenamed('i-Ready', 'iReady')
metadata_table4 = metadata_table4.select([F.col(col).alias(col.replace(' ', '')) for col in metadata_table4.columns])
# round every column except event_actor and rename columns for final Clever table join
for c in metadata_table4.columns:
    if c != "event_actor":
        metadata_table4 = metadata_table4.withColumn(c, F.round(F.col(c), 3))
        metadata_table4 = metadata_table4.withColumnRenamed(c, 'Clever_avgNumAccessesPerDay_' + c)
    else: 
        metadata_table4 = metadata_table4.withColumnRenamed(c, "StudentId")
# join back to other Clever metadata table
dfFinal_metadata = dfFinal_metadata.join(metadata_table4, dfFinal_metadata.event_actor == metadata_table4.StudentId, how='inner')
dfFinal_metadata = dfFinal_metadata.drop('StudentId')
display(dfFinal_metadata.limit(10))

In [44]:
# join enriched Clever digital activity data back to Student-model table
dfStudentModel = dfInsights.join(dfFinal_metadata, dfInsights.StudentId_external_pseudonym == dfFinal_metadata.event_actor, how='inner')
dfStudentModel = dfStudentModel.drop('event_actor')
display(dfStudentModel.limit(10))

## 4.) Normalize all Digital Activity data by Student's School and Grade
This method extracts the averages seen in the entire StudentModel table for each school and grade, and creates 3 new columns per student:
 - One for Insights_avgNumActivitiesPerDay -> Insights_avgNumActivitiesPerDay_norm
 - One for Insights_avgSecInTeamsMeetings -> Insights_avgSecInTeamsMeetings_norm
 - One for Clever_avgNumAppsUsedPerDay -> Clever_avgNumAppsUsedPerDay_norm

Normalizing is accomplished by the following equation:

$$X_{norm} = {X-X_{min}\over X_{max}-X_{min}}$$

In [45]:
dfNormValues = dfStudentModel.select('StudentGrade', 'SchoolName', 'Insights_avgNumActivitiesPerDay', 'Insights_avgSecInTeamsMeetings', 'Clever_avgNumAppsUsedPerDay')
# find the min and max values per school and grade and join together into a single df
dfNormValues_mins = dfNormValues.groupBy('StudentGrade', 'SchoolName').min('Insights_avgNumActivitiesPerDay', 'Insights_avgSecInTeamsMeetings', 'Clever_avgNumAppsUsedPerDay')
dfNormValues_maxs = dfNormValues.groupBy('StudentGrade', 'SchoolName').max('Insights_avgNumActivitiesPerDay', 'Insights_avgSecInTeamsMeetings', 'Clever_avgNumAppsUsedPerDay')
dfNormValues_mins = dfNormValues_mins.withColumnRenamed('StudentGrade', 'Grade').withColumnRenamed('SchoolName', 'Name')
dfNormValues_both = dfNormValues_mins.join(dfNormValues_maxs, (dfNormValues_mins.Grade == dfNormValues_maxs.StudentGrade) \
& (dfNormValues_mins.Name == dfNormValues_maxs.SchoolName), how='inner')
dfNormValues_both = dfNormValues_both.drop('StudentGrade', 'SchoolName')
# Calculate the dividing factor for normalization (i.e. x_max-x_min) per average column, school, and grade
dfNormValues_both = dfNormValues_both.withColumn('divFactor_Insights_avgNumActivitiesPerDay', F.col('max(Insights_avgNumActivitiesPerDay)')-F.col('min(Insights_avgNumActivitiesPerDay)'))
dfNormValues_both = dfNormValues_both.withColumn('divFactor_Insights_avgSecInTeamsMeetings', F.col('max(Insights_avgSecInTeamsMeetings)')-F.col('min(Insights_avgSecInTeamsMeetings)'))
dfNormValues_both = dfNormValues_both.withColumn('divFactor_Clever_avgNumAppsUsedPerDay', F.col('max(Clever_avgNumAppsUsedPerDay)')-F.col('min(Clever_avgNumAppsUsedPerDay)'))
# round each of the columns and drop the max columns
dfNormValues_both = dfNormValues_both.drop('max(Insights_avgNumActivitiesPerDay)', 'max(Insights_avgSecInTeamsMeetings)', 'max(Clever_avgNumAppsUsedPerDay)')
dfNormValues_both = dfNormValues_both.withColumn('divFactor_Insights_avgNumActivitiesPerDay', F.round(F.col('divFactor_Insights_avgNumActivitiesPerDay'), 3))
dfNormValues_both = dfNormValues_both.withColumn('divFactor_Insights_avgSecInTeamsMeetings', F.round(F.col('divFactor_Insights_avgSecInTeamsMeetings'), 3))
dfNormValues_both = dfNormValues_both.withColumn('divFactor_Clever_avgNumAppsUsedPerDay', F.round(F.col('divFactor_Clever_avgNumAppsUsedPerDay'), 3))
display(dfNormValues_both)

In [46]:
# join this normalization-values table back to the StudentModel
dfStudentModel_norm = dfStudentModel.join(dfNormValues_both, (dfStudentModel.StudentGrade == dfNormValues_both.Grade) \
& (dfStudentModel.SchoolName == dfNormValues_both.Name), how='left')
dfStudentModel_norm = dfStudentModel_norm.drop('Grade', 'Name')
# iteratively create new columns per student with their normalized averages 
dfStudentModel_norm = dfStudentModel_norm.withColumn('Insights_normAvgNumActivitiesPerDay', (F.col('Insights_avgNumActivitiesPerDay')-F.col('min(Insights_avgNumActivitiesPerDay)'))/F.col('divFactor_Insights_avgNumActivitiesPerDay'))
dfStudentModel_norm = dfStudentModel_norm.withColumn('Insights_normAvgSecInTeamsMeetings', (F.col('Insights_avgSecInTeamsMeetings')-F.col('min(Insights_avgSecInTeamsMeetings)'))/F.col('divFactor_Insights_avgSecInTeamsMeetings'))
dfStudentModel_norm = dfStudentModel_norm.withColumn('Clever_normAvgNumAppsUsedPerDay', (F.col('Clever_avgNumAppsUsedPerDay')-F.col('min(Clever_avgNumAppsUsedPerDay)'))/F.col('divFactor_Clever_avgNumAppsUsedPerDay'))
# round these new normalized-averages columns
dfStudentModel_norm = dfStudentModel_norm.withColumn('Insights_normAvgNumActivitiesPerDay', F.round(F.col('Insights_normAvgNumActivitiesPerDay'), 3))
dfStudentModel_norm = dfStudentModel_norm.withColumn('Insights_normAvgSecInTeamsMeetings', F.round(F.col('Insights_normAvgSecInTeamsMeetings'), 3))
dfStudentModel_norm = dfStudentModel_norm.withColumn('Clever_normAvgNumAppsUsedPerDay', F.round(F.col('Clever_normAvgNumAppsUsedPerDay'), 3))
# drop the columns used for calculation
dfStudentModel_norm = dfStudentModel_norm.drop('min(Insights_avgNumActivitiesPerDay)', 'min(Insights_avgSecInTeamsMeetings)', 'min(Clever_avgNumAppsUsedPerDay)')
dfStudentModel_norm = dfStudentModel_norm.drop('divFactor_Insights_avgNumActivitiesPerDay', 'divFactor_Insights_avgSecInTeamsMeetings', 'divFactor_Clever_avgNumAppsUsedPerDay')
display(dfStudentModel_norm.limit(10))

## 5.) Write the StudentModel_pseudo table to Stage 3p


In [47]:
dfStudentModel_norm.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/chronic_absenteeism/StudentModel_pseudo')

## 6.) Build StudentModel_lookup table

In [48]:
dfInsights_person_np = oea.load('M365', 'Person_lookup', stage=oea.stage2np)
dfInsights_aaduser_np = oea.load('M365', 'AadUser_lookup', stage=oea.stage2np)
dfInsights_np = dfInsights_person_np.join(dfInsights_aaduserpersonmapping, dfInsights_person_np.Id_pseudonym == dfInsights_aaduserpersonmapping.StudentId_internal_pseudonym, how='inner')
dfInsights_np = dfInsights_np.withColumnRenamed('Id', 'StudentId_internal').withColumnRenamed('ObjectId_pseudonym', 'StudentId_external_pseudonym')
dfInsights_np = dfInsights_np.select('StudentId_internal_pseudonym', 'StudentId_internal', 'StudentId_external_pseudonym', 'Surname', 'GivenName', 'MiddleName')
display(dfInsights_np.limit(10))

In [49]:
dfInsights_aaduser_np = dfInsights_aaduser_np.withColumnRenamed('Surname', 'Surname2').withColumnRenamed('GivenName', 'GivenName2')
dfInsights_np = dfInsights_np.join(dfInsights_aaduser_np, dfInsights_np.StudentId_external_pseudonym == dfInsights_aaduser_np.ObjectId_pseudonym, how='inner')
dfInsights_np = dfInsights_np.withColumnRenamed('ObjectId', 'StudentId_external')
dfInsights_np = dfInsights_np.select('StudentId_internal_pseudonym', 'StudentId_internal', 'StudentId_external_pseudonym', 'StudentId_external', 'Surname', 'GivenName', 'MiddleName')
display(dfInsights_np.limit(10))

## 7.) Write the StudentModel_lookup table to Stage 3np

In [50]:
dfInsights_np.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3np + '/chronic_absenteeism/StudentModel_lookup')