# Learning Analytics Package: Build Roster Tables

Builds the rostering tables for the Learning Analytics package dashboard, in the context of using the Higher Ed. test data from Microsoft Education Insights roster and activity data.

The following tables are created in each of the steps outlined below:

1. Student_pseudo, 
2. Enrollment_pseudo, and
3. Student_lookup.

In [1]:
%run /OEA_py

StatementMeta(, 154, -1, Finished, Available)

2022-12-02 16:42:19,624 - OEA - DEBUG - OEA initialized.
OEA initialized.


## 1.) Create Student_pseudo Table

Data Aggregations on Insights (M365) roster data: AADUserPersonMapping, Person, PersonOrgRole, Organization and RefDefinition tables. This notebook was developed using the Higher Ed. test data in stage 2p/np of the data lake. 

This table has one row per student in the education system, containing their internal IDs (PersonId), external IDs (AADUser/ObjectId), the school they belong to within the education system, and their grade/role in the education system (currently only grabs students).

This table is then written out to stage 3p under: stage3p/learning_analytics/Student_pseudo.

In [2]:
dfInsights_aaduserpersonmapping = oea.load('M365', 'AadUserPersonMapping_pseudo')
dfInsights_person = oea.load('M365', 'Person_pseudo')
dfInsights_personOrgRole = oea.load('M365', 'PersonOrganizationRole_pseudo')
dfInsights_organization = oea.load('M365', 'Organization_pseudo')
dfInsights_refDefinition = oea.load('M365', 'RefDefinition_pseudo')
dfInsights_enrollment = oea.load('M365', 'Enrollment_pseudo')

StatementMeta(sparkMed, 154, 2, Finished, Available)

In [3]:
dfInsights = dfInsights_personOrgRole.join(dfInsights_person, dfInsights_personOrgRole.PersonId_pseudonym == dfInsights_person.Id_pseudonym, how='inner')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'RefRoleId', 'RefGradeLevelId', 'OrganizationId')

StatementMeta(sparkMed, 154, 3, Finished, Available)

In [4]:
dfInsights = dfInsights.join(dfInsights_organization, dfInsights.OrganizationId == dfInsights_organization.Id, how='inner')
dfInsights = dfInsights.withColumnRenamed('Name', 'OrganizationName')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'RefRoleId', 'RefGradeLevelId', 'OrganizationId', 'OrganizationName')

StatementMeta(sparkMed, 154, 4, Finished, Available)

In [5]:
dfInsights = dfInsights.join(dfInsights_refDefinition, dfInsights.RefRoleId == dfInsights_refDefinition.Id, how='inner')
dfInsights = dfInsights.withColumnRenamed('Code', 'PersonRole')
dfInsights = dfInsights.filter(dfInsights['PersonRole'] == 'Student')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'RefGradeLevelId', 'OrganizationId', 'OrganizationName')

StatementMeta(sparkMed, 154, 5, Finished, Available)

In [6]:
dfInsights = dfInsights.join(dfInsights_refDefinition, dfInsights.RefGradeLevelId == dfInsights_refDefinition.Id, how='inner')
dfInsights = dfInsights.withColumnRenamed('Code', 'StudentGrade')
dfInsights = dfInsights.select('PersonId_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'StudentGrade', 'OrganizationId', 'OrganizationName')

StatementMeta(sparkMed, 154, 6, Finished, Available)

In [7]:
dfInsights_aaduserpersonmapping = dfInsights_aaduserpersonmapping.withColumnRenamed('PersonId_pseudonym', 'StudentId_internal_pseudonym')
dfInsights = dfInsights.join(dfInsights_aaduserpersonmapping, dfInsights.PersonId_pseudonym == dfInsights_aaduserpersonmapping.StudentId_internal_pseudonym, how='inner')
dfInsights = dfInsights.withColumnRenamed('ObjectId_pseudonym', 'StudentId_external_pseudonym').withColumnRenamed('OrganizationName', 'SchoolName')
dfInsights = dfInsights.select('StudentId_internal_pseudonym', 'StudentId_external_pseudonym', 'Surname', 'GivenName', 'MiddleName', 'PersonRole', 'StudentGrade', 'SchoolName')
display(dfInsights.limit(10))

StatementMeta(sparkMed, 154, 7, Finished, Available)

SynapseWidget(Synapse.DataFrame, 24197675-eb5f-4189-a694-00c8aebdb569)

### Write to Stage 3p

In [10]:
dfStudent = dfInsights
dfStudent.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/learning_analytics/Student_pseudo')

StatementMeta(sparkMed, 154, 10, Finished, Available)

## 2.) Create Enrollment_pseudo Table

Data Aggregations/Enrichment on Insights (M365) roster data: AADGroup (pseudo and lookup), Enrollment, Section, Course, CourseGradeLevel, CourseSubject, Organization and RefDefinition tables. This notebook was developed using the Higher Ed. test data in stage 2p/np of the data lake.

This table writes out multiple rows per student depending on the number of courses they're enrolled in, with specifications regarding the courses/sections that the student is enrolled in (i.e. courseGradeLevel, courseSubject, school associated, etc.)

This table is then written out to stage 3p under: stage3p/learning_analytics/Enrollment_pseudo.

In [11]:
dfInsights_aadgroup = oea.load('M365', 'AadGroup_pseudo')
#dfInsights_aadgroupmembership = oea.load('M365', 'AadGroupMembership_pseudo')
dfInsights_aadgroup_np = oea.load('M365', 'AadGroup_lookup', stage=oea.stage2np)
dfInsights_enrollment = oea.load('M365', 'Enrollment_pseudo')
dfInsights_section = oea.load('M365', 'Section_pseudo')
dfInsights_course = oea.load('M365', 'Course_pseudo')
dfInsights_courseGradeLevel = oea.load('M365', 'CourseGradeLevel_pseudo')
dfInsights_courseSubject = oea.load('M365', 'CourseSubject_pseudo')
dfInsights_organization = oea.load('M365', 'Organization_pseudo')
dfInsights_refDefinition = oea.load('M365', 'RefDefinition_pseudo')

StatementMeta(sparkMed, 154, 11, Finished, Available)

In [18]:
dfInsights_enrollment = dfInsights_enrollment.withColumnRenamed('Id', 'EnrollmentId').drop('SourceSystemId')
dfStudentClasses = dfInsights_enrollment.join(dfInsights_section, dfInsights_enrollment.SectionId == dfInsights_section.Id, how='inner')
dfStudentClasses = dfStudentClasses.withColumnRenamed('PersonId_pseudonym', 'StudentId_internal_pseudonym').withColumnRenamed('Name', 'SectionName') \
    .withColumnRenamed('RefSectionRoleId', 'PersonRole').withColumnRenamed('OrganizationId', 'SchoolId')
dfStudentClasses = dfStudentClasses.select('EnrollmentId', 'SectionName', 'SectionId', 'StudentId_internal_pseudonym', 'PersonRole', 'EntryDate', 'ExitDate', 'SchoolId', 'CourseId')

StatementMeta(sparkMed, 154, 18, Finished, Available)

In [19]:
dfStudentClasses = dfStudentClasses.join(dfInsights_course, dfStudentClasses.CourseId == dfInsights_course.Id, how='inner')
dfStudentClasses = dfStudentClasses.withColumnRenamed('Name', 'CourseName')
dfStudentClasses = dfStudentClasses.select('EnrollmentId', 'SectionName', 'SectionId', 'StudentId_internal_pseudonym', 'PersonRole', 'EntryDate', 'ExitDate', 'SchoolId', 'CourseName', 'CourseId')

StatementMeta(sparkMed, 154, 19, Finished, Available)

In [20]:
dfStudentClasses = dfStudentClasses.join(dfInsights_organization, dfStudentClasses.SchoolId == dfInsights_organization.Id, how='inner')
dfStudentClasses = dfStudentClasses.withColumnRenamed('Name', 'SchoolName')
dfStudentClasses = dfStudentClasses.select('EnrollmentId', 'SectionName', 'SectionId', 'StudentId_internal_pseudonym', 'PersonRole', 'EntryDate', 'ExitDate', 'SchoolName', 'CourseName', 'CourseId')

StatementMeta(sparkMed, 154, 20, Finished, Available)

In [21]:
dfInsights_courseGradeLevel = dfInsights_courseGradeLevel.withColumnRenamed('Id', 'CourseGradeLevelId')
dfCourseGradeLevel = dfInsights_courseGradeLevel.join(dfInsights_refDefinition, dfInsights_courseGradeLevel.RefGradeLevelId == dfInsights_refDefinition.Id, how='inner')
dfCourseGradeLevel = dfCourseGradeLevel.select('CourseId', 'Code')
dfCourseGradeLevel = dfCourseGradeLevel.withColumnRenamed('Code', 'CourseGradeLevel').withColumnRenamed('CourseId', 'Id')
dfStudentClasses = dfStudentClasses.join(dfCourseGradeLevel, dfStudentClasses.CourseId == dfCourseGradeLevel.Id, how='inner')
dfStudentClasses = dfStudentClasses.select('EnrollmentId', 'SectionName', 'SectionId', 'StudentId_internal_pseudonym', 'PersonRole', 'EntryDate', 'ExitDate', 'SchoolName', 'CourseName', 'CourseId', 'CourseGradeLevel')

StatementMeta(sparkMed, 154, 21, Finished, Available)

In [22]:
# Build final enrollment table; temp
dfInsights_aadgroup_ = dfInsights_aadgroup.select('ObjectId_pseudonym', 'SectionId').withColumnRenamed('SectionId', 'Id').withColumnRenamed('ObjectId_pseudonym', 'AADGroup_ClassId_pseudo')
dfEnrollment = dfStudentClasses.join(dfInsights_aadgroup_, dfStudentClasses.SectionId == dfInsights_aadgroup_.Id, how='inner')
dfEnrollment = dfEnrollment.drop('Id')
dfInsights_aadgroup_np_ = dfInsights_aadgroup_np.select('ObjectId_pseudonym', 'ObjectId')
dfEnrollment = dfEnrollment.join(dfInsights_aadgroup_np_, dfEnrollment.AADGroup_ClassId_pseudo == dfInsights_aadgroup_np_.ObjectId_pseudonym, how='inner')
dfEnrollment = dfEnrollment.drop('SectionId').withColumnRenamed('ObjectId', 'SectionId')
dfEnrollment = dfEnrollment.select('EnrollmentId', 'SectionName', 'SectionId', 'StudentId_internal_pseudonym', 'PersonRole', 'EntryDate', 'ExitDate', 'SchoolName', 'CourseName', 'CourseId', 'CourseGradeLevel')

display(dfEnrollment.limit(10))

StatementMeta(sparkMed, 154, 22, Finished, Available)

SynapseWidget(Synapse.DataFrame, 32609bea-5f3d-489b-9de5-80e67c5cd823)

### Write to Stage 3p

In [23]:
dfEnrollment.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/learning_analytics/Enrollment_pseudo')

StatementMeta(sparkMed, 154, 23, Finished, Available)

## 3.) Create Student_lookup Table

In [24]:
dfInsights_person_np = oea.load('M365', 'Person_lookup', stage=oea.stage2np)
dfInsights_aaduser_np = oea.load('M365', 'AadUser_lookup', stage=oea.stage2np)
dfInsights_np = dfInsights_person_np.join(dfInsights_aaduserpersonmapping, dfInsights_person_np.Id_pseudonym == dfInsights_aaduserpersonmapping.StudentId_internal_pseudonym, how='inner')
dfInsights_np = dfInsights_np.withColumnRenamed('Id', 'StudentId_internal').withColumnRenamed('ObjectId_pseudonym', 'StudentId_external_pseudonym')
dfInsights_np = dfInsights_np.select('StudentId_internal_pseudonym', 'StudentId_internal', 'StudentId_external_pseudonym', 'Surname', 'GivenName', 'MiddleName')
display(dfInsights_np.limit(10))

StatementMeta(sparkMed, 154, 24, Finished, Available)

SynapseWidget(Synapse.DataFrame, 7924e03a-a2c1-4683-9b7a-ac0883ac784e)

In [25]:
dfInsights_aaduser_np = dfInsights_aaduser_np.withColumnRenamed('Surname', 'Surname2').withColumnRenamed('GivenName', 'GivenName2')
dfInsights_np = dfInsights_np.join(dfInsights_aaduser_np, dfInsights_np.StudentId_external_pseudonym == dfInsights_aaduser_np.ObjectId_pseudonym, how='inner')
dfInsights_np = dfInsights_np.withColumnRenamed('ObjectId', 'StudentId_external')
dfInsights_np = dfInsights_np.select('StudentId_internal_pseudonym', 'StudentId_internal', 'StudentId_external_pseudonym', 'StudentId_external', 'Surname', 'GivenName', 'MiddleName')
display(dfInsights_np.limit(10))

StatementMeta(sparkMed, 154, 25, Finished, Available)

SynapseWidget(Synapse.DataFrame, 83588b6b-477c-46d3-af7d-478491e0403c)

In [26]:
dfInsights_np.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3np + '/learning_analytics/Student_lookup')

StatementMeta(sparkMed, 154, 26, Finished, Available)