# Learning Analytics Package: Build Engagement Tables (for Meetings and Assignments Data)

Builds the engagement tables for the Learning Analytics package dashboard, in the context of using the Higher Ed. test data from Microsoft Education Insights activity data, and Microsoft Graph meeting attendance data.

The following tables are created in each of the steps outlined below:

1. Meetings_pseudo, 
2. MeetingsAggregate_pseudo,
3. InsightsActivity_pseudo, and
4. Assignments_pseudo.

In [68]:
%run /OEA_py

StatementMeta(, 158, -1, Finished, Available)

2022-12-02 20:18:20,404 - OEA - DEBUG - OEA initialized.
OEA initialized.


## 1.) Create Meetings_pseudo Table

Data aggregations and curation on Graph meeting attendance and student-class enrollment data: meeting_attendance_report, Student, Enrollment and TechActivity tables. This notebook was developed using the Higher Ed. test data in stage 2p/np of the data lake. 

This table has one row per student per meeting attended from within the education system, containing the:
 - meeting ID, 
 - class ID, 
 - student ID,
 - additional student data,
 - meeting attendance flag, and
 - student meeting attendance duration.

This table is then written out to stage 3p under: stage3p/learning_analytics/Meetings_pseudo.

In [29]:
import datetime as dt
dfGraph_meetingAtten = oea.load('graph_api', 'meeting_attendance_report_pseudo')
dfStudent = oea.load_delta('stage3p/learning_analytics/Student_pseudo')
dfEnrollment = oea.load_delta('stage3p/learning_analytics/Enrollment_pseudo')
dfInsights_activity = oea.load('M365', 'TechActivity_pseudo')

StatementMeta(sparkMed, 155, 9, Finished, Available)

In [32]:
dfInsights_activity_ = dfInsights_activity.join(dfStudent, dfInsights_activity.ActorId_pseudonym == dfStudent.StudentId_external_pseudonym, how='inner')
dfMeetings = dfGraph_meetingAtten.join(dfInsights_activity_, (dfGraph_meetingAtten.meetingId == dfInsights_activity_.MeetingSessionId) & \
                                     (dfGraph_meetingAtten.userId_pseudonym == dfInsights_activity_.StudentId_internal_pseudonym), how='inner')

dfMeetings = dfMeetings.select('meetingId', 'meetingStartDateTime', 'meetingEndDateTime', 'ClassId', 'StudentId_internal_pseudonym', 'MeetingDuration', 'totalAttendanceInSec', 'SignalType', 'attendanceInterval_joinDateTime', 'attendanceInterval_leaveDateTime')
dfMeetings = dfMeetings.withColumnRenamed('ClassId', 'Id').withColumnRenamed('totalAttendanceInSec', 'StudentTotalAttendanceInSec') \
            .withColumnRenamed('SignalType', 'Insights_SignalType').withColumnRenamed('MeetingDuration', 'StudentMeetingAttendanceDuration')

# needs to be updated
#import datetime as dt
#def addColumnsForLatenessCheck(column):
#    df = df.withColumn('meetingStart+5min', F.col('meetingStartDateTime') + dt.timedelta(minutes=5))
#    df = df.withColumn('meetingEnd-5min', F.col('meetingEndDateTime') - dt.timedelta(minutes=5))
#    return df

#dfTest = addColumnsForLatenessCheck(dfMeetings)

# determine late students by whether they joined more than 5 minutes after meeting start time, or left more than 5 minutes earlier than meeting end time - indicated with a 0 flag.
# determine students that attended the whole time - indicated with a 1 flag.
#dfMeetings = dfMeetings.withColumn('meetingAttendanceFlag', F.when(F.col('attendanceInterval_joinDateTime') > F.col('meetingStartDateTime'), 0).otherwise( \
#                            F.when(F.col('attendanceInterval_leaveDateTime') < F.col('meetingEndDateTime'), 0).otherwise(1)))
dfMeetings = dfMeetings.withColumn('meetingAttendanceFlag', F.when(F.col('attendanceInterval_joinDateTime') > F.col('meetingStartDateTime'), 0).otherwise(1))

dfMeetings = dfMeetings.drop('attendanceInterval_joinDateTime', 'attendanceInterval_leaveDateTime')

display(dfMeetings.limit(10))

StatementMeta(sparkMed, 155, 12, Finished, Available)

SynapseWidget(Synapse.DataFrame, f5a3e160-7f7c-416d-b02f-c2c4fb73f1cc)

In [33]:
def insertMissedStudents(dfStudentsMissed, dfMeetings):
    # isolate data points needed to append dfMeetings
    numStudentsMissed = dfStudentsMissed.count()
    list_of_missedStudentIds = dfStudentsMissed.select('StudentId_internal_pseudonym').collect()
    list_of_missedStudentClass = dfStudentsMissed.select('AADGroup_ClassId').collect()
    
    for n in range(numStudentsMissed):
        # iterate through students missed
        df_metadata = dfMeetings.groupBy('meetingId', 'meetingStartDateTime','meetingEndDateTime', 'Id', 'StudentId_internal_pseudonym').count()
        currentStudentId = list_of_missedStudentIds[n][0]
        currentStudentClass = list_of_missedStudentClass[n][0]
        df_metadata = df_metadata.filter(df_metadata['Id'] == currentStudentClass)
        df_test = df_metadata.filter(df_metadata['StudentId_internal_pseudonym'] == currentStudentId)
        if df_test.count() == 0:
            # check to ensure that the student missed meetings from a class
            df_metadata = df_metadata.groupBy('meetingId', 'meetingStartDateTime','meetingEndDateTime').count()
            numMeetingsMissed = df_metadata.count()
            list_of_meetings = df_metadata.select('meetingId').collect()
            list_of_meetingStartTimes = df_metadata.select('meetingStartDateTime').collect()
            list_of_meetingEndTimes = df_metadata.select('meetingEndDateTime').collect()
            for m in range(numMeetingsMissed):
                # iterate through meetings missed
                meetingID = list_of_meetings[m][0]
                meetingStartTime = list_of_meetingStartTimes[m][0]
                meetingEndTime = list_of_meetingEndTimes[m][0]
                newRow = spark.createDataFrame([(meetingID,meetingStartTime,meetingEndTime,currentStudentClass,currentStudentId, "0:00:00", 0, '', -1)])

                dfMeetings = dfMeetings.union(newRow)
        else:
            # if the student attended some meetings but not others in a class, isolate those they missed.
            df_metadata2 = df_metadata.groupBy('meetingId', 'meetingStartDateTime', 'meetingEndDateTime').count()
            df_metadata2 = df_metadata2.withColumnRenamed('meetingId', 'MeetingIDs').withColumnRenamed('meetingStartDateTime', 'startTime').withColumnRenamed('meetingEndDateTime', 'endTime')
            dfMeetingsMissed = df_metadata2.join(df_test, df_metadata2.MeetingIDs == df_test.meetingId, how='leftanti')
            numMeetingsMissed = dfMeetingsMissed.count()
            list_of_meetings = df_metadata2.select('MeetingIDs').collect()
            list_of_meetingStart = df_metadata2.select('startTime').collect()
            list_of_meetingEnd = df_metadata2.select('endTime').collect()
            for m in range(numMeetingsMissed):
                # iterate through meetings missed
                meetingID = list_of_meetings[m][0]
                meetingStartTime = list_of_meetingStart[m][0]
                meetingEndTime = list_of_meetingEnd[m][0]
                newRow = spark.createDataFrame([(meetingID,meetingStart,meetingEnd,currentStudentClass,currentStudentId, "0:00:00", 0, '', -1)])

                dfMeetings = dfMeetings.union(newRow)

    return dfMeetings

StatementMeta(sparkMed, 155, 13, Finished, Available)

In [34]:
# find students that missed meetings per class, and add them to the meetings table with meetingAttendanceFlag = -1
dfEnrollment_ = dfEnrollment.withColumnRenamed('SectionId', 'ClassId')
dfMissedStudents = dfMeetings.withColumnRenamed('StudentId_internal_pseudonym', 'StudentId')

dfMissedStudents = dfEnrollment_.join(dfMissedStudents, (dfEnrollment_.StudentId_internal_pseudonym == dfMissedStudents.StudentId) & \
                (dfEnrollment_.AADGroup_ClassId == dfMissedStudents.Id), how='leftanti')

dfMeetings = insertMissedStudents(dfStudentsMissed=dfMissedStudents, dfMeetings=dfMeetings)
# temp
dfMeetings = dfMeetings.withColumnRenamed('Id', 'SectionId')

display(dfMeetings)

StatementMeta(sparkMed, 155, 14, Finished, Available)

SynapseWidget(Synapse.DataFrame, 85ce53d9-db01-4fb8-84aa-f4da54a9f21c)

### Write to Stage 3p

In [36]:
dfMeetings.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/learning_analytics/Meetings_pseudo')

StatementMeta(sparkMed, 155, 16, Finished, Available)

## 2.) Create MeetingsAggregate_pseudo Table

Data aggregations and curation on Graph meeting attendance data: meeting_attendance_report, Insights data: Enrollment_pseudo and TechAcitivty_pseudo, and previously enriched data: Enrollment_pseudo. This notebook was developed using the Higher Ed. test data in stage 2p/np of the data lake. 

This table has one row per meeting from within the education system, containing the:
 - meeting ID, 
 - class ID, 
 - total number of meeting participants, 
 - first student to join (student ID, totalAttendanceInSec, joinDateTime, leaveDateTime), 
 - last student to leave (student ID, totalAttendanceInSec, joinDateTime, leaveDateTime),
 - meeting duration, and 
 - additional enrichment columns.

This table is then written out to stage 3p under: stage3p/learning_analytics/MeetingsAggregate_pseudo.

In [37]:
dfGraph_meetingAtten = oea.load('graph_api', 'meeting_attendance_report_pseudo')
dfInsights_activity = oea.load('M365', 'TechActivity_pseudo')
dfInsights_enrollment = oea.load('M365', 'Enrollment_pseudo')
dfEnrollment = oea.load_delta('stage3p/learning_analytics/Enrollment_pseudo')

StatementMeta(sparkMed, 155, 17, Finished, Available)

In [38]:
from pyspark.sql import Window
# grabs the first student with the earliest date time without consideration of the time they spent in the meeting
w = Window.partitionBy('meetingId').orderBy("attendanceInterval_joinDateTime")
df_metadata1 = dfGraph_meetingAtten.withColumn("rn",F.row_number().over(w)).filter(F.col("rn") ==1).drop("rn")
df_metadata1 = df_metadata1.select('meetingId', 'meetingEndDateTime', 'meetingStartDateTime', 'totalParticipantCount', 'totalAttendanceInSec', 'attendanceInterval_joinDateTime', 'attendanceInterval_leaveDateTime', \
                                 'userId_pseudonym', 'role')
df_metadata1 = df_metadata1.withColumnRenamed('meetingId', 'MeetingId').withColumnRenamed('totalAttendanceInSec', 'FirstJoined_totalAttendanceInSec') \
                .withColumnRenamed('attendanceInterval_joinDateTime', 'FirstJoined_attenInterval_joinDateTime').withColumnRenamed('attendanceInterval_leaveDateTime', 'FirstJoined_attenInterval_leaveDateTime') \
                .withColumnRenamed('userId_pseudonym', 'FirstJoined_StudentId_internal_pseudonym').withColumnRenamed('role', 'FirstJoined_Role')
# grabs the first student with the latest date time without consideration of the time they spent in the meeting
w = Window.partitionBy('meetingId').orderBy(F.desc("attendanceInterval_leaveDateTime"))
df_metadata2 = dfGraph_meetingAtten.withColumn("rn",F.row_number().over(w)).filter(F.col("rn") ==1).drop('rn')
df_metadata2 = df_metadata2.select('meetingId', 'totalAttendanceInSec', 'attendanceInterval_joinDateTime', 'attendanceInterval_leaveDateTime', 'userId_pseudonym', 'role')
df_metadata2 = df_metadata2.withColumnRenamed('meetingId', 'meetingId2').withColumnRenamed('totalAttendanceInSec', 'LastToLeave_totalAttendanceInSec') \
                .withColumnRenamed('attendanceInterval_joinDateTime', 'LastToLeave_attenInterval_joinDateTime').withColumnRenamed('attendanceInterval_leaveDateTime', 'LastToLeave_attenInterval_leaveDateTime') \
                .withColumnRenamed('userId_pseudonym', 'LastToLeave_StudentId_internal_pseudonym').withColumnRenamed('role', 'LastToLeave_Role')
# join these two dfs together to get the first student to join and last student to the leave the meeting
dfEngageMeetingAgg = df_metadata1.join(df_metadata2, df_metadata1.MeetingId == df_metadata2.meetingId2, how='inner')
dfEngageMeetingAgg = dfEngageMeetingAgg.drop('meetingId2')

display(dfEngageMeetingAgg)

StatementMeta(sparkMed, 155, 18, Finished, Available)

SynapseWidget(Synapse.DataFrame, 9db04060-3018-45bc-8721-27dc95d9e423)

In [39]:
# join the ClassId data (from Insights activity data) to the final engagement meeting aggregation df
df_metadata = dfInsights_activity.select('MeetingSessionId', 'ClassId')
df_metadata = df_metadata.groupBy('MeetingSessionId', 'ClassId').count()
dfEngageMeetingAgg = dfEngageMeetingAgg.join(df_metadata, dfEngageMeetingAgg.MeetingId == df_metadata.MeetingSessionId, how='inner')
dfEngageMeetingAgg = dfEngageMeetingAgg.drop('count').drop('MeetingSessionId')
dfEngageMeetingAgg = dfEngageMeetingAgg.withColumnRenamed('ClassId', 'AADGroup_ClassId')
display(dfEngageMeetingAgg.limit(10))

StatementMeta(sparkMed, 155, 19, Finished, Available)

SynapseWidget(Synapse.DataFrame, 52ed265d-ae1e-44be-8a8d-325c2b5517ad)

In [44]:
df_metadata1 = dfMeetings.withColumn('meetingAttendanceFlag', F.when(F.col('meetingAttendanceFlag') == 0, 1).otherwise(F.when(F.col('meetingAttendanceFlag') == -1, 0).otherwise(1)))
df_metadata1 = df_metadata1.groupBy('meetingId', 'SectionId').sum('meetingAttendanceFlag')
df_metadata1 = df_metadata1.withColumnRenamed('sum(meetingAttendanceFlag)', 'numStudentsAttendedMeeting')
df_metadata2 = dfInsights_enrollment.groupBy('SectionId').count()
df_metadata2 = df_metadata2.withColumnRenamed('SectionId', 'ClassId').withColumnRenamed('count', 'numStudentsEnrolledInSection')
df_metadata3 = dfEnrollment.groupBy('SectionId', 'AADGroup_ClassId').count()
df_metadata3 = df_metadata3.drop('count').withColumnRenamed('AADGroup_ClassId', 'AADGroupId')
df_metadata = df_metadata2.join(df_metadata3, df_metadata2.ClassId == df_metadata3.SectionId, how='inner')
df_metadata = df_metadata.drop('ClassId', 'SectionId')
df_metadata = df_metadata.join(df_metadata1, df_metadata.AADGroupId == df_metadata1.SectionId, how='inner')
df_metadata = df_metadata.drop('AADGroupId')

df_metadata = df_metadata.withColumn('numStudentsMissedMeeting', F.col('numStudentsEnrolledInSection')-F.col('numStudentsAttendedMeeting'))

display(df_metadata.limit(10))

StatementMeta(sparkMed, 155, 24, Finished, Available)

SynapseWidget(Synapse.DataFrame, ec95f711-236a-474d-997b-33e1d6d454b1)

In [45]:
# join the column for num students that missed the meeting (from comparing attendance to enrollment) to the final meeting aggregation df
df_metadata = df_metadata.select('SectionId', 'meetingId', 'numStudentsEnrolledInSection', 'numStudentsAttendedMeeting', 'numStudentsMissedMeeting')
df_metadata = df_metadata.withColumnRenamed('meetingId', 'id')
dfMeetingAgg = dfEngageMeetingAgg.join(df_metadata, dfEngageMeetingAgg.MeetingId == df_metadata.id, how='inner')
dfMeetingAgg = dfMeetingAgg.drop('id', 'totalParticipantCount', 'AADGroup_ClassId').withColumnRenamed('numStudentsAttendedMeeting', 'totalParticipantCount').withColumnRenamed('MeetingId', 'meetingId')
display(dfMeetingAgg.limit(10))

StatementMeta(sparkMed, 155, 25, Finished, Available)

SynapseWidget(Synapse.DataFrame, 46a0734a-24b9-47b9-9cee-f54515ca5f2c)

### Write to Stage 3p

In [46]:
dfMeetingAgg.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/learning_analytics/MeetingsAggregate_pseudo')

StatementMeta(sparkMed, 155, 26, Finished, Available)

## 3.) Create InsightsActivity_pseudo Table

Data aggregations and curation on Insights (M365) activity data: TechActivity table. This notebook was developed using the Higher Ed. test data in stage 2p/np of the data lake. 

This table has one row per Insights signal, while adding a SignalCategories column, and removing unnecessary columns.

This table is then written out to stage 3p under: stage3p/learning_analytics/InsightsActivity_pseudo.

In [64]:
dfInsights_activity = oea.load('M365', 'TechActivity_pseudo')

StatementMeta(sparkMed, 157, 2, Finished, Available)

In [65]:
dfInsightsActivity = dfInsights_activity.select('SignalType', 'StartTime', 'SignalId', 'ClassId', 'AppName', 'ActorId_pseudonym', 'ActorRole', 'AssignmentId', \
                        'SubmissionId', 'SubmissionCreatedTime', 'Action', 'DueDate', 'Grade', 'SourceFileExtension', 'MeetingDuration', 'MeetingSessionId', 'MeetingType')
dfInsightsActivity = dfInsightsActivity.withColumnRenamed('ActorId_pseudonym', 'StudentId_external_pseudonym').withColumnRenamed('ClassId', 'SectionId')

StatementMeta(sparkMed, 157, 3, Finished, Available)

In [54]:
# create a new column for categorizing the signals
def SignalCat(SignalType):
    if SignalType == 'PostChannelMessage':
        res = 'Messaging'
    elif SignalType == 'ReplyChannelMessage':
        res = 'Messaging'
    elif SignalType == 'VisitTeamChannel':
        res = 'Messaging'
    elif SignalType == 'ExpandChannelMessage':
        res = 'Messaging'
    elif SignalType == 'ReactedWithEmoji':
        res = 'Messaging'
    elif SignalType == 'Like':
        res = 'Files'
    elif SignalType == 'Unlike':
        res = 'Files'
    elif SignalType == 'FileAccessed':
        res = 'Files'
    elif SignalType == 'FileModified':
        res = 'Files'
    elif SignalType == 'FileDownloaded':
        res = 'Files'
    elif SignalType == 'FileUploaded':
        res = 'Files'
    elif SignalType == 'ShareNotificationRequested':
        res = 'Files'
    elif SignalType == 'AddedToSharedWithMe':
        res = 'Files'
    elif SignalType == 'CommentCreated':
        res = 'Files'
    elif SignalType == 'CommentDeleted':
        res = 'Files'
    elif SignalType == 'UserAtMentioned':
        res = 'Files'
    elif SignalType == 'Reflect':
        res = 'Reflect'
    elif SignalType == 'OneNotePageChanged':
        res = 'Notebook'
    elif SignalType == 'SubmissionEvent':
        res = 'Assignments'
    elif SignalType == 'AssignmentEvent':
        res = 'Assignments'
    elif SignalType == 'CallRecordSummarized':
        res = 'TeamsMeeting'
    else:
        res = ''
    return res
  
# define the function/dataType
new_f = F.udf(SignalCat, StringType())
  
# Add the new row
dfInsightsActivity = dfInsightsActivity.withColumn("SignalCategories", new_f('SignalType'))
display(dfInsightsActivity.limit(100))

StatementMeta(sparkMed, 155, 34, Finished, Available)

SynapseWidget(Synapse.DataFrame, 4190419e-32e2-4edf-b7de-cc8f748155d7)

### Write to Stage 3p

In [66]:
dfInsightsActivity.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/learning_analytics/InsightsActivity_pseudo')

StatementMeta(sparkMed, 157, 4, Finished, Available)

## 4.) Create Assignments_pseudo Table

Data aggregations and curation on Insights (M365) activity data: TechActivity table. This notebook was developed using the Higher Ed. test data in stage 2p/np of the data lake. 

This table has one row per Insights AssignmentId, SectionId, and Student, while adding a few custom enrichment columns.

This table is then written out to stage 3p under: stage3p/learning_analytics/Assignments_pseudo.

In [69]:
dfInsights_activity = oea.load('M365', 'TechActivity_pseudo')

StatementMeta(sparkMed, 158, 2, Finished, Available)

In [70]:
dfAssignments = dfInsights_activity.select('SignalType', 'SignalId', 'ClassId', 'AppName', 'ActorId_pseudonym', 'ActorRole', 'AssignmentId', \
                        'SubmissionId', 'SubmissionCreatedTime', 'Action', 'DueDate', 'Grade', 'SourceFileExtension')
dfAssignments = dfAssignments.filter(dfAssignments['ActorRole'] == 'Student')
dfAssignments = dfAssignments.withColumnRenamed('ActorId_pseudonym', 'StudentId_external_pseudonym').withColumnRenamed('ClassId', 'SectionId')
display(dfAssignments.limit(10))

StatementMeta(sparkMed, 158, 3, Finished, Available)

In [60]:
dfAssignments = dfAssignments.groupBy('SectionId', 'StudentId_external_pseudonym', 'AssignmentId').count()
dfAssignments = dfAssignments.filter(dfAssignments['AssignmentId'] != 'undefined')
dfAssignments = dfAssignments.withColumnRenamed('count', 'Insights_TotalNumSignals')
dfAssignments = dfAssignments.select('SectionId', 'AssignmentId', 'StudentId_external_pseudonym', 'Insights_TotalNumSignals')
display(dfAssignments.limit(10))

StatementMeta(sparkMed, 155, 40, Finished, Available)

SynapseWidget(Synapse.DataFrame, c75a1027-33c9-42ce-8155-52a4dd7ef690)

In [61]:
dfAssignments.coalesce(1).write.format('delta').mode('overwrite').option('header', True).save(oea.stage3p + '/learning_analytics/Assignments_pseudo')

StatementMeta(sparkMed, 155, 41, Finished, Available)