# Test Data Generation: Insights Tables Class

**Affiliation**: *Kwantum Edu Analytics*. **Last Modified**: *2/14/2023*.

This OEA test data generation class notebook generates fictitous Insights roster and activity tables, as seen in the Microsoft Education Insights module. This notebook is needed to successfully run the insights_test_data_gen_demo notebook.

This class notebook primarily leans on the use of the OEA_py class notebook, ```Faker``` and ```random``` python packages, and already-generated base-truth tables to generate **27** Insights module tables:

 1. **Activity**
 2. **AadGroup**
 3. **AadGroupMembership** 
 4. **AadUser**
 5. **AadUserPersonMapping**
 6. **Course**
 7. **CourseGradeLevel**
 8. **CourseSubject**
 9. **Enrollment**
 10. **Organization**
 11. **Person**
 12. **PersonDemographic**
 13. **PersonDemographicEthnicity**
 14. **PersonDemographicPersonFlag**
 15. **PersonDemographicRace**
 16. **PersonEmailAddress**
 17. **PersonIdentifier**
 18. **PersonOrganizationRole**
 19. **PersonPhoneNumber**
 20. **PersonRelationship**
 21. **RefDefinition** *(Note: This CSV is landed from GitHub as an ungenerated base-truth table.)*
 22. **Section**
 23. **SectionGradeLevel**
 24. **SectionSession**
 25. **SectionSubject**
 26. **Session**
 27. **SourceSystem**

There is one main method ```genInsights(startdate, enddate, ed_level, gen_activity, num_activity_signals)``` to generate roster, AAD, and activity tables. Parameter descriptions are given:
  - *startdate*: roster start date.
  - *enddate*: roster end date.
  - *ed_level*: accepts k12 or hed - used for activity data generation.
  - *gen_activity*: boolean argument indicating whether to generate activity data.
  - *num_activity_signals*: number of rows for student-activity signals desired to be generated.


In [1]:
import logging
import random, decimal
from tokenize import Ignore
from faker import Faker
import pandas as pd
import datetime as dt
import numpy as np
from pyspark.sql import functions as F

class InsightsDataGen():
    def __init__(self, startdate='2022-01-03T00:00:00', enddate='2022-06-03T00:00:00'):
        self.startdate = startdate
        self.enddate = enddate
        
        self.faker = Faker('en_US')

        # set current datetime for rundate folder for writing out files
        currentDate = dt.datetime.now()
        self.currentDateTime = currentDate.strftime("%Y-%m-%d %H-%M-%S")

        # initialize dfs for each Insights table to be generated
        activity = {
            'SignalType':[],
            'StartTime':[],
            'UserAgent':[],
            'SignalId':[],
            'SisClassId':[],
            'ClassId':[],
            'ChannelId':[],
            'AppName':[],
            'ActorId':[],
            'ActorRole':[],
            'SchemaVersion':[],
            'AssignmentId':[],
            'SubmissionId':[],
            'SubmissionCreatedTime':[],
            'Action':[],
            'DueDate':[],
            'ClassCreationDate':[],
            'Grade':[],
            'SourceFileExtension':[],
            'MeetingDuration':[],
            'MeetingSessionId':[],
            'MeetingType':[],
            'ReadingSubmissionWordsPerMinute':[],
            'ReadingSubmissionAccuracyScore':[],
            'ReadingSubmissionMispronunciationCount':[],
            'ReadingSubmissionRepetitionsCount':[],
            'ReadingSubmissionInsertionsCount':[],
            'ReadingSubmissionObmissionScore':[],
            'ReadingSubmissionAttemptNumber':[],
            'ReadingAssignmentWordCount':[],
            'ReadingAssignmentFleschKincaidGradeLevel':[],
            'ReadingAssignmentLanguag':[]
        }
        self.M365_activity = pd.DataFrame(activity, dtype=object)
        aaduser = {
            'ObjectId':[],
            'UserPrincipalName':[],
            'Mail':[],
            'MailNickName':[],
            'GivenName':[],
            'Surname':[],
            'DisplayName':[],
            'AnchorId':[],
            'StudentId':[],
            'TeacherId':[],
            'Role':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_aaduser = pd.DataFrame(aaduser, dtype=object)
        aaduserpersonmapping = {
            'ObjectId':[],
            'PersonId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_aaduserpersonmapping = pd.DataFrame(aaduserpersonmapping, dtype=object)
        aadgroup = {
            'ObjectId':[],
            'DisplayName':[],
            'Mail':[],
            'MailNickName':[],
            'AnchorId':[],
            'SectionId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_aadgroup = pd.DataFrame(aadgroup, dtype=object)
        aadgroupmembership = {
            'UserObjectId':[],
            'GroupObjectId':[],
            'Role':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_aadgroupmembership = pd.DataFrame(aadgroupmembership, dtype=object)
        course = {
            'Id':[],
            'SourceSystemId':[],
            'ExternalId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'Name':[],
            'OrganizationId':[],
            'IsActiveInSession':[],
            'Code':[],
            'AcademicYearSessionId':[]
        }
        self.M365_course = pd.DataFrame(course, dtype=object)
        coursesubject = {
            'Id':[],
            'CourseId':[],
            'RefAcademicSubjectId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_coursesubject = pd.DataFrame(coursesubject, dtype=object)
        coursegradelevel = {
            'Id':[],
            'CourseId':[],
            'RefGradeLevelId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_coursegradelevel = pd.DataFrame(coursegradelevel, dtype=object)
        enroll = {
            'Id':[],
            'SourceSystemId':[],
            'ExternalId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'PersonId':[],
            'SectionId':[],
            'RefSectionRoleId':[],
            'IsActiveInSession':[],
            'IsPrimaryStaffForSection':[],
            'EntryDate':[],
            'ExitDate':[]
        }
        self.M365_enrollment = pd.DataFrame(enroll, dtype=object)
        org = {
            'Id':[],
            'SourceSystemId':[],
            'ExternalId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'Name':[],
            'Identifier':[],
            'RefOrganizationTypeId':[],
            'ParentOrganizationId':[]
        }
        self.M365_organization = pd.DataFrame(org, dtype=object)
        person = {
            'Id':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'Surname':[],
            'GivenName':[],
            'MiddleName':[],
            'PreferredSurname':[],
            'PreferredGivenName':[],
            'PreferredMiddleName':[]
        }
        self.M365_person = pd.DataFrame(person, dtype=object)
        persondemo = {
            'Id':[],
            'PersonId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'RefSexId':[],
            'BirthDate':[],
            'BirthCity':[],
            'BirthState':[],
            'BirthCountryCode':[]
        }
        self.M365_persondemographic = pd.DataFrame(persondemo, dtype=object)
        persondemo_eth = {
            'Id':[],
            'PersonId':[],
            'RefEthnicityId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_persondemographicethnicity = pd.DataFrame(persondemo_eth, dtype=object)
        persondemo_pflag = {
            'Id':[],
            'PersonId':[],
            'RefPersonFlag':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_persondemographicpersonflag = pd.DataFrame(persondemo_pflag, dtype=object)
        persondemo_race = {
            'Id':[],
            'PersonId':[],
            'RefRaceId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_persondemographicrace = pd.DataFrame(persondemo_race, dtype=object)
        personemailadd = {
            'Id':[],
            'PersonId':[],
            'EmailAddress':[],
            'PriorityOrder':[],
            'RefEmailAddressTypeId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_personemailaddress = pd.DataFrame(personemailadd, dtype=object)
        personidentifier = {
            'Id':[],
            'PersonId':[],
            'SourceSystemId':[],
            'RefIdentifierTypeId':[],
            'Identifier':[],
            'FirstSeenDateTime':[],
            'IsPresentInSource':[]
        }
        self.M365_personidentifier = pd.DataFrame(personidentifier, dtype=object)
        personorgrole = {
            'Id':[],
            'SourceSystemId':[],
            'ExternalId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'OrganizationId':[],
            'PersonId':[],
            'RefRoleId':[],
            'SessionId':[],
            'IsActiveInSession':[],
            'RoleStartDate':[],
            'RoleEndDate':[],
            'IsPrimary':[],
            'RefGradeLevelId':[]
        }
        self.M365_personorganizationrole = pd.DataFrame(personorgrole, dtype=object)
        personphone = {
            'Id':[],
            'PersonId':[],
            'PhoneNumber':[],
            'PriorityOrder':[],
            'RefPhoneNumberTypeId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_personphonenumber = pd.DataFrame(personphone, dtype=object)
        personrelationship = {
            'Id':[],
            'PersonId':[],
            'RelatedPerssonId':[],
            'RefPersonRelationshipId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_personrelationship = pd.DataFrame(personrelationship, dtype=object)
        section = {
            'Id':[],
            'SourceSystemId':[],
            'ExternalId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'Name':[],
            'OrganizationId':[],
            'CourseId':[],
            'Code':[],
            'Location':[]
        }
        self.M365_section = pd.DataFrame(section, dtype=object)
        sectiongradelevel = {
            'Id':[],
            'SectionId':[],
            'RefGradeLevelId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_sectiongradelevel = pd.DataFrame(sectiongradelevel, dtype=object)
        sectionsession = {
            'Id':[],
            'SectionId':[],
            'SessionId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'IsActiveInSession':[]
        }
        self.M365_sectionsession = pd.DataFrame(sectionsession, dtype=object)
        sectionsubject = {
            'Id':[],
            'SectionId':[],
            'RefAcademicSubjectId':[],
            'FirstSeenDateTime':[]
        }
        self.M365_sectionsubject = pd.DataFrame(sectionsubject, dtype=object)
        session = {
            'Id':[],
            'SourceSystemId':[],
            'ExternalId':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[],
            'Name':[],
            'RefSessionTypeId':[],
            'RefAcademicYearId':[],
            'StartDate':[],
            'EndDate':[],
            'ParentSessionId':[]
        }
        self.M365_session = pd.DataFrame(session, dtype=object)
        sourcesystem = {
            'Id':[],
            'Name':[],
            'FirstSeenDateTime':[],
            'LastSeenDateTime':[]
        }
        self.M365_sourcesystem = pd.DataFrame(sourcesystem, dtype=object)

        sourcepath = 'stage1/Transactional/test_data/v0.1/'
        self.students = oea.load_csv(sourcepath + 'base_students/', header=True).toPandas()
        self.schools = oea.load_csv(sourcepath + 'base_schools/', header=True).toPandas()
        self.courses = oea.load_csv(sourcepath + 'base_courses/', header=True).toPandas()
        self.sections = oea.load_csv(sourcepath + 'base_sections/', header=True).toPandas()
        self.enrollment = oea.load_csv(sourcepath + 'base_enrollment/', header=True).toPandas()
        # land the base refdef CSV unless already exists 
        refdef_exists = oea.path_exists(sourcepath + 'base_refdef/')
        if refdef_exists:
            logger.info('base_refdef CSV already exists.')
        else:
            """NOTE: subject to change depending on the directory of Insights module test data gen kit (specifically, the location of the base refdef file)."""
            data = requests.get('https://raw.githubusercontent.com/microsoft/OpenEduAnalytics/main/modules/module_test_data_generation_kit/notebook/Insights_module/base_refdef.csv').text
            oea.land(data, 'test_data/v0.1/base_refdef', 'base_insights_refdef.csv', oea.SNAPSHOT_BATCH_DATA)
        self.refdef = oea.load_csv(sourcepath + 'base_refdef/', header=True)
        # create refdef lookup/mapping
        self.sourcesystemid = self.faker.uuid4()
        self.sessionid = self.faker.uuid4()

    def genInsights(self,startdate='2022-01-01T00:00:00',enddate='2022-06-01T00:00:00', ed_level='k12', gen_activity=True,num_activity_signals=20):
        self.edlevel = ed_level
        self.startdate = startdate
        self.enddate = enddate
        self.genSourceSystem()
        self.genSectionSession()
        self.genSession()
        self.genOrganization()
        self.genSection()
        self.genAadGroup()
        self.genSectionGradeLevel()
        self.genSectionSubject()
        self.genCourse()
        self.genCourseSubject()
        self.genCourseGradeLevel()
        self.genPersonDemographicEthnicity()
        self.genPersonDemographicPersonFlag()
        self.genPersonDemographicRace()
        self.genPersonDemographic()
        self.genPerson()
        self.genPersonEmailAddress()
        self.genAadUserPersonMapping()
        self.genAadUser()
        self.genPersonPhoneNumber()
        self.genPersonRelationship()
        self.genPersonIdentifier()
        self.genRefDefinition()
        self.genEnrollment() # <- this function may take a while depending on size of base_enrollment table
        self.genPersonOrganizationRole() # <- this function may take a while depending on size of base_enrollment table
        self.genAadGroupMembership()
        if gen_activity:
            self.genActivity(num_activity_signals)
            logger.info('Finished Insights generation.')
        else:
            logger.info('No Insights activity to generate - finished Insights generation.')

    def _get_gradelevel_map(self, base_table_grade='undergraduate: year 1'):
        # create map from base-truth table gen grade level, to insights grade level options
        if base_table_grade == 'undergraduate: year 1':
            insights_grade = 'PS1'
        elif base_table_grade == 'undergraduate: year 2':
            insights_grade = 'PS2'
        elif base_table_grade == 'undergraduate: year 3':
            insights_grade = 'PS3'
        elif base_table_grade == 'undergraduate: year 4':
            insights_grade = 'PS4'
        elif base_table_grade == 'graduate: year 1':
            insights_grade = 'graduate'
        elif base_table_grade == 'graduate: year 2':
            insights_grade = 'graduate'
        elif base_table_grade == 'general education':
            insights_grade = 'PS'
        else:
            insights_grade = '0'
        return insights_grade

    def _get_subject_map(self, base_table_subject='General Elementary Education'):
        if base_table_subject == 'General Elementary Education':
            insights_subject = 'Non-Subject-Specific'
        # arts
        elif base_table_subject == 'General Art Knowledge':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Visual Art':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Digital Art':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Sculpture Art':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Photographic Art':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Theory of Art':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Theory of Art Education':
            insights_subject = 'Visual and Performing Arts'
        elif base_table_subject == 'Performance Art':
            insights_subject = 'Visual and Performing Arts'
        # sciences
        elif base_table_subject == 'Physics':
            insights_subject = 'Life and Physical Sciences'
        elif base_table_subject == 'Chemistry':
            insights_subject = 'Life and Physical Sciences'
        elif base_table_subject == 'Biology':
            insights_subject = 'Life and Physical Sciences'
        elif base_table_subject == 'Philosophy':
            insights_subject = 'Social Sciences and History'
        elif base_table_subject == 'General Applied Knowledge':
            insights_subject = 'Non-Subject-Specific'
        # business
        elif base_table_subject == 'Communication & Behavior':
            insights_subject = 'Business and Marketing'
        elif base_table_subject == 'Theory of Management':
            insights_subject = 'Business and Marketing'
        elif base_table_subject == 'Theory of Business':
            insights_subject = 'Business and Marketing'
        elif base_table_subject == 'Applied Management':
            insights_subject = 'Business and Marketing'
        elif base_table_subject == 'Theory of Entrepreneuring':
            insights_subject = 'Business and Marketing'
        else:
            insights_subject = base_table_subject
        return insights_subject

    def __get_daterange(self):
        daterange = []
        startdate = dt.datetime(2022,1,3)
        enddate = dt.datetime(2022,1,28)
        while(startdate < enddate):
            daterange.append(startdate)
            startdate = startdate + dt.timedelta(days=1)
        return daterange

    def _gen_signal(self, dfenroll, daterange):
        # all other roster tables must be generated first
        # NOTE: generated activity may not look realistic, and only generates student activities
        for index,enroll in dfenroll.iterrows():
            if self.edlevel == 'k12':
                signal_category = random.choices(['messaging','assignment','reflect','meeting','reading'], weights=[0.4,0.2,0.1,0.1,0.2])
            else:
                signal_category = random.choices(['messaging','assignment','reflect','meeting'], weights=[0.4,0.3,0.1,0.2])
            # initalize/assign static fields
            day = random.choice(daterange)
            starttime = day + dt.timedelta(days=0,hours=random.randint(0,23), minutes=random.randint(0,59))
            useragent = ''
            signalid = self.faker.uuid4()
            sisclassid = ''
            classid = enroll['GroupObjectId']
            channelid = ''
            actorid = enroll['UserObjectId']
            actorrole = 'Student'
            schemaversion = '1.14'
            classcreationdate = dt.date(2022,1,3)
            meetingsessionid = ''
            # Messaging signal gen
            if signal_category[0] == 'messaging':
                message_category = random.choices(['Sharepoint Online','Teams','OneNote','TeamsMobile'], weights=[0.4,0.3,0.25,0.05])
                appname = message_category[0]
                if message_category[0] == 'Sharepoint Online':
                    sharepoint_action = random.choices(['Like','Unlike','FileAccessed','ShareNotificationRequested','AddedToSharedWithMe','CommentCreated','CommentDeleted','UserAtMentioned'], weights=[0.125,0.125,0.1,0.125,0.125,0.15,0.05,0.2])
                    signaltype = sharepoint_action[0]
                elif message_category[0] == 'Teams':
                    teams_action = random.choices(['PostChannelMessage','ReplyChannelMessage','VisitTeamChannel','ExpandChannelMessage','ReactedWithEmoji'], weights=[0.3,0.1,0.3,0.2,0.1])
                    signaltype = teams_action[0]
                elif message_category[0] == 'TeamsMobile':
                    teamsmobile_action = random.choices(['PostChannelMessage','ReplyChannelMessage','VisitTeamChannel','ExpandChannelMessage','ReactedWithEmoji'], weights=[0.05,0.1,0.4,0.05,0.4])
                    signaltype = teamsmobile_action[0]
                elif message_category[0] == 'OneNote':
                    signaltype = 'OneNotePageChanged'
                assignmentid = ''
                submissionid = ''
                submissioncreatedtime = ''
                action = ''
                duedate = ''
                grade = ''
                sourcefileextension = ''
                meetingduration = ''
                meetingtype = ''
            # Assignment signal gen (currently, only generates submissions for assignments or SharePoint File-related signals)
            elif signal_category[0] == 'assignment':
                assignment_category = random.choices(['SubmissionEvent','File'], weights=[0.5,0.5])
                if assignment_category[0] == 'SubmissionEvent':
                    signaltype = 'SubmissionEvent'
                    appname = 'Assignments'
                    assignmentid = self.faker.uuid4()
                    submission_actions = random.choices(['Visited','Submitted','FeedbackSubmitted','Returned'], weights=[0.4,0.3,0.1,0.2])
                    action = submission_actions[0]
                    if action == 'Visited':
                        submissionid = ''
                        submissioncreatedtime = ''
                        duedate = f'{starttime + dt.timedelta(days=7)}'
                        grade = ''
                        sourcefileextension = ''
                    elif action == 'Returned':
                        submissionid = self.faker.uuid4()
                        submissioncreatedtime = f'{starttime}'
                        duedate = f'{starttime}' 
                        grade = '{}'.format(decimal.Decimal(random.randrange(4000, 10000))/100)
                        sourcefileextension = random.choices(['docx','xlsx','pptx','pdf','mp4','web','jpg'])[0]
                    else:
                        submissionid = self.faker.uuid4()
                        submissioncreatedtime = f'{starttime}'
                        duedate = f'{starttime + dt.timedelta(days=1)}'
                        grade = ''
                        sourcefileextension = random.choices(['docx','xlsx','pptx','pdf','mp4','web','jpg'])[0]
                else:
                    file_actions = random.choices(['FileModified','FileDownloaded','FileUploaded'], weights=[0.5,0.3,0.2])
                    signaltype = file_actions[0]
                    appname = 'SharePoint Online'
                    assignmentid = ''
                    action = ''
                    submissionid = ''
                    submissioncreatedtime = ''
                    duedate = ''
                    grade = ''
                    sourcefileextension = random.choices(['docx','xlsx','pptx','pdf','mp4','web','jpg'])[0]
            # Reflect signal gen
            elif signal_category[0] == 'reflect':
                signaltype = 'Reflect'
                appname = 'Reflect'
                reflect_action = random.choices(['CardPosted','FeedbackSubmitted'], weights=[0.8,0.2])
                action = reflect_action[0]
            # Meeting signal gen
            elif signal_category[0] == 'meeting':
                signaltype = 'CallRecordSummarized'
                appname = 'Teams'
                meetingduration = f'{random.randint(0,2)}:{random.randint(1,59)}:{random.randint(0,59)}'
                meetingtype = random.choices(['adHoc','scheduledRecurring','scheduledOneTime'])[0]
            # Reading signal gen
            elif signal_category[0] == 'reading':
                reading_categories = random.choices(['ReadingAssignment','ReadingSubmission'], weights=[0.4,0.6])
                signaltype = reading_categories[0]
                appname = 'ReadingProgress'
                grade = ''
                duedate = ''
                assignmentid = self.faker.uuid4()
                if signaltype == 'ReadingAssignment':
                    action = 'Visited'
                    submissionid = ''
                    submissioncreatedtime = ''
                    rs_wordsperminute = ''
                    rs_accuracy_s = ''
                    rs_mispronunciations_c = ''
                    rs_repetitions_c = ''
                    rs_insertions_c = ''
                    rs_omission_s = ''
                    rs_attemptnumber = ''
                    ra_word_c = ''
                    ra_fleschkincaidgradelevel = ''
                    ra_language = ''
                else:
                    readingassign_actions = random.choices(['Attempt','Submit'], weights=[0.4,0.6])
                    action == readingassign_actions[0]
                    if action == 'Attempt':
                        submissionid = ''
                        submissioncreatedtime = ''
                        rs_wordsperminute = ''
                        rs_accuracy_s = ''
                        rs_mispronunciations_c = f'{random.randint(0,120)}'
                        rs_repetitions_c = f'{random.randint(0,20)}'
                        rs_insertions_c = f'{random.randint(0,50)}'
                        rs_omission_s = f'{random.randint(0,50)}'
                        rs_attemptnumber = '1'
                        ra_word_c = ''
                        ra_fleschkincaidgradelevel = ''
                        ra_language = ''
                    else:
                        submissionid = self.faker.uuid4()
                        submissioncreatedtime = f'{starttime}'
                        rs_wordsperminute = f'{random.randint(1,200)}'
                        rs_accuracy_s = '{}'.format(decimal.Decimal(random.randrange(4000, 10000))/100)
                        rs_mispronunciations_c = f'{random.randint(0,120)}'
                        rs_repetitions_c = f'{random.randint(0,20)}'
                        rs_insertions_c = f'{random.randint(0,50)}'
                        rs_omission_s = f'{random.randint(0,50)}'
                        rs_attemptnumber = '1'
                        ra_word_c = f'{random.randint(50,1000)}'
                        ra_fleschkincaidgradelevel = ''
                        ra_language = ''
            # Fill in any blank fields as needed.
            if signal_category[0] != 'reading':
                rs_wordsperminute = ''
                rs_accuracy_s = ''
                rs_mispronunciations_c = ''
                rs_repetitions_c = ''
                rs_insertions_c = ''
                rs_omission_s = ''
                rs_attemptnumber = ''
                ra_word_c = ''
                ra_fleschkincaidgradelevel = ''
                ra_language = ''
                if signal_category[0] != 'assignment':
                    assignmentid = ''
                    submissionid = ''
                    submissioncreatedtime = ''
                    duedate = ''
                    grade = ''
                    sourcefileextension = ''
                    if signal_category != 'reflect':
                        action = ''
            if signal_category[0] != 'meeting':
                meetingduration = ''
                meetingtype = ''
            # rs = reading submission, ra = reading assignment; _c = count, _s = score
            self.M365_activity.loc[len(self.M365_activity.index)] = [signaltype,starttime,useragent,signalid,sisclassid,classid,channelid,appname,actorid,actorrole,schemaversion,assignmentid, \
            submissionid,submissioncreatedtime,action,duedate,classcreationdate,grade,sourcefileextension,meetingduration,meetingsessionid,meetingtype, \
            rs_wordsperminute,rs_accuracy_s,rs_mispronunciations_c,rs_repetitions_c,rs_insertions_c,rs_omission_s,rs_attemptnumber,ra_word_c,ra_fleschkincaidgradelevel,ra_language]

    def genActivity(self, num_signals=20):
        date_range = self.__get_daterange()
        while num_signals > 0:
            num_enrollments = len(self.M365_aadgroupmembership.index) - 1
            random_enroll = random.randint(0,num_enrollments)
            enroll = self.M365_aadgroupmembership.filter(items=[random_enroll], axis=0)
            self._gen_signal(enroll,date_range)
            num_signals = num_signals - 1
        self.writetofile('Activity', self.M365_activity)

    def genAadUser(self):
        # person and AadUserPersonMapping tables must be generated first
        dfAadUserPersonMapping = spark.createDataFrame(self.M365_aaduserpersonmapping)
        for index, person in self.M365_person.iterrows():
            personid = person['Id']
            objectid = dfAadUserPersonMapping.filter(dfAadUserPersonMapping['PersonId']==f'{personid}').collect()[0][0]
            givenname = person['GivenName']
            surname = person['Surname']
            displayname = f'{givenname} {surname}'
            mail = f'{givenname}{surname}@contoso.edu'
            userprincipalname = mail
            mailnickname = 'School Email'
            anchorid = ''
            studentid = self.faker.uuid4()
            teacherid = ''
            role = 'student'
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            self.M365_aaduser.loc[len(self.M365_aaduser.index)] = [objectid,userprincipalname,mail,mailnickname,givenname,surname,displayname,anchorid,studentid,teacherid,role,firstseendatetime,lastseendatetime]
        self.writetofile('AadUser', self.M365_aaduser)
    
    def genAadUserPersonMapping(self):
        # person table must be generated first
        for index, person in self.M365_person.iterrows():
            objectid = self.faker.uuid4()
            personid = person['Id']
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            self.M365_aaduserpersonmapping.loc[len(self.M365_aaduserpersonmapping.index)] = [objectid,personid,firstseendatetime,lastseendatetime]
        self.writetofile('AadUserPersonMapping', self.M365_aaduserpersonmapping)

    def genAadGroup(self):
        # section table must be generated first
        for index, section in self.M365_section.iterrows():
            objectid = self.faker.uuid4()
            displayname = section['Name']
            mailnickname = displayname.replace(' ', '_').lower()
            mail = f'{mailnickname}@contoso.edu'
            anchorid = ''
            sectionid = section['Id']
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            self.M365_aadgroup.loc[len(self.M365_aadgroup.index)] = [objectid,displayname,mail,mailnickname,anchorid,sectionid,firstseendatetime,lastseendatetime]
        self.writetofile('AadGroup', self.M365_aadgroup)

    def genAadGroupMembership(self):
        # must have generated Enrollment, AadUserPersonMapping, and AadGroup tables first
        dfEnroll = spark.createDataFrame(self.M365_enrollment)
        dfAadUserPersonMapping = spark.createDataFrame(self.M365_aaduserpersonmapping)
        dfAadGroup = spark.createDataFrame(self.M365_aadgroup)
        # clean, join tables for relevant data and then write to AadGroupMembership CSV
        dfEnroll = dfEnroll.select('PersonId', 'SectionId', 'FirstSeenDateTime', 'LastSeenDateTime')
        dfAadUserPersonMapping = dfAadUserPersonMapping.withColumnRenamed('PersonId', 'pid').drop('FirstSeenDateTime', 'LastSeenDateTime')
        dfAadGroup = dfAadGroup.select('ObjectId','SectionId').withColumnRenamed('SectionId', 'sid')
        dfAadGroupMembership = dfEnroll.join(dfAadUserPersonMapping, dfEnroll.PersonId == dfAadUserPersonMapping.pid, how='inner').drop('pid','PersonId').withColumnRenamed('ObjectId', 'UserObjectId')
        dfAadGroupMembership = dfAadGroupMembership.join(dfAadGroup, dfAadGroupMembership.SectionId == dfAadGroup.sid, how='inner').drop('sid', 'SectionId').withColumnRenamed('ObjectId', 'GroupObjectId')
        dfAadGroupMembership = dfAadGroupMembership.withColumn('Role', F.lit('Member'))
        dfAadGroupMembership = dfAadGroupMembership.select('UserObjectId', 'GroupObjectId', 'Role', 'FirstSeenDateTime', 'LastSeenDateTime')
        genfilepath = 'stage1/Transactional/test_data/v0.1/M365_gen/AadGroupMembership/snapshot_batch_data/rundate='+self.currentDateTime
        dfAadGroupMembership.na.drop('all')
        dfAadGroupMembership.coalesce(1).write.save(oea.to_url(genfilepath), format='csv', mode='overwrite', header='false', mergeSchema='true')
        # 
        pdfAadGroupMembership = dfAadGroupMembership.toPandas()
        self.M365_aadgroupmembership = self.M365_aadgroupmembership.append(pdfAadGroupMembership)
        self.M365_aadgroupmembership.reset_index(inplace=True)

    def genSourceSystem(self):
        id = self.sourcesystemid
        name = 'Source System'
        firstseendatetime = self.startdate
        lastseendatetime = self.enddate
        self.M365_sourcesystem.loc[len(self.M365_sourcesystem.index)] = [id,name,firstseendatetime,lastseendatetime]
        self.writetofile('SourceSystem', self.M365_sourcesystem)
    
    def genRefDefinition(self):
        # almost identical to writetofile function - without recreating a spark df
        genfilepath = 'stage1/Transactional/test_data/v0.1/M365_gen/RefDefintion/snapshot_batch_data/rundate='+self.currentDateTime
        self.refdef.na.drop('all')
        self.refdef.coalesce(1).write.save(oea.to_url(genfilepath), format='csv', mode='overwrite', header='false', mergeSchema='true')

    def genRefTranslation(self):
        return

    def genOrganization(self):
        # create a parent org based on whether generating hed or k12 data
        if self.edlevel == 'hed':
            parentorgname = 'University of Contoso'
            identifier = parentorgname
            reforganizationtypeid = self.refdef.filter(self.refdef['Code']=='University').collect()[0][0]
        else:
            parentorgname = 'Contoso ISD 3'
            identifier = parentorgname
            reforganizationtypeid = self.refdef.filter(self.refdef['Code']=='District').collect()[0][0]
        parentorg_id = self.faker.uuid4()
        sourcesystemid = self.sourcesystemid
        externalid = random.randint(10000,99999)
        firstseendatetime = self.startdate
        lastseendatetime = self.enddate
        parentorganizationid  = ''
        self.M365_organization.loc[len(self.M365_organization.index)] = [parentorg_id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,parentorgname,identifier,reforganizationtypeid,parentorganizationid]
        for index, school in self.schools.iterrows():
            id = school['SchoolID']
            sourcesystemid = self.sourcesystemid
            externalid = random.randint(10000,99999)
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            name = school['SchoolName'] 
            identifier = school['SchoolName']
            if school['SchoolType'] == 'College':
                reforganizationtypeid = self.refdef.filter(self.refdef['Code']=='College').collect()[0][0]
            else:
                reforganizationtypeid = self.refdef.filter(self.refdef['Code']=='School').collect()[0][0]
            parentorganizationid = parentorg_id
            self.M365_organization.loc[len(self.M365_organization.index)] = [id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,name,identifier,reforganizationtypeid,parentorganizationid]
        self.writetofile('Organization', self.M365_organization)

    def genSectionSession(self):
        for index, section in self.sections.iterrows():
            id = self.sessionid
            sectionid = section['SectionID']
            sessionid = self.sessionid
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate 
            isactiveinsession = True
            self.M365_sectionsession.loc[len(self.M365_sectionsession.index)] = [id,sectionid,sessionid,firstseendatetime,lastseendatetime,isactiveinsession]
        self.writetofile('SectionSession', self.M365_sectionsession)
    
    def genSession(self):
        id = self.sessionid
        sourcesystemid = self.sourcesystemid 
        externalid = random.randint(100,999)
        firstseendatetime = self.startdate
        lastseendatetime = self.enddate
        name = 'Session I' 
        refsessiontypeid = self.refdef.filter(self.refdef['Code']=='Semester').collect()[0][0]
        refacademicyearid = self.refdef.filter(self.refdef['Code']=='2022').collect()[0][0]
        startdate = self.startdate
        enddate = self.enddate
        parentsessionid = '' 
        self.M365_session.loc[len(self.M365_session.index)] = [id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,name,refsessiontypeid,refacademicyearid,startdate,enddate,parentsessionid]
        self.writetofile('Session', self.M365_session)
    
    def genSection(self):
        for index, section in self.sections.iterrows():
            id = section['SectionID']
            sourcesystemid = self.sourcesystemid
            externalid = ''
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            name = section['SectionName']
            organizationid = section['SchoolID']
            courseid = section['CourseID']
            code = name[-3:]
            location = section['SchoolName'] 
            self.M365_section.loc[len(self.M365_section.index)] = [id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,name,organizationid,courseid,code,location]
        self.writetofile('Section', self.M365_section)

    def genSectionGradeLevel(self):
        for index,section in self.sections.iterrows():
            id = self.faker.uuid4()
            sectionid = section['SectionID']
            # get insights grade level options for base-truth table gen grade level
            if self.edlevel == 'hed':
                sectiongradelevel = self._get_gradelevel_map(section['SectionGradeLevel'])
            else:
                sectiongradelevel = section['SectionGradeLevel']
            refgradelevelid = self.refdef.filter(self.refdef['Code']==f'{sectiongradelevel}').collect()[0][0]
            firstseendatetime = self.startdate 
            self.M365_sectiongradelevel.loc[len(self.M365_sectiongradelevel.index)] = [id,sectionid,refgradelevelid,firstseendatetime]
        self.writetofile('SectionGradeLevel', self.M365_sectiongradelevel)

    def genSectionSubject(self):
        for index, section in self.sections.iterrows():
            id = self.faker.uuid4()
            sectionid = section['SectionID']
            sectionsubject = self._get_subject_map(section['SectionSubject'])
            refacademicsubjectid = self.refdef.filter(self.refdef['Code']==sectionsubject).collect()[0][0]
            firstseendatetime = self.startdate 
            self.M365_sectionsubject.loc[len(self.M365_sectionsubject.index)] = [id,sectionid,refacademicsubjectid,firstseendatetime]
        self.writetofile('SectionSubject', self.M365_sectionsubject)

    def genCourse(self):
        for index, course in self.courses.iterrows():
            id = course['CourseID']
            sourcesystemid = self.sourcesystemid
            externalid = ''
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            name = course['CourseName']
            organizationid = course['SchoolID']
            isactiveinsession = True
            code = random.randint(1000,9999)
            academicyearsessionid = self.sessionid
            self.M365_course.loc[len(self.M365_course.index)] = [id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,name,organizationid,isactiveinsession,code,academicyearsessionid]
        self.writetofile('Course', self.M365_course)

    def genCourseSubject(self):
        for index, course in self.courses.iterrows():
            id = self.faker.uuid4()
            courseid = course['CourseID']
            coursesubject = self._get_subject_map(course['CourseSubject'])
            refacademicsubjectid = self.refdef.filter(self.refdef['Code']==coursesubject).collect()[0][0]
            firstseendatetime = self.startdate
            self.M365_coursesubject.loc[len(self.M365_coursesubject.index)] = [id,courseid,refacademicsubjectid,firstseendatetime]
        self.writetofile('CourseSubject', self.M365_coursesubject)

    def genCourseGradeLevel(self):
        for index, course in self.courses.iterrows():
            id = self.faker.uuid4()
            courseid = course['CourseID']
            # get insights grade level options for base-truth table gen grade level
            if self.edlevel == 'hed':
                coursegradelevel = self._get_gradelevel_map(course['CourseGradeLevel'])
            else:
                coursegradelevel = course['CourseGradeLevel']
            refgradelevelid = self.refdef.filter(self.refdef['Code']==f'{coursegradelevel}').collect()[0][0]
            firstseendatetime = self.startdate 
            self.M365_coursegradelevel.loc[len(self.M365_coursegradelevel.index)] = [id,courseid,refgradelevelid,firstseendatetime]
        self.writetofile('CourseGradeLevel', self.M365_coursegradelevel)

    def genPersonOrganizationRole(self):
        refroleid = self.refdef.filter(self.refdef['Code']=='Student').collect()[0][0]
        for index, enroll in self.enrollment.iterrows():
            id = self.faker.uuid4()
            sourcesystemid = self.sourcesystemid
            externalid = ''
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            organizationid = enroll['SchoolID']
            personid = enroll['StudentID']
            sessionid = self.sessionid
            isactiveinsession = True
            rolestartdate = self.startdate
            roleenddate = self.enddate
            isprimary = ''
            # get insights grade level options for base-truth table gen grade level
            if self.edlevel == 'hed':
                coursegradelevel = self._get_gradelevel_map(enroll['CourseGradeLevel'])
            else:
                coursegradelevel = enroll['CourseGradeLevel']
            refgradelevelid = self.refdef.filter(self.refdef['Code']==f'{coursegradelevel}').collect()[0][0]
            self.M365_personorganizationrole.loc[len(self.M365_personorganizationrole.index)] = [id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,organizationid,personid,refroleid,sessionid,isactiveinsession,rolestartdate,roleenddate,isprimary,refgradelevelid]
        self.writetofile('PersonOrganizationRole', self.M365_personorganizationrole)

    def genPersonDemographicEthnicity(self):
        for index, student in self.students.iterrows():
            id = self.faker.uuid4()
            personid = student['StudentID']
            if student['HispanicLatino'] == 'True':
                refethnicityid = 'EBD54EE0-4E76-469E-A0D1-27836518E87C'
            else:
                refethnicityid = ''
            firstseendatetime = self.startdate 
            lastseendatetime = self.enddate
            self.M365_persondemographicethnicity.loc[len(self.M365_persondemographicethnicity.index)] = [id,personid,refethnicityid,firstseendatetime,lastseendatetime]
        self.writetofile('PersonDemographicEthnicity', self.M365_persondemographicethnicity)

    def genPersonDemographicPersonFlag(self):
        for index, student in self.students.iterrows():
            id = self.faker.uuid4()
            personid = student['StudentID']
            personflag = student['Flag']
            if self.edlevel == 'hed':
                refpersonflagid = ''
            elif personflag != '':
                refpersonflagid = self.refdef.filter(self.refdef['Code']==f'{personflag}').collect()[0][0]
            else:
                refpersonflagid = ''
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            self.M365_persondemographicpersonflag.loc[len(self.M365_persondemographicpersonflag.index)] = [id,personid,refpersonflagid,firstseendatetime,lastseendatetime]
        self.writetofile('PersonDemographicPersonFlag', self.M365_persondemographicpersonflag)

    def genPersonDemographicRace(self):
        for index, student in self.students.iterrows():
            id = self.faker.uuid4()
            personid = student['StudentID']
            studentrace = student['Race']
            refraceid = self.refdef.filter(self.refdef['Code']==studentrace).collect()[0][0]
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            self.M365_persondemographicrace.loc[len(self.M365_persondemographicrace.index)] = [id,personid,refraceid,firstseendatetime,lastseendatetime]
        self.writetofile('PersonDemographicRace', self.M365_persondemographicrace)
    
    def genPersonDemographic(self):
        for index, student in self.students.iterrows():
            id = self.faker.uuid4()
            personid = student['StudentID']
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            if student['Gender'] == 'M':
                refsexid = 'F543B59B-8AC5-49CF-A4F1-8439613F3378'
            elif student['Gender'] == 'F':
                refsexid = '591DF534-C4D9-48D4-A465-B8DFD80C3D05'
            else:
                refsexid = 'FD5A5217-6559-47FE-ABD3-8EF53FC23E85'
            birthdate = student['Birthday']
            birthcity = student['City']
            birthstate = student['State']
            birthcountycode = student['Zipcode']
            self.M365_persondemographic.loc[len(self.M365_persondemographic.index)] = [id,personid,firstseendatetime,lastseendatetime,refsexid,birthdate,birthcity,birthstate,birthcountycode]
        self.writetofile('PersonDemographic', self.M365_persondemographic)

    def genPersonPhoneNumber(self):
        for index, student in self.students.iterrows():
            id = self.faker.uuid4()
            personid = student['StudentID']
            phonenumber = student['Phone']
            priorityorder = ''
            refphonenumbertypeid = '2EF682BF-22F4-4A13-B7E6-2F8ADCCBF3C8' 
            firstseendatetime = self.startdate
            self.M365_personphonenumber.loc[len(self.M365_personphonenumber.index)] = [id,personid,phonenumber,priorityorder,refphonenumbertypeid,firstseendatetime]
        self.writetofile('PersonPhoneNumber', self.M365_personphonenumber)

    def genPersonEmailAddress(self):
        for index, student in self.students.iterrows():
            id = self.faker.uuid4()
            personid = student['StudentID']
            emailaddress = student['Email']
            priorityorder = '' 
            refemailaddresstypeid = 'DB43E1D8-DE3D-4142-A54C-C5B27E43D59F' 
            firstseendatetime = self.startdate
            self.M365_personemailaddress.loc[len(self.M365_personemailaddress.index)] = [id,personid,emailaddress,priorityorder,refemailaddresstypeid,firstseendatetime]
        self.writetofile('PersonEmailAddress', self.M365_personemailaddress)

    def genPersonRelationship(self):
        id = '' 
        personid = '' 
        relatedpersonid = '' 
        refpersonrelationshipid = '' 
        firstseendatetime = '' 
        self.M365_personrelationship.loc[len(self.M365_personrelationship.index)] = [id,personid,relatedpersonid,refpersonrelationshipid,firstseendatetime]
        self.writetofile('PersonRelationship', self.M365_personrelationship)
    
    def genPersonIdentifier(self):
        id = '' 
        personid = '' 
        sourcesystemid = '' 
        refidentifiertypeid = '' 
        identifier = '' 
        firstseendatetime = '' 
        ispresentinsource = '' 
        self.M365_personidentifier.loc[len(self.M365_personidentifier.index)] = [id,personid,sourcesystemid,refidentifiertypeid,identifier,firstseendatetime,ispresentinsource]
        self.writetofile('PersonIdentifier', self.M365_personidentifier)

    def genPerson(self):
        for index, student in self.students.iterrows():
            id = student['StudentID']
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            surname = student['LastName']
            givenname = student['FirstName']
            middlename = student['MiddleName']
            preferredsurname = surname
            preferredgivenname = givenname
            preferredmiddlename = middlename
            self.M365_person.loc[len(self.M365_person.index)] = [id,firstseendatetime,lastseendatetime,surname,givenname,middlename,preferredsurname,preferredgivenname,preferredmiddlename]
        self.writetofile('Person', self.M365_person)
        
    def genEnrollment(self):
        for index, enroll in self.enrollment.iterrows():
            id = self.faker.uuid4() 
            sourcesystemid = self.sourcesystemid
            externalid = '' 
            firstseendatetime = self.startdate
            lastseendatetime = self.enddate
            personid = enroll['StudentID']
            sectionid = enroll['SectionID']
            refsectionroleid = '3DA186F2-D4CA-43C2-9EBB-10B0B89EDB87' 
            isactiveinsession = True
            isprimarystaffforsection = False
            entrydate = self.startdate 
            exitdate = self.enddate
            self.M365_enrollment.loc[len(self.M365_enrollment.index)] = [id,sourcesystemid,externalid,firstseendatetime,lastseendatetime,personid,sectionid,refsectionroleid,isactiveinsession,isprimarystaffforsection,entrydate,exitdate]
        self.writetofile('Enrollment',self.M365_enrollment)

    def writetofile(self,filename,dfout):
        # turns the pandas df into a pyspark df, and then writes out the generated tables to stage1
        genfilepath = 'stage1/Transactional/test_data/v0.1/M365_gen/' + filename + '/snapshot_batch_data/rundate='+self.currentDateTime
        dfOutfile = spark.createDataFrame(dfout)
        dfOutfile = dfOutfile.na.drop('all')
        dfOutfile.coalesce(1).write.save(oea.to_url(f'{genfilepath}'), format='csv', mode='overwrite', header='false', mergeSchema='true')