# Hybrid Student Engagement Notebook

This notebook creates 4 tables (student, dayactivity, yearactivity and calendar) into a new Spark database called s3_hybrid (stage 3 hybrid). 


### Provision storage accounts

The storage account variable has to be changed to the name of the storage account associated with your Azure resource group.

In [1]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window

# data lake and container information
storage_account = 'stoeahybrid'
use_test_env = True

if use_test_env:
    stage1 = 'abfss://test-env@' + storage_account + '.dfs.core.windows.net/stage1'
    stage2 = 'abfss://test-env@' + storage_account + '.dfs.core.windows.net/stage2'
    stage3 = 'abfss://test-env@' + storage_account + '.dfs.core.windows.net/stage3'
else:
    stage1 = 'abfss://stage1@' + storage_account + '.dfs.core.windows.net'
    stage2 = 'abfss://stage2@' + storage_account + '.dfs.core.windows.net'
    stage3 = 'abfss://stage3@' + storage_account + '.dfs.core.windows.net'

StatementMeta(spark2, 17, 1, Finished, Available)



### Load Raw Data from Lake
To ensure that that the right tables are loaded, confirm that the file paths match your data lake storage containers.

In [2]:
# load needed tables from parquet data lake storage
dfStudAttendanceRaw = spark.read.format('parquet').load(f'{stage3}/contoso_sis/studentattendance')
dfActivityV2Raw = spark.read.format('parquet').load(f'{stage3}/m365/Activity0p2')
dfPersonRaw = spark.read.format('parquet').load(f'{stage3}/m365/Person')
dfStudOrgRaw = spark.read.format('parquet').load(f'{stage3}/m365/StudentOrgAffiliation')
dfOrgRaw = spark.read.format('parquet').load(f'{stage3}/m365/Org')
dfSectionRaw = spark.read.format('parquet').load(f'{stage3}/m365/Section')
dfCourseRaw = spark.read.format('parquet').load(f'{stage3}/m365/Course')
dfRefRaw = spark.read.format('parquet').load(f'{stage3}/m365/RefDefinition')

StatementMeta(spark2, 17, 2, Finished, Available)



## 1. Student table
Contains students' information at a school level

**Databases and tables used:**

1. Spark DB: s3_m365 (stage 3 m365 feed)
- Table: person (student PersonId and ExternalId relationship)
- Table: org
- Table: studentorgaffiliation
- Table: section
- Table: course
- Table: refdefinition

2. Spark DB: stage 3 SIS data
- Table: studentattendance (student attendance by date, school, and course section)

**Databases and tables created:**

1. Spark DB: s3_hybrid (stage 3 hybrid)
- Table: student

## Clean and Subset Data

In [3]:
# take only active students
dfPerson = dfPersonRaw.filter(dfPersonRaw.IsActive == 'True')
dfStudOrg = dfStudOrgRaw.filter(dfStudOrgRaw.IsActive == 'True')

# take needed columns and rename to align with other data sources
dfPerson = dfPerson.select('Id','ExternalId')
dfPerson = dfPerson.withColumnRenamed('Id', 'PersonId')

dfStudOrg = dfStudOrg.select('PersonId', 'IsPrimary', 'IsActive', 'OrgId', 'RefGradeLevelId')

dfOrg = dfOrgRaw.select('Identifier', 'name', 'Id')
dfOrg = dfOrg.withColumnRenamed('Identifier', 'School_ID')
dfOrg = dfOrg.withColumnRenamed('name', 'School_Name')
dfOrg = dfOrg.withColumnRenamed('Id', 'OrgId')

dfSection = dfSectionRaw.select('ExternalId', 'Name', 'CourseId', 'Id', 'Code', 'SessionId', 'OrgId')
dfSection = dfSection.withColumnRenamed('ExternalId', 'section_id')
dfSection = dfSection.withColumnRenamed('name', 'section_name')
dfSection = dfSection.drop('School_Name')
dfSection = dfSection.join(dfOrg, 'OrgId')

dfCourse = dfCourseRaw.withColumnRenamed('Id', 'CourseId')
dfCourse = dfCourse.withColumnRenamed('Name', 'Course')
dfCourse = dfCourse.drop('ExternalId')

dfRef = dfRefRaw.select('Id', 'Code', 'Description')
dfRef = dfRef.withColumnRenamed('Id', 'RefId')



StatementMeta(spark2, 17, 3, Finished, Available)



In [4]:
# combine student information and school details with attendance data
dfStudAttendance = dfStudAttendanceRaw.select('student_id', 'school_id', 'attendance_date', 'Period', 'section_id', 
                        'PresenceFlag', 'attendance_status')
                  
dfStudAttendance = dfStudAttendance.withColumnRenamed('school_id', 'School_ID')
dfStudAttendance = dfStudAttendance.withColumnRenamed('student_id','ExternalId')
dfStudAttendance = dfStudAttendance.withColumn("Date", to_date(col("attendance_date"), 'yyyy-MM-dd'))
dfStudAttendance = dfStudAttendance.drop('attendance_date')

dfStudAttendance = dfStudAttendance.join(dfOrg, 'School_ID')
dfStudAttendance.show(1,vertical=True)

StatementMeta(spark2, 17, 4, Finished, Available)

-RECORD 0---------------------------------
 School_ID         | sch1                 
 ExternalId        | 68f1c007e6178d345... 
 Period            | 1                    
 section_id        | sec1                 
 PresenceFlag      | 1                    
 attendance_status | Present              
 Date              | 2021-02-10           
 School_Name       | Gallagher High       
 OrgId             | edp_sch1             
only showing top 1 row

### Find Primary School


In [5]:
# find school which students have highest attendance count
df = (dfStudAttendance.groupBy("ExternalId", 'School_ID', 'School_Name')
    .agg(sum("PresenceFlag").alias("Present_Count")))


w = Window.partitionBy('ExternalId')
dfStudSchoolPrimary = df.withColumn('maxPres', max('Present_Count').over(w))\
    .where(col('Present_Count') == col('maxPres'))\
    .drop('maxPres').drop('Present_Count')

dfStudSchoolPrimary.show(3, vertical=True)

StatementMeta(spark2, 17, 5, Finished, Available)

-RECORD 0---------------------------
 ExternalId  | 3b7e980ef9b5d89af... 
 School_ID   | sch5                 
 School_Name | Robinson High        
-RECORD 1---------------------------
 ExternalId  | 418cf724bc5222168... 
 School_ID   | sch5                 
 School_Name | Robinson High        
-RECORD 2---------------------------
 ExternalId  | 8591cfe8502d9b9f6... 
 School_ID   | sch4                 
 School_Name | Mitchell High        
only showing top 3 rows

In [6]:
# rename columns to indicate primary school
dfStudSchoolPrimary = dfStudSchoolPrimary.withColumnRenamed('School_Name', 'SchoolNamePrimary')
dfStudSchoolPrimary = dfStudSchoolPrimary.withColumnRenamed('School_ID', 'SchoolIdPrimary')

StatementMeta(spark2, 17, 6, Finished, Available)



## Combine tables

In [7]:
# join person table to student and ref tables to create student profile 
dfStudent = dfPerson.join(dfStudOrg, 'PersonId')
dfStudent = dfStudent.withColumnRenamed('RefGradeLevelId', 'RefId')
dfStudent = dfStudent.join(dfRef, 'RefId')
dfStudentFinal = dfStudent.join(dfStudSchoolPrimary, 'ExternalId')
dfStudentFinal = dfStudentFinal.withColumnRenamed('Code', 'GradeLevel')
dfStudentFinal = dfStudentFinal.withColumnRenamed('Description', 'GradeName')
dfStudentFinal = dfStudentFinal.select('ExternalId',  'PersonId', 'IsActive', 'SchoolNamePrimary', 'SchoolIdPrimary', 'OrgId', 'GradeLevel', 'GradeName')
dfStudentFinal = dfStudentFinal.drop('OrgId', 'GradeLevel')
dfStudentFinal.show(1, vertical=True)
print(dfStudentFinal.count())

StatementMeta(spark2, 17, 7, Finished, Available)

-RECORD 0---------------------------------
 ExternalId        | 68f1c007e6178d345... 
 PersonId          | f2447993213182b11... 
 IsActive          | true                 
 SchoolNamePrimary | Gallagher High       
 SchoolIdPrimary   | sch1                 
 GradeName         | Eleventh grade       
only showing top 1 row

100

### Write Data Back to Lake

In [8]:
# write back to the lake
dfStudentFinal.write.format('parquet').mode('overwrite').save(stage3 + '/test_s3_hybrid/Student')

StatementMeta(spark2, 17, 8, Finished, Available)



### Load to Spark DB

In [9]:
# Create spark db to allow for access to the data in the delta-lake via SQL on-demand.
# This is only creating metadata for SQL on-demand, pointing to the data in the delta-lake.
# This also makes it possible to connect in Power BI via the azure sql data source connector.
def create_spark_db(db_name, source_path):
    spark.sql(f'CREATE DATABASE IF NOT EXISTS {db_name}')
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.Student")
    spark.sql(f"create table if not exists {db_name}.Student using PARQUET location '{source_path}/Student'")
    
create_spark_db('test_s3_hybrid', stage3 + '/test_s3_hybrid')

StatementMeta(spark2, 17, 9, Finished, Available)



## 2. Calendar table
Contains a basic calendar table to support data analysis in a Power BI dashboard.

**Databases and tables used:**
- None

**Databases and tables created:**

1. Spark DB: s3_hybrid (stage 3 hybrid)
- Table: calendar

In [10]:
# date range
start = "2020-01-01"
stop = "2021-12-30"

# create calendar dataframe
temp_df = spark.createDataFrame([(start, stop)], ("start", "stop"))
temp_df = temp_df.select([col(c).cast("timestamp") for c in ("start", "stop")])
temp_df = temp_df.withColumn("stop",date_add("stop",1).cast("timestamp"))
temp_df = temp_df.select([col(c).cast("long") for c in ("start", "stop")])
start, stop = temp_df.first()
interval=60*60*24

df = spark.range(start,stop,interval).select(col("id").cast("timestamp").alias("DateTime"))
df = df.withColumn("Date", to_date(col("DateTime")))

df = df.drop("DateTime")
df = df.withColumn('Year', date_format('Date', 'YYYY'))
df = df.withColumn('Month', date_format('Date', 'MMMM'))
df = df.withColumn('MonthNum', date_format('Date', 'M'))
df = df.withColumn('Week', date_format('Date', 'W'))
df = df.withColumn('Day', date_format('Date', 'D'))
df.show(2)

StatementMeta(spark2, 17, 10, Finished, Available)

+----------+----+-------+--------+----+---+
|      Date|Year|  Month|MonthNum|Week|Day|
+----------+----+-------+--------+----+---+
|2020-01-01|2020|January|       1|   1|  1|
|2020-01-02|2020|January|       1|   1|  2|
+----------+----+-------+--------+----+---+
only showing top 2 rows

## Write Data Back to Lake

In [11]:
# write back to the lake in stage 3 ds3_main directory
df.write.format('parquet').mode('overwrite').save(stage3 + '/test_s3_hybrid/Calendar')

StatementMeta(spark2, 17, 11, Finished, Available)



### Load to Spark DB

In [12]:
# Create spark db to allow for access to the data in the data lake via SQL on-demand.
#This is only creating metadata for SQL on-demand, pointing to the data in the delta-lake.
# This also makes it possible to connect in Power BI via the azure sql data source connector.
def create_spark_db(db_name, source_path):
    spark.sql(f'CREATE DATABASE IF NOT EXISTS {db_name}')
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.Calendar")
    spark.sql(f"create table if not exists {db_name}.Calendar using PARQUET location '{source_path}/Calendar'")

create_spark_db('test_s3_hybrid', stage3 + '/test_s3_hybrid')

StatementMeta(spark2, 17, 12, Finished, Available)



## 3. Dayactivity table
Contains student daily digital and in-person activity 

**Databases and tables used:** 

1. Spark DB: s3_m365 (stage 3 m365 feed)
- Table: Activity0p2 (user m365 app activity v2)

2. Spark DB: stage 3 SIS data
- Table: studentattendance (student attendance by date, school, and course section)

### Clean and Subset data

In [13]:
# combine student attendance with section and course data
dfStudAttendanceFinal = dfStudAttendance.join(dfStudentFinal, 'ExternalId')
dfStudAttendanceFinal = dfStudAttendanceFinal.join(dfSection, ['section_id', 'School_ID'])
dfStudAttendanceFinal = dfStudAttendanceFinal.join(dfCourse, 'CourseId')
dfStudAttendanceFinal = dfStudAttendanceFinal.select('PersonId', 'ExternalId', 'Date', 'attendance_status', 'PresenceFlag'
                                                    ,'Period', 'SchoolNamePrimary', 'SessionId', 'section_id', 'CourseId', 'Course')
dfStudAttendanceFinal = dfStudAttendanceFinal.withColumnRenamed('section_id', 'SectionId')
dfStudAttendanceFinal.show(1, vertical=True)
print(dfStudAttendanceFinal.count())

StatementMeta(spark2, 17, 13, Finished, Available)

-RECORD 0---------------------------------
 PersonId          | f2447993213182b11... 
 ExternalId        | 68f1c007e6178d345... 
 Date              | 2021-02-21           
 attendance_status | Present              
 PresenceFlag      | 1                    
 Period            | 1                    
 SchoolNamePrimary | Gallagher High       
 SessionId         | edp_term2            
 SectionId         | sec15                
 CourseId          | edp_course5          
 Course            | Science Biology      
only showing top 1 row

1050

In [14]:
# isolate in-person school days
schoolDays = dfStudAttendanceFinal.select('Date').distinct()
print(schoolDays.count())

StatementMeta(spark2, 17, 14, Finished, Available)

56

In [15]:
# take only needed columns and filter for only students
dfActivityV2 = dfActivityV2Raw.where(col('ActorRole') == 'Student').select('PersonId','AppName', 'SignalType'
                                    ,'StartTime', 'MeetingDuration')

# active students only, include external id
dfActivityV2 = dfActivityV2.join(dfStudentFinal, 'PersonId')
dfActivityV2.show(1, vertical=True)
print(dfActivityV2.count())

StatementMeta(spark2, 17, 15, Finished, Available)

-RECORD 0---------------------------------
 PersonId          | f2447993213182b11... 
 AppName           | Excel                
 SignalType        | FileModified         
 StartTime         | 2021-03-05 15:46:46  
 MeetingDuration   | 00:11:04             
 ExternalId        | 68f1c007e6178d345... 
 IsActive          | true                 
 SchoolNamePrimary | Gallagher High       
 SchoolIdPrimary   | sch1                 
 GradeName         | Eleventh grade       
only showing top 1 row

23180

## Aggregate Activity by Date


In [16]:
# convert MeetingDuration column to time format
dfActivityV2 = dfActivityV2.withColumn('MeetingDuration', to_timestamp('MeetingDuration'))
dfActivityV2 = dfActivityV2.withColumn('HourToMinutes', hour(col('MeetingDuration'))*60)
dfActivityV2 = dfActivityV2.withColumn('Minutes', minute(col('MeetingDuration')))
dfActivityV2 = dfActivityV2.withColumn('Duration', col('HourToMinutes') + col('Minutes'))

# aggregation of activity
dfActivityV2agg = (dfActivityV2.groupBy("PersonId", "ExternalId",
                    "AppName", "SignalType",
                    to_date("StartTime").alias("Date"))
    .agg(sum("Duration").alias("DurationSum")))

dfActivityV2agg.show(1,vertical=True)

StatementMeta(spark2, 17, 16, Finished, Available)

-RECORD 0---------------------------
 PersonId    | 201f952a2007e9aa4... 
 ExternalId  | ee4f65129e61c4d97... 
 AppName     | PDF viewers          
 SignalType  | AddedToSharedWithMe  
 Date        | 2021-02-19           
 DurationSum | 20                   
only showing top 1 row

In [17]:
# count number of presence flags per date
dfStudAttendanceAgg = (dfStudAttendanceFinal.groupBy('PersonId', 'ExternalId', 'Date', 'SectionId', 'Course')
                    .agg(mean("PresenceFlag").alias("Present_Mean")))

dfStudAttendanceAgg = dfStudAttendanceAgg.withColumn('Present', when(col('Present_Mean') > 0, 1).otherwise(0))

print(dfStudAttendanceAgg.count())
dfStudAttendanceAgg.show(1,vertical=True)

StatementMeta(spark2, 17, 17, Finished, Available)

1050
-RECORD 0----------------------------
 PersonId     | 70395011b16faae0c... 
 ExternalId   | 62673a604d240d8ca... 
 Date         | 2021-02-26           
 SectionId    | sec5                 
 Course       | Art                  
 Present_Mean | 1.0                  
 Present      | 1                    
only showing top 1 row

## Daily activity: Merge into a single wide table


In [18]:
# focus on teams meetings, assignments, and communications

df1 = dfActivityV2agg.where( ( col("AppName") == "Teams" ) &
                            ( col("SignalType") == "CallRecordSummarized" ))

dfDayAct = df1.select("PersonId", "ExternalId", "Date", "DurationSum")
dfDayAct = dfDayAct.withColumnRenamed("DurationSum", "TeamsMeetingsDuration")

df2 = dfActivityV2agg.withColumn('Assignments', when(( col("AppName") == "Assignments" ) & 
                            ( col("SignalType") == "SubmissionEvent" ) , 1).otherwise(0))
df2 = df2.select("PersonId", "ExternalId", "Date", "Assignments")

dfDayAct = dfDayAct.join(df2, ["PersonId", "ExternalId", "Date"], 'outer')

df3 = dfActivityV2agg.withColumn('TeamsCommunications', when(( col("AppName") == "Teams" ) & 
                        ( col("SignalType").isin(['AddedToSharedWithMe', 'CommentCreated',
                            'CommentDeleted', 'ExpandChannelMessage', 'PostChannelMessage',
                            'ReactedWithEmoji', 'ReplyChannelMessage', 'Unlike',
                            'UserAtMentioned', 'VisitTeamChannel'])), 1).otherwise(0))
df3 = df3.select("PersonId", "ExternalId", "Date", "TeamsCommunications")

dfDayAct = dfDayAct.join(df3, ["PersonId", "ExternalId", "Date"], 'outer')


dfDayAct.show(5, vertical=True)

dfDayAct = dfDayAct.groupBy("PersonId", "ExternalId", "Date").agg(sum('TeamsMeetingsDuration').alias('TeamsMeetingsDuration'), mean('Assignments').alias('Assignments'), mean('TeamsCommunications').alias('TeamsCommunications'))
 
dfDayAct.show(5, vertical=True)
print(dfDayAct.count())

StatementMeta(spark2, 17, 18, Finished, Available)

-RECORD 0-------------------------------------
 PersonId              | 0b039b8211f7c0bcf... 
 ExternalId            | 87856b6419d0ec873... 
 Date                  | 2021-03-05           
 TeamsMeetingsDuration | null                 
 Assignments           | 0                    
 TeamsCommunications   | 0                    
-RECORD 1-------------------------------------
 PersonId              | 0b039b8211f7c0bcf... 
 ExternalId            | 87856b6419d0ec873... 
 Date                  | 2021-03-05           
 TeamsMeetingsDuration | null                 
 Assignments           | 0                    
 TeamsCommunications   | 0                    
-RECORD 2-------------------------------------
 PersonId              | 0b039b8211f7c0bcf... 
 ExternalId            | 87856b6419d0ec873... 
 Date                  | 2021-03-05           
 TeamsMeetingsDuration | null                 
 Assignments           | 0                    
 TeamsCommunications   | 0                    
-RECORD 3----

In [19]:
# add in person attendance
dfDayAct = dfDayAct.join(dfStudAttendanceAgg, ["PersonId", "ExternalId", "Date"], 'outer')

dfDayAct.show(2,vertical=True)
dfDayAct.count()

StatementMeta(spark2, 17, 19, Finished, Available)

-RECORD 0-------------------------------------
 PersonId              | 0b039b8211f7c0bcf... 
 ExternalId            | 87856b6419d0ec873... 
 Date                  | 2021-03-05           
 TeamsMeetingsDuration | null                 
 Assignments           | 0.0                  
 TeamsCommunications   | 0.15384615384615385  
 SectionId             | null                 
 Course                | null                 
 Present_Mean          | null                 
 Present               | null                 
-RECORD 1-------------------------------------
 PersonId              | 0f53869a3bedf6b2c... 
 ExternalId            | 2b946c8bd9fdcb8b2... 
 Date                  | 2021-04-01           
 TeamsMeetingsDuration | null                 
 Assignments           | 0.0                  
 TeamsCommunications   | 0.0                  
 SectionId             | null                 
 Course                | null                 
 Present_Mean          | null                 
 Present     

5812

In [20]:
# subset dates to only school dates
dfDayAct = dfDayAct.join(schoolDays, ["Date"], "inner")

print(dfDayAct.count())

schoolDaysCheck = dfDayAct.select('Date').distinct()

print(schoolDaysCheck.count())

StatementMeta(spark2, 17, 20, Finished, Available)

4186
55

In [21]:
print(dfDayAct.dtypes)

StatementMeta(spark2, 17, 21, Finished, Available)

[('Date', 'date'), ('PersonId', 'string'), ('ExternalId', 'string'), ('TeamsMeetingsDuration', 'bigint'), ('Assignments', 'double'), ('TeamsCommunications', 'double'), ('SectionId', 'string'), ('Course', 'string'), ('Present_Mean', 'double'), ('Present', 'int')]

In [22]:
# fill missing with 0
dfDayAct = dfDayAct.na.fill(0)

print(dfDayAct.count())
dfDayAct.show(1, vertical=True)

StatementMeta(spark2, 17, 22, Finished, Available)

4186
-RECORD 0-------------------------------------
 Date                  | 2021-04-29           
 PersonId              | e2a7969d1392cd0bf... 
 ExternalId            | 9baf0a604c20c7f34... 
 TeamsMeetingsDuration | 0                    
 Assignments           | 0.0                  
 TeamsCommunications   | 0.0                  
 SectionId             | sec88                
 Course                | Math - Algebra       
 Present_Mean          | 1.0                  
 Present               | 1                    
only showing top 1 row

In [23]:
# add activity indicator columns
dfDayAct= dfDayAct.withColumn('ActiveTeamsMeetings', when(col('TeamsMeetingsDuration') > 0, 1).otherwise(0))
dfDayAct= dfDayAct.withColumn('ActiveTeamsCommunications', when(col('TeamsCommunications') > 0 , 1).otherwise(0))
dfDayAct= dfDayAct.withColumn('ActiveAssignments', when(col('Assignments') > 0 , 1).otherwise(0))
dfDayAct= dfDayAct.withColumn('DigitallyActive', when( \
                (col('ActiveTeamsMeetings')+ col('ActiveTeamsCommunications')+ col('ActiveAssignments'))\
                 > 0, 1).otherwise(0))

dfDayAct.show(1,vertical=True)

StatementMeta(spark2, 17, 23, Finished, Available)

-RECORD 0-----------------------------------------
 Date                      | 2021-04-29           
 PersonId                  | 7c654ccc81f1ba474... 
 ExternalId                | aa5bd029c7fd4ec6d... 
 TeamsMeetingsDuration     | 0                    
 Assignments               | 0.0                  
 TeamsCommunications       | 0.0                  
 SectionId                 | sec75                
 Course                    | English Language     
 Present_Mean              | 1.0                  
 Present                   | 1                    
 ActiveTeamsMeetings       | 0                    
 ActiveTeamsCommunications | 0                    
 ActiveAssignments         | 0                    
 DigitallyActive           | 0                    
only showing top 1 row

## 3. Dayactivity table
Contains student daily digital and in-person activity 

**Databases and tables used:** 

1. Spark DB: s3_m365 (stage 3 m365 feed)
- Table: Activity0p2 (user m365 app activity v2)

2. Spark DB: stage 3 SIS data
- Table: studentattendance (student attendance by date, school, and course section)

## Yearly Aggregates


In [24]:
dfYearAct = dfDayAct.groupBy("PersonId", "ExternalId")\
    .agg(sum("ActiveTeamsMeetings").alias("DaysActiveTeamsMeetings")\
    ,sum("ActiveTeamsCommunications").alias("DaysActiveTeamsCommunications")\
    ,sum("ActiveAssignments").alias("DaysActiveAssignments")\
    ,sum("DigitallyActive").alias("DaysDigitallyActive")\
    ,sum("Present").alias("DaysPresent"))


dfYearAct = dfYearAct.withColumn('Present_Perc', 
            col("DaysPresent")/schoolDays.count())

dfYearAct = dfYearAct.withColumn('Atten_Threshold_Met', 
            when(col('Present_Perc') > 0.9, 1).otherwise(0))

print(dfYearAct.count())
dfYearAct.show(5,vertical=True)


StatementMeta(spark2, 17, 24, Finished, Available)

100
-RECORD 0---------------------------------------------
 PersonId                      | b52042e19537f0557... 
 ExternalId                    | 8228c26c3604bac68... 
 DaysActiveTeamsMeetings       | 0                    
 DaysActiveTeamsCommunications | 4                    
 DaysActiveAssignments         | 0                    
 DaysDigitallyActive           | 4                    
 DaysPresent                   | 12                   
 Present_Perc                  | 0.21428571428571427  
 Atten_Threshold_Met           | 0                    
-RECORD 1---------------------------------------------
 PersonId                      | 0f3f3f677ce0424a2... 
 ExternalId                    | b2437b12e9221fae3... 
 DaysActiveTeamsMeetings       | 0                    
 DaysActiveTeamsCommunications | 0                    
 DaysActiveAssignments         | 0                    
 DaysDigitallyActive           | 0                    
 DaysPresent                   | 12                   
 Prese

## Write Back to the Lake

In [25]:
# write back to the lake
dfDayAct.write.format('parquet').mode('overwrite').save(stage3 + '/test_s3_hybrid/dayActivity')
dfYearAct.write.format('parquet').mode('overwrite').save(stage3 + '/test_s3_hybrid/yearActivity')

StatementMeta(spark2, 17, 25, Finished, Available)



## Load to Spark DB

In [26]:
# Create spark db to allow for access to the data in the delta-lake via SQL on-demand.
# This is only creating metadata for SQL on-demand, pointing to the data in the delta-lake.
# This also makes it possible to connect in Power BI via the azure sql data source connector.
def create_spark_db(db_name, source_path):
    spark.sql(f'CREATE DATABASE IF NOT EXISTS {db_name}')
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.dayActivity")
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.yearActivity")
    spark.sql(f"create table if not exists {db_name}.dayActivity using PARQUET location '{source_path}/dayActivity'")
    spark.sql(f"create table if not exists {db_name}.yearActivity using PARQUET location '{source_path}/yearActivity'")
    
create_spark_db('test_s3_hybrid', stage3 + '/test_s3_hybrid')

StatementMeta(spark2, 17, 26, Finished, Available)

