# Graph API Module Example Notebook

This notebook creates 3 tables (users, m365_app_user_detail and teams_acivity_user_details) into two new Spark databases called s2np_graphapi and s2p_graphapi.

s2p_graphapi is utilized for the Graph Reports API PowerBI dashboard provided


### Provision storage accounts

The storage account variable has to be changed to the name of the storage account associated with your Azure resource group.

In [1]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType
from pyspark.sql.functions import *
from pyspark.sql.window import Window


# data lake and container information
storage_account = 'stoeahybriddev2'
use_test_env = False

if use_test_env:
    stage1np = 'abfss://test-env@' + storage_account + '.dfs.core.windows.net/stage1np'
    stage2np = 'abfss://test-env@' + storage_account + '.dfs.core.windows.net/stage2np'
    stage2p = 'abfss://test-env@' + storage_account + '.dfs.core.windows.net/stage2p'
else:
    stage1np = 'abfss://stage1np@' + storage_account + '.dfs.core.windows.net'
    stage2np = 'abfss://stage2np@' + storage_account + '.dfs.core.windows.net'
    stage2p = 'abfss://stage2p@' + storage_account + '.dfs.core.windows.net'

StatementMeta(spark3p1sm, 99, 1, Finished, Available)

### Load Raw Data from Lake
To ensure that that the right tables are loaded, confirm that the file paths match your data lake storage containers. 

The top code-block defines the user schema of the stage 1 JSON file(s) stored.

In [2]:
# schemas for each of the JSON files for transformation into tables
user_schema = StructType(fields=[
    StructField('value', ArrayType(
        StructType([
            StructField('surname', StringType(), False),
            StructField('givenName', StringType(), False),
            StructField('userPrincipalName', StringType(), False),
            StructField('id', StringType(), False)
        ])
    ))
])

StatementMeta(spark3p1sm, 99, 2, Finished, Available)

In [3]:
# load needed tables from JSON data lake storage
dfUsersRaw = spark.read.format('json').load(f'{stage1np}/GraphAPI/Users/*.json', schema=user_schema)
dfM365UserActivityRaw = spark.read.format('json').load(f'{stage1np}/GraphAPI/M365_App_User_Detail/*.json')
dfTeamsUserActivityRaw = spark.read.format('json').load(f'{stage1np}/GraphAPI/Teams_Activity_User_Detail/*.json')

StatementMeta(spark3p1sm, 99, 3, Finished, Available)

## 1. Users table
Contains all users (students and teachers) at a school-system level

** Databases and tables used: **

 - None 
 
**JSON files used:**

- GraphAPI/Users/*.json

**Database and table created:**

1. Spark DB: s2p_graphapi
- Table: users
2. Spark DB: s2np_graphapi
- Table: users

In [5]:
dfUsers = dfUsersRaw.select(explode('value').alias('exploded_values')).select("exploded_values.*")
display(dfUsers.limit(10))

StatementMeta(spark3p1sm, 99, 5, Finished, Available)

SynapseWidget(Synapse.DataFrame, 7aa442ef-5610-464a-a39a-33b39ebaea13)

## Write Data Back to Lake

### Writing to Stage 2np

In [6]:
# write back to the lake in stage 2 ds2_main directory
dfUsers.write.format('parquet').mode('overwrite').save(stage2np + '/GraphAPI/Users')

StatementMeta(spark3p1sm, 99, 6, Finished, Available)

### Writing to Stage 2p
Pseudonymizing users data

In [7]:
%run /OEA_py

StatementMeta(, 99, -1, Finished, Available)

In [8]:
oea = OEA()

usersSchema = [['surname', 'string', 'mask'],
                        ['givenName', 'string', 'mask'],
                        ['userPrincipalName', 'string', 'hash'],
                        ['id', 'string', 'mask']]

df_pseudo, df_lookup = oea.pseudonymize(dfUsers, usersSchema)

df_pseudo.write.format('parquet').mode('overwrite').save(stage2p + '/GraphAPI/Users')

StatementMeta(spark3p1sm, 99, 8, Finished, Available)

2021-10-19 00:45:47,103 - OEA - DEBUG - OEA initialized.
OEA initialized.

### Load to Spark DB

In [9]:
# Create spark db to allow for access to the data in the delta-lake via SQL on-demand.
# This is only creating metadata for SQL on-demand, pointing to the data in the delta-lake.
# This also makes it possible to connect in Power BI via the azure sql data source connector.
def create_spark_db(db_name, source_path):
    spark.sql(f'CREATE DATABASE IF NOT EXISTS {db_name}')
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.users")
    spark.sql(f"create table if not exists {db_name}.users using PARQUET location '{source_path}'")
    
create_spark_db('s2np_graphapi', stage2np + '/GraphAPI/Users')
create_spark_db('s2p_graphapi', stage2p + '/GraphAPI/Users')

StatementMeta(spark3p1sm, 99, 9, Finished, Available)

## 2. M365_app_user_detail table
Contains a sample m365 table to support data analysis in a Power BI dashboard.

**Databases and tables used:**
- None

**JSON files used:**

- GraphAPI/M365_App_User_Detail/*.json

**Database and table created:**

1. Spark DB: s2p_graphapi
- Table: m365_app_user_detail
2. Spark DB: s2np_graphapi
- Table: m365_app_user_detail


In [11]:
dfM365UserActivity = dfM365UserActivityRaw.select(explode('value').alias('exploded_values')).select("exploded_values.*")

StatementMeta(spark3p1sm, 99, 11, Finished, Available)

### Processing m365 activity "details" data 
This code block moves the relevant data from "details" and allocates them into their respective columns.

In [12]:
import pyspark.sql.functions as f

dfM365UserActivity = dfM365UserActivity.withColumn('reportPeriod', f.explode(f.col('details').reportPeriod)) \
                        .withColumn('mobile', f.explode(f.col('details').mobile)) \
                        .withColumn('web', f.explode(f.col('details').web)) \
                        .withColumn('mac', f.explode(f.col('details').mac)) \
                        .withColumn('windows', f.explode(f.col('details').windows)) \
                        .withColumn('excel', f.explode(f.col('details').excel)) \
                        .withColumn('excelMobile', f.explode(f.col('details').excelMobile)) \
                        .withColumn('excelWeb', f.explode(f.col('details').excelWeb)) \
                        .withColumn('excelMac', f.explode(f.col('details').excelMac)) \
                        .withColumn('excelWindows', f.explode(f.col('details').excelWindows)) \
                        .withColumn('oneNote', f.explode(f.col('details').oneNote)) \
                        .withColumn('oneNoteMobile', f.explode(f.col('details').oneNoteMobile)) \
                        .withColumn('oneNoteWeb', f.explode(f.col('details').oneNoteWeb)) \
                        .withColumn('oneNoteMac', f.explode(f.col('details').oneNoteMac)) \
                        .withColumn('oneNoteWindows', f.explode(f.col('details').oneNoteWindows)) \
                        .withColumn('outlook', f.explode(f.col('details').outlook)) \
                        .withColumn('outlookMobile', f.explode(f.col('details').outlookMobile)) \
                        .withColumn('outlookWeb', f.explode(f.col('details').outlookWeb)) \
                        .withColumn('outlookMac', f.explode(f.col('details').outlookMac)) \
                        .withColumn('outlookWindows', f.explode(f.col('details').outlookWindows)) \
                        .withColumn('powerPoint', f.explode(f.col('details').powerPoint)) \
                        .withColumn('powerPointMobile', f.explode(f.col('details').powerPointMobile)) \
                        .withColumn('powerPointWeb', f.explode(f.col('details').powerPointWeb)) \
                        .withColumn('powerPointMac', f.explode(f.col('details').powerPointMac)) \
                        .withColumn('powerPointWindows', f.explode(f.col('details').powerPointWindows)) \
                        .withColumn('teams', f.explode(f.col('details').teams)) \
                        .withColumn('teamsMobile', f.explode(f.col('details').teamsMobile)) \
                        .withColumn('teamsWeb', f.explode(f.col('details').teamsWeb)) \
                        .withColumn('teamsMac', f.explode(f.col('details').teamsMac)) \
                        .withColumn('teamsWindows', f.explode(f.col('details').teamsWindows)) \
                        .withColumn('word', f.explode(f.col('details').word)) \
                        .withColumn('wordMobile', f.explode(f.col('details').wordMobile)) \
                        .withColumn('wordWeb', f.explode(f.col('details').wordWeb)) \
                        .withColumn('wordMac', f.explode(f.col('details').wordMac)) \
                        .withColumn('wordWindows', f.explode(f.col('details').wordWindows)) \
                        .drop('details')

display(dfM365UserActivity.limit(10))

StatementMeta(spark3p1sm, 99, 12, Finished, Available)

SynapseWidget(Synapse.DataFrame, d43c0a78-49d3-45fa-9ec0-60c23ff1daaa)

## Write Data Back to Lake

In [13]:
# write back to the lake in stage 2 ds2_main directory
dfM365UserActivity = dfM365UserActivity.withColumn('reportRefreshDate', to_date(col('reportRefreshDate'), 'yyyy-MM-dd'))
dfM365UserActivity = dfM365UserActivity.withColumn('lastActivityDate', to_date(col('lastActivityDate'), 'yyyy-MM-dd'))
dfM365UserActivity = dfM365UserActivity.withColumn('lastActivationDate', to_date(col('lastActivationDate'), 'yyyy-MM-dd'))
dfM365UserActivity.write.format('parquet').mode('overwrite').save(stage2np + '/GraphAPI/M365_App_User_Detail')

StatementMeta(spark3p1sm, 99, 13, Finished, Available)

### Writing to Stage 2p
Pseudonymizing M365 data

In [14]:
m365Schema = [['reportRefreshDate', 'date', 'no-op'],
                        ['userPrincipalName', 'string', 'hash'],
                        ['lastActivityDate', 'date', 'no-op'],
                        ['reportPeriod', 'string', 'no-op'],
                        ['mobile', 'boolean', 'no-op'],
                        ['web', 'boolean', 'no-op'],
                        ['mac', 'boolean', 'no-op'],
                        ['windows', 'boolean', 'no-op'],
                        ['excel', 'boolean', 'no-op'],
                        ['excelMac', 'boolean', 'no-op'],
                        ['excelMobile', 'boolean', 'no-op'],
                        ['excelWeb', 'boolean', 'no-op'],
                        ['excelWindows', 'boolean', 'no-op'],
                        ['oneNote', 'boolean', 'no-op'],
                        ['oneNoteMac', 'boolean', 'no-op'],
                        ['oneNoteMobile', 'boolean', 'no-op'],
                        ['oneNoteWeb', 'boolean', 'no-op'],
                        ['oneNoteWindows', 'boolean', 'no-op'],
                        ['outlook', 'boolean', 'no-op'],
                        ['outlookMac', 'boolean', 'no-op'],
                        ['outlookMobile', 'boolean', 'no-op'],
                        ['outlookWeb', 'boolean', 'no-op'],
                        ['outlookWindows', 'boolean', 'no-op'],
                        ['powerPoint', 'boolean', 'no-op'],
                        ['powerPointMac', 'boolean', 'no-op'],
                        ['powerPointMobile', 'boolean', 'no-op'],
                        ['powerPointWeb', 'boolean', 'no-op'],
                        ['powerPointWindows', 'boolean', 'no-op'],
                        ['teams', 'boolean', 'no-op'],
                        ['teamsMac', 'boolean', 'no-op'],
                        ['teamsMobile', 'boolean', 'no-op'],
                        ['teamsWeb', 'boolean', 'no-op'],
                        ['teamsWindows', 'boolean', 'no-op'],
                        ['word', 'boolean', 'no-op'],
                        ['wordMac', 'boolean', 'no-op'],
                        ['wordMobile', 'boolean', 'no-op'],
                        ['wordWeb', 'boolean', 'no-op'],
                        ['wordWindows', 'boolean', 'no-op']
                       ]


df_pseudo, df_lookup = oea.pseudonymize(dfM365UserActivity, m365Schema)

df_pseudo.write.format('parquet').mode('overwrite').save(stage2p + '/GraphAPI/M365_App_User_Detail')

StatementMeta(spark3p1sm, 99, 14, Finished, Available)

### Load to Spark DB

In [15]:
# Create spark db to allow for access to the data in the delta-lake via SQL on-demand.
# This is only creating metadata for SQL on-demand, pointing to the data in the delta-lake.
# This also makes it possible to connect in Power BI via the azure sql data source connector.
def create_spark_db(db_name, source_path):
    spark.sql(f'CREATE DATABASE IF NOT EXISTS {db_name}')
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.m365_app_user_detail")
    spark.sql(f"create table if not exists {db_name}.m365_app_user_detail using PARQUET location '{source_path}'")
    
create_spark_db('s2np_graphapi', stage2np + '/GraphAPI/M365_App_User_Detail')
create_spark_db('s2p_graphapi', stage2p + '/GraphAPI/M365_App_User_Detail')

StatementMeta(spark3p1sm, 99, 15, Finished, Available)

## 3. Teams_activity_user_details table
Contains a sample Teams table to support data analysis in a Power BI dashboard.

**Databases and tables used:**
- None

**JSON files used:**
- GraphAPI/Teams_Activity_User_Detail/*.json

**Database and table created:**

1. Spark DB: s2p_graphapi
- Table: teams_activity_user_details
2. Spark DB: s2np_graphapi
- Table: teams_activity_user_details

In [16]:
dfTeamsUserActivity = dfTeamsUserActivityRaw.select(explode('value').alias('exploded_values')).select("exploded_values.*")
dfTeamsUserActivity = dfTeamsUserActivity.withColumn('assignedProducts', f.explode(f.col('assignedProducts')))
dfTeamsUserActivity = dfTeamsUserActivity.drop('@odata.type')
display(dfTeamsUserActivity.limit(10))

StatementMeta(spark3p1sm, 99, 16, Finished, Available)

SynapseWidget(Synapse.DataFrame, 3608020e-2288-4df7-a199-104e0b1577e8)

In [19]:
dfTeamsUserActivity = dfTeamsUserActivity.select('userPrincipalName','lastActivityDate','reportRefreshDate', 'reportPeriod','isDeleted', 'isLicensed', 
'deletedDate', 'hasOtherAction', 'assignedProducts',
'adHocMeetingsAttendedCount', 'adHocMeetingsOrganizedCount',  'callCount', 'meetingCount',
 'meetingsAttendedCount', 'meetingsOrganizedCount', 'privateChatMessageCount',  'scheduledOneTimeMeetingsAttendedCount', 
'scheduledOneTimeMeetingsOrganizedCount', 'scheduledRecurringMeetingsAttendedCount', 'scheduledRecurringMeetingsOrganizedCount',
 'screenShareDuration', 'teamChatMessageCount', 'videoDuration','audioDuration')

StatementMeta(spark3p1sm, 99, 19, Finished, Available)

In [21]:
# convert duration to second only
dfTeamsUserActivity = dfTeamsUserActivity.withColumn(
    'screenShareDuration', 
    F.coalesce(F.regexp_extract('screenShareDuration', r'(\d+)H', 1).cast('int'), F.lit(0)) * 3600 + 
    F.coalesce(F.regexp_extract('screenShareDuration', r'(\d+)M', 1).cast('int'), F.lit(0)) * 60 + 
    F.coalesce(F.regexp_extract('screenShareDuration', r'(\d+)S', 1).cast('int'), F.lit(0))
    ).withColumn(
    'videoDuration', 
    F.coalesce(F.regexp_extract('videoDuration', r'(\d+)H', 1).cast('int'), F.lit(0)) * 3600 + 
    F.coalesce(F.regexp_extract('videoDuration', r'(\d+)M', 1).cast('int'), F.lit(0)) * 60 + 
    F.coalesce(F.regexp_extract('videoDuration', r'(\d+)S', 1).cast('int'), F.lit(0))
    ).withColumn(
    'audioDuration', 
    F.coalesce(F.regexp_extract('audioDuration', r'(\d+)H', 1).cast('int'), F.lit(0)) * 3600 + 
    F.coalesce(F.regexp_extract('audioDuration', r'(\d+)M', 1).cast('int'), F.lit(0)) * 60 + 
    F.coalesce(F.regexp_extract('audioDuration', r'(\d+)S', 1).cast('int'), F.lit(0))
    )
display(dfTeamsUserActivity.limit(10))

StatementMeta(spark3p1sm, 99, 21, Finished, Available)

SynapseWidget(Synapse.DataFrame, 2ae3ec23-152c-46cf-90dd-a150b0b2be75)

## Write Data Back to Lake

In [22]:
# write back to the lake in stage 2 ds2_main directory
dfTeamsUserActivity = dfTeamsUserActivity.withColumn('reportRefreshDate', to_date(col('reportRefreshDate'), 'yyyy-MM-dd'))
dfTeamsUserActivity = dfTeamsUserActivity.withColumn('deletedDate', to_date(col('deletedDate'), 'yyyy-MM-dd'))
dfTeamsUserActivity.write.format('parquet').mode('overwrite').save(stage2np + '/GraphAPI/Teams_Activity_User_Detail')

StatementMeta(spark3p1sm, 99, 22, Finished, Available)

### Writing to Stage 2p
Pseudonymizing Teams data

In [23]:
teamsSchema = [['reportRefreshDate', 'string', 'no-op'],
                        ['lasActivityDate', 'string', 'no-op'],
                        ['deletedDate', 'integer', 'no-op'],
                        ['isDeleted', 'string', 'no-op'],
                        ['isLiscenced', 'string', 'no-op'],                        
                        ['reportPeriod', 'string', 'no-op'],
                        ['userPrincipalName', 'string', 'hash'],
                        ['privateChatMessageCount', 'integer', 'no-op'],
                        ['teamChatMessageCount', 'integer', 'no-op'],
                        ['meetingsAttendedCount', 'integer', 'no-op'],
                        ['meetingCount', 'integer', 'no-op'],
                        ['meetingsOrganizedCount', 'integer', 'no-op'],                        
                        ['callCount', 'integer', 'no-op'],
                        ['audioDuration', 'string', 'no-op'],
                        ['videoDuration', 'string', 'no-op'],
                        ['screenShareDuration', 'string', 'no-op'],                        
                        ['scheduledOneTimeMeetingsAttendedCount', 'integer', 'no-op'],
                        ['scheduledOneTimeMeetingsOrganizedCount', 'string', 'no-op'],
                        ['scheduledRecurringMeetingsAttendedCount', 'string', 'no-op'],
                        ['scheduledRecurringMeetingsOrganizedCount', 'string', 'no-op'],
                        ['adHocMeetingsAttendedCount', 'string', 'no-op'],
                        ['adHocMeetingsOrganizedCount', 'string', 'no-op'],
                        ['assignedProducts', 'string', 'no-op'],
                        ['hasOtherAction', 'string', 'no-op']]

df_pseudo, df_lookup = oea.pseudonymize(dfTeamsUserActivity, teamsSchema)

df_pseudo.write.format('parquet').mode('overwrite').save(stage2p + '/GraphAPI/Teams_Activity_User_Detail')

StatementMeta(spark3p1sm, 99, 23, Finished, Available)

### Load to Spark DB

In [24]:
# Create spark db to allow for access to the data in the delta-lake via SQL on-demand.
# This is only creating metadata for SQL on-demand, pointing to the data in the delta-lake.
# This also makes it possible to connect in Power BI via the azure sql data source connector.
def create_spark_db(db_name, source_path):
    spark.sql(f'CREATE DATABASE IF NOT EXISTS {db_name}')
    spark.sql(f"DROP TABLE IF EXISTS {db_name}.teams_activity_user_details")
    spark.sql(f"create table if not exists {db_name}.teams_activity_user_details using PARQUET location '{source_path}'")
    
create_spark_db('s2np_graphapi', stage2np + '/GraphAPI/Teams_Activity_User_Detail')
create_spark_db('s2p_graphapi', stage2p + '/GraphAPI/Teams_Activity_User_Detail')

StatementMeta(spark3p1sm, 99, 24, Finished, Available)

## Reset Data Processing

Uncomment the last line in order to reset the data processing from this notebook (this drops both spark databases: s2np_graphapi and s2p_graphapi)

In [32]:
def reset_all_processing():
    oea.rm_if_exists(stage2np + '/GraphAPI')
    oea.rm_if_exists(stage2p + '/GraphAPI')
    oea.drop_db('s2np_graphapi')
    oea.drop_db('s2p_graphapi')

#reset_all_processing()

StatementMeta(spark3p1sm, 63, 32, Finished, Available)