# Docs

## Peter Daniels - 2025 August

Apparently, the old 8x8 "Post Call Survery" API will be switched off on 2025-08-20, so we need to use the new API which is documentted here:

https://developer.8x8.com/analytics/docs/customer-experience-post-call-survey

The base URL is region specific, based on the location of your Contact Center tenant.

- United States: https://api.8x8.com/analytics/cc/{version}/historical-metrics/
- Europe: https://api.8x8.com/eu/analytics/cc/{version}/historical-metrics/
- Asia-Pacific: https://api.8x8.com/au/analytics/cc/{version}/historical-metrics/
- Canada: https://api.8x8.com/ca/analytics/cc/{version}/historical-metrics/
- {version} to be replaced by current Version. As of June 2023 this is 7 resulting in /v7/

So, our base URL for historical metrics is: https://api.8x8.com/analytics/cc/v7/historical-metrics/
and we use a "type" of "detailed-reports-survey" in the create-report API call.

I am using this notebook to replace the old "pcs" subject_area code in the existing contact_center_ops_api_processing notebook rather than try to munge the new API calls into that maze of code.

This code gets most of the data elements for the existing raw pcs delta table from eth new API, then joins to the raw delats table we built from the old-code's detailed-reports-interaction-details call to get the durations. The idea is to append to the same raw delta table the old API code produced so that down stream processing and consumption (e.g. 8x8_perform_transformations notebook) remains the same.

The /historical_analytics/incremental/20250828/ (for example) data has interactionId and transactionId, which are always single valued. To handle new PCS API interactionIds and transacitonIds where there might be multiple Ids in a list, we are taking the first element of the list.  That allows us to join to the /historical_analytics/incremental delta table to get durations and times.


## Questions and Answers

1. Is there any filtering we should be doing?  The old code's API had a region parameter - so we sent a call to get each region's PCS data: 'regions':[69,70,301,304,303,321,305]. These were only US regions.  It had a commented out line for UK regions. 'regions': [301,304,322,303,321,305,324].

    **Answer: No, no filtering. report filters are being handled downstream.**

2. In order to fit the raw schema that the old PCS API code created for downstream processing into gold, we would need "hits" and "QuestionLabel" columns populated. We don't have hits, so considering just populating with 0.  QuestionLabel might be populated by some logic that looks at the survey type - TBD. Not sure how relliant the downstream xform code is on this being accurate. However, I did see that "hits" was not in the gold table/view (gold.CcoPcs), so unless there are reports based on raw (bad idea), we should be OK with 0 for hits.
    
    **Answer: this is one for Ethan Boelkins.  He says these are not used, so we are OK.**
3. I have not deeply reviewed the raw-to-gold processing in 8x8_perform_transformations notebook or how the final gold view (gold.CcoPcs) is consumed in order to get a solid sense of how important various data elements are.


    ** Answer: I would defer to the answer on two for this. In the next call, lets figure out which exact fields **
    **are mismatched in the new API vs the old one to determine any impact. Verified with Ethan that we have the data we need.**

4. The old PCS API died 2025-08-21. Do we need some capability to go back in time and reload days since then using the new API? Right now, like the old code, it just gets the previous day's data.

    **Answer: We added manual date range setting in the code to do a historical load**

5. The new API uses transactionIds and interactionIds (plural) our code currently picks the first Id in those columns to join to the histaorical analytics data to get our durations/times/etc. Is this OK? Since the duraton data may not be of import, this may not be an issue, but worth noting here again.

    **Answer: I think yes, as here is the behavior that I have observed so far: 1. All interaction IDs coming from the Interaction Details API, they are all unique. 2. The interaction duration is not relevant to survey reporting. It could be at some point, so it would probably be best to have continuity here, and to me that appears to be taking the first queue and transaction ID from the survey records to tie that to. This again is likely something that Ethan can confirm. Confimred.  Agreed.**

6.  Do we want 0s and 0.0s for ALL duration and times when we fail to lookup interaction details via first interactionId and first transactionId? 

    **Answer: We are now setting them to "0" when NULL (JOIN to interaction deails failed)**


# Imports

In [34]:
import pandas as pd
import requests
from base64 import b64encode
from datetime import date, timedelta, datetime, timezone
import re
import json
from pyspark.sql import Row, functions as F, types as T

# Includes

In [35]:
%run /utils/common_functions

# Main

## API Connection Config

In [36]:
# Token URL
access_token_url = "https://api.8x8.com/oauth/v2/token"

# Get a token (how?)
#  Need the Admin Console clientId and secret to get an access token.
# 
# curl #location --request POST 'https://api.8x8.com/oauth/v2/token' \
# --header 'Content-Type: application/x-www-form-urlencoded' \
# --header 'Authorization: Basic base64encode({clientId}:{secret})' 
# --data-urlencode 'grant_type=client_credentials'

# Client ID and client secret are used to build an auth string to an access token
# We get the client secret from AKV
client_id = "YonXyRNfVMGFcwOOe2Dna8GTHEcGHsfy"
client_secret = mssparkutils.credentials.getSecret(kv_name, "cco-8x8-client-secret", "ls_kv_adap")
print(f"Retrieved cco-8x8-client-secret from AKV for getting an access token (partial output here): {client_secret[:3]}...{client_secret[-3:]}")

auth_string = f'{client_id}:{client_secret}'
auth_header = b64encode(auth_string.encode()).decode()
#print(f"auth_header: {auth_header}")

headers = {'Authorization': f'Basic {auth_header}',
           'Content-Type': 'application/x-www-form-urlencoded'}

data = { 'grant_type': 'client_credentials'}

resp = requests.post(access_token_url, headers=headers, data=data, timeout=30)
resp.raise_for_status()          # raises on 4xx/5xx
response_json = resp.json()      # now safely parse JSON

access_token = response_json['access_token'] # We use this access token in all subsequent API calls
#print(access_token)
print("Access token generated for subsequent API calls")

## Set our datetime range

In [37]:
# Normal code - yesterday
# Get yesterday's start and end datetimes (ISO 8601 format - yyyy-MM-ddThh:mm:ss.###Z)
print("Building our start and end datetime params for the report creation API call")
yesterday = date.today() - timedelta(days=1)
# TODO: ARE We supposed to use UTC? When generating "yesterday"? The old code did not, so I am leaving it as-is for now.
# Get yesterday's date in UTC?
# utc_today = datetime.now(timezone.utc).date()
# yesterday = utc_today - timedelta(days=1)

# Start of yesterday (midnight)
yesterday_start_str = datetime.combine(yesterday, datetime.min.time()).strftime('%Y-%m-%dT%H:%M:%S.000Z')
# End of yesterday (23:59:59)
yesterday_end_str = datetime.combine(yesterday, datetime.max.time().replace(microsecond=0)).strftime('%Y-%m-%dT%H:%M:%S.999Z')

print("yesterday_start_str:", yesterday_start_str)
print("yesterday_end_str:", yesterday_end_str)

# Use a generic start and end datetime str variable
start_datetime_str = yesterday_start_str
end_datetime_str = yesterday_end_str

In [38]:
# Manual Code to set a particular datetime range.
# IMPORTANT! This should be commented out for normal execution.
# IMPORTANT! You still need to run the code cell above in order to compare our dates to yesterday further down.
# The last successful loadDate for Prod with the old API was 2025-08-20 for 2025-08-19 call data, 
# so we should start with the 20th's data for that historical load.
#start_datetime_str = "2025-08-20T00:00:00.000Z"
# I think just use yesterday for the end.  You can put whatever you want here anyhoo.
#end_datetime_str = yesterday_end_str

In [39]:
print(f"Using start_datetime_str: {start_datetime_str} in our API call dateRange param")
print(f"Using end_datetime_str: {end_datetime_str} in our API call dateRange param")

## Make API calls to create the detailed-reports-survey report and access it - page by page

In [40]:
report_request_headers = {'Authorization':f'Bearer {access_token}'}
report_request_url = 'https://api.8x8.com/analytics/cc/v7/historical-metrics/detailed-reports'

# We are using the timezone element in the body that the previous code used.
# Build our API body for creating the report. We are using yesterday's dates.
body = {'type':'detailed-reports-survey',
        'title': 'interactions',
        'timezone': 'America/New_York',
        'dateRange':
            {'start': start_datetime_str,
            'end': end_datetime_str
             },
        }
print(body)

# -------------------------------Create the report -----------------------------
print("Making the 'create report' API call...")
report_request = requests.post(headers=report_request_headers, url=report_request_url, json=body)
report_request.raise_for_status()
report_request_data = report_request.json()
print(report_request_data)

# From the create report API call, we get a link/URL to the report
report_link = report_request.headers.get('Link')
if not report_link:
    raise RuntimeError("No 'Link' header returned from report creation.")
#print(report_request)
print(report_link)

match = re.search(r'<(.+?)>; rel="data"', report_link) #regex expression that drops the unneeded details from the report link.
report_data_url = None # Use this variable to access the report.
if match:
    report_data_url = match.group(1)
    report_data_url = report_data_url #assigns the needed link back to the variable outside the If statement for easier usage.

#---------------------------------------Access the report created in the prior call----------------------------------------------------

all_data = []
report_access_headers = {'Authorization':f'Bearer {access_token}', 'Accept': 'application/json'}
report_access = requests.get(report_data_url, headers = report_access_headers)
report_access.raise_for_status()

while True:
    data = report_access.json() # Grabs the data from the first page response.
    all_data.append(data) #appends the data to all_data which will be eventually cleaned.

    head = report_access.headers.get('Link')
    if not head:
        break

    match = re.search(r'<(.+?)>; rel="next"', head)
    if not match:
        break

    next_url = match.group(1)
    print(f"next_url: {next_url}")
    report_access = requests.get(next_url, headers=report_access_headers) #If the header contains a URL to access the next page, call it.
    report_access.raise_for_status()

print("Retrieved all response data from our report.")

#all_data_pretty = json.dumps(all_data, indent=4)
#print(all_data_pretty)

#-------------------------------- Handling the Response-------------------------------------------

# TODO: Dump the truly raw data into the datalake before "cleaning"/flattening?

print("Cleaning the data into a nice format")
cleaned_data = []
for outer in all_data:
    # Each page can be a list; each list element can be a dict with 'items'
    for record in outer if isinstance(outer, list) else []:
        if not isinstance(record, dict) or 'items' not in record:
            continue
        row = {}
        for item in record['items']:
            key = item.get('key')
            value = item.get('value')
            if key is None:
                continue

            if isinstance(value, list):
                if all(isinstance(v, dict) for v in value):
                    # List of dicts, e.g. results -> results_1_answerDigit, etc.
                    for i, obj in enumerate(value, start=1):
                        for sub_k, sub_v in obj.items():
                            row[f"{key}_{i}_{sub_k}"] = sub_v
                else:
                    # List of scalars -> comma-join
                    row[key] = ", ".join(map(str, value))
            else:
                row[key] = value
        cleaned_data.append(row)

if not cleaned_data:
    raise RuntimeError("No rows returned from API (cleaned_data is empty).")

print("Finished cleaning our reponse/report data")

## Convert cleaned data to Spark dataframe

In [41]:
# Convert to spark dataframe, etc.

# *****************************************************************
# Let's get a spark dataframe for ALL columns  (as strings) for figuring out addl columns needed for filtering.
# 1) Collect all possible keys across all dicts
all_keys = sorted({k for r in cleaned_data if isinstance(r, dict) for k in r.keys()})

# 2) Normalize rows (missing keys → None)
rows = [{k: r.get(k) for k in all_keys} for r in cleaned_data if isinstance(r, dict)]

# 3) Build stable string-typed schema
schema = T.StructType([T.StructField(k, T.StringType(), True) for k in all_keys])

# 4) Create DataFrame
df_spark_all = spark.createDataFrame([Row(**row) for row in rows], schema=schema)

#df_spark_all.printSchema()
# root
#  |-- achievableScore: string (nullable = true)
#  |-- actualScore: string (nullable = true)
#  |-- agentGroupIds: string (nullable = true)
#  |-- agentGroupNames: string (nullable = true)
#  |-- agentIds: string (nullable = true)
#  |-- agentNames: string (nullable = true)
#  |-- callerName: string (nullable = true)
#  |-- callerNumber: string (nullable = true)
#  |-- channelId: string (nullable = true)
#  |-- endTime: string (nullable = true)
#  |-- interactionIds: string (nullable = true) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Key column
#  |-- queueIds: string (nullable = true)
#  |-- queueNames: string (nullable = true)
#  |-- results_1_answerDigit: string (nullable = true)
#  |-- results_1_answerType: string (nullable = true)
#  |-- results_1_questionId: string (nullable = true)
#  |-- results_1_questionTitle: string (nullable = true)
#  |-- results_1_questionType: string (nullable = true)
#  |-- results_1_scaleMax: string (nullable = true)
#  |-- results_1_scaleMin: string (nullable = true)
#  |-- results_1_voiceRecordingUuid: string (nullable = true)
#  |-- results_2_answerDigit: string (nullable = true)
#  |-- results_2_answerType: string (nullable = true)
#  |-- results_2_questionId: string (nullable = true)
#  |-- results_2_questionTitle: string (nullable = true)
#  |-- results_2_questionType: string (nullable = true)
#  |-- results_2_scaleMax: string (nullable = true)
#  |-- results_2_scaleMin: string (nullable = true)
#  |-- results_2_voiceRecordingUuid: string (nullable = true)
#  |-- startTime: string (nullable = true)
#  |-- surveyDuration: string (nullable = true) ??????????????????????????????
#  |-- surveyId: string (nullable = true)
#  |-- surveyIsDeleted: string (nullable = true)
#  |-- surveyName: string (nullable = true) ??????????????????????????????????
#  |-- surveyScorePercentage: string (nullable = true)
#  |-- surveyType: string (nullable = true) ??????????????????????????????????
#  |-- transactionIds: string (nullable = true) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Key Column

#display(df_spark_all)

# *****************************************************************

# We only need certain columns fro proceed - keep it lean.
wanted = [
    "interactionIds","startTime","callerName","callerNumber",
    "results_1_answerDigit","results_2_answerDigit",
    "actualScore","agentIds","queueIds","transactionIds",
]

# 1) Filter at the source (missing keys become None)
rows = [{k: r.get(k) for k in wanted} for r in cleaned_data if isinstance(r, dict)]

# 2) Give Spark a stable schema (all strings first is very robust)
schema = T.StructType([T.StructField(k, T.StringType(), True) for k in wanted])
df_spark = spark.createDataFrame([Row(**row) for row in rows], schema=schema)

print("Converted cleaned_data to a spark dataframe, df_spark, with schema:")
df_spark.printSchema()
# root
#  |-- interactionIds: string (nullable = true)
#  |-- startTime: string (nullable = true)
#  |-- callerName: string (nullable = true)
#  |-- callerNumber: string (nullable = true)
#  |-- results_1_answerDigit: string (nullable = true)
#  |-- results_2_answerDigit: string (nullable = true)
#  |-- actualScore: string (nullable = true)
#  |-- agentIds: string (nullable = true)
#  |-- queueIds: string (nullable = true)
#  |-- transactionIds: string (nullable = true)

#display(df_spark)

In [42]:
#display(df_spark.limit(100))

## Join to historical_analytics data we captured in legacy notebook

### Get our dates to load

In [43]:
# What dates of historical_analytics do we need?
# Get multiple paths if our start_datetime_str is not the same as our yesterday_start_str (hist load)
# Otherwise, normal processing - get today's date
if start_datetime_str != yesterday_start_str:
    print(f"start_datetime_str ({start_datetime_str}) is not yesterday_start_str ({yesterday_start_str})")
    print("Building a list of dates to get our historical_analytics data to join to.")

    # Parse strings (format is always: 2025-08-20T00:00:00.000Z)
    start_dt = datetime.strptime(start_datetime_str, "%Y-%m-%dT%H:%M:%S.%fZ").replace(tzinfo=timezone.utc)
    end_dt   = datetime.strptime(end_datetime_str,   "%Y-%m-%dT%H:%M:%S.%fZ").replace(tzinfo=timezone.utc)

    # Build list of dates (inclusive)
    dates_to_get = []
    current_date = start_dt.date()
    end_date     = end_dt.date()

    while current_date <= end_date:
        dates_to_get.append(current_date.strftime("%Y%m%d"))
        current_date += timedelta(days=1)
else:
    today_str = datetime.now(timezone.utc).strftime("%Y%m%d")
    print(today_str)
    dates_to_get = [today_str]

print("dates we will load for 'historical_analytics/incremental/yyyyMMdd/':")
print(dates_to_get)


### Load historical_analytics incremental data for our dates (usually just today's)

In [44]:
# We have to get today's historical_analytics load OR a range of dates if we're loding hostory
# Example delta path:
#abfss://raw@azwwwnonproddevadapadls.dfs.core.windows.net/historical_analytics/incremental/20250827
print("Loading incremental dates for historical_analytics.")

df_hist_analytics = None

for date_str in dates_to_get:
    hist_analytics_delta_path = f"{raw_adls_path}historical_analytics/incremental/{date_str}"
    try:
        df_temp = spark.read.format("delta").load(hist_analytics_delta_path)
        if df_hist_analytics is None:
            df_hist_analytics = df_temp
        else:
            df_hist_analytics = df_hist_analytics.unionByName(df_temp)
        print(f"✅ Loaded {hist_analytics_delta_path}")
    except Exception as e:
        print(f"⚠️ Skipping {hist_analytics_delta_path}: {e}")

if df_hist_analytics is None:
    raise RuntimeError("❌ No dataframes were loaded — be sure to run the main contact_center_ops_api_proicessing notebook befor this so we have the interaction details for our date(s).")

print("Done loading incremental dates for historical_analytics.")
#df_hist_analytics.show(10)

In [45]:
#display(df_hist_analytics.limit(20))
# df_hist_analytics.printSchema()
# root
#  |-- participants: string (nullable = true)
#  |-- time: string (nullable = true)
#  |-- agentNotes: string (nullable = true)
#  |-- blindTransferToAgent: string (nullable = true)
#  |-- blindTransferToQueue: string (nullable = true)
#  |-- campaignId: string (nullable = true)
#  |-- campaignName: string (nullable = true)
#  |-- caseFollowUp: string (nullable = true)
#  |-- caseNumber: string (nullable = true)
#  |-- channelId: string (nullable = true)
#  |-- chatType: string (nullable = true)
#  |-- conferencesEstablished: string (nullable = true)
#  |-- consultationsEstablished: string (nullable = true)
#  |-- creationTime: string (nullable = true)
#  |-- customerName: string (nullable = true)
#  |-- destination: string (nullable = true)
#  |-- direction: string (nullable = true)
#  |-- dispositionAction: string (nullable = true)
#  |-- externalTransactionData: string (nullable = true)
#  |-- finishedTime: string (nullable = true)
#  |-- interactionId: string (nullable = true)
#  |-- interactionLabels: string (nullable = true)
#  |-- interactionType: string (nullable = true)
#  |-- ivrTreatmentDuration: string (nullable = true)
#  |-- mediaType: string (nullable = true)
#  |-- originalInteractionId: string (nullable = true)
#  |-- originalTransactionId: string (nullable = true)
#  |-- origination: string (nullable = true)
#  |-- outboundPhoneCode: string (nullable = true)
#  |-- outboundPhoneCodeId: string (nullable = true)
#  |-- outboundPhoneCodeList: string (nullable = true)
#  |-- outboundPhoneCodeListId: string (nullable = true)
#  |-- outboundPhoneCodeText: string (nullable = true)
#  |-- outboundPhoneShortCode: string (nullable = true)
#  |-- participantAssignNumber: string (nullable = true)
#  |-- participantBusyDuration: string (nullable = true)
#  |-- participantHandlingDuration: string (nullable = true)
#  |-- participantHandlingEndTime: string (nullable = true)
#  |-- participantHold: string (nullable = true)
#  |-- participantHoldDuration: string (nullable = true)
#  |-- participantId: string (nullable = true)
#  |-- participantLongestHoldDuration: string (nullable = true)
#  |-- participantName: string (nullable = true)
#  |-- participantOfferAction: string (nullable = true)
#  |-- participantOfferActionTime: string (nullable = true)
#  |-- participantOfferDuration: string (nullable = true)
#  |-- participantOfferTime: string (nullable = true)
#  |-- participantProcessingDuration: string (nullable = true)
#  |-- participantType: string (nullable = true)
#  |-- participantWrapUpDuration: string (nullable = true)
#  |-- participantWrapUpEndTime: string (nullable = true)
#  |-- queueId: string (nullable = true)
#  |-- queueName: string (nullable = true)
#  |-- queueTime: string (nullable = true)
#  |-- queueWaitDuration: string (nullable = true)
#  |-- recordId: string (nullable = true)
#  |-- transactionId: string (nullable = true)
#  |-- warmTransfersCompleted: string (nullable = true)
#  |-- wrapUpCode: string (nullable = true)
#  |-- wrapUpCodeId: string (nullable = true)
#  |-- wrapUpCodeList: string (nullable = true)
#  |-- wrapUpCodeListId: string (nullable = true)
#  |-- wrapUpCodeText: string (nullable = true)
#  |-- wrapUpShortCode: string (nullable = true)
#  |-- loadDateTime: timestamp (nullable = true)
#  |-- participantMute: string (nullable = true)
#  |-- participantMuteDuration: string (nullable = true)
#  |-- participantLongestMuteDuration: string (nullable = true)
#  |-- scheduleHours: string (nullable = true)
#  |-- TimeToAbandon: string (nullable = true)
#  |-- Transfers: string (nullable = true)
#  |-- interactionDuration: string (nullable = true)
#  |-- loadDate: timestamp (nullable = true)


### The actual JOIN processing - a bit complex

In [46]:
# Join new survey data (df_spark) with historical analytics (df_hist_analytics - interactions)
# Important: We handle the case when the join columns in pcs (df_spark) have a list of Ids for interactionIds of transactionIds
# by picking the first one in the list
# Note that the lists of these Ids are comma separated, but also have a space.  For example: "765345, 263547"

print("Building df_final to add to our pcs/pcs delta table...")

# --- 1) Parse duration-like JSON strings in interactions -> string ms values ---
# Example duration column value: "{'value': 26990, 'ongoing': False}"
dur_schema = T.StructType([
    T.StructField("value", T.LongType(), True),
    T.StructField("ongoing", T.BooleanType(), True),
])

duration_cols = {
    "participantHandlingDuration": "agentCallHandlingDuration",
    "participantHoldDuration": "holdDuration",
    "participantMuteDuration": "muteDuration",
    "ivrTreatmentDuration": "timeInIVR",
    "queueWaitDuration": "waitTime",
    "interactionDuration": "callDuration",
}

df_hist_norm = df_hist_analytics
for src_col, out_name in duration_cols.items():
    src_str = F.col(src_col).cast("string")

    # Clean Python-like dict text into valid JSON (handles None/True/False and single quotes)
    cleaned = F.regexp_replace(src_str, r"'", '"')
    cleaned = F.regexp_replace(cleaned, r"\bTrue\b", "true")
    cleaned = F.regexp_replace(cleaned, r"\bFalse\b", "false")
    cleaned = F.regexp_replace(cleaned, r"\bNone\b", "null")
    cleaned = F.when(F.length(cleaned) == 0, F.lit(None)).otherwise(cleaned)

    # Parse JSON; if it parses, take .value; else fall back to numeric cast of original string
    parsed_value = F.from_json(cleaned, dur_schema).getField("value")
    value_long = F.when(parsed_value.isNotNull(), parsed_value) \
                  .otherwise(src_str.cast("long"))

    # Final as STRING (ms) to match legacy schema
    df_hist_norm = df_hist_norm.withColumn(out_name, value_long.cast("string"))

df_hist_pick = df_hist_norm.select(
    F.col("interactionId").cast("string").alias("interactionId_key"),
    F.col("transactionId").cast("string").alias("transactionId_key"),
    *[F.col(v) for v in duration_cols.values()]
)

#df_hist_pick.printSchema()
#display(df_hist_pick.limit(100))

# THIS CODE WORKS

# --- 2) Helper to extract the first ID from a comma-separated / bracketed / quoted string ---
def first_from_csv_col(col):
    # Remove wrapping brackets and quotes, then split on commas, then take first token
    cleaned = F.regexp_replace(col, r'^\[|\]$|\"', '')            # strip [ ], "
    parts   = F.split(cleaned, r'\s*,\s*')                        # split on commas with optional spaces
    return F.element_at(parts, 1)                                 # 1-based index; null-safe if empty

# Normalize df_spark keys for join
df_s_norm = (
    df_spark
    .withColumn("interactionId_key", first_from_csv_col(F.col("interactionIds")))
    .withColumn("transactionId_key", first_from_csv_col(F.col("transactionIds")))
)

#display(df_s_norm.limit(100))

# --- 3) Join and select with correct legacy names & types ---
# NOTE: New API startTime is formatted like: "2025-08-28T19:37:05.696-04:00"
#   We need it to be formatted like the old API: "08/29/2024 17:17:49" for down stream processing.
df_joined = (
    df_s_norm.alias("s")
    .join(df_hist_pick.alias("h"),
          on=F.col("s.interactionId_key") == F.col("h.interactionId_key"),
          how="left")
    .select(
        # Legacy names
        F.col("s.interactionId_key").alias("callId"),
        # OLD CODE - Keeps raw UTC timezone
        # F.date_format(
        #     F.to_timestamp("s.startTime", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"),
        #     "MM/dd/yyyy HH:mm:ss"
        # ).alias("callDate"),
        # NEW CODE converts incoming raw UTC to EST
        F.date_format(
            F.from_utc_timestamp(
                F.to_timestamp("s.startTime", "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"),
                "America/New_York"
            ),
            "MM/dd/yyyy HH:mm:ss"
        ).alias("callDate"),
        F.col("s.callerName").cast("string").alias("callerName"),
        F.col("s.callerNumber").cast("string").alias("callerNumber"),

        # Defaults for NULL/None
        F.coalesce(F.col("s.results_1_answerDigit").cast("string"), F.lit("NA")).alias("question1"),
        F.coalesce(F.col("s.results_2_answerDigit").cast("string"), F.lit("NA")).alias("question2"),

        F.col("s.actualScore").cast("int").alias("totalScore"),
        F.col("s.agentIds").cast("string").alias("agentList"),
        F.col("s.queueIds").cast("string").alias("queueList"),
        F.col("s.transactionId_key").alias("transactionId"),

        # Durations with defaults
        F.coalesce(F.col("h.agentCallHandlingDuration").cast("string"), F.lit("0")).alias("agentCallHandlingDuration"),
        F.coalesce(F.col("h.holdDuration").cast("string"), F.lit("0.0")).alias("holdDuration"),
        F.coalesce(F.col("h.muteDuration").cast("int"), F.lit(0)).alias("muteDuration"),
        F.coalesce(F.col("h.timeInIVR").cast("string"), F.lit("0")).alias("timeInIVR"),
        F.coalesce(F.col("h.waitTime").cast("string"), F.lit("0")).alias("waitTime"),
        F.coalesce(F.col("h.callDuration").cast("string"), F.lit("0")).alias("callDuration")
    )
)

# Join with survey df and rename to fit old API call data element names:
# AttributeName → MAPPED TO NEW
# ----------------------------------------
# callId                    → interactionIds
# callDate                  → startTime
# callerName                → callerName
# callerPhoneNumber         → callerNumber
# question1                 → results_1_answerDigit
# question2                 → results_2_answerDigit
# totalScore                → actualScore
# agentList                 → agentIds
# queueList                 → queueIds
# transactionId             → transactionIds
# agentCallHandlingDuration → Get from interaction details column w same name
# holdDuration              → Get from interaction details column w same name
# muteDuration              → Get from interaction details column w same name
# timeInIVR                 → Get from interaction details column w same name
# waitTime                  → Get from interaction details column w same name
# callDuration              → Get from interaction details column w same name

# Transform this into the same dataframe structure we have in the old raw pcs/pcs delta table
# {hits, data, QuestionLabel, loadDate, loadDateTime}
# Gold delta table does NOT include hits, so we MAY not need to figure out how to calculate that.
# Gold delta table DOES include QuestionLable, so we will want to figure out some logic for it.
# data is our essential payload - need to convert the dataframe column values to a JSON/dict element:
#
# --{
# --    "callId": "int-18fdf8df0d3-aYkEI6o0fvyMwMiP1eGiNTV1S-phone-03-wolverineworldwid01",
# --    "callDate": "06/03/2024 15:22:52",
# --    "callerName": "WIRELESS CALLER",
# --    "callerPhoneNumber": "2222222222",
# --    "question1": "5",
# --    "question2": "5",
# --    "totalScore": 10,
# --    "agentList": "[agkOrv32ujRCuUOQ3sYjW_Ag]",
# --    "queueList": "[276]",
# --    "transactionId": "178519",
# --    "agentCallHandlingDuration": "312815.0",
# --    "holdDuration": "0.0",
# --    "muteDuration": 0,
# --    "timeInIVR": "105709.0",
# --    "waitTime": "41592.0",
# --    "callDuration": "461309.0"
# --}

df_final = (
    df_joined.select(
        F.lit(0).cast("int").alias("hits"),
        F.struct(
            F.col("callId").cast("string").alias("callId"),
            F.col("callDate").cast("string").alias("callDate"),
            F.col("callerName").cast("string").alias("callerName"),
            F.col("callerNumber").cast("string").alias("callerPhoneNumber"),  # <-- fix name
            F.col("question1").cast("string").alias("question1"),
            F.col("question2").cast("string").alias("question2"),
            F.col("totalScore").cast("int").alias("totalScore"),
            F.col("agentList").cast("string").alias("agentList"),
            F.col("queueList").cast("string").alias("queueList"),
            F.col("transactionId").cast("string").alias("transactionId"),
            F.col("agentCallHandlingDuration").cast("string").alias("agentCallHandlingDuration"),
            F.col("holdDuration").cast("string").alias("holdDuration"),
            F.col("muteDuration").cast("int").alias("muteDuration"),
            F.col("timeInIVR").cast("string").alias("timeInIVR"),
            F.col("waitTime").cast("string").alias("waitTime"),
            F.col("callDuration").cast("string").alias("callDuration"),
        ).alias("data"),
        F.lit("{Q1=Satisfaction, Q2=Resolution}").alias("QuestionLabel"),
        F.current_date().alias("loadDate"),
        F.current_timestamp().alias("loadDateTime"),
    )
)

print("Done Building df_final to add to our pcs/pcs delta table.")

#df_final.printSchema()
# root
#  |-- hits: integer (nullable = false)
#  |-- data: struct (nullable = false)
#  |    |-- callId: string (nullable = true)
#  |    |-- callDate: string (nullable = true)
#  |    |-- callerName: string (nullable = true)
#  |    |-- callerPhoneNumber: string (nullable = true)
#  |    |-- question1: string (nullable = false)
#  |    |-- question2: string (nullable = false)
#  |    |-- totalScore: integer (nullable = true)
#  |    |-- agentList: string (nullable = true)
#  |    |-- queueList: string (nullable = true)
#  |    |-- transactionId: string (nullable = true)
#  |    |-- agentCallHandlingDuration: string (nullable = false)
#  |    |-- holdDuration: string (nullable = false)
#  |    |-- muteDuration: integer (nullable = false)
#  |    |-- timeInIVR: string (nullable = false)
#  |    |-- waitTime: string (nullable = false)
#  |    |-- callDuration: string (nullable = false)
#  |-- QuestionLabel: string (nullable = false)
#  |-- loadDate: date (nullable = false)
#  |-- loadDateTime: timestamp (nullable = false)

#display(df_final.limit(100))

In [47]:
#display(df_joined.limit(20))

# Min and MAx callDate:
# df_joined.select(
#     F.min(F.to_timestamp("callDate", "MM/dd/yyyy HH:mm:ss")).alias("min_callDate"),
#     F.max(F.to_timestamp("callDate", "MM/dd/yyyy HH:mm:ss")).alias("max_callDate")
# ).show(truncate=False)


## Append to the raw pcs/pcs delta table.  Consider duplicate loadDate handling

In [48]:
# Confirm our existing delta table schema - ah - data is not a string!  It's a struct!
# pcs_raw_delta_path = f"{raw_adls_path}/pcs/pcs"
# df_pcs = (
#     spark
#     .read
#     .format("delta")
#     .load(pcs_raw_delta_path)
# )
# df_pcs.printSchema()
# root
#  |-- hits: integer (nullable = true)
#  |-- data: struct (nullable = true)
#  |    |-- callId: string (nullable = true)
#  |    |-- callDate: string (nullable = true)
#  |    |-- callerName: string (nullable = true)
#  |    |-- callerPhoneNumber: string (nullable = true)
#  |    |-- question1: string (nullable = true)
#  |    |-- question2: string (nullable = true)
#  |    |-- totalScore: integer (nullable = true)
#  |    |-- agentList: string (nullable = true)
#  |    |-- queueList: string (nullable = true)
#  |    |-- transactionId: string (nullable = true)
#  |    |-- agentCallHandlingDuration: string (nullable = true)
#  |    |-- holdDuration: string (nullable = true)
#  |    |-- muteDuration: integer (nullable = true)
#  |    |-- timeInIVR: string (nullable = true)
#  |    |-- waitTime: string (nullable = true)
#  |    |-- callDuration: string (nullable = true)
#  |-- QuestionLabel: string (nullable = true)
#  |-- loadDate: date (nullable = true)
#  |-- loadDateTime: timestamp (nullable = true)


In [49]:
pcs_raw_delta_path = f"{raw_adls_path}/pcs/pcs"

# TODO: What if we've already loaded this date?
#   We have some manual code below.  Leaving it out of here for now to make regular daily runs more efficient.

# Append df_final to the pcs/pcs delta table
print("Appending df_final to the raw pcs/pcs delta table...")
(
    df_final
    .write
    .format("delta")
    .mode("append")
    .save(pcs_raw_delta_path)
)
print("Done Appending df_final to the raw pcs/pcs delta table.")
print("DONE.")

# EXIT normal processing

In [50]:
# Exit normal processing
mssparkutils.notebook.exit("0")

# Manual Code - consider adding to daily load code above.  For now, leave it manual

## Delete raw data

In [None]:
# Delete todays data (or any date's data) so we don't duplicate data
# NOTE: Will need to activate PIM to do this when executing as your user account.
# loadDate = "yyyy-MM-dd"
today_str = date.today().strftime("%Y-%m-%d")
print(f"today_str {today_str}")

# Change this as needed  # TMP: 9/3, 8/29
dateToDelete = "2025-09-11"
#dateToDelete = today_str
print(f"dateToDelete: {dateToDelete}")

pcs_raw_delta_path = f"{raw_adls_path}/pcs/pcs"

spark.sql(f"""
DELETE FROM delta.`{pcs_raw_delta_path}`
WHERE loadDate = '{dateToDelete}'
""")
print(f"Deleted raw pcs/pcs data for loadDate '{dateToDelete}'")

## Delete gold data

In [None]:
# Manual code to delete our recent historical load from 8/20-current
# # GOLD
# so we can reload it.

# Change this as needed  # TMP: 9/3, 8/29
dateToDelete = "2025-09-11"
#dateToDelete = today_str
print(f"dateToDelete: {dateToDelete}")

pcs_gold_delta_path = f"{gold_adls_path}cco/pcs"

spark.sql(f"""
DELETE FROM delta.`{pcs_gold_delta_path}`
WHERE loadDate = '{dateToDelete}'
""")
print(f"Deleted raw pcs/pcs data for loadDate '{dateToDelete}'")



## Clean up the small file problem in raw

In [None]:
pcs_raw_delta_path = f"{raw_adls_path}/pcs/pcs"
print(f"Cleaning up the many small file issue in raw.pcs {pcs_raw_delta_path}")
# Read current table
df = spark.read.format("delta").load(pcs_raw_delta_path)

# Pick a small, sane number of output files for ~690k rows (e.g., 8–16)
df.coalesce(12) \
  .write.format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .save(pcs_raw_delta_path)

print(f"Done - Cleaning up the many small file issue in raw.pcs {pcs_raw_delta_path}")


## Vacuum the old files after coalescing to 12 (or whatever) files for raw

In [None]:
from delta.tables import DeltaTable
print(f"Vaccuming {pcs_raw_delta_path}")
DeltaTable.forPath(spark, pcs_raw_delta_path).vacuum()   # default retention (~7 days)
print(f"Done - Vaccuming {pcs_raw_delta_path}")