# Extract Time Series Data from SAP IOT

This notebook will download the time series data from SAP IOT and store the downloaded files in the
folder which is configured in the respective configuration file under **[extract]/[time-series]/[directory]**.

## Pre-requirements

- within the `notebooks` folder there must be a file `.env` which contains the credentials to
access SAP PAI and SAP IOT. See file `.env.sample` as a reference for all needed parameters
to be maintained. By default you'll get the details of the parameters from the PAI and IOT
service key.

- the indicator ETL part needs to be done as this notebook requires a mapping from PAI indicators
to APM indicators which will be created in the steps when migrating the indicators. As a result
the view V_POST_LOAD_INDICATORS will hold all needed information.

- you need to know which datamodel your IOT data is using. If you use the IOT subscription
which is embedded with SAP APM, it'll propably the `ABSTRACT` model. Otherwise, using a separate
SAP IOT subscription will be `THING` model. In the first cell you need to define this configuration.

## Steps in the notebook

1. Define configuration and data model
2. Initialize SQLite database
3. Create database table `iot_export_status`
4. Initiate download
5. Check processing status
6. Download the data
7. Unzip the downloaded files

For detailed description see [documentation](../docs/time-series-migration.md).

In [7]:
# Define the config you want to use
CONFIG_ID = "dev"
# Define the data model you want to use, either "ABSTRACT" or "THING"
DATA_MODEL = "ABSTRACT"

%pip install -r ../requirements.txt

226.37s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Note: you may need to restart the kernel to use updated packages.


# Determine Indicator Groups

As first we need to figure out for which indicator groups we want to download the time series data from cold-store.

In [2]:
# initialize sqlite database
from modules.acf.model_api import ApiModel
from modules.iot.iot import SAPIoTAPIWrapper
from modules.util.helpers import Logger
from modules.util.database import (
    SQLAlchemyClient,
    EquIndicatorGroups,
    FlocIndicatorGroups,
)


extraction_ids = []
log = Logger.get_logger(CONFIG_ID)
log.info("** EXTRACT - TIME SERIES DATA **")
Logger.blank_line(log)
log.info(f"Configuration ID: {CONFIG_ID} - Data Model: {DATA_MODEL}")

db = SQLAlchemyClient(CONFIG_ID)
db.table_create_all()
iot_wrapper = SAPIoTAPIWrapper(config_id=CONFIG_ID)
model_wrapper = ApiModel(config_id=CONFIG_ID)

if DATA_MODEL == "ABSTRACT":
    # property types are the indicator groups
    equi_ind_groups = db.select(EquIndicatorGroups)
    floc_ind_groups = db.select(FlocIndicatorGroups)
    log.debug(f"fetched {len(equi_ind_groups)} indicator groups")
    # iterate over all indicator groups
    for group in equi_ind_groups:
        log.debug(
            f"found indicator group: {group['indicatorGroups_description_short']}"
        )
        extraction_ids.append(f"IG_{group['indicatorGroups_id']}")

    for group in floc_ind_groups:
        log.debug(
            f"found indicator group: {group['indicatorGroups_description_short']}"
        )
        extraction_ids.append(f"IG_{group['indicatorGroups_id']}")

elif DATA_MODEL == "THING":
    thing_types = iot_wrapper.get_thing_types()
    equi_model = model_wrapper.get_equipment_models()
    floc_model = model_wrapper.get_floc_models()

    log.debug(f"fetched {len(equi_model)} equipment models")

    # we expect that every model has a thing type
    # iterate over all equipment models and check if the thing type exists
    thing_types_found = []
    for model in equi_model:
        search_terms = model["modelSearchTerms"].split(",")
        found = False
        for term in search_terms:
            if term in thing_types:
                thing_types_found.append(term)
                found = True
                break
        if not found:
            log.debug(f"thing type {model['modelSearchTerms']} not found")

    log.info(
        f"found {len(thing_types_found)} equipment models with existing thing types"
    )
    for thing_type in thing_types_found:
        log.debug(f"found thing type: {thing_type}")
        property_sets = iot_wrapper.get_property_sets_by_thing_type(thing_type)
        log.debug(f"found {len(property_sets)} property sets")
        for property_set in property_sets:
            log.debug(f"found property set: {property_set}")
            if property_set not in extraction_ids:
                extraction_ids.append(property_set)

log.info(f"found {len(extraction_ids)} extraction ids")

2025-01-25 19:16:52,146  [INFO]: ** EXTRACT - TIME SERIES DATA **

2025-01-25 19:16:52,148  [INFO]: Configuration ID: dev - Data Model: ABSTRACT
2025-01-25 19:16:52,176  [INFO]: [DB] Tenant ID: dev
2025-01-25 19:16:52,177  [INFO]: [DB] Database Connection: sqlite:///../migration-data/dev.db
2025-01-25 19:16:52,177  [DEBU]: [DB] SQLAlchemy Echo: False
2025-01-25 19:16:52,177  [INFO]: [DB] Drop Reload: True
2025-01-25 19:16:52,183  [DEBU]: [DB] Table Created: T_PAI_EXTERNALDATA
2025-01-25 19:16:52,183  [DEBU]: [DB] Table Created: T_PAI_EXTERNALDATA_FLOC
2025-01-25 19:16:52,184  [DEBU]: [DB] Table Created: T_EIOT_MAPPING
2025-01-25 19:16:52,184  [DEBU]: [DB] Table Created: T_EIOT_MAPPING_INDICATORS
2025-01-25 19:16:52,184  [DEBU]: [DB] Table Created: T_EIOT_UPLOAD_STATUS
2025-01-25 19:16:52,184  [DEBU]: [DB] Table Created: T_APM_INDICATOR_POSITIONS
2025-01-25 19:16:52,185  [DEBU]: [DB] Table Created: T_PAI_EQU_HEADER
2025-01-25 19:16:52,185  [DEBU]: [DB] Table Created: T_PAI_FLOC_HEADER
2

# Database Setup

We'll create a new database table "iot_export_status" where we store the
request id from the time series cold store. Later we use the status
column to keep track which files can be downloaded.

In [3]:
from sqlalchemy import create_engine
from modules.database.tables import meta_obj
from modules.util.config import get_config_by_id
from modules.util.helpers import Logger

log = Logger.get_logger(CONFIG_ID)
config = get_config_by_id(CONFIG_ID)
engine = create_engine(config["database"]["connection"], echo=False)

with engine.connect() as conn:
    log.info("creating iot_export_status table")
    # meta_obj.drop_all(engine)
    meta_obj.create_all(engine)
    # check that the table is empty
    
    # conn.execute(iot_export_status_table.delete())
    # result = conn.execute(iot_export_status_table.select())
    # log.info(result.fetchall())
    conn.commit()

2025-01-25 19:16:57,842  [INFO]: creating iot_export_status table


# Initiate Download

For the extracted indicator groups we'll trigger the download. First we start with yearly time-frames.
The overall time-frame and the time slices must be defined in the config file under `[extract]-[time-series]`.
Here you find the properties: `time_range_from`, `time_range_to`, `time_range_interval`.

For each request we'll get a request id back which we save in an internal status table (iot_export_status).

In [8]:
from modules.iot.iot import SAPIoTAPIWrapper
from modules.util.helpers import generate_slices
from sqlalchemy import create_engine, func, select
from modules.database.tables import iot_export_status_table
from modules.util.config import get_config_by_id, get_system_by_type
from modules.util.helpers import Logger

import time

log = Logger.get_logger(CONFIG_ID)
config = get_config_by_id(CONFIG_ID)
iot_config = get_system_by_type(config, "IOT")
engine = create_engine(config["database"]["connection"], echo=False)

iot_wrapper = SAPIoTAPIWrapper(config_id=CONFIG_ID)
time_slices = generate_slices(
    config["extract"]["time-series"]["time_range_from"],
    config["extract"]["time-series"]["time_range_to"],
    config["extract"]["time-series"]["time_range_interval"],
)
property_set_type_ignore_list = list()

with engine.connect() as conn:
    # iterate over all thing types

    for indicator_group in extraction_ids:
        # check if this property set type is in the ignore list
        if indicator_group in property_set_type_ignore_list:
            # if the property set type is in the ignore list, continue with the next property set type
            continue
        log.info(f"start export for {indicator_group}")
        # iterate over all time slices
        for time_slice in time_slices:
            # get the data for the thing type and the time slice
            try: 
                # check if this property set type is already in the iot_export_status table
                # if it is already in the table, we don't need to export it again
                query = iot_export_status_table.select().where(
                    (iot_export_status_table.c.tenant_id == config["config_id"])
                    & (iot_export_status_table.c.indicator_group == indicator_group)
                    & (
                        iot_export_status_table.c.start_date
                        == time_slice[0].strftime("%Y-%m-%d")
                    )
                    & (
                        iot_export_status_table.c.end_date
                        == time_slice[1].strftime("%Y-%m-%d")
                    )
                )
                status = conn.execute(query).fetchall()

                if status != []:
                    # if the status is not None, the export is already done
                    continue

                # initiate the export
                request_id = iot_wrapper.initiate_time_series_export(
                    indicator_group=indicator_group,
                    start_date=time_slice[0].strftime("%Y-%m-%d"),
                    end_date=time_slice[1].strftime("%Y-%m-%d"),
                )
                # store the response id in the thing_types_status dictionary to keep track of the exports
                # with reference to the start and end date of the export
                stmt = iot_export_status_table.insert().values(
                    tenant_id=config["config_id"],
                    indicator_group=indicator_group,
                    start_date=time_slice[0].strftime("%Y-%m-%d"),
                    end_date=time_slice[1].strftime("%Y-%m-%d"),
                    status="Initiated",
                    request_id=request_id,
                )

                conn.execute(stmt)
                conn.commit()

            except Exception as e:
                # dependent on the DATA_MODEL, we need to deal differently with the error
                # THING_MODEL
                # 400 - bad request: malformed query or errors in the query (property does not exist)
                # 404 - not found: there is no data in the provided timerange
                # 413 - too large: too much data - choose smaller date interval
                if DATA_MODEL == 'THING':
                    if e.response.status_code == 400:
                        # store the error message in the iot_export_status table
                        error_message = e.response.json()
                        if "message" in error_message:
                            message = error_message["message"]
                        else:
                            message="Unknown error"

                        stmt = iot_export_status_table.insert().values(
                            tenant_id=config["config_id"],
                            indicator_group=indicator_group,
                            start_date=time_slice[0].strftime("%Y-%m-%d"),
                            end_date=time_slice[1].strftime("%Y-%m-%d"),
                            status="Error",
                            message=message,
                            request_id="None",
                        )
                    elif e.response.status_code == 404:
                        # add this property set type to an ignore list and none of the time slices will be exported
                        property_set_type_ignore_list.append(indicator_group)
                        continue
                    elif e.response.status_code == 413:
                        # reduce the time slice to a smaller interval
                        continue

                # ABSTRACT_MODEL
                # Any export script we build for not thing model but model abstraction download from PAI or APM has to deal
                # with the message text from a 400 return - 400 and "no sdata" is basically a succes checking if there is
                # data - 400 and "too much data" means the query timeframe has to be reduced.
                elif DATA_MODEL == 'ABSTRACT':
                    if e.response.status_code == 400:
                        error_message = e.response.json()
                        # check if response message contains "no data"
                        if "message" in error_message and "no data" in error_message["message"]:
                            property_set_type_ignore_list.append(indicator_group)
                            continue
                        elif "message" in error_message and "Data not found for the requested date range" in error_message["message"]:
                            property_set_type_ignore_list.append(indicator_group)
                            continue
                        elif "message" in error_message and "too much data" in error_message["message"]:
                            # means the query timeframe has to be reduced.
                            log.warning(
                                "query timeframe has to be reduced"
                                f"for TS download for: {indicator_group}"
                                f"{time_slice[0].strftime("%Y-%m-%d")}"
                                f"{time_slice[1].strftime("%Y-%m-%d")}"
                            )
                            stmt = iot_export_status_table.insert().values(
                                tenant_id=config["config_id"],
                                indicator_group=indicator_group,
                                start_date=time_slice[0].strftime("%Y-%m-%d"),
                                end_date=time_slice[1].strftime("%Y-%m-%d"),
                                status="Timeframe",
                                message="query timeframe has to be reduced",
                                request_id="None",
                            )
                    else:
                        error_message = e.response.json()
                        if "message" in error_message:
                            message = error_message["message"]
                        else:
                            message = "Unknown error"

                        log.error(
                            f"unknown status code: {e.response.status_code} when "
                            f"initiate TS download for: {indicator_group}"
                            f"{time_slice[0].strftime("%Y-%m-%d")}"
                            f"{time_slice[1].strftime("%Y-%m-%d")}"
                        )
                        stmt = iot_export_status_table.insert().values(
                            tenant_id=config["config_id"],
                            indicator_group=indicator_group,
                            start_date=time_slice[0].strftime("%Y-%m-%d"),
                            end_date=time_slice[1].strftime("%Y-%m-%d"),
                            status="Error",
                            message=message,
                            request_id="None",
                        )

                conn.execute(stmt)
                conn.commit()

                continue

    # print status report: how many exports are initiated and how many failed
    status_combined = (
        select(iot_export_status_table.c.status, func.count().label("count"))
        .where(iot_export_status_table.c.status.in_(["Error", "Initiated"]))
        .group_by(iot_export_status_table.c.status)
    )

    result = conn.execute(status_combined).fetchall()
    status_dict = {row[0]: row[1] for row in result}
    status_error = status_dict.get("Error", 0)
    status_initiated = status_dict.get("Initiated", 0)

    log.info(f"{status_error} entries failed and {status_initiated} entries are initiated")

log.info("done")

NameError: name 'extraction_ids' is not defined

# Check processing status

Next we need to check the processing status of all initiated downloads. Once all
downloads are ready to download we can continue with next step.

The following are the possible statuses:

- Initiated: The request is placed successfully.
- Submitted: The request for data export is initiated and the method is retrieving the data and preparing for the export process.
- Failed: The request for data export failed due to various reasons. The reasons are listed in the response payload.
- Exception: The system retried to initiate the data export but failed.
- Ready for Download: The request for data export succeeded and the data is available in a file format for download.
- Expired: The data that is ready for download is available only for seven days, beyond which the exported data is not available for download. You should re-initiate the request for data export.

In [5]:
from modules.iot.iot import SAPIoTAPIWrapper
from sqlalchemy import create_engine, and_
from modules.database.tables import iot_export_status_table
from modules.util.config import get_config_by_id, get_system_by_type

iot_wrapper = SAPIoTAPIWrapper(config_id=CONFIG_ID)

config = get_config_by_id(CONFIG_ID)
iot_config = get_system_by_type(config, "IOT")
engine = create_engine(config["database"]["connection"], echo=False)

with engine.connect() as conn:
    # select all initiated exports from the iot_export_status table
    query = iot_export_status_table.select().where(
        (iot_export_status_table.c.tenant_id == config["config_id"]) &
        (iot_export_status_table.c.status == "Initiated")
    )

    results = conn.execute(query).fetchall()

    # iterate over download_data dictionary and check the status of the exports
    all_exports_complete = False

    count = 1
    while not all_exports_complete:
        all_exports_complete = True
        for export_status in results:
            export_status = export_status._asdict()
            # get the status of the export
            status = iot_wrapper.get_time_series_export_status(
                request_id=export_status["request_id"]
            )
            log.info(
                f"Status for {export_status['indicator_group']} from "
                f"{export_status['start_date']} to "
                f"{export_status['end_date']}: {status}"
            )

            if export_status["status"] != status:
                # update the status in the iot_export_status table
                stmt = (
                    iot_export_status_table.update()
                    .values(status=status)
                    .where(
                        and_(
                            iot_export_status_table.c.tenant_id == config["config_id"],
                            iot_export_status_table.c.indicator_group
                            == export_status["indicator_group"],
                            iot_export_status_table.c.start_date
                            == time_slice[0].strftime("%Y-%m-%d"),
                            iot_export_status_table.c.end_date
                            == time_slice[1].strftime("%Y-%m-%d"),
                        )
                    )
                )

                res = conn.execute(stmt)
                conn.commit()

                # update the status in the results list as well

                export_status['status'] = status

            # check if the status is one of the final statuses
            if status not in ["Failed", "Exception", "Ready for Download", "Expired"]:
                all_exports_complete = False

        # wait for some time before checking the status again
        if not all_exports_complete:
            time.sleep(30*count)
            count += 1

log.info("done")

2025-01-25 18:53:27,419  [INFO]: Status for IG_3A21672057D04945BD3AEB638AEC1F2A from 2020-01-01 to 2020-12-15: Ready for Download
2025-01-25 18:53:27,551  [INFO]: Status for IG_94749721CCB54448AABE7C788022CD71 from 2020-01-01 to 2020-12-15: Ready for Download
2025-01-25 18:53:27,553  [INFO]: done


# Download Time-Series Data to Disk

The next step is to download the cold store data to the disk for further processing.
The folder where you want to save the time series data can be defined in the config
section at `["extract"]["time-series"]["directory"]`. As the file size can be large
the download will be done in chunks. 

In [6]:
import os
from modules.iot.iot import SAPIoTAPIWrapper
from sqlalchemy import create_engine, and_
from modules.database.tables import iot_export_status_table
from modules.util.config import get_config_by_id

config = get_config_by_id(CONFIG_ID)
iot_wrapper = SAPIoTAPIWrapper(config_id=CONFIG_ID)
engine = create_engine(config["database"]["connection"], echo=False)

DOWNLOAD_FOLDER = config["extract"]["time-series"]["directory"]

if not os.path.exists(DOWNLOAD_FOLDER):
    os.makedirs(DOWNLOAD_FOLDER)


with engine.connect() as conn:
    # select all exports that are ready for download
    query = iot_export_status_table.select().where(
        and_(
            iot_export_status_table.c.tenant_id == config["config_id"],
            iot_export_status_table.c.status == "Ready for Download",
        )
    )

    results = conn.execute(query).fetchall()

    # iterate over download_data dictionary and download the data
    for export_status in results:
        export_status = export_status._asdict()

        file_path = os.path.join(
            DOWNLOAD_FOLDER,
            f"{export_status['indicator_group']}+"
            f"{export_status['start_date']}+"
            f"{export_status['end_date']}.zip",
        )

        # download the data
        iot_wrapper.download_time_series_export_sequential(
            request_id=export_status["request_id"], file_path=file_path, log=log
        )

        # iot_wrapper.download_time_series_export(
        #     request_id=export_status["request_id"], file_path=file_path
        # )

        # update the status in the iot_export_status table
        stmt = (
            iot_export_status_table.update()
            .values(status="Downloaded")
            .where(
                and_(
                    iot_export_status_table.c.tenant_id == config["config_id"],
                    iot_export_status_table.c.indicator_group
                    == export_status["indicator_group"],
                    iot_export_status_table.c.start_date
                    == time_slice[0].strftime("%Y-%m-%d"),
                    iot_export_status_table.c.end_date
                    == time_slice[1].strftime("%Y-%m-%d"),
                )
            )
        )

        conn.execute(stmt)
        conn.commit()

log.info("done")

2025-01-25 18:53:35,999  [INFO]: Download Request Part 1 started now: 2025-01-25 18:53:35.999407


Downloaded 0.020% Speed is 0.00 KB per second
Downloaded 11.293% Speed is 1125.62 KB per second
Downloaded 21.547% Speed is 1025.08 KB per second
Downloaded 31.801% Speed is 1025.02 KB per second
Downloaded 42.055% Speed is 1024.84 KB per second
Downloaded 52.289% Speed is 1023.40 KB per second
Downloaded 62.523% Speed is 1023.03 KB per second
Downloaded 72.757% Speed is 1023.46 KB per second
Downloaded 83.010% Speed is 1025.12 KB per second
Downloaded 93.264% Speed is 1024.17 KB per second


2025-01-25 18:55:29,397  [INFO]: Total Time taken for the file download for request id <built-in function id>in minutes: 1.8899389
2025-01-25 18:55:29,404  [INFO]: Download Request Part 1 started now: 2025-01-25 18:55:29.404490


Downloaded 0.180% Speed is 0.00 KB per second


2025-01-25 18:55:53,243  [INFO]: Total Time taken for the file download for request id <built-in function id>in minutes: 0.39730856666666664
2025-01-25 18:55:53,257  [INFO]: done


# Extract downloaded data

When downloading the time series data we'll get a compressed zip file back. For further
processing in the next step we need to extract the files from the archive. This be done
in this step.

In [None]:
import gzip
import os
import shutil
import zipfile

DOWNLOAD_FOLDER = config["extract"]["time-series"]["directory"]

# read all files in the download folder
files = []
for root, dirs, filenames in os.walk(DOWNLOAD_FOLDER):
    for filename in filenames:
        if filename.endswith(".zip"):
            files.append(os.path.join(root, filename))

# iterate over all files
for file in files:
    # get the thing type, property set type, start date and end date from the file name
    file_name = os.path.basename(file)
    # the files needs to be unzipped first - crete a folder with the name of the file
    file_folder = file.replace(".zip", "")
    if not os.path.exists(file_folder):
        os.makedirs(file_folder)
    # unzip the file
    with zipfile.ZipFile(file, 'r') as zip_ref:
        log.info(f"unzipping {file}")
        zip_ref.extractall(file_folder)
    
        # now read all files with ending .gz in the folder where the file was unzipped
        for root, dirs, filenames in os.walk(file_folder):
            for filename in filenames:
                if filename.endswith(".gz"):
                    log.info(f"unzipping {filename}")
                    # unzip the file
                    with gzip.open(os.path.join(root, filename), 'rb') as f_in:
                        with open(os.path.join(root, filename.replace(".gz", "")),'wb') as f_out:
                            shutil.copyfileobj(f_in, f_out)
                        # we can delete the gz file after unzipping
                        os.remove(os.path.join(root, filename))


log.info("done")