# Industrial Machinery Data Ingestor for IoT Data

## Introduction
Implementing a production IoT Predictive Maintenance solution requires having mature remote condition monitoring infrastructure in place. More specifically, data-driven prognostics hinge on the availability of statistically significant amounts of run-to-failure telemetry and maintenance records, from which equipment degradation patterns can be learned to enable failure predications based on both historical and newly collected data.
Real-world run-to-failure data sets are virtually impossible to come across due to their commercially sensitive nature. Of the several publicly available synthetic data sets[1][2], none ideally fits the canonical IoT scenario in which highly irregular real-time data streams from sensors are captured and used for condition monitoring or anomaly detection[3]. For instance, the tiny Turbofan Engine Degradation Simulation Data Set [1] created with the commercial version of the Modular Aero-Propulsion System Simulation environment (C-MAPSS) [4] contains a single telemetry snapshot per operational cycle of unspecified length, whereas the data set which accompanies the Predictive Maintenance Modelling Guide [2] features hourly, presumably aggregated, sensor readings produced by making statistical alterations to uncorrelated series of random numbers. Unfortunately, the source code used for generating these data sets is not publicly available (in fact, obtaining C-MAPSS requires a grant from the U.S. Government). This makes the tasks of converting these data sets to a format which bears more resemblance to that of an IoT-centered scenario and generating new customized data for real-time simulation rather difficult.

## Notebook Purpose
This notebook ingests and prepares the IoT data being collected and stored in the PdM solution deployment for further use in FeatureEngineering and Model Training. It ingests both sensor telemetry data as well as logging output from IoT devices.

## Data inventory
The ingested data set will be a combination of maintenance and telemetry records. It is assumed that sensor data, along with the values describing a machine's current operational settings (in this case, rotational speed), is periodically transmitted from an IoT Edge device, with or without preprocessing, to the cloud where it is used for predictive analysis.
### Maintenance and failure records
In the present implementation, the maintenance data set will contain primarily failure events indicating exact points in time when a machine had a critical failure of a particular type. The intent of Predictive Maintenance is preventing these events by raising alarms in advance, so that appropriate preventive activities can be carried out.
#### Format
* timestamp
* level (INFO, WARNING, ERROR, CRITICAL)
* machine ID
* code (identifies event/failure type)
* message (contains additional details)
### Telemetry
This data set contains an IoT telemetry stream.
#### Format
* timestamp
* machine ID
* ambient temperature (°C)
* ambient pressure (kPa)
* rotational speed (RPM)
* temperature (°C)
* pressure (kPa)
* vibration signal

### Operational settings, operational conditions and performance
Operational settings determine the mode of operation and have a substantial effect on the observed performance and other monitored phenomena. For rotational equipment, one of the possible operational settings is the "desired" speed expressed in rotations per minute (RPM). Other operational settings may come into play in real-world scenarios. (For instance, the Turbofan Engine Degradation Simulation Data Set [1] contains three operational settings.)
Operational conditions define the environment in which equipment is being operated. Weather conditions, location, characteristics of the operator are some of the examples. Operational conditions may impact the performance of the equipment and therefore should be taken into account in predictive modeling.
Performance is determined by current operational settings, operational conditions and physical state of the equipment (e.g., wear). In this example, performance is expressed as a set of the following sensor measurements:
* speed (actual)
* temperature
* pressure
* vibration

Depending on the type of the equipment, some sensors will measure useful output, and some the side effects of mechanical or other processes (or, in energy terms, the loss). Most of the time, upcoming failures manifest themselves in a gradually diminishing useful output and increased loss under some or all operational settings; for example, assuming that pressure is considered "useful output," a machine operating at the same speed would generate increasingly less pressure while, possibly, also producing more heat or vibration. Performance measurements often exhibit some complex nonlinear behavior with respect to operational settings, operational conditions and equipment health.
The general idea behind Predictive Maintenance is that different types of impending failures manifest themselves in different ways over time, and that such patterns can be learned given sufficient amount of collected data.

### Dependency Importing and Environment Variable Retrieval

In [None]:
import os
import string
import json
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType, StringType
from pyspark.storagelevel import StorageLevel
from azure.storage.table import TableService

#### Read Environment Variables

In [None]:
#For development purposes only until ENV Variables get set
from pathlib import Path
env_config_file_location = (str(Path.home())+"/NotebookEnvironmentVariablesConfig.json")
config_file = Path(env_config_file_location)
if not config_file.is_file():
  env_config_file_location = ("/dbfs"+str(Path.home())+"/NotebookEnvironmentVariablesConfig.json")
f = open(env_config_file_location)
env_variables = json.load(f)["DataIngestion"]

STORAGE_ACCOUNT_SUFFIX = 'core.windows.net'
STORAGE_ACCOUNT_NAME = env_variables["STORAGE_ACCOUNT_NAME"]
STORAGE_ACCOUNT_KEY = env_variables["STORAGE_ACCOUNT_KEY"]
TELEMETRY_CONTAINER_NAME = env_variables["TELEMETRY_CONTAINER_NAME"]
LOG_TABLE_NAME = env_variables["LOG_TABLE_NAME"]
DATA_ROOT = env_variables["DATA_ROOT_FOLDER"]

### Setting up Ingested Data Drop Folder
This location is where the prepared ingested IoT data is stored for further use in the notebooks to follow.

In [None]:
data_dir = DATA_ROOT + '/data'

#TODO: Convert data_dir into env variable
% rm -rf $data_dir
% mkdir $data_dir $data_dir/logs

### Retrieving telemetry data
The raw data retrieved from the PdM solution storage contains all the IoT telemetry data in the "Body" column of the dataframe in a byte array. It needs to be deserialized into a string representing JSON, then expanded into a separate dataframe to be used by FeatureEngineering and ModelTraining.

In [None]:
wasbTelemetryUrl = "wasb://{0}@{1}.blob.{2}/*/*/*/*/*/*/*".format(TELEMETRY_CONTAINER_NAME, 
                                                                  STORAGE_ACCOUNT_NAME, 
                                                                  STORAGE_ACCOUNT_SUFFIX)

sc = SparkSession.builder.getOrCreate()
hc = sc._jsc.hadoopConfiguration()
hc.set("avro.mapred.ignore.inputs.without.extension", "false")
if STORAGE_ACCOUNT_KEY:
     hc.set("fs.azure.account.key.{}.blob.core.windows.net".format(STORAGE_ACCOUNT_NAME), STORAGE_ACCOUNT_KEY)
hc.set("fs.azure.account.key.{}.blob.core.windows.net"
    .format(STORAGE_ACCOUNT_NAME), STORAGE_ACCOUNT_KEY)
sql = SQLContext.getOrCreate(sc)
avroblob = sql.read.format("com.databricks.spark.avro").load(wasbTelemetryUrl)
avroblob.show()

### Convert byteformatted "body" of raw blob data into JSON, explode result into new Pyspark DataFrame
The output here shows the schema of the telemetry data as well as a preview of the telemetry data with the specific columns necessary for FeatureEngineering and ModelTraining

In [None]:
#Convert byteformat to string format in pyspark dataframe
from json import loads as Loads
column = avroblob['Body']
string_udf = udf(lambda x: x.decode("utf-8"))
avroblob=avroblob.withColumn("BodyString", string_udf(column))
avroblob.printSchema()

#Convert "body" into new DataFrame
telemetry_df = sql.read.json(avroblob.select("BodyString").rdd.map(lambda r: r.BodyString))
subsetted_df = telemetry_df.select(["timestamp", "ambient_pressure","ambient_temperature","machineID","pressure","speed","speed_desired","temperature"])
subsetted_df.show()

In [None]:
import datetime
e = '%Y-%m-%dT%H:%M:%S.%f'
reformatted_time_df = subsetted_df.withColumn("timestamp", F.col("timestamp").cast("timestamp"))

reformatted_time_df.printSchema()

### Write dataframe to Parquet format

In [None]:
reformatted_time_df.write.parquet(data_dir+"/telemetry", mode="overwrite")

## Get Logs

In [None]:
#table retrieval
table_service = TableService(account_name=STORAGE_ACCOUNT_NAME, account_key=STORAGE_ACCOUNT_KEY)
tblob = table_service.query_entities(LOG_TABLE_NAME)

### Process log table data into Pandas DataFrame

In [None]:
attributes = list()
for row in tblob:
    if (len(attributes) == 0):
        for attribute in row:
            attributes.append(attribute)
    break
log_df = pd.DataFrame(columns=attributes)
for row in tblob:
    if (row["Level"] != "DEBUG"):
        row_dict = {}    
        for attribute in row:
            if (attribute != "Timestamp"):
                row_dict[attribute] = row[attribute]
            else:
                newtime = row[attribute].replace(tzinfo=None)
                timeitem = pd.Timestamp(newtime, tz=None)
                row_dict[attribute] = timeitem
        log_df = log_df.append(row_dict, ignore_index=True)
log_df.head()

### Number of Run-To-Failure Sequences
The number of Run-To-Failure sequences is especially important for FeatureEngineering and ModelTraining as these log instances are used to train the predictive model. If there are no failure sequences logged, then training a predictive model is useless as the model has no reference for what a situation for failure may look like. Do not proceed with the notebooks if there are no Run-To-Failure sequences logged.

In [None]:
message_counts = log_df['Message'].value_counts()
if ('failure' in message_counts):
    print("Number of Run-to-Failures:", message_counts['failure'])
else:
    raise ValueError('Run to failure count is 0. Do not proceed.')

### Select necessary attributes

In [None]:
log_df = log_df[["Timestamp", "Code", "Level", "PartitionKey"]].astype(str)
log_df.columns = ["timestamp", "code","level","machineID"]
log_df.index = log_df['timestamp']
log_df.head()

### Write logs to system storage

In [None]:
log_df = sqlContext.createDataFrame(log_df)
log_df.write.parquet(data_dir+"/logs", mode="overwrite")