### Pre-requisite:
Before running this notebook, you will have to:
1. download the csv file named `dht_1k.csv` 
stored under https://github.com/IBMProjectEventStore/db2eventstore-IoT-Analytics/tree/master/data.
2. Go to the `Project tab` and load both above mentioned csv files into the current project as dataset.
----
**Note: This Notebook can only run in Python version >= 3.0**

In [None]:
from eventstore.oltp import EventContext
from eventstore.sql import EventSession
from eventstore.common import ConfigurationReader
from pyspark.sql import SparkSession

ConfigurationReader.setEventUser("admin")
ConfigurationReader.setEventPassword("password")

In [None]:
#sparkSession = SparkSession.builder.config('spark.jars', './spark-time-series-sql.jar').appName("EventStore SQL in Python").getOrCreate()
sparkSession = SparkSession.builder.appName("EventStore SQL in Python").getOrCreate()
eventSession = EventSession(sparkSession.sparkContext, "EVENTDB")
eventSession.set_query_read_option("SnapshotNow")
eventSession.open_database()
ctx = EventContext.get_event_context("EVENTDB")

In [None]:
eventSession._jvm.org.apache.spark.sql.types.SqlGeometry.registerAll(eventSession._jsparkSession)

In [None]:
from eventstore.catalog import TableSchema
from pyspark.sql.types import *

In [None]:
table_names = ctx.get_names_of_tables()
for idx, name in enumerate(table_names):
    print(idx, name)

In [None]:
from datetime import datetime

def datetime_converter(datetime_string):
    # (1) Convert to datetime format
    utc_time = datetime.strptime(datetime_string.split('.000Z')[0], "%Y-%m-%dT%H:%M:%S")

    return int((utc_time - datetime(1970, 1, 1)).total_seconds())

In [None]:
# Define table schema to be created
with EventContext.get_event_context("EVENTDB") as ctx:
    schema = StructType([
        StructField("sensor_id", IntegerType(), nullable = False),
        StructField("timestamp", IntegerType(), nullable = False),
        StructField("location", IntegerType(), nullable = False),
        StructField("humidity", FloatType(), nullable = True),
        StructField("temperature", FloatType(), nullable = False),
        StructField("LAT", FloatType(), nullable = False),
        StructField("LON", FloatType(), nullable = False),
        StructField("sensor_type", StringType(), nullable = False)
    ])  
    table_schema = TableSchema("dht_full_table", schema,
                                sharding_columns=["sensor_id"],
                                pk_columns=["timestamp","sensor_id"])

In [None]:
## create table and loading data for DHT

In [None]:
# try create table if not exist
# try:
#     ctx.drop_table("dht_full_table")
# except Exception as error:
#     print(error)
try:
    ctx.create_table(table_schema)
except Exception as error:
    print(error)
    
table_names = ctx.get_names_of_tables()
for idx, name in enumerate(table_names):
    print(name)

In [None]:
dht_table = eventSession.load_event_table("dht_full_table")

In [None]:
# ingest data into table
import os
resolved_table_schema = ctx.get_table("dht_full_table")
print(resolved_table_schema)
with open(os.environ['DSX_PROJECT_DIR']+'/datasets/dht_1k.csv') as f:
    f.readline()
    content = f.readlines()
content = [l.split(",") for l in content]
batch = [dict(sensor_id=int(c[5]), timestamp=datetime_converter(c[7]), location=int(c[0]), \
              humidity=float(c[2]),temperature=float(c[1]),lat=float(c[3]),lon=float(c[4]),sensor_type=str(c[6])) for c in content]
ctx.batch_insert(resolved_table_schema, batch)

In [None]:
# verify ingested result
dht_table = eventSession.load_event_table("dht_full_table")
dht_table.count()

In [None]:
dht_table.createOrReplaceTempView("dht_full_table")

In [None]:
eventSession.sql("select count(*) from dht_full_table").show()

## Objective: Group Sensors into Geohashes by using SQL with ST Support.
### Use SQL
We utilize SQL for this step because the volume of raw data can be huge - there are a lot of sensors and each sensor has a lot of readings per day. We use sql so that we can avoid pulling the whole raw data which could cause some serious memory issues. It is often suggested to, whenever possible, run Spatial operations in SQL first as a preprocessing step to reduce the complexity and volume of the data.
### SQL with ST Support
The key part in this query is the ST support - 
- `ST_Point(lon, lat)` creates a spatial ST_Point object from given latitude and longitude in the raw data.
- `ST_ContainingGeohash(ST_Point, distance_buffer)` encode the point into its geohash.

Everything else is just the normal SQL query - 
- We get geohash, humidity from the raw dataset.
- We group these readings by geohash and calcuate average reading for each geohash.

For a full list of geospatial functions available on Cloud SQL Query, [click here](https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.analytics.doc/doc/geo_functions.html)

In [None]:
stmt = """
    SELECT geohash, AVG(humidity) as avg_h
    FROM(
        SELECT cast(ST_ContainingGeohash(ST_Point(lon, lat), 300) as string) as geohash, humidity
        FROM dht_full_table
    )
    GROUP BY geohash
"""

eventSession.sql(stmt).createOrReplaceTempView("dht_spatial_agg_table")

Objectives - 

- Since all the humidity sensors are discretely and spatially distributed, we want to group them into areas based on their locations so that we are able to tell the humidity for different areas instead of discrete points. To achieve this, we will utilize Geohashes since each geohash represents a grid area on the earth, and we compute the geohash for each sensor location and group them by geohashes.

In [None]:
eventSession.sql("select * from dht_spatial_agg_table").show()

In [None]:
#decode the geohashes and then find the top humidity location
decode_stmt = """
    SELECT geohash, ST_X(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) as lon, 
    ST_Y(ST_BoundingBoxCenter(ST_Envelope(ST_GeohashDecode(geohash)))) AS lat, avg_h
    FROM dht_spatial_agg_table
    ORDER BY avg_h desc
"""

eventSession.sql(decode_stmt).createOrReplaceTempView("dht_spatial_decoded_agg_table")

In [None]:
df_spark = eventSession.sql("select * from dht_spatial_decoded_agg_table")

In [None]:
df_spark.show()