# Spark Structured Streaming 读取 Kinesis 数据

Kinesis中的原数据，目前 Kinesis 中不断有数据注入。
```json
{
    "ordertime": 1573573055,
    "orderid": 23,
    "itemid": "Item_1231231",
    "orderunits": 15,
    "address": {
        "city": "City_a",
        "state": "State_xxx",
        "zipcode": 10000
    }
}
```

最终目标为
1. Kinesis 中的数据落到 S3 中
2. 根据 **ordertime** 这个字段进行分区，例如 `ordertime=2019101123/`
3. 落盘 S3 的数据需要进行铺平, 目标格式如下：
```json
{
    "ordertime": 1573573055,
    "orderid": 23,
    "itemid": "Item_1231231",
    "orderunits": 15,
    "address_city": "City_a",
    "address_state": "State_xxx",
    "address_zipcode": 10000
}
```


## 初始化 Spark Context

这个 Jar 是 Structured Streaming 对于 Kinesis 的实现，基于开源项目 [kinesis-sql](https://github.com/qubole/kinesis-sql).

In [None]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "s3://shiheng-poc/jars/spark-sql-kinesis_2.11-2.4.0.jar"
    }
}

In [None]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.types._

val spark = SparkSession.
builder.
appName("ShiHengSparkStructuredSparking").
getOrCreate()

## 定义 JSON 铺平的 UDF

主要参考了 [How to flatten JSON in Spark Dataframe](https://www.24tutorials.com/spark/flatten-json-spark-dataframe/)

In [None]:
def flattenDataframe(df: DataFrame): DataFrame = {

val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
val length = fields.length

for(i <- 0 to fields.length-1){
  val field = fields(i)
  val fieldtype = field.dataType
  val fieldName = field.name
  fieldtype match {
    case arrayType: ArrayType =>
      val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
      val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
     // val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
      val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
      return flattenDataframe(explodedDf)
    case structType: StructType =>
      val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
      val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
      val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
     val explodedf = df.select(renamedcols:_*)
      return flattenDataframe(explodedf)
    case _ =>
  }
}
df
}

## 创建 Kinesis 中数据的 Schema

Kinesis 中读取的数据，对 `data` 字段进行 base64 decode, decode 完毕后，是一个标准的 JSON. 如下是它的 schema.

In [None]:
val orderSchema = new StructType().
add("ordertime", StringType).
add("orderid", IntegerType).
add("itemid", StringType).
add("orderunits", IntegerType).
add("address", new StructType().
    add("city", StringType).
    add("state", StringType).
    add("zipcode", IntegerType)
   )

// Array byte 转化成 UTF-8 string
val b2String = udf((payload: Array[Byte]) => new String(payload))

## 读取 Kinesis 中的数据


Kinesis 中 读取到的原始数据的格式，其中 **data** 字段为实际内容，该字段进行了 base64 加密。
```
{
    "data": "base64 encoded content",
    "streamName": "stream-name",
    "partitionKey": "kinesis-partition-key",
    "sequenceNumber": "seq-number",
    "approximateArrivalTimestamp": 1573573055
}
```

In [None]:
val kinesis = spark.readStream.
format("kinesis").
option("streamName", "shiheng-orders").
option("endpointUrl", "https://kinesis.cn-northwest-1.amazonaws.com.cn").
option("startingPosition", "LATEST").
option("maxFetchDuration", "30s").
option("fetchBufferSize", "100mb").
load

kinesis.printSchema()

In [None]:
val records = kinesis.select(from_json(b2String('data), orderSchema) as 'root)

In [None]:
val records_2 = flattenDataframe(records.select("root.*"))

In [None]:
records_2.printSchema

In [None]:
val result = kinesis.selectExpr("lcase(CAST(data as STRING)) as word").groupBy($"word").count()