# Spark  Streaming 读取 Kinesis 数据

Kinesis中的原数据，目前 Kinesis 中不断有数据注入。
```json
{
    "ordertime": 1573573055,
    "orderid": 23,
    "itemid": "Item_1231231",
    "orderunits": 15,
    "address": {
        "city": "City_a",
        "state": "State_xxx",
        "zipcode": 10000
    }
}
```

最终目标为
1. Kinesis 中的数据落到 S3 中
2. 根据 **ordertime** 这个字段进行分区，例如 `ordertime=2019101123/`
3. 落盘 S3 的数据需要进行铺平, 目标格式如下：
```json
{
    "ordertime": 1573573055,
    "orderid": 23,
    "itemid": "Item_1231231",
    "orderunits": 15,
    "address_city": "City_a",
    "address_state": "State_xxx",
    "address_zipcode": 10000
}
```


## 引入外部 package

不确定下面的引入方法是否正确，参考文档[stack overflow](https://stackoverflow.com/questions/49302862/how-to-configure-jupyter-configure-with-multiple-packages)

In [None]:
%%configure -f
{ "conf": {"spark.jars.packages": "org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.0" }}

## 初始化 Spark Context



In [None]:
# import os
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.2 pyspark-shell'

import json
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
import time
from pyspark.sql.types import Row
from pyspark.sql import SparkSession

In [None]:
%%info

In [None]:
def getSparkSessionInstance(sparkConf):
    if ("sparkSessionSingletonInstance" not in globals()):
        globals()["sparkSessionSingletonInstance"] = SparkSession \
            .builder \
            .config(conf=sparkConf) \
            .getOrCreate()
    return globals()["sparkSessionSingletonInstance"]


def convertTime(timestamp):
    return time.strftime("%Y%m%d%H", time.localtime(timestamp))


def save2s3(rdd):
    spark = getSparkSessionInstance(rdd.context.getConf())
    if rdd.isEmpty():
        return
    rowRdd = rdd.map(lambda w: Row(ordertime=w[0],ordertime_raw=w[1],orderid=w[2],itemid=w[3],orderunits=w[4],address_city=w[5],address_state=w[6],address_zipcode=w[7]))
    df = spark.createDataFrame(rowRdd)
    resultDF = df.select(df.ordertime,df.ordertime_raw,df.orderid,df.itemid,df.orderunits,df.address_city,df.address_state,df.address_zipcode)
    resultDF.createOrReplaceTempView("resultDF")
    resultDF.write.partitionBy("ordertime").csv(path="s3n://shiheng-poc/maweijun/data2",mode="append")


In [None]:
sc = spark.SparkContext()

# spark = getSparkSessionInstance(rdd.context.getConf())
ssc = StreamingContext(spark, 60)  # 每 60s 保存一次数据
streamName = 'shiheng-orders'
appName = 'SpartStreamingShiheng'
endpointUrl = "https://kinesis.cn-northwest-1.amazonaws.com.cn"
regionName = "cn-northwest-1"

In [None]:

dstream = KinesisUtils.createStream(ssc, appName, streamName, endpointUrl, regionName, InitialPositionInStream.LATEST, 2)


In [None]:
py_rdd_stream = dstream.map(lambda x:(convertTime(json.loads(x)["ordertime"]),json.loads(x)["ordertime"],json.loads(x)["orderid"],json.loads(x)["itemid"],json.loads(x)["orderunits"],json.loads(x)["address"]["city"],json.loads(x)["address"]["state"],json.loads(x)["address"]["zipcode"]))

py_rdd_stream.foreachRDD(save2s3)

ssc.start()
ssc.awaitTermination()