# Spark Streaming 读取 Kinesis 数据 (scala)

Kinesis中的原数据，目前 Kinesis 中不断有数据注入。
```json
{
    "ordertime": 1573573055,
    "orderid": 23,
    "itemid": "Item_1231231",
    "orderunits": 15,
    "address": {
        "city": "City_a",
        "state": "State_xxx",
        "zipcode": 10000
    }
}
```

最终目标为
1. Kinesis 中的数据落到 S3 中
2. 根据 **ordertime** 这个字段进行分区，例如 `ordertime=2019101123/`
3. 落盘 S3 的数据需要进行铺平, 目标格式如下：
```json
{
    "ordertime": 1573573055,
    "orderid": 23,
    "itemid": "Item_1231231",
    "orderunits": 15,
    "address_city": "City_a",
    "address_state": "State_xxx",
    "address_zipcode": 10000
}
```

## 引入项目依赖

* 在 Jupyter 中，使用 `%%configure` 来配置饮用外部包。
* 在 spark-shell 中，命令为 `pyspark --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.2`
* 通过 spark-submit 来提交，命令为 `spark-submit --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.2 s3://<path-to-the-script-file-in-s3>`

In [None]:
%%configure -f
{
    "conf": {
        "spark.jars": "s3://shiheng-poc/jars/.ivy2/jars/org.apache.spark_spark-streaming-kinesis-asl_2.11-2.4.0.jar,s3://shiheng-poc/jars/.ivy2/jars/com.amazonaws_aws-java-sdk-core-1.11.271.jar,s3://shiheng-poc/jars/.ivy2/jars/com.amazonaws_aws-java-sdk-s3-1.11.271.jar,s3://shiheng-poc/jars/.ivy2/jars/com.amazonaws_amazon-kinesis-client-1.8.10.jar"
    }
}

In [None]:
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import com.amazonaws.services.kinesis.AmazonKinesisClient
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming._
import scala.util.parsing.json._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types.{StructType,ArrayType}
import org.joda.time.DateTime

val appName     = "shiHengKinesisSparkApp"
val streamName  = "shiheng-orders"
val endpointUrl = "https://kinesis.cn-northwest-1.amazonaws.com.cn"
val regionName  = "cn-northwest-1"

## 定义数据转化

以下定义了方法，将 nested JSON 进行铺平。

In [None]:

def flattenDataframe(df: DataFrame): DataFrame = {
    val fields = df.schema.fields
    val fieldNames = fields.map(x => x.name)
    val length = fields.length
    for(i <- 0 to fields.length-1){
        val field = fields(i)
        val fieldtype = field.dataType
        val fieldName = field.name
        fieldtype match {
            case arrayType: ArrayType =>
                val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
                val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
                // val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
                val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
                return flattenDataframe(explodedDf)
            case structType: StructType =>
                val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
                val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
                val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
                val explodedf = df.select(renamedcols:_*)
                return flattenDataframe(explodedf)
            case _ =>
            }
        }
        df
    }



定义 unix 时间转化成 `yyyyMMddHHmm` 格式的 UDF 方法

In [None]:
val unixToDT = udf{(ordertime:Long) => new DateTime(ordertime * 1000).toDateTime.toString("yyyyMMddHH").toLong}

设置 Kinesis Client 的授权方式, 这里回

In [None]:
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials()
require(credentials != null, "No AWS credentials found. Please specify credentials using one of the methods specified " +
  "in http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/credentials.html")
val kinesisClient = new AmazonKinesisClient(credentials)
kinesisClient.setEndpoint(endpointUrl)
val numShards = kinesisClient.describeStream(streamName).getStreamDescription.getShards().size()
val numStreams = numShards

// Spark Streaming interval
val batchInterval = Milliseconds(60000)
val kinesisCheckpointInterval = batchInterval

开启 Streamingm 读取 Kiensis 的数据

In [None]:
val ssc = new StreamingContext(sc, batchInterval)
// ssc.checkpoint("/tmp/checkpoint") // no checkpoint can be set, or you'll get error

val kinesisStreams = (0 until numStreams).map { i =>
  KinesisInputDStream.builder
    .streamingContext(ssc)
    .streamName(streamName)
    .endpointUrl(endpointUrl)
    .regionName(regionName)
    .initialPosition(new Latest())
    .checkpointAppName(appName)
    .checkpointInterval(kinesisCheckpointInterval)
    .storageLevel(StorageLevel.MEMORY_AND_DISK_2)
    .build()
}

val unionStreams = ssc.union(kinesisStreams)
val sqlContext = SparkSession.builder().getOrCreate()

 输出到 S3 中， 根据订单时间进行分区

In [None]:
unionStreams.foreachRDD ((rdd: RDD[Array[Byte]], time: Time) => {
  println("**********************************************")
  println(rdd.count)
  println("rdd isempty:" + rdd.isEmpty)
  if (rdd.isEmpty()) {
    println("No data input!!!")
  } else {
    val lines   = rdd.map(byteArray => new String(byteArray))//.collect()//.toList
    val linesDF = sqlContext.read.json(lines)
    // linesDF.show()
    // linesDF.printSchema
    val linesFlattenDF = flattenDataframe(linesDF)
    val linesResultDF  = linesFlattenDF.withColumn("ot", unixToDT(linesFlattenDF("ordertime")))
    // linesResultDF.show()
    
    // set mode to append in case `analysisexception path already exists`
    linesResultDF.write.mode("append").partitionBy("ot").csv("s3://shiheng-poc/ss-kinesis-scala/")
  }
  println("**********************************************")
})

开启 Spark

In [None]:
ssc.start()
ssc.awaitTermination()

## 退出程序

1. 登录 EMR master node
2. `yarn application -kill <application-id>`

## 实验结束

打开 S3, 查看我们生成的文件