# PySpark之SparkStreaming基本操作
## 前言
流数据具有如下特征：  
•数据快速持续到达，潜在大小也许是无穷无尽的•数据来源众多，格式复杂  
•数据量大，但是不十分关注存储，一旦经过处理，要么被丢弃，要么被归档存储  
•注重数据的整体价值，不过分关注个别数据  
•数据顺序颠倒，或者不完整，系统无法控制将要处理的新到达的数据元素的顺序  

流计算(数据的价值随着时间的流式而降低)：  
实时获取来自不同数据源的海量数据，经过实时分析处理，获得有价值的信息  

流计算处理流程(强调实时性)：  
数据实时采集--->数据实时计算--->实时查询服务  
- 数据实时采集：数据实时采集阶段通常采集多个数据源的海量数据，需要保证实时性、低延迟与稳定可靠
- 数据实时计算：数据实时计算阶段对采集的数据进行实时的分析和计算，并反馈实时结果
- 实时查询服务：经由流计算框架得出的结果可供用户进行实时查询、展示或储存  

流处理系统与传统的数据处理系统有如下不同：
- 流处理系统处理的是实时的数据，而传统的数据处理系统处理的是预先存储好的静态数据
- 用户通过流处理系统获取的是实时结果，而通过传统的数据处理系统，获取的是过去某一时刻的结果
- 流处理系统无需用户主动发出查询，实时查询服务可以主动将实时结果推送给用户  

SparkStreaming操作：  
Spark Streaming可整合多种输入数据源，如Kafka、Flume、HDFS，甚至是普通的TCP套接字。经处理后的数据可存储至文件系统、数据库，或显示在仪表盘里。  
![](./imgs/ssc.png)  
Spark Streaming的基本原理:  
是将实时输入数据流以时间片（秒级）为单位进行拆分，然后经Spark引擎以类似批处理的方式处理每个时间片数据（<font color = "red">伪实时</font>）  
![](./imgs/ssc_pro.png)  
Spark Streaming最主要的抽象:  
DStream（Discretized Stream，离散化数据流），表示连续不断的数据流。在内部实现上，Spark Streaming的输入数据按照时间片（如1秒）分成一段一段，每一段数据转换为Spark中的RDD，这些分段就是Dstream，并且对DStream的操作都最终转变为对相应的RDD的操作

## 基本操作
编写Spark Streaming程序的基本步骤是：  
1.通过创建输入DStream来定义输入源  
2.通过对DStream应用转换操作和输出操作来定义流计算  
3.用streamingContext.start()来开始接收数据和处理流程  
4.通过streamingContext.awaitTermination()方法来等待处理结束（手动结束或因为错误而结束）  
5.可以通过streamingContext.stop()来手动结束流计算进程  

### RDD队列流

In [24]:
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession
import time

# 环境配置
conf = SparkConf().setAppName("RDD Queue").setMaster("local")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)

In [64]:
%%writefile rdd_queue.py
import findspark  
findspark.init()
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession
import time

# 环境配置
conf = SparkConf().setAppName("RDD Queue").setMaster("local")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 10)

# 创建RDD数据流
rddQueue = []
for i in range(5):
    rddQueue += [ssc.sparkContext.parallelize([j for j in range(i)], 10)]
    time.sleep(1)
    
inputStream = ssc.queueStream(rddQueue)
reduceedStream = inputStream.map(lambda x: (x%10, 1)).reduceByKey(lambda x, y: x + y)

reduceedStream.pprint()
ssc.start()
ssc.stop(stopSparkContext=True, stopGraceFully=True)

Overwriting rdd_queue.py


In [65]:
!python rdd_queue.py

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
-------------------------------------------                                     
Time: 2021-05-09 11:18:00
-------------------------------------------

-------------------------------------------
Time: 2021-05-09 11:18:10
-------------------------------------------
(0, 1)

-------------------------------------------
Time: 2021-05-09 11:18:20
-------------------------------------------
(0, 1)
(1, 1)



### 词频统计(无状态操作)
输出结果显示:  
![](./imgs/ssc_out.png)

In [4]:
# %%writefile sscrun.py
import findspark  
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# 创建环境
sc = SparkContext(master="local[2]", appName="NetworkWordCount")
ssc = StreamingContext(sc, 10)  # 每隔一秒钟监听一次数据

# 监听端口数据
lines = ssc.socketTextStream("localhost", 9999)
#　拆分单词
words = lines.flatMap(lambda line: line.split(" "))
# 统计词频
pairs = words.map(lambda word:(word, 1))
wordCounts = pairs.reduceByKey(lambda x, y:x+y)
# 打印数据
wordCounts.pprint()

# 开启流式处理
ssc.start()
ssc.awaitTermination()

Overwriting sscrun.py


In [9]:
!python ./sscrun.py

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
-------------------------------------------                                     
Time: 2021-05-09 10:47:40
-------------------------------------------
('1', 2)

-------------------------------------------                                     
Time: 2021-05-09 10:47:50
-------------------------------------------
('1', 18)

^Ctage 0:>                                                          (0 + 1) / 1]
Traceback (most recent call last):
  File "./sscrun.py", line 22, in <module>
    ssc.awaitTermination()
  File "/usr/local/spark/python/pyspark/streaming/context.py", line 199, in awaitTermination
    self._jssc.awaitTermination()
  File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1303, in __call__
  File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1033, in send_command
  File "/usr/local/spark/p

### 词频统计(有状态操作)

In [13]:
%%writefile calcultor.py
import findspark  
findspark.init()
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# 创建环境
sc = SparkContext(master="local[2]", appName="NetworkWordCount")
ssc = StreamingContext(sc, 3)  # 每隔一秒钟监听一次数据

# 设置检测点
ssc.checkpoint("checkpoint")

# 状态更新函数
def updatefun(new_values,last_sum):
    
    # 向前转态加上当前状态的key状态的value值
    return sum(new_values) + (last_sum or 0)

# 监听端口数据
lines = ssc.socketTextStream("localhost", 9999)
#　拆分单词
words = lines.flatMap(lambda line: line.split(" "))
#  统计词频
pairs = words.map(lambda word:(word, 1))
wordCounts = pairs.updateStateByKey(updateFunc=updatefun)
wordCounts.pprint()

ssc.start()
ssc.awaitTermination()

Overwriting calcultor.py


In [11]:
!python calcultor.py

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "calcultor.py", line 11, in <module>
    ssc.checkpoint("checkpoint")
  File "/usr/local/spark/python/pyspark/streaming/context.py", line 260, in checkpoint
    self._jssc.checkpoint(directory)
  File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o24.checkpoint.
: java.net.ConnectException: Call From gavin-X550JX/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.

### Windows

In [12]:
%%writefile top.py
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql.session import SparkSession

def get_countryname(line):
    country_name = line.strip()

    if country_name == 'usa':
        output = 'USA'
    elif country_name == 'ind':
        output = 'India'
    elif country_name == 'aus':
        output = 'Australia'
    else:
        output = 'Unknown'

    return (output, 1)

# 设置参数
batch_interval = 1
window_length = 6*batch_interval
frquency = 3*batch_interval

sc =  sc = SparkContext(master="local[2]", appName="NetworkWordCount")
ssc = StreamingContext(sc, batch_interval)

ssc.checkpoint("checkpoint")
# 监听端口数据
lines = ssc.socketTextStream("localhost", 9999)

addFunc = lambda x, y: x+y
invAddFunc = lambda x, y: x-y
word_counts = lines.map(get_countryname).reduceByKeyAndWindow(addFunc, invAddFunc, window_length, frquency)
word_counts.pprint()

ssc.start()
ssc.awaitTermination()

Overwriting top.py


## 参考