# PyFlink 安装

### 作者：胖胖揽住

### 版本 2023.11.15

## Anaconda3 安装

首先从TUNA下载Anaconda3安装包。

```Bash
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
sh Anaconda3-2023.09-0-Linux-x86_64.sh
```
安装过程中，请使用默认设置。
应该安装在`~/anaconda3`。

## Python 3.9 安装

通过 conda 安装 Python 3.9 将变得简单可靠。

```Bash
conda create -n pyflink_39 python=3.9
conda activate pyflink_39
```


## Apache-Flink 安装

先去[Apache 官网](https://dlcdn.apache.org/flink/)下载安装 flink，这里以 1.18.0 为例：

```Bash
wget https://dlcdn.apache.org/flink/flink-1.18.0/flink-1.18.0-bin-scala_2.12.tgz
sudo tar -zxvf flink-1.18.0-bin-scala_2.12.tgz  -C /usr/local   
```

修改目录名称，并设置权限，命令如下：
```Bash
cd /usr/local
sudo mv / flink-1.18.0 ./flink #这里是因为我这里下的是这个版本，读者需要酌情调整
sudo chown -R hadoop:hadoop ./flink ##这里是因为我这里虚拟机的用户名是这个，读者需要酌情调整
```

Flink解压缩并且设置好权限后，直接就可以在本地模式运行，不需要修改任何配置。
如果要做调整，可以编辑`“/usr/local/flink/conf/flink-conf.yam`这个文件。
比如其中的`env.java.home`参就可以设置为本地Java的绝对路径
不过一般不需要手动修改什么配置。

不过，需要注意的是，Flink现在需要的是Java11，所以需要用下列命令手动安装一下：
```Bash
sudo apt install openjdk-11-jdk -y
```

接下来还需要修接下来还需要修改配置文件，添加环境变量：

```Bash
nano ~/.bashrc
```

文件中添加如下内容：
```
export FLINK_HOME=/usr/local/flink
export PATH=$FLINK_HOME/bin:$PATH
```

保存并退出.bashrc文件，然后执行如下命令让配置文件生效：
```Bash
source ~/.bashrc
```

## 安装 Python 依赖包

然后使用 pip 安装 apache-flink 包， 以及 Kafka-python 等等依赖包

```Bash
pip install apache-flink 
pip install kafka-python chardet pandas numpy scipy simpy 
pip install matplotlib cython sympy xlrd pyopengl BeautifulSoup4 pyqt6 scikit-learn requests tensorflow torch keras tqdm gym DRL
```

## 代码说明

本文代码修改自官方[文档版本1.18](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/python/datastream_tutorial/)。

In [None]:
# 使用 Flink Python DataStream API 的词频统计
# 以下代码来自官方[文档版本1.18](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/dev/python/datastream_tutorial/)。
# 将下面的代码保存为`DataStream_API_word_count.py`。

import os
# Get current absolute path
current_file_path = os.path.abspath(__file__)
# Get current dir path
current_dir_path = os.path.dirname(current_file_path)
# Change into current dir path
os.chdir(current_dir_path)

import argparse
import logging
import sys
import numpy as np 
import pandas as pd
from pyflink.table import StreamTableEnvironment
from pyflink.common import WatermarkStrategy, Encoder, Types
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import FileSource, StreamFormat, FileSink, OutputFileConfig, RollingPolicy


word_count_data = ["To be, or not to be,--that is the question:--",
                   "Whether 'tis nobler in the mind to suffer",
                   "The slings and arrows of outrageous fortune",
                   "Or to take arms against a sea of troubles,",
                   "And by opposing end them?--To die,--to sleep,--",
                   "No more; and by a sleep to say we end",
                   "The heartache, and the thousand natural shocks",
                   "That flesh is heir to,--'tis a consummation",
                   "Devoutly to be wish'd. To die,--to sleep;--",
                   "To sleep! perchance to dream:--ay, there's the rub;",
                   "For in that sleep of death what dreams may come,",
                   "When we have shuffled off this mortal coil,",
                   "Must give us pause: there's the respect",
                   "That makes calamity of so long life;",
                   "For who would bear the whips and scorns of time,",
                   "The oppressor's wrong, the proud man's contumely,",
                   "The pangs of despis'd love, the law's delay,",
                   "The insolence of office, and the spurns",
                   "That patient merit of the unworthy takes,",
                   "When he himself might his quietus make",
                   "With a bare bodkin? who would these fardels bear,",
                   "To grunt and sweat under a weary life,",
                   "But that the dread of something after death,--",
                   "The undiscover'd country, from whose bourn",
                   "No traveller returns,--puzzles the will,",
                   "And makes us rather bear those ills we have",
                   "Than fly to others that we know not of?",
                   "Thus conscience does make cowards of us all;",
                   "And thus the native hue of resolution",
                   "Is sicklied o'er with the pale cast of thought;",
                   "And enterprises of great pith and moment,",
                   "With this regard, their currents turn awry,",
                   "And lose the name of action.--Soft you now!",
                   "The fair Ophelia!--Nymph, in thy orisons",
                   "Be all my sins remember'd."]


# 定义word_count函数，用于计算输入文件中单词的数量
def word_count(input_path, output_path):
    # 获取StreamExecutionEnvironment实例
    env = StreamExecutionEnvironment.get_execution_environment()
    # 设置运行模式为批处理模式
    env.set_runtime_mode(RuntimeExecutionMode.BATCH)
    # 设置并行度为1
    env.set_parallelism(1)

    # 如果输入路径不为空，则从输入路径中读取数据
    if input_path is not None:
        ds = env.from_source(
            source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
                                                       input_path)
                             .process_static_file_set().build(),
            # 设置水印策略为单调时间戳
            watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
            source_name="file_source"
        )
    else:
        print("Executing word_count example with default input data set.")
        print("Use --input to specify file input.")
        # 使用word_count_data作为输入数据集
        ds = env.from_collection(word_count_data)

    # 将输入数据集拆分成单词
    def split(line):
        yield from line.split()
    ds = ds.flat_map(split) \
        .map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()])) \
        .key_by(lambda i: i[0]) \
        .reduce(lambda i, j: (i[0], i[1] + j[1]))

    # 如果输出路径不为空，则将结果写入输出路径
    if output_path is not None:
        ds.sink_to(
            sink=FileSink.for_row_format(
                base_path=output_path,
                encoder=Encoder.simple_string_encoder())
            .with_output_file_config(
                OutputFileConfig.builder()
                .with_part_prefix("prefix")
                .with_part_suffix(".ext")
                .build())
            .with_rolling_policy(RollingPolicy.default_rolling_policy())
            .build()
        )
    else:
        print("Printing result to stdout. Use --output to specify output path.")
        t_env = StreamTableEnvironment.create(env)
        table = t_env.from_data_stream(ds)
        df = table.to_pandas()
        
        # 将结果写入csv文件
        df.to_csv('./DataStream_API_word_count.csv', index=False)
        print(df)
    # 执行word_count函数
    env.execute()


# 定义word_count函数，用于计算输入文件中单词的数量
if __name__ == '__main__':
    # 设置日志输出格式
    logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(message)s")
    # 创建参数解析器
    parser = argparse.ArgumentParser()
    # 添加参数
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to process.')
    parser.add_argument(
        '--output',
        dest='output',
        required=False,
        help='Output file to write results to.')

    # 解析参数
    argv = sys.argv[1:]
    known_args, _ = parser.parse_known_args(argv)
    # 调用word_count函数
    word_count(known_args.input, known_args.output)

In [None]:
# 使用 Flink Python Table API 的词频统计
# 以下代码来自官方[文档版本1.18](https://nightlies.apache.org/flink/flink-docs-release-1.18/zh/docs/dev/python/table_api_tutorial/)。
# 将下面的代码保存为`Table_API_word_count.py`。

import argparse
import logging
import sys

from pyflink.common import Row
from pyflink.table import (EnvironmentSettings, TableEnvironment, TableDescriptor, Schema,
                           DataTypes, FormatDescriptor)
from pyflink.table.expressions import lit, col
from pyflink.table.udf import udtf

word_count_data = ["To be, or not to be,--that is the question:--",
                   "Whether 'tis nobler in the mind to suffer",
                   "The slings and arrows of outrageous fortune",
                   "Or to take arms against a sea of troubles,",
                   "And by opposing end them?--To die,--to sleep,--",
                   "No more; and by a sleep to say we end",
                   "The heartache, and the thousand natural shocks",
                   "That flesh is heir to,--'tis a consummation",
                   "Devoutly to be wish'd. To die,--to sleep;--",
                   "To sleep! perchance to dream:--ay, there's the rub;",
                   "For in that sleep of death what dreams may come,",
                   "When we have shuffled off this mortal coil,",
                   "Must give us pause: there's the respect",
                   "That makes calamity of so long life;",
                   "For who would bear the whips and scorns of time,",
                   "The oppressor's wrong, the proud man's contumely,",
                   "The pangs of despis'd love, the law's delay,",
                   "The insolence of office, and the spurns",
                   "That patient merit of the unworthy takes,",
                   "When he himself might his quietus make",
                   "With a bare bodkin? who would these fardels bear,",
                   "To grunt and sweat under a weary life,",
                   "But that the dread of something after death,--",
                   "The undiscover'd country, from whose bourn",
                   "No traveller returns,--puzzles the will,",
                   "And makes us rather bear those ills we have",
                   "Than fly to others that we know not of?",
                   "Thus conscience does make cowards of us all;",
                   "And thus the native hue of resolution",
                   "Is sicklied o'er with the pale cast of thought;",
                   "And enterprises of great pith and moment,",
                   "With this regard, their currents turn awry,",
                   "And lose the name of action.--Soft you now!",
                   "The fair Ophelia!--Nymph, in thy orisons",
                   "Be all my sins remember'd."]


# 定义一个函数word_count，用于计算单词出现次数
# 参数input_path和output_path分别表示输入和输出路径
def word_count(input_path, output_path):
    # 创建一个TableEnvironment对象，用于执行流式计算
    t_env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())
    # 设置并行度为1
    t_env.get_config().set("parallelism.default", "1")
    # 如果输入路径不为空，则创建一个临时表，用于读取输入数据
    if input_path is not None:
        t_env.create_temporary_table(
            'source',
            TableDescriptor.for_connector('filesystem')
                .schema(Schema.new_builder()
                        .column('word', DataTypes.STRING())
                        .build())
                .option('path', input_path)
                .format('csv')
                .build())
        tab = t_env.from_path('source')
    # 否则，使用默认的数据集
    else:
        print("Executing word_count example with default input data set.")
        print("Use --input to specify file input.")
        tab = t_env.from_elements(map(lambda i: (i,), word_count_data),
                                  DataTypes.ROW([DataTypes.FIELD('line', DataTypes.STRING())]))
    # 如果输出路径不为空，则创建一个临时表，用于存储计算结果
    if output_path is not None:
        t_env.create_temporary_table(
            'sink',
            TableDescriptor.for_connector('filesystem')
                .schema(Schema.new_builder()
                        .column('word', DataTypes.STRING())
                        .column('count', DataTypes.BIGINT())
                        .build())
                .option('path', output_path)
                .format(FormatDescriptor.for_format('canal-json')
                        .build())
                .build())
    # 否则，将计算结果打印到标准输出
    else:
        print("Printing result to stdout. Use --output to specify output path.")
        t_env.create_temporary_table(
            'sink',
            TableDescriptor.for_connector('print')
                .schema(Schema.new_builder()
                        .column('word', DataTypes.STRING())
                        .column('count', DataTypes.BIGINT())
                        .build())
                .build())
    # 定义一个UDF，用于将每一行文本拆分成单词
    @udtf(result_types=[DataTypes.STRING()])
    def split(line: Row):
        for s in line[0].split():
            yield Row(s)
    # 将文本拆分成单词，并计算每个单词出现的次数
    tab.flat_map(split).alias('word') \
        .group_by(col('word')) \
        .select(col('word'), lit(1).count) \
        .execute_insert('sink') \
        .wait()
    # 将计算结果转换为Pandas数据框，并保存到csv文件中
    df = tab.to_pandas()
    df.to_csv('./Table_API_word_count.csv', index=False)
    print(df)

if __name__ == '__main__':
    # 设置日志输出格式
    logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(message)s")
    # 创建参数解析器
    parser = argparse.ArgumentParser()
    # 添加参数
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to process.')
    parser.add_argument(
        '--output',
        dest='output',
        required=False,
        help='Output file to write results to.')
    # 解析参数
    argv = sys.argv[1:]
    known_args, _ = parser.parse_known_args(argv)

    # 调用word_count函数处理输入文件
    word_count(known_args.input, known_args.output)

Executing word_count example with default input data set.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.


# 使用Docker搭建本地Kafka集群

操作系统选择 Ubuntu 22.04.3   

1. 安装 Docker 和 Docker Compose:
```Bash
sudo apt install Docker Docker-compose
```
2. 创建本地 `docker-compose.yml` 文件，其中包含以下内容：

```yaml
version: '3'
services:
  zookeeper:
    image: 'bitnami/zookeeper:latest'
    environment:
      - ALLOW_ANONYMOUS_LOGIN=yes
  kafka:
    image: 'bitnami/kafka:latest'
    ports:
      - '9092:9092'
    environment:
      - KAFKA_ADVERTISED_HOST_NAME=localhost
      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
      - KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092
      - KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092
      - KAFKA_CREATE_TOPICS=test:1:1
      - ALLOW_PLAINTEXT_LISTENER=yes
    depends_on:
      - zookeeper
```

3. 找到“docker-compose.yml”所在目录并运行以下命令：

````Bash
docker-compose up -d
````

这将运行一个包含 Zookeeper 实例和 Kafka 实例的本地 Kafka 集群，该集群将在本地主机的端口 9092 上运行。

In [None]:
# 使用 kafka-python 生成流的简单方法

# 以下代码使用kafka-python模块将数据发送到本地Kafka集群。
# 此代码打开一个名为 `hamlet.txt` 的文本文件，并将其内容作为流发送到指定的 Kafka 主题 `hamlet`：

# 导入KafkaProducer模块
from kafka import KafkaProducer
# 导入time模块
import time
# 导入os模块
import os

# 定义一个函数，用于将文件发送到Kafka
def send_file_to_kafka(file_path: str, topic: str, bootstrap_servers: str):
    # 创建一个KafkaProducer实例，用于发送消息
    producer = KafkaProducer(bootstrap_servers=bootstrap_servers)
    # 获取文件大小
    file_size = os.path.getsize(file_path)
    # 循环发送文件
    while True:
        # 打开文件
        with open(file_path, "rb") as f:
            # 循环读取文件
            while True:
                # 读取文件内容
                data = f.read(1024)
                # 如果没有内容，则跳出循环
                if not data:
                    break
                # 将文件内容发送到Kafka
                producer.send(topic, data)
                # 计算发送的字节数
                bytes_sent = len(data)
                # 打印发送的字节数
                print(f"Sent {bytes_sent} bytes to Kafka topic {topic}")
                # 计算发送的百分比
                percent_sent = (f.tell() / file_size) * 100
                # 打印发送的百分比
                print(f"{percent_sent:.2f}% of the file sent")
                # 等待3秒
                time.sleep(3)
        # 获取用户输入
        user_input = input("Press 'c' to continue sending the file or 'q' to quit: ")
        # 如果用户输入q，则退出循环
        if user_input == "q":
            break
# 调用函数，将hamlet.txt文件发送到Kafka的hamlet主题
send_file_to_kafka("./hamlet.txt",  "hamlet", "localhost:9092")
# 在此代码中，send_file_to_kafka 函数接受三个参数：file_path、topic 和 bootstrap_servers。
# file_path是本地文件的路径，topic是数据要发送到的Kafka主题，bootstrap_servers是Kafka集群的地址。
# 该函数使用with语句打开文件，读取其内容，并将它们作为流数据发送到指定的Kafka主题。
# 发送过程中，打印出发送进度，并使用time.sleep方法暂停0.1秒来控制发送速率。

In [None]:
# 使用 kafka-python 展现流数据的简单方法

from kafka import KafkaConsumer

# 创建一个KafkaConsumer实例，用于从Kafka主题中读取消息
consumer = KafkaConsumer(
    # 指定要读取的消息主题
    "hamlet",
    # 指定Kafka服务器的地址和端口
    bootstrap_servers=["localhost:9092"],
    # 指定当消费者重新启动时，它应该从哪个偏移量开始读取消息
    auto_offset_reset="earliest",
    # 指定是否在消费者处理消息时，应该提交偏移量
    enable_auto_commit=True,
    # 指定消费者组，用于提交偏移量
    group_id="my-group",
    # 指定消息的解码方式
    value_deserializer=lambda x: x.decode("utf-8")
)

# 循环读取Kafka主题中的消息，并打印消息长度和消息内容
for message in consumer:
    print(f"Received {len(message.value)} bytes from Kafka topic {message.topic}")
    print(f"{message.value}")

# 在上面的代码中，我们使用`KafkaConsumer`类来创建一个消费者对象。
# 我们将 `hamlet` 作为主题名称传递给构造函数。
# 我们还传递 `localhost:9092` 作为引导服务器的地址。
# 我们使用 `value_deserializer` 参数来解码从 Kafka 主题收到的消息。
# 我们使用 `for` 循环来迭代消费者对象，并使用 `print` 函数来打印消息的内容。

In [1]:
# 使用 pyflink 进行流数据展现

import os
import argparse
import logging
import sys
import numpy as np 
import pandas as pd
from pyflink.table import StreamTableEnvironment
from pyflink.common import WatermarkStrategy, Encoder, Types
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import FileSource, StreamFormat, FileSink, OutputFileConfig, RollingPolicy
from pyflink.common import Types, SimpleStringSchema
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import FlinkKafkaProducer, FlinkKafkaConsumer

def split(line):
    # 将行拆分成单词
    yield from line.split()

def read_from_kafka():
    # 从Kafka读取数据
    # 获取当前的执行环境
    env = StreamExecutionEnvironment.get_execution_environment()    
    # 添加kafka连接器
    env.add_jars("file:///home/hadoop/Desktop/PyFlink-Tutorial/flink-sql-connector-kafka-3.1-SNAPSHOT.jar")
    print("start reading data from kafka")
    # 创建一个kafka消费者，用于从kafka中读取消息
    kafka_consumer = FlinkKafkaConsumer(
        topics='hamlet', # The topic to consume messages from
        deserialization_schema= SimpleStringSchema('UTF-8'), # The schema to deserialize messages
        properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'my-group'} # The Kafka broker address and consumer group ID
    )
    # 从最早的记录开始读取数据
    kafka_consumer.set_start_from_earliest()
    # 将kafka消费者添加到执行环境中，并打印输出
    env.add_source(kafka_consumer).print()
    # 执行执行环境
    env.execute()

if __name__ == '__main__':
    read_from_kafka()

start reading data from kafka


Py4JJavaError: An error occurred while calling o0.execute.
: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
	at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
	at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:141)
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
	at org.apache.flink.runtime.rpc.pekko.PekkoInvocationHandler.lambda$invokeRpc$1(PekkoInvocationHandler.java:268)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
	at org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1267)
	at org.apache.flink.runtime.concurrent.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93)
	at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
	at org.apache.flink.runtime.concurrent.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
	at org.apache.flink.runtime.concurrent.pekko.ScalaFutureUtils$1.onComplete(ScalaFutureUtils.java:47)
	at org.apache.pekko.dispatch.OnComplete.internal(Future.scala:310)
	at org.apache.pekko.dispatch.OnComplete.internal(Future.scala:307)
	at org.apache.pekko.dispatch.japi$CallbackBridge.apply(Future.scala:234)
	at org.apache.pekko.dispatch.japi$CallbackBridge.apply(Future.scala:231)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at org.apache.flink.runtime.concurrent.pekko.ScalaFutureUtils$DirectExecutionContext.execute(ScalaFutureUtils.java:65)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
	at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
	at org.apache.pekko.pattern.PromiseActorRef.$bang(AskSupport.scala:629)
	at org.apache.pekko.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:34)
	at org.apache.pekko.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:33)
	at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:536)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at org.apache.pekko.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:73)
	at org.apache.pekko.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:110)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
	at org.apache.pekko.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:110)
	at org.apache.pekko.dispatch.TaskInvocation.run(AbstractDispatcher.scala:59)
	at org.apache.pekko.dispatch.ForkJoinExecutorConfigurator$PekkoForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:57)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
	at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:176)
	at org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:107)
	at org.apache.flink.runtime.scheduler.DefaultScheduler.recordTaskFailure(DefaultScheduler.java:285)
	at org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:276)
	at org.apache.flink.runtime.scheduler.DefaultScheduler.onTaskFailed(DefaultScheduler.java:269)
	at org.apache.flink.runtime.scheduler.SchedulerBase.onTaskExecutionStateUpdate(SchedulerBase.java:764)
	at org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:741)
	at org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:83)
	at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:488)
	at jdk.internal.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRpcInvocation$1(PekkoRpcActor.java:309)
	at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
	at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcInvocation(PekkoRpcActor.java:307)
	at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:222)
	at org.apache.flink.runtime.rpc.pekko.FencedPekkoRpcActor.handleRpcMessage(FencedPekkoRpcActor.java:85)
	at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168)
	at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33)
	at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29)
	at scala.PartialFunction.applyOrElse(PartialFunction.scala:127)
	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126)
	at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176)
	at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547)
	at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545)
	at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229)
	at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590)
	at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557)
	at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280)
	at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241)
	at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253)
	... 5 more
Caused by: org.apache.flink.kafka.shaded.org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata


In [None]:
# 简单的词频统计

# 导入os模块
import os
# 导入re模块
import re
# 导入Counter模块
from collections import Counter
# 导入StreamTableEnvironment模块
from pyflink.table import StreamTableEnvironment
# 导入StreamExecutionEnvironment模块
from pyflink.datastream import StreamExecutionEnvironment
# 导入FlinkKafkaConsumer模块
from pyflink.datastream.connectors.kafka import FlinkKafkaConsumer
# 导入SimpleStringSchema模块
from pyflink.common import SimpleStringSchema

# 定义去除标点符号的函数
def remove_punctuation(text):
    # 使用正则表达式去除标点符号
    return re.sub(r'[^\w\s]','',text)

# 定义统计单词的函数
def count_words(text):
    # 将文本按空格分割成单词列表
    words = text.split()
    # 使用Counter模块统计单词出现次数
    return Counter(words)

# 定义从Kafka读取数据的函数
def read_from_kafka():
    # 获取StreamExecutionEnvironment实例
    env = StreamExecutionEnvironment.get_execution_environment()    
    # 添加flink-sql-connector-kafka-3.1-SNAPSHOT.jar包
    env.add_jars("file:///home/hadoop/Desktop/PyFlink-Tutorial/flink-sql-connector-kafka-3.1-SNAPSHOT.jar")
    # 打印从Kafka读取数据的信息
    print("start reading data from kafka")
    # 创建FlinkKafkaConsumer实例
    kafka_consumer = FlinkKafkaConsumer(
        topics='hamlet', # The topic to consume messages from
        deserialization_schema= SimpleStringSchema('UTF-8'), # The schema to deserialize messages
        properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'my-group'} # The Kafka broker address and consumer group ID
    )
    # 从最早的记录开始读取数据
    kafka_consumer.set_start_from_earliest()
    # 将FlinkKafkaConsumer实例添加到StreamExecutionEnvironment实例中
    stream = env.add_source(kafka_consumer)
    # 将StreamExecutionEnvironment实例中的数据映射为去除标点符号的文本
    stream_remove_punctuation = stream.map(lambda x: remove_punctuation(x))
    # 将去除标点符号的文本映射为统计单词的文本
    stream_count_words = stream_remove_punctuation.map(lambda x: count_words(x))
    # 打印统计单词的文本
    stream_count_words.print()
    # 执行StreamExecutionEnvironment实例
    env.execute()

# 调用read_from_kafka函数
read_from_kafka()

In [None]:
# 更详细的词频统计

# 导入 argparse、io、json、logging、os、pandas、re、Counter、StringIO、FlinkKafkaConsumer、StreamExecutionEnvironment、DataTypes、EnvironmentSettings、FormatDescriptor、Schema、StreamTableEnvironment、TableEnvironment、udf 模块
import argparse
import io
import json
import logging
import os
import pandas as pd
import re
from collections import Counter
from io import StringIO
from pyflink.common import SimpleStringSchema, Time
from pyflink.datastream.connectors.kafka import FlinkKafkaConsumer
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import (DataTypes, EnvironmentSettings, FormatDescriptor,
                           Schema, StreamTableEnvironment, TableDescriptor,
                           TableEnvironment, udf)
from pyflink.table.expressions import col, lit

# 定义去除标点符号的函数
def remove_punctuation(text):
    return re.sub(r'[^\w\s]','',text)

# 定义计算字节数的函数
def count_bytes(text):
    return len(text.encode('utf-8'))

# 定义计算单词数量的函数
def count_words(text):
    words = text.split()
    result = dict(Counter(words))
    max_word = max(result, key=result.get)
    return {'total_bytes': count_bytes(text), 'total_words': len(words), 'most_frequent_word': max_word, 'most_frequent_word_count': result[max_word]}

# 定义从Kafka读取数据的函数
def read_from_kafka():
    # 获取StreamExecutionEnvironment实例
    env = StreamExecutionEnvironment.get_execution_environment()  
    # 添加flink-sql-connector-kafka-3.1-SNAPSHOT.jar包
    env.add_jars("file:///home/hadoop/Desktop/PyFlink-Tutorial/flink-sql-connector-kafka-3.1-SNAPSHOT.jar")
    print("start reading data from kafka")
    # 创建FlinkKafkaConsumer实例，指定主题、反序列化函数、配置参数
    kafka_consumer = FlinkKafkaConsumer(
        topics='hamlet', 
        deserialization_schema= SimpleStringSchema('UTF-8'), 
        properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'my-group'} 
    )
    # 从最早的日志开始读取
    kafka_consumer.set_start_from_earliest()
    # 将Kafka日志流转换为流表
    stream_original_text = env.add_source(kafka_consumer)
    # 对流表中的每一行进行去除标点符号操作
    stream_remove_punctuation = stream_original_text.map(lambda x: remove_punctuation(x))
    # 对流表中的每一行进行计算单词数量的操作
    stream_count_words = stream_remove_punctuation.map(lambda x: count_words(x))
    # 将流表中的每一行打印出来
    stream_count_words.print()
    # 执行流计算
    env.execute()
read_from_kafka()

# 玩转 CSV

假设我们得到一个“data.csv”文件，其中包含任何内容，并且该文件中只有年份数据才是我们需要的。
我们首先使用以下代码生成“StreamGeneratorCSV”，将“CSV”文件转换为“Kafka Stream”。

In [None]:
# 一个简单的 CSV 流生成器

#以下代码使用kafka-python模块将数据发送到本地Kafka集群。
#此代码打开一个名为“hamlet.txt”的文本文件，并将其内容作为流发送到指定的 Kafka 主题“hamlet”：

from kafka import KafkaProducer
import time
import os
import chardet

def send_file_to_kafka(file_path: str, topic: str, bootstrap_servers: str):
    '''
    Send a file to a Kafka topic
    :param file_path: path to the local file
    :param topic: Kafka topic to which the data should be sent
    :param bootstrap_servers: address of the Kafka cluster
    '''
    # 创建一个KafkaProducer实例
    producer = KafkaProducer(bootstrap_servers=bootstrap_servers)
    # 获取文件大小
    file_size = os.path.getsize(file_path)

    # 检测文件编码
    with open(file_path, "rb") as f:
        result = chardet.detect(f.read())
        encoding = result["encoding"]

    # 获取文件行数
    with open(file_path, "r", encoding=encoding) as f:
        lines_total = len(f.readlines())

    lines_send = 0
    while True:
        # 打开文件
        with open(file_path, "rb") as f:
            while True:
                # 读取文件10行
                data = f.readlines(10)
                if not data:
                    break
                # 将数据转换为字符串
                data_str = str(data)
                # 将字符串转换为字节
                data_bytes = data_str.encode()
                # 将字节发送到Kafka
                producer.send(topic, data_bytes)
                # 记录发送的行数
                lines_send += 10
                # 计算已发送的百分比
                percent_sent = (lines_send / lines_total) * 100                
                # 计算已发送的字节数
                bytes_sent = len(data_bytes)
                print(f"Sent {bytes_sent} bytes {topic} {percent_sent:.2f}% sent")
                # 每3秒检查一次
                time.sleep(3)
                
        # 询问是否继续发送
        user_input = input("Press 'c' to continue sending the file or 'q' to quit: ")
        if user_input == "q":
            break
# 调用send_file_to_kafka函数，将文件data.csv发送到Kafka主题data，Kafka集群的地址为localhost:9092
send_file_to_kafka("./data.csv",  "data", "localhost:9092")

# 解释以上代码
# 在这个代码中，send_file_to_kafka 函数接受三个参数：file_path、topic 和 bootstrap_servers。
# file_path 是本地文件的路径，topic 是要将数据发送到的 Kafka 主题，bootstrap_servers 是 Kafka 集群的地址。
# 该函数使用 with 语句打开文件，读取其内容，并将其作为流数据发送到指定的 Kafka 主题。在发送过程中，它会打印出传输进度，并使用 time.sleep 方法暂停 3 秒以控制发送速率。



# 输出年份数值

StreamShowerWithFlinkCSV.py 是一个使用 DataStream 处理 CSV 文件的 Python 脚本。实际上，下面的代码使用 re 函数。
但这不重要，只是对从 CSV 文件生成的 DataStream 随便试试。

In [None]:
# StreamShowerWithFlinkCSV.py

# 导入正则表达式模块、参数解析模块、日志模块、系统模块、numpy模块、pandas模块、pyflink模块
import re
import argparse
import logging
import sys
import numpy as np 
import pandas as pd
from pyflink.table import StreamTableEnvironment
from pyflink.common import WatermarkStrategy, Encoder, Types
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import FileSource, StreamFormat, FileSink, OutputFileConfig, RollingPolicy
from pyflink.common import Types, SimpleStringSchema
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors.kafka import FlinkKafkaProducer, FlinkKafkaConsumer

# 定义split函数，用于将字符串拆分成单个单词
def split(line):
    yield from line.split()

# 定义read_from_kafka函数，用于从Kafka读取数据
def read_from_kafka():
    # 定义Kafka消费的起始年份
    Year_Begin =1999
    # 定义Kafka消费的结束年份
    Year_End = 2023
    # 获取StreamExecutionEnvironment实例
    env = StreamExecutionEnvironment.get_execution_environment()    
    # 添加jars包
    env.add_jars("file:///home/hadoop/Desktop/PyFlink-Tutorial/flink-sql-connector-kafka-3.1-SNAPSHOT.jar")
    # 打印开始读取Kafka数据
    print("start reading data from kafka")

    # 创建Kafka消费者，用于从Kafka读取数据
    kafka_consumer = FlinkKafkaConsumer(
        topics='data', # The topic to consume messages from
        deserialization_schema= SimpleStringSchema('UTF-8'), # The schema to deserialize messages
        properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'my-group'} # The Kafka broker address and consumer group ID
    )

    # 从最早的偏移量开始读取Kafka数据
    kafka_consumer.set_start_from_earliest()

    # 添加Kafka消费者，并过滤掉不在指定年份范围内的数据
    env.add_source(kafka_consumer).map(lambda x: ' '.join(re.findall(r'\d+', x))).filter(lambda x: any([Year_Begin <= int(i) <= Year_End for i in x.split()])).map(lambda x:  [i for i in x.split() if Year_Begin <= int(i) <= Year_End][0]).print()
    # 执行StreamExecutionEnvironment
    env.execute()

# 调用read_from_kafka函数
if __name__ == '__main__':
    read_from_kafka()

`MapFunction`: 将一个元素作为输入并将一个元素作为输出的函数。通过对每个元素应用转换，它可用于转换数据流。
`FlatMapFunction`：将一个元素作为输入，并将零个、一个或多个元素作为输出的函数。它可通过对每个元素应用变换来转换数据流。
`FilterFunction`: 将一个元素作为输入并返回一个布尔值的函数。它可用于删除不符合特定条件的元素，从而过滤数据流。
`KeySelector`: 从元素中提取键的函数。它可用于按键对数据流中的元素进行分组。
`ReduceFunction`: 还原函数 将两个元素作为输入并将一个元素作为输出的函数。它可以通过组合共享一个共同键的元素来聚合数据流。
`WindowFunction`: 将元素窗口作为输入并将一个或多个元素作为输出的函数。它可用于在数据流上定义窗口，并对每个窗口内的元素进行转换。