# WaterLily-RL Tutorial - Getting Started  教程 - 入门指南
WaterLily-RL is a Deep Reinforcement Learning(DRL) simulation framework forcing on the fluid dyamics. 
It trains the agent in a vitual fluid environment created by WaterLily, a simple and fast fluid simulator.

WaterLily-RL 是一个专注于流体动力学的​​深度强化学习仿真框架​​。它通过在 WaterLily（一个简单快速的流体模拟器）创建的虚拟流体环境中训练智能体（agent）

## Introduction  简介
This tutorial provides a basic DRL task in fluid dynamics - a agent learns how to execute a direct force 
to restrain the virbration caused by incoming flow (Vortex-Induced Vibration).

本教程提供了一个流体动力学中的基础 DRL 任务——训练一个智能体学习如何施加一个直接的作用力，
以抑制来流引起的振动（涡激振动，Vortex-Induced Vibration）

## Install Python Dependencies Using Pip  使用 Pip 安装 Python 依赖
List of full Python dependencies can be found in the setup.py,follow the instruction in README can help you install them.

完整的 Python 依赖列表可以在 setup.py文件中找到。按照 README 文件中的说明可以帮助你安装它们。

## Install Julia Dependencies  安装 Julia 依赖
List of full Julia dependencies can be found in the README.

完整的 Julia 依赖列表可以在 README 文件中找到。
希望这个翻译对你有帮助！如果你需要进一步的协助，请随时告诉我。

## Multi-threads  多线程支持
If your operating system is Linux, you can build multiply threads to accelerate the simulation. 
It is not recommanded to implement multi-threads in Windows, since it may result in issues in rendering.

如果您的操作系统是 Linux，可以构建多线程来加速仿真。
不建议在 Windows 系统中使用多线程功能，因为这可能会导致渲染问题。

In [None]:
import os
os.environ["JULIA_NUM_THREADS"] = "8" # build 8 threads
from julia import Julia
jl = Julia(compiled_modules=False)
from julia import Main
print(Main.eval("Threads.nthreads()"))


## Imports 导入依赖库
As we used Stable-Baselines3 as the benchmark DRL library, 'gymnasium' and 'stable_baselines3' 
are required. You can import the PPO here as the whole tutorial is based on it.

由于我们采用 Stable-Baselines3 作为基准深度强化学习库，因此需要安装 'gymnasium' 和 'stable_baselines3' 这两个库。
在本教程中您可以导入 PPO（近端策略优化）算法，因为整个教程都基于该算法实现。

In [None]:


"""

Lib 支持

"""
import os
from julia import Julia
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines3.common.callbacks import BaseCallback, CheckpointCallback, CallbackList
from src.gym_base import JuliaEnv


## Return the episodic reward and Set the checkpoint  返回周期奖励与设置检查点

In order to deal with the interrupt during learning processing, or willing to continuing 
the completed training, you need to set the checkpoints and callback later.

为应对学习过程中可能发生的中断情况，或希望延续已完成的训练进程，您需要设置检查点并在后续配置回调功能。

In [None]:
"""

反馈reward和建立checkpoint

"""
class RewardLoggerCallback(BaseCallback):
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_rewards = []
        self.current_rewards = None
        self.episode_steps = []          # 存每个 episode 的 step 数
        self.current_steps = None

    def _on_training_start(self) -> None:
        self.current_rewards = np.zeros(self.training_env.num_envs)
        self.current_steps = np.zeros(self.training_env.num_envs, dtype=int)

    def _on_step(self) -> bool:
        rewards = self.locals["rewards"]
        dones = self.locals["dones"]
        self.current_rewards += rewards
        self.current_steps += 1   # 每个 step 累加


        for i, done in enumerate(dones):
            if done:
                self.episode_rewards.append(self.current_rewards[i])
                self.episode_steps.append(self.current_steps[i])  # 记录步数

                print(f"Episode finished after {self.current_steps[i]} steps")
                print(f"Episode reward: {self.current_rewards[i]:.2f}")
                # reset
                self.current_rewards[i] = 0.0
                self.current_steps[i] = 0

        return True

checkpoint_callback = CheckpointCallback(
    save_freq= 10000,
    save_path="./checkpoints/",
    name_prefix="ppo_model",
    save_replay_buffer=True,
    save_vecnormalize=True
)

## Parameters 参数配置
The three dict "statics", "variables"and "spaces" are essential parameters which should be
determined before running. 'VIV_gym' from 'src' floder is the source code to refer the julia
file.

在运行前必须确定三个核心参数字典："statics"（静态参数）、"variables"（动态变量）和 "spaces"（空间参数）。
可通过 'src' 文件夹中的 'VIV_gym' 源代码文件参考对应的 Julia 文件实现。

In [None]:
"""

训练用参数(VIV)

"""
diameter = 16
def pos_generator():
    return [0.0, np.random.uniform(- diameter/6, diameter/6)]

# static parameters
statics = {
    "L_unit": diameter,                         # dimension of the object (diameter of the circle)
    "action_scale": 50,                         # amplifing the action from [-1,1] to [-50,50]
    "size": [10, 8],                            # size ratio of the simulation env
    "location": [3, 4]                          # location of the object in the simulation env (size ratio)
}
#variable parameters
variables = {
    "position":[0.0, diameter/6],               # manual position of the object respects to the initial location
    "velocity":[0.0, 0.0]                       # initial velocity
}
# size of action sapce and observation spaces
spaces = {
    "action":1,                                 # action space
    "observation":3                             # observation spaces
}

from src.gym_base import VIVEnv

## Create the env and instantiate the agent  创建环境并实例化智能体
### Refer the WaterLily env and build the model  引用 WaterLily 环境并构建模型
For this example, we will use the basic VIV senario as the simulation env. Thus,
we can set the 'VIVEnv' to 'env' and import the previously determined parameters
(statics, variables and spaces). It is recommanded to set the 'render_mode' as 
'None', since create image will extremely slow down the train process.

Then, as mentioned before, we accepct PPO algorthm and the policy is "MlpPolicy" 
since the input action is a vector.

在本示例中，我们将使用基础的涡激振动（VIV）场景作为仿真环境。
因此，可将 'VIVEnv' 设置为 'env' 并导入先前确定的参数
（statics、variables 和 spaces）。建议将 'render_mode' 设置为
'None'，因为创建图像会大幅降低训练速度。

随后，如前所述，我们采用 PPO 算法并选择 "MlpPolicy" 策略，
因为输入动作为向量形式。

In [None]:
# Build the env and model
env = DummyVecEnv([lambda: JuliaEnv(render_mode=None, env = VIVEnv, max_episode_steps=2000, statics = statics, 
                                    variables = variables, spaces = spaces, verbose=1)])

model = PPO(
    "MlpPolicy",
    env=env,
    verbose=1,
    device = 'cpu'
)

let's define the callback function from previously determined checkpoint function.

让我们通过先前定义的检查点函数来设置回调函数。

In [None]:
# set the checkpoint and reward sum in callback
reward_callback = RewardLoggerCallback()
callback = CallbackList([checkpoint_callback, reward_callback])

### Train the agent and save it 训练并保存模型

In [None]:
# train and save the agent
model.learn(total_timesteps=100_000, callback = callback)
model.save("./model/PPO_model")

In [None]:
# collect te rewards and close the env
rewards = np.array(reward_callback.episode_rewards)
np.save('./result/data/rewards.npy', rewards)
env.close()

## Evaluate the train process  训练过程评估
### Image  过程图例
After collecting the rewards, you can draw the rewards during training process here.
Apperantly, the reward doesn't converge and the performance is not well. Don't worry,
we can continuously train it from the last checkpoint later.

在收集奖励数据后，您可在此绘制训练过程中的奖励变化曲线。
目前可见奖励尚未收敛，模型表现欠佳。无需担心，
我们后续可以从最后一个检查点继续训练模型。

In [None]:
"""

绘图功能

"""
import matplotlib.pyplot as plt

# load the rewards
rewards = np.load('./result/data/rewards.npy')
# param：the sliding window for means and std
window = 10

def plot_rewards(rewards, window=100):
    # calculate the means and stds
    def moving_avg(x, w):
        return np.convolve(x, np.ones(w)/w, mode='valid')

    mean = moving_avg(rewards, window)
    std = np.array([
        np.std(rewards[max(0, i - window + 1):i + 1])
        for i in range(window - 1, len(rewards))
    ])

    # x-axis
    x = np.arange(window - 1, len(rewards))

    # plot
    plt.figure(figsize=(12, 6))
    plt.plot(x, mean, label='Mean Reward')
    plt.fill_between(x, mean - std, mean + std, alpha=0.3, label='±1 Std Dev')
    plt.xlabel("Episode")
    plt.ylabel("Reward")
    plt.title("Episode Reward over Training")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

plot_rewards(rewards,window)

### GIF & info  动态演示与数据记录
You probably wanna see the trained agent's motion in a grphic windows or a gif of 
the whole moving process. You can create the gif here by play the trained model in
our sim env.

Moreover, if you wanted to analyse the dynamics of the trained agent, you can save
the information, determined in julia files, into a npy file.

您可能希望查看已训练智能体在图形窗口中的运动表现，或是生成完整运动过程的GIF动画。您可以通过在仿真环境中运行训练好的模型，在此直接创建动态演示图。

此外，若需分析已训练智能体的动力学特性，可将Julia文件中定义的相关信息保存为npy格式文件。

In [None]:
# create gif after training
from src.gif import create_GIF

# same simulation env while 'render_mode' is 'rgb_array' to create images
env = JuliaEnv(render_mode="rgb_array", env = VIVEnv, max_episode_steps=2000, statics = statics, variables = variables, spaces = spaces, verbose=True)

# load the trained PPO_model
model = PPO.load("./model/PPO_model", env=env)

# video frame
frames = []

# reset the env
print("测试", env.reset())
obs, _ = env.reset()

done = False
truncated = False

# if 'not done', then continue to perform the simulation operation based on trained model
while not done and not truncated:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)

# save as gif
input_frame = "images"
output_gif = "./result/gif/train_policy_demo.gif"
create_GIF(input_frame, output_gif)
env.close()

# save the info
np.save("./result/data/info_PPO.npy", info["info"])


### Analysis  可视化分析
Now you have a npy file which contains all dynamic information, so you can plow a 
line chart of fluid force, agent applied force and the displacement in y direction.
The result is not appropriate, as the train is not 100% converged, you can run the 
callback this time to continue the training.

现在您已获取包含所有动力学信息的npy文件，可绘制流体作用力、智能体施加力以及y方向位移的线性图表。
由于当前训练未完全收敛，结果尚未达到理想状态。您此时可运行回调功能以继续训练进程。


In [None]:
import matplotlib.pyplot as plt

info = np.load("./result/data/info_PPO.npy", allow_pickle=True)
force = [f["F"] for f in info[50:]]
y_force = [f["fluid_force_y"] for f in info[50:]]
x_force = [f["fluid_force_x"] for f in info[50:]]
y_dis = [f["y_dis"] for f in info[50:]]
x_dis = [f["x_dis"] for f in info[50:]]

x = np.arange(len(y_force))
# x2 = np.arange(len(y_dis2))

# 画图
plt.figure(figsize=(8, 5))
plt.plot(x, force, label="y_force", color="red")
plt.plot(x, y_force, label="y_fluid", color="blue")
plt.plot(x, y_dis, label="y_displacement", color="green")

# 图例、标签、标题
plt.xlabel("step")
plt.ylabel("force & displacement")
plt.title("Force and Displacement in y direction")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## Callback  回调训练
You found out that the train doesn't end up with a converged reward. You can
start a subsequent process from the checkpoint of 100k steps. Run the other
200k steps train now. After all, a converged sum of reward should be around -10.

发现当前训练未获得收敛的奖励值。您可以从10万步的检查点启动后续训练流程，继续运行20万步的训练任务。最终收敛的累计奖励值应稳定在-10左右。


In [None]:
"""

加载checkpoint并继续训练

"""
env = DummyVecEnv([lambda: JuliaEnv(render_mode=None, env = VIVEnv, max_episode_steps=2000, statics = statics, 
                                    variables = variables, spaces = spaces, verbose=1)])

reward_callback = RewardLoggerCallback()
callback = CallbackList([checkpoint_callback, reward_callback])

model = PPO.load("./checkpoints/ppo_model_100000_steps", env=env, device='cpu')
model.learn(total_timesteps=200_000, callback = callback)
rewards_ex = np.array(reward_callback.episode_rewards)
rewards = np.load('./result/data/rewards.npy')
rewards = np.concatenate([rewards, rewards_ex])
np.save('./result/data/rewards.npy', rewards)
model.save("./model/PPO_model_100k-300k")
env.close()

Plot the means and stds of the rewards now. You can find it is nearly converged.
Apparently, the agent learned a policy to execute the force to counteract the lift 
force, keeps the agent around y=0.

现在绘制奖励的均值和标准差图表，可以发现结果已接近收敛状态。
显然，智能体已学会通过施加作用力来抵消升力的策略，能够将自身位置稳定在y=0附近。

In [None]:
"""

Reward Image

"""
# load the rewards
rewards = np.load('./result/data/rewards.npy')
# param：the sliding window for means and std
window = 10

def plot_rewards(rewards, window=100):
    episode = np.arange(len(rewards))

    # calculate the means and stds
    def moving_avg(x, w):
        return np.convolve(x, np.ones(w)/w, mode='valid')

    mean = moving_avg(rewards, window)
    std = np.array([
        np.std(rewards[max(0, i - window + 1):i + 1])
        for i in range(window - 1, len(rewards))
    ])

    # x-axis
    x = np.arange(window - 1, len(rewards))

    # plot
    plt.figure(figsize=(12, 6))
    plt.plot(x, mean, label='Mean Reward')
    plt.fill_between(x, mean - std, mean + std, alpha=0.3, label='±1 Std Dev')
    plt.xlabel("Episode")
    plt.ylabel("Reward")
    plt.title("Episode Reward over 250k Steps Training")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

plot_rewards(rewards,window)

In [None]:
"""

GIF

"""
# same simulation env while 'render_mode' is 'rgb_array' to create images
env = JuliaEnv(render_mode="rgb_array", env = VIVEnv, max_episode_steps=2000, statics = statics, variables = variables, spaces = spaces, verbose=True)

# load the trained PPO_model
model = PPO.load("./model/PPO_model_100k-300k", env=env)

# video frame
frames = []

# reset the env
obs, _ = env.reset()

done = False
truncated = False

# if 'not done', then continue to perform the simulation operation based on trained model
while not done and not truncated:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)

# save as gif
input_frame = "images"
output_gif = "./result/gif/train_policy_demo.gif"
create_GIF(input_frame, output_gif)
env.close()

# save the info
np.save("./result/data/info_PPO.npy", info["info"])


You will find that the agent is completely trained, and the y-displacement are
nearly 0 all time.

您将会发现智能体已完成训练，y方向位移始终保持在接近零的水平。

In [None]:
info = np.load("./result/data/info_PPO.npy", allow_pickle=True)
force = [f["F"] for f in info[50:]]
y_force = [f["fluid_force_y"] for f in info[50:]]
x_force = [f["fluid_force_x"] for f in info[50:]]
y_dis = [f["y_dis"] for f in info[50:]]
x_dis = [f["x_dis"] for f in info[50:]]

x = np.arange(len(y_force))
# x2 = np.arange(len(y_dis2))

# 画图
plt.figure(figsize=(8, 5))
plt.plot(x, force, label="y_force", color="red")
plt.plot(x, y_force, label="y_fluid", color="blue")
plt.plot(x, y_dis, label="y_displacement", color="green")

# 图例、标签、标题
plt.xlabel("step")
plt.ylabel("force & displacement")
plt.title("Force and Displacement in y direction")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()