In this notebook, we will show our final policy.

Our policy can be expressed by the following formula (with the weights of the strategy represented to two decimal places):

\begin{align}
o_{0} &= - \tanh{\left(0.05 i_{14} - 3.38 \right)}\newline
o_{1} &= \tanh{\left(0.69 i_{9} + 3.21 \right)}\newline
o_{2} &= \tanh{\left(18.46 i_{2} + 5.74 \right)}\newline
o_{3} &= - \tanh{\left(- 18.36 i_{0} + 2.49 i_{1} + 0.38 i_{10} + 10.16 i_{5} + 3.85 i_{7} + 2.91 i_{8} + 26.87 \right)}\newline
o_{4} &= - \tanh{\left(98.39 i_{0} + 0.15 i_{14} - 122.89 \right)}\newline
o_{5} &= \tanh{\left(16.54 i_{0} - 0.46 i_{10} + 0.38 i_{14} + 3.42 i_{5} + 3.37 i_{7} + 7.29 i_{9} - 10.3 \right)}\newline
\end{align}

Write our policy into python function with numpy package:

In [1]:
import numpy as np
from numpy import tanh, clip
np.random.seed(0)

output_cnt = 6

def policy(i):
    o = np.zeros(output_cnt)
    
    o[0] = -tanh(0.046154*i[14] - 3.378302)
    o[1] = tanh(0.691099*i[9] + 3.213468)
    o[2] = tanh(18.460631*i[2] + 5.735261)
    o[3] = -tanh(-18.364946*i[0] + 2.486137*i[1] + 0.379982*i[10] + 10.157374*i[5] + 3.845969*i[7] + 2.913171*i[8] + 26.870047)
    o[4] = -tanh(98.38753*i[0] + 0.15219*i[14] - 122.89392)
    o[5] = tanh(16.544113*i[0] - 0.46117*i[10] + 0.379904*i[14] + 3.416607*i[5] + 3.370985*i[7] + 7.287278*i[9] - 10.300297)
    return o

Evaluate our policy once. 

Note that we are using `render_mode="rgb_array_list"` in the environment, which will slow down the running speed of the code, as the environment will render the state at each step.

In [2]:
import gymnasium as gym

def evaluate_once(env, seed, policy, episode_length):
    observation, info = env.reset(seed=seed)
    total_reward = 0
    for _ in range(episode_length):
        action = policy(observation)
        observation, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        if terminated or truncated:
            break
    return total_reward

env = gym.make("Walker2d-v4", render_mode="rgb_array_list", height=400, width=400)
reward = evaluate_once(env, seed=0, policy=policy, episode_length=1000)
rgb_array_list = env.render()
env.close()

print("total_reward: ", reward)

total_reward:  3465.600985371167


Visualize the episide.

Save the animation as a gif file and then play it!

In [3]:
import imageio
import IPython.display as display

# Display one out of every three frames to speed up the animation playback.
imageio.mimsave('../results/animation.gif', rgb_array_list[::5] , fps=60, loop=0)  
# display.Image(filename='animation.gif')

Evaluate our policy 150 * 8 = 1200 times.

Note that `multiprocess_gym` is a gym environment that uses `multiprocessing` to run many environments in parallel across different CPU cores.

`worker_num` refers to the number of workers, and it is recommended to set this to the number of your CPU cores.

`env_num_per_worker` is the number of environments each worker runs.

The total number of evaluations will be `worker_num * env_num_per_worker`.

To use `multiprocess_gym`, you need
```
psutil==5.9.8 (to exactly put your processes in different cpu cores)


In [4]:
from multiprocess_gym import MultiProcessEnv
env = MultiProcessEnv(worker_num=150, env_num_per_worker=8, env_name="Walker2d-v4", policy_func=lambda args, obs: policy(obs), can_jit=False)
rewards = env.examine([0])
env.close()

rewards = np.array(rewards)
print(f"mean: {rewards.mean()}, std: {rewards.std()}, max: {rewards.max()}, min: {rewards.min()}")

examining policy for 1200 times
mean: 3309.392333984375, std: 400.9336242675781, max: 3484.778076171875, min: 1363.0408935546875


After 1200 evaluations, our policy obtain the average score of 3309.