# Laboratorium 7

Celem siódmego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random
from tqdm import tqdm

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
import tensorflow as tf

## Zadanie 1 - Actor-Critic

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu należy utworzyć dwie głębokie sieci neuronowe:
    1. *actor* - sieć, która będzie uczyła się optymalnej strategii (podobna do tej z laboratorium 6),
    2. *critic* - sieć, która będzie uczyła się funkcji oceny stanu (podobnie jak się DQN).
Wagi sieci *actor* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta).
\end{equation*}
Wagi sieci *critic* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    w \leftarrow w + \beta \delta_t \nabla_w\upsilon(s_{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
    \delta_t \leftarrow r_t + \gamma \upsilon(s_{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}
</p>

In [3]:
class REINFORCEAgent:
    def __init__(self, state_size, action_size, actor, critic):
        self.state_size = state_size
        self.action_size = action_size
        self.gamma = 0.99  # discount rate
        self.actor = actor
        self.critic = critic  # critic network should have only one output
        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []
        self.next_state_memory = []
        self.done_memory = []

    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """

        prediction = self.actor.predict_on_batch(state)
        best_action = np.random.choice(np.arange(action_size), p=prediction[0])

        return best_action

    def get_cumulative_rewards(self):
        """
        based on https://github.com/yandexdataschool/Practical_RL/blob/spring20/week06_policy_based/reinforce_tensorflow.ipynb
        take a list of immediate rewards r(s,a) for the whole session
        compute cumulative rewards R(s,a) (a.k.a. G(s,a) in Sutton '16)
        R_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...

        The simple way to compute cumulative rewards is to iterate from last to first time tick
        and compute R_t = r_t + gamma*R_{t+1} recurrently

        You must return an array/list of cumulative rewards with as many elements as in the initial rewards.
        """

        n_rewards = len(self.reward_memory)
        cumulative_rewards = [0] * n_rewards
        cumulative_rewards[-1] = self.reward_memory[-1]
        for reward_idx in range(n_rewards - 2, -1, -1):
            cumulative_rewards[reward_idx] = self.reward_memory[reward_idx] + self.gamma * cumulative_rewards[reward_idx+1]
        return tf.convert_to_tensor(cumulative_rewards, dtype=tf.float32)

    def remember(self, state, next_state, action, reward, done):
        self.state_memory.append(state)
        self.action_memory.append(action)
        self.reward_memory.append(reward)
        self.next_state_memory.append(next_state)
        self.done_memory.append(done)

    def learn(self):
        """
        Function learn networks using information about state, action, reward and next state.
        First the values for state and next_state should be estimated based on output of critic network.
        Critic network should be trained based on target value:
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """

        for idx in range(len(self.state_memory)):
            with tf.GradientTape(persistent=True) as tape:

                actor_state_prediction = self.actor(self.state_memory[idx], training=True)
                critic_state_prediction =  self.critic(self.state_memory[idx], training=True)
                critic_next_state_prediction = self.critic(self.next_state_memory[idx], training=True)
                critic_loss = self.reward_memory[idx] + self.gamma * critic_next_state_prediction * (1 - int(self.done_memory[idx])) - critic_state_prediction
                log_prob = tf.math.log(actor_state_prediction[0, self.action_memory[idx]])
                actor_loss = -log_prob * critic_loss
                critic_loss = critic_loss ** 2

            grads_actor = tape.gradient(actor_loss, self.actor.trainable_variables)
            grads_critic = tape.gradient(critic_loss, self.critic.trainable_variables)

            self.actor.optimizer.apply_gradients(zip(grads_actor, self.actor.trainable_variables))
            self.critic.optimizer.apply_gradients(zip(grads_critic, self.critic.trainable_variables))

        self.state_memory = []
        self.action_memory = []
        self.reward_memory = []
        self.next_state_memory = []
        self.done_memory = []

Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
env = gym.make("CartPole-v1").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.0001
beta_learning_rate = 0.0005

actor = tf.keras.Sequential()
actor.add(tf.keras.layers.Dense(100, input_shape=(state_size,), activation='relu'))
actor.add(tf.keras.layers.Dense(200, activation='relu'))
actor.add(tf.keras.layers.Dense(action_size, activation='softmax'))
actor.compile(loss=tf.keras.losses.huber, optimizer=tf.keras.optimizers.Adam(learning_rate=alpha_learning_rate), run_eagerly=True)

critic = tf.keras.Sequential()
critic.add(tf.keras.layers.Dense(100, input_shape=(state_size,), activation='relu'))
critic.add(tf.keras.layers.Dense(200, activation='relu'))
critic.add(tf.keras.layers.Dense(1, activation='linear'))
critic.compile(loss=tf.keras.losses.huber, optimizer=tf.keras.optimizers.Adam(learning_rate=beta_learning_rate), run_eagerly=True)

Czas nauczyć agenta gry w środowisku *CartPool*:

In [5]:
agent = REINFORCEAgent(state_size, action_size, actor, critic)


for i in range(100):
    score_history = []

    for i in tqdm(range(100)):
        done = False
        score = 0
        state = env.reset()[0]
        state = tf.convert_to_tensor(state[np.newaxis, :], dtype=tf.float32)
        while not done:
            action = agent.get_action(state)
            next_state, reward, done, _, _ = env.step(action)
            next_state = tf.convert_to_tensor(next_state[np.newaxis, :], dtype=tf.float32)
            agent.remember(state, next_state, action, reward, done)
            state = next_state
            score += reward

        agent.learn()
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break

  if not isinstance(terminated, (bool, np.bool8)):
100%|██████████| 100/100 [02:24<00:00,  1.44s/it]


mean reward:30.550


100%|██████████| 100/100 [18:13<00:00, 10.94s/it]


mean reward:235.190


100%|██████████| 100/100 [50:26<00:00, 30.27s/it]   

mean reward:655.480
You Win!



