# Laboratorium 7

Celem siódmego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmu głębokiego uczenia aktywnego - Actor-Critic. Zaimplementowany algorytm będzie testowany z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
from keras import backend as K
from keras.layers import Dense, Activation, Input
from keras.models import Model, load_model
from keras.optimizers import Adam


Using TensorFlow backend.


## Zadanie 1 - Actor-Critic

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Actor-Critic. W tym celu należy utworzyć dwie głębokie sieci neuronowe:
    1. *actor* - sieć, która będzie uczyła się optymalnej strategii (podobna do tej z laboratorium 6),
    2. *critic* - sieć, która będzie uczyła się funkcji oceny stanu (podobnie jak się DQN).
Wagi sieci *actor* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    \theta \leftarrow \theta + \alpha \delta_t \nabla_\theta log \pi_{\theta}(a_t, s_t | \theta).
\end{equation*}
Wagi sieci *critic* aktualizowane są zgodnie ze wzorem:
\begin{equation*}
    w \leftarrow w + \beta \delta_t \nabla_w\upsilon(s_{t + 1}, w),
\end{equation*}
gdzie:
\begin{equation*}
    \delta_t \leftarrow r_t + \gamma \upsilon(s_{t + 1}, w) - \upsilon(s_t, w).
\end{equation*}
</p>

In [6]:
class ActorCriticAgent:
    def __init__(self, action_size, actor, critic, policy):
        self.gamma = 0.99
        self.action_size = action_size

        self.actor = actor
        self.critic = critic
        self.policy = policy
        self.action_space = [i for i in range(action_size)]

    def get_action(self, state):
        """
        Compute the action to take in the current state, basing on policy returned by the network.

        Note: To pick action according to the probability generated by the network
        """

        #
        # INSERT CODE HERE to get action in a given state
        #        
        state = state[np.newaxis, :]
        probabilities = self.policy.predict(state)[0]
        return np.random.choice(self.action_space, p=probabilities)
  

    def learn(self, state, action, reward, next_state, done):
        """
        Function learn networks using information about state, action, reward and next state. 
        First the values for state and next_state should be estimated based on output of critic network.
        Critic network should be trained based on target value:
        target = r + \gamma next_state_value if not done]
        target = r if done.
        Actor network shpuld be trained based on delta value:
        delta = target - state_value
        """
        #
        # INSERT CODE HERE to train network
        #
        
        state = state[np.newaxis, :]
        next_state = next_state[np.newaxis, :]
        critic_value_for_next_state = self.critic.predict(next_state)
        critic_value = self.critic.predict(state)

        target = reward + self.gamma * critic_value_for_next_state * (1 - int(done))
        delta = target - critic_value

        actions = np.zeros([1, self.action_size])
        actions[np.arange(1), action] = 1

        self.actor.fit([state, delta], actions, verbose=0)

        self.critic.fit(state, target, verbose=0)

Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
alpha_learning_rate = 0.0001
beta_learning_rate = 0.0005

input = Input(shape=(state_size,))
delta = Input(shape=[1])
dense1 = Dense(64, activation='relu')(input)
probs = Dense(action_size, activation='softmax')(dense1)
values = Dense(1, activation='linear')(dense1)


def custom_loss(y_true, y_pred):
    out = K.clip(y_pred, 1e-8, 1 - 1e-8)
    log_lik = y_true * K.log(out)

    return K.sum(-log_lik * delta)


actor_model = Model(input=[input, delta], output=[probs])
actor_model.compile(optimizer=Adam(lr=alpha_learning_rate), loss=custom_loss)
critic_model = Model(input=[input], output=[values])
critic_model.compile(optimizer=Adam(lr=beta_learning_rate), loss='mean_squared_error')
policy = Model(input=[input], output=[probs])



Czas nauczyć agenta gry w środowisku *CartPool*:

In [7]:
agent = ActorCriticAgent(action_size, actor_model, critic_model, policy)

for i in range(100):
    score_history = []

    for i in range(100):
        done = False
        score = 0
        state = env.reset()
        for t in range(1000):
            action = agent.get_action(state)
            next_state, reward, done, _ = env.step(action)
            agent.learn(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if (done):
                break
        score_history.append(score)

    print("mean reward:%.3f" % (np.mean(score_history)))

    if np.mean(score_history) > 300:
        print("You Win!")
        break
        
    # ~20-40 min

mean reward:21.420
mean reward:20.550
mean reward:22.290
mean reward:29.500
mean reward:40.530
mean reward:54.830
mean reward:103.020
mean reward:158.230
mean reward:154.460
mean reward:227.090
mean reward:136.600
mean reward:101.690
mean reward:185.940
mean reward:176.040
mean reward:269.780
mean reward:976.050
You Win!
