# Laboratorium 5 (4 pkt)

Celem czwartego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmów głębokiego uczenia aktywnego. Zaimplementowane algorytmy będą testowane z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
import random
from tqdm import tqdm

Dołączenie bibliotek do obsługi sieci neuronowych

In [2]:
import tensorflow as tf

## Zadanie 1 - Double Deep Q-Network

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Double Deep Q-Network. Wartoscią oczekiwaną sieci jest:
\begin{equation}
       Q^*(s, a) \approx r + \gamma argmax_{a'}Q_\theta'(s', a') 
\end{equation}
a wagi pomiędzy sieciami wymieniane są co dziesięć aktualizacji wag sieci sterującej poczynaniami agenta ($Q$).
</p>

In [3]:
class DDQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 0.5  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.95
        self.learning_rate = 0.001
        self.epsilon_decay_diff = 0.04
        self.replay_counter = 0
        self.model = self._build_model()
        self.target_model = self._build_model()
        self.update_weights()

    def _build_model(self):
        model = tf.keras.Sequential()
        model.add(tf.keras.layers.Dense(64, input_shape=(state_size,), activation='relu'))
        model.add(tf.keras.layers.Dense(64, activation='relu'))
        model.add(tf.keras.layers.Dense(action_size, activation='relu'))
        model.compile(loss=tf.keras.losses.mse, optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate))
        return model

    def remember(self, state, action, reward, next_state, done):
        #Function adds information to the memory about last action and its results
        self.memory.append((state, action, reward, next_state, done)) 

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """

        if np.random.random() < self.epsilon:
            return np.random.choice(action_size)
        return self.get_best_action(state)
  
    def get_best_action(self, state):
        """
        Compute the best action to take in a state.
        """

        prediction = self.target_model.predict(state, verbose=0)
        best_action = tf.argmax(prediction[0]).numpy()
        return best_action

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory. 
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        After each 10 Q Network trainings parameters should be copied to the target Q Network
        """

        sample_idx = np.random.choice(len(self.memory), size=batch_size, replace=False)
        batch_list = [self.memory[idx] for idx in sample_idx]

        states, actions, rewards, next_states, dones = [np.array(x).reshape((batch_size, state_size if i == 0 or i == 3 else 1)) for i, x in enumerate(zip(*batch_list))]

        next_state_preds = np.max(self.target_model.predict(next_states, verbose=0), axis=1).reshape((batch_size, 1))
        targets = np.where(dones, rewards, rewards + self.gamma * next_state_preds)

        predictions = self.model.predict(states, verbose=0)
        predictions[np.arange(len(actions)), actions[:, 0]] = targets[:, 0]
        self.model.fit(states, predictions, verbose=0)

        self.update_weights()
        self.replay_counter += 1

    def update_epsilon_value(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon -= self.epsilon_decay_diff
        else:
            self.epsilon = self.epsilon_min

    def update_weights(self):
        """copy trained Q Network params to target Q Network"""

        if not self.replay_counter % 10:
            self.target_model.set_weights(self.model.get_weights())


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
env = gym.make("CartPole-v1").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001

Czas nauczyć agenta gry w środowisku *CartPool*:

In [5]:
agent = DDQNAgent(action_size, learning_rate)

agent.epsilon = 0.5
agent.epsilon_decay_diff = 0.05
batch_size = 128
EPISODES = 1000
for e in range(EPISODES):
    summary = []
    pbar = tqdm(range(100))
    for _ in pbar:
        total_reward = 0
        env_state = env.reset()[0]
        state = tf.convert_to_tensor(env_state[np.newaxis, :], dtype=tf.float32)

        for time in range(500):
            action = agent.get_action(state)
            next_state_env, reward, done, _, _ = env.step(action)
            total_reward += reward
            next_state = tf.convert_to_tensor(next_state_env[np.newaxis, :], dtype=tf.float32)
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

        summary.append(total_reward)
        pbar.set_description(f'training epoch')

    agent.update_epsilon_value()
    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))
    agent.model.save_weights('test')
    if np.mean(summary) > 195:
        print ("You Win!")
        break

  if not isinstance(terminated, (bool, np.bool8)):
training epoch: 100%|██████████| 100/100 [01:37<00:00,  1.03it/s]


epoch #0	mean reward = 34.720	epsilon = 0.450


training epoch: 100%|██████████| 100/100 [01:51<00:00,  1.12s/it]


epoch #1	mean reward = 37.540	epsilon = 0.400


training epoch: 100%|██████████| 100/100 [05:44<00:00,  3.44s/it]


epoch #2	mean reward = 113.730	epsilon = 0.350


training epoch: 100%|██████████| 100/100 [11:14<00:00,  6.74s/it]


epoch #3	mean reward = 209.300	epsilon = 0.300
You Win!
