# Laboratorium 5

Celem czwartego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmów głębokiego uczenia aktywnego. Zaimplementowane algorytmy będą testowane z wykorzystaniem środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [32]:
from collections import deque
import gym
import numpy as np
import random
import time as tm

from gym.envs.classic_control import CartPoleEnv

Dołączenie bibliotek do obsługi sieci neuronowych

In [33]:
from keras import Model
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam

## Zadanie 1 - Double Deep Q-Network

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Double Deep Q-Network. Wartoscią oczekiwaną sieci jest:
\begin{equation}
       Q^*(s, a) \approx r + \gamma argmax_{a'}Q_\theta'(s', a') 
\end{equation}
a wagi pomiędzy sieciami wymieniane są co dziesięć aktualizacji wag sieci sterującej poczynaniami agenta ($Q$).
</p>

In [37]:

class DDQNAgent:
    def __init__(self, action_size, learning_rate, model: Model, target_model: Model, get_legal_actions=None, env=None):
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95  # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.999
        self.tau = float(0.85)
        self.learning_rate = learning_rate
        self.model = model
        self.get_legal_actions = get_legal_actions
        self.env = env
        
        self.target_model = target_model
        self.update_weights()
        self.replay_counter = 1
        
        
    def remember(self, state, action, reward, next_state, done):
        # Function adds information to the memory about last action and its results
        self.memory.append((state, action, reward, next_state, done))

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """

        #
        # INSERT CODE HERE to get action in a given state (according to epsilon greedy algorithm)
        #

        epsilon = self.epsilon

        # Pick Action
        if isinstance(env, CartPoleEnv):
            if np.random.random() < epsilon:
                return self.env.action_space.sample()
            else:
                return np.argmax(self.model.predict(state)[0])
        else:
            possible_actions = self.get_legal_actions(state)
            if len(possible_actions) == 0:
                return None
            best_action = self.get_best_action(state)
            chosen_action = best_action

            if random.uniform(0, 1) < epsilon:
                random_actions = possible_actions.copy()
                random_actions.remove(best_action)
                chosen_action = random.choice(random_actions if random_actions else [best_action])
            return chosen_action

    def get_best_action(self, state):
        """
        Compute the best action to take in a state (using current q-values).
        """
        if isinstance(env, CartPoleEnv):
            possible_actions = self.env.action_space
        else:
            possible_actions = self.get_legal_actions(state)
            if len(possible_actions) == 0:
                return None

        return np.argmax(self.model.predict(state))

    def lower_epsilon(self):
        new_epsilon = self.epsilon * self.epsilon_decay
        if new_epsilon >= self.epsilon_min:
            self.epsilon = new_epsilon

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory.
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        Also every time the function replay is called self.epsilon value should be updated according to equation:
        self.epsilon *= self.epsilon_decay
        """
        #
        # INSERT CODE HERE to train network
        #

        if len(self.memory) < batch_size:
            return

        info_sets = random.sample(self.memory, batch_size)
        states_list = []
        targets_list = []
        for info_set in info_sets:
            state, action, reward, next_state, done = info_set
            states_list.append(state.flatten())
            target = self.target_model.predict(state)
            if done:
                target[0][action] = reward
            else:
                Q_future = max(self.target_model.predict(next_state)[0])
                target[0][action] = reward + Q_future * self.gamma
            targets_list.append(target.flatten())

        states_array = np.array(states_list)
        targets_array = np.array(targets_list)

        self.model.train_on_batch(states_array, targets_array)
        self.lower_epsilon()
        self.replay_counter += 1
        if self.replay_counter >= 10:
            self.update_weights()
            self.replay_counter = 0


    def update_weights(self):
        """copy trained Q Network params to target Q Network"""
        #
        # INSERT CODE HERE to train network
        #
        weights = self.model.get_weights()
        target_weights = self.target_model.get_weights()
        for i in range(len(target_weights)):
            target_weights[i] = weights[i] * self.tau + target_weights[i] * (1 - self.tau)
            # target_weights[i] = weights[i]
        self.target_model.set_weights(target_weights)


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [38]:
def build_model():
    model = Sequential()
    model.add(Dense(16, input_dim=state_size, activation="relu"))
    model.add(Dense(32, activation="relu"))
    model.add(Dense(16, activation="relu"))
    model.add(Dense(action_size))  # wyjście
    model.compile(loss="mean_squared_error",
                  optimizer=Adam(lr=learning_rate))
    return model

env = gym.make("CartPole-v0").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001

model = build_model()
target_model = build_model()

Czas nauczyć agenta gry w środowisku *CartPool*:

In [39]:
agent = DDQNAgent(action_size, learning_rate, model, target_model, env=env)

agent.epsilon = 0.75

done = False
batch_size = 64
EPISODES = 1000
counter = 0
for e in range(EPISODES):
    start = tm.time()
    summary = []
    for _ in range(100):
        total_reward = 0
        env_state = env.reset()
    
        #
        # INSERT CODE HERE to prepare appropriate format of the state for network
        #
        state = np.array([np.array(env_state).flatten()])
        
        for time in range(1000):
            action = agent.get_action(state)
            next_state_env, reward, done, _ = env.step(action)
            total_reward += reward

            #
            # INSERT CODE HERE to prepare appropriate format of the next state for network
            #
            next_state = np.array([np.array(next_state_env).flatten()])

            #add to experience memory
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        #
        # INSERT CODE HERE to train network if in the memory is more samples then size of the batch
        #
        if len(agent.memory) > batch_size:
            agent.replay(batch_size)
            # agent.update_weights()
        
        summary.append(total_reward)
    end = tm.time()
    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}\ttime = {:.3f}".format(e, np.mean(summary), agent.epsilon,
                                                                                    end - start))
    if np.mean(total_reward) > 195:
        print ("You Win!")
        break
    # Results (episodes to win):
    # 19, 12, 26
    # with replay: 10, 12, 6, 13
    # with tau updating: 23
    # tau = 1: 21, 11, 10
    # tau = 0.85: 20

epoch #0	mean reward = 17.150	epsilon = 0.681	time = 10.690
epoch #1	mean reward = 15.530	epsilon = 0.616	time = 9.672
epoch #2	mean reward = 15.860	epsilon = 0.557	time = 9.992
epoch #3	mean reward = 39.810	epsilon = 0.504	time = 10.695
epoch #4	mean reward = 66.160	epsilon = 0.456	time = 12.133
epoch #5	mean reward = 102.780	epsilon = 0.413	time = 13.960
epoch #6	mean reward = 92.010	epsilon = 0.373	time = 13.875
epoch #7	mean reward = 114.130	epsilon = 0.338	time = 15.479
epoch #8	mean reward = 129.850	epsilon = 0.306	time = 16.449
epoch #9	mean reward = 131.100	epsilon = 0.277	time = 18.751
epoch #10	mean reward = 172.860	epsilon = 0.250	time = 19.959
epoch #11	mean reward = 161.320	epsilon = 0.226	time = 18.441
epoch #12	mean reward = 144.480	epsilon = 0.205	time = 18.133
epoch #13	mean reward = 129.320	epsilon = 0.185	time = 17.336
epoch #14	mean reward = 107.850	epsilon = 0.168	time = 17.121
epoch #15	mean reward = 114.440	epsilon = 0.152	time = 18.676
epoch #16	mean reward = 11