# Laboratorium 4 (4 pkt.)

Celem czwartego laboratorium jest zapoznanie się oraz zaimplementowanie algorytmów głębokiego uczenia aktywnego. Zaimplementowane algorytmy będą testowane z wykorzystaniem wcześniej przygotowanych środowisk: *FrozenLake* i *Pacman* oraz środowiska z OpenAI - *CartPole*.


Dołączenie standardowych bibliotek

In [1]:
from collections import deque
import gym
import numpy as np
from tqdm import tqdm

Dołączenie bibliotek ze środowiskami:

In [2]:
from env.FrozenLakeMDP import frozenLake
from env.FrozenLakeMDPExtended import frozenLake as frozenLakeExtended

Dołączenie bibliotek do obsługi sieci neuronowych

In [3]:
import tensorflow as tf

## Zadanie 1 - Deep Q-Network

<p style='text-align: justify;'>
Celem ćwiczenie jest zaimplementowanie algorytmu Deep Q-Network. Wartoscią oczekiwaną sieci jest:
\begin{equation}
        Q(s_t, a_t) = r_{t+1} + \gamma \text{max}_a Q(s_{t + 1}, a)
\end{equation}
</p>

In [4]:
class DQNAgent:
    def __init__(self, action_size, state_size, learning_rate, model):
        self.action_size = action_size
        self.state_size = state_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay_diff = 0.04
        self.epsilon_decay_dot = 0.99
        self.learning_rate = learning_rate
        self.model = model

    def remember(self, state, action, reward, next_state, done):
        #Function adds information to the memory about last action and its results
        # if reward or np.random.random() < 0.05:
        self.memory.append((state, action, reward, next_state, done))

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """

        if np.random.random() < self.epsilon:
            return np.random.choice(action_size)

        return self.get_best_action(state)

    def get_best_action(self, state):
        """
        Compute the best action to take in a state.
        """

        prediction = self.model.predict(state, verbose=0)
        best_action = tf.argmax(prediction[0]).numpy()
        return best_action

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory.
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        """

        sample_idx = np.random.choice(len(self.memory), size=batch_size, replace=False)
        batch_list = [self.memory[idx] for idx in sample_idx]

        states, actions, rewards, next_states, dones = [np.array(x).reshape((batch_size, state_size if i == 0 or i == 3 else 1)) for i, x in enumerate(zip(*batch_list))]

        next_state_preds = np.max(self.model.predict(next_states, verbose=0), axis=1).reshape((batch_size, 1))
        targets = np.where(dones, rewards, rewards + self.gamma * next_state_preds)

        predictions = self.model.predict(states, verbose=0)
        predictions[np.arange(len(actions)), actions[:, 0]] = targets[:, 0]
        self.model.fit(states, predictions, verbose=0)

    def update_epsilon_value(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon -= self.epsilon_decay_diff
        else:
            self.epsilon = self.epsilon_min

Czas przygotować model sieci, która będzie się uczyła poruszania po środowisku *FrozenLake*, warstwa wejściowa powinna mieć tyle neuronów ile jest możlliwych stanów, warstwa wyjściowa tyle neuronów ile jest możliwych akcji do wykonania:

In [5]:
env = frozenLake("8x8")
state_size = env.get_number_of_states()
action_size = len(env.get_possible_actions(None))
learning_rate = 0.001

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=(state_size,), activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(action_size, activation='linear'))
model.compile(loss=tf.keras.losses.mse, optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])#, run_eagerly=True)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                2080      
                                                                 
 dense_1 (Dense)             (None, 32)                1056      
                                                                 
 dense_2 (Dense)             (None, 4)                 132       
                                                                 
Total params: 3,268
Trainable params: 3,268
Non-trainable params: 0
_________________________________________________________________


 Czas nauczyć agenta poruszania się po środowisku *FrozenLake*, jako stan przyjmij wektor o liczbie elementów równej liczbie możliwych stanów, z wartością 1 ustawioną w komórce o indeksie równym aktualnemu stanowi, pozostałe elementy mają być wypełnione zerami:
* 1 pkt < 35 epok,
* 0.5 pkt < 60 epok,
* 0.25 pkt - w pozostałych przypadkach.

In [6]:
agent = DQNAgent(action_size, state_size, learning_rate, model)

# agent.model.load_weights('test')

agent.epsilon = 1
agent.epsilon_decay_diff = 0.025
# agent.epsilon_decay_dot = 0.98
agent.gamma = 0.95
batch_size = 64
EPISODES = 1000
for e in range(EPISODES):
    summary = []
    # pbar = tqdm(range(1000))
    for _ in range(100):
        total_reward = 0
        env_state = env.reset()
        state = np.zeros((1, state_size))
        state[0, env_state] = 1
        state = tf.convert_to_tensor(state, dtype=tf.float32)

        while True:
            action = agent.get_action(state)
            next_state_env, reward, done, _ = env.step(action)
            total_reward += reward
            next_state = np.zeros((1, state_size))
            next_state[0, next_state_env] = 1
            next_state = tf.convert_to_tensor(next_state, dtype=tf.float32)

            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

        summary.append(total_reward)
        # pbar.set_description(f'training epoch')

    agent.update_epsilon_value()
    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))
    agent.model.save_weights('test')
    if np.mean(summary) > 0.9:
        print ("You Win!")
        break

epoch #0	mean reward = 0.000	epsilon = 0.975
epoch #1	mean reward = 0.000	epsilon = 0.950
epoch #2	mean reward = 0.010	epsilon = 0.925
epoch #3	mean reward = 0.010	epsilon = 0.900
epoch #4	mean reward = 0.010	epsilon = 0.875
epoch #5	mean reward = 0.020	epsilon = 0.850
epoch #6	mean reward = 0.020	epsilon = 0.825
epoch #7	mean reward = 0.020	epsilon = 0.800
epoch #8	mean reward = 0.060	epsilon = 0.775
epoch #9	mean reward = 0.030	epsilon = 0.750
epoch #10	mean reward = 0.110	epsilon = 0.725
epoch #11	mean reward = 0.080	epsilon = 0.700
epoch #12	mean reward = 0.100	epsilon = 0.675
epoch #13	mean reward = 0.090	epsilon = 0.650
epoch #14	mean reward = 0.030	epsilon = 0.625
epoch #15	mean reward = 0.000	epsilon = 0.600
epoch #16	mean reward = 0.050	epsilon = 0.575
epoch #17	mean reward = 0.120	epsilon = 0.550
epoch #18	mean reward = 0.050	epsilon = 0.525
epoch #19	mean reward = 0.270	epsilon = 0.500
epoch #20	mean reward = 0.050	epsilon = 0.475
epoch #21	mean reward = 0.150	epsilon = 0.45

Czas przygotować model sieci, która będzie się uczyła poruszania po środowisku *FrozenLakeExtended*, tym razem stan nie jest określany poprzez pojedynczą liczbę, a przez 3 tablice:
* pierwsza zawierająca informacje o celu,
* druga zawierająca informacje o dziurach,
* trzecia zawierająca informację o położeniu gracza.

In [31]:
env = frozenLakeExtended("4x4")

state_size = env.get_number_of_states()
action_size = len(env.get_possible_actions(None))
learning_rate = 0.001

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=(3 * state_size,), activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(action_size, activation='linear'))
model.compile(loss=tf.keras.losses.mse, optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])#, run_eagerly=True)
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_24 (Dense)            (None, 32)                1568      
                                                                 
 dense_25 (Dense)            (None, 64)                2112      
                                                                 
 dense_26 (Dense)            (None, 32)                2080      
                                                                 
 dense_27 (Dense)            (None, 4)                 132       
                                                                 
Total params: 5,892
Trainable params: 5,892
Non-trainable params: 0
_________________________________________________________________


 Czas nauczyć agenta poruszania się po środowisku *FrozenLakeExtended*, jako stan przyjmij wektor składający się ze wszystkich trzech tablic (2 pkt.):

In [32]:
class DQNAgent:
    def __init__(self, action_size, state_size, learning_rate, model):
        self.action_size = action_size
        self.state_size = state_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay_diff = 0.04
        self.epsilon_decay_dot = 0.99
        self.learning_rate = learning_rate
        self.model = model

    def remember(self, state, action, reward, next_state, done):
        #Function adds information to the memory about last action and its results
        # if reward or np.random.random() < 0.05:
        self.memory.append((state, action, reward, next_state, done))

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """

        if np.random.random() < self.epsilon:
            return np.random.choice(action_size)

        return self.get_best_action(state)

    def get_best_action(self, state):
        """
        Compute the best action to take in a state.
        """

        prediction = self.model.predict(state, verbose=0)
        best_action = tf.argmax(prediction[0]).numpy()
        return best_action

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory.
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        """

        sample_idx = np.random.choice(len(self.memory), size=batch_size, replace=False)
        batch_list = [self.memory[idx] for idx in sample_idx]

        states, actions, rewards, next_states, dones = [np.array(x).reshape((batch_size, 48 if i == 0 or i == 3 else 1)) for i, x in enumerate(zip(*batch_list))]

        next_state_preds = np.max(self.model.predict(next_states, verbose=0), axis=1).reshape((batch_size, 1))
        targets = np.where(dones, rewards, rewards + self.gamma * next_state_preds)

        predictions = self.model.predict(states, verbose=0)
        predictions[np.arange(len(actions)), actions[:, 0]] = targets[:, 0]
        self.model.fit(states, predictions, verbose=0)

    def update_epsilon_value(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon -= self.epsilon_decay_diff
        else:
            self.epsilon = self.epsilon_min

In [33]:
agent = DQNAgent(action_size, state_size, learning_rate, model)

agent.epsilon = 0.8

batch_size = 64
EPISODES = 2000
for e in range(EPISODES):
    summary = []
    pbar = tqdm(range(100))
    for _ in pbar:
        total_reward = 0
        env_state = np.array(env.reset()).reshape(-1)

        state = tf.convert_to_tensor(env_state[np.newaxis, :], dtype=tf.float32)

        while True:
            action = agent.get_action(state)
            next_state_env, reward, done, _ = env.step(action)
            total_reward += reward
            next_state = tf.convert_to_tensor(np.array(next_state_env).reshape(-1)[np.newaxis, :], dtype=tf.float32)
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

        summary.append(total_reward)
        pbar.set_description(f'training epoch')

    agent.update_epsilon_value()
    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))
    agent.model.save_weights('test')
    if np.mean(summary) > 0.9:
        print ("You Win!")
        break

training epoch: 100%|██████████| 100/100 [00:26<00:00,  3.78it/s]


epoch #0	mean reward = 0.010	epsilon = 0.760


training epoch: 100%|██████████| 100/100 [00:28<00:00,  3.57it/s]


epoch #1	mean reward = 0.100	epsilon = 0.720


training epoch: 100%|██████████| 100/100 [00:28<00:00,  3.49it/s]


epoch #2	mean reward = 0.080	epsilon = 0.680


training epoch: 100%|██████████| 100/100 [00:29<00:00,  3.42it/s]


epoch #3	mean reward = 0.260	epsilon = 0.640


training epoch: 100%|██████████| 100/100 [00:30<00:00,  3.23it/s]


epoch #4	mean reward = 0.200	epsilon = 0.600


training epoch: 100%|██████████| 100/100 [00:34<00:00,  2.86it/s]


epoch #5	mean reward = 0.250	epsilon = 0.560


training epoch: 100%|██████████| 100/100 [00:37<00:00,  2.70it/s]


epoch #6	mean reward = 0.260	epsilon = 0.520


training epoch: 100%|██████████| 100/100 [00:38<00:00,  2.57it/s]


epoch #7	mean reward = 0.280	epsilon = 0.480


training epoch: 100%|██████████| 100/100 [00:41<00:00,  2.44it/s]


epoch #8	mean reward = 0.400	epsilon = 0.440


training epoch: 100%|██████████| 100/100 [00:39<00:00,  2.53it/s]


epoch #9	mean reward = 0.450	epsilon = 0.400


training epoch: 100%|██████████| 100/100 [00:39<00:00,  2.51it/s]


epoch #10	mean reward = 0.550	epsilon = 0.360


training epoch: 100%|██████████| 100/100 [00:49<00:00,  2.04it/s]


epoch #11	mean reward = 0.620	epsilon = 0.320


training epoch: 100%|██████████| 100/100 [00:47<00:00,  2.11it/s]


epoch #12	mean reward = 0.620	epsilon = 0.280


training epoch: 100%|██████████| 100/100 [00:44<00:00,  2.24it/s]


epoch #13	mean reward = 0.690	epsilon = 0.240


training epoch: 100%|██████████| 100/100 [00:50<00:00,  1.99it/s]


epoch #14	mean reward = 0.650	epsilon = 0.200


training epoch: 100%|██████████| 100/100 [00:47<00:00,  2.13it/s]


epoch #15	mean reward = 0.720	epsilon = 0.160


training epoch: 100%|██████████| 100/100 [00:48<00:00,  2.08it/s]


epoch #16	mean reward = 0.820	epsilon = 0.120


training epoch: 100%|██████████| 100/100 [00:47<00:00,  2.10it/s]


epoch #17	mean reward = 0.890	epsilon = 0.080


training epoch: 100%|██████████| 100/100 [00:58<00:00,  1.72it/s]


epoch #18	mean reward = 0.890	epsilon = 0.040


training epoch: 100%|██████████| 100/100 [00:48<00:00,  2.08it/s]


epoch #19	mean reward = 0.950	epsilon = -0.000
You Win!


Czas przygotować model sieci, która będzie się uczyła działania w środowisku [*CartPool*](https://gym.openai.com/envs/CartPole-v0/):

In [4]:
class DQNAgent:
    def __init__(self, action_size, state_size, learning_rate, model):
        self.action_size = action_size
        self.state_size = state_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95    # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay_diff = 0.04
        self.epsilon_decay_dot = 0.99
        self.learning_rate = learning_rate
        self.model = model

    def remember(self, state, action, reward, next_state, done):
        #Function adds information to the memory about last action and its results
        # if reward or np.random.random() < 0.05:
        self.memory.append((state, action, reward, next_state, done))

    def get_action(self, state):
        """
        Compute the action to take in the current state, including exploration.
        With probability self.epsilon, we should take a random action.
            otherwise - the best policy action (self.get_best_action).

        Note: To pick randomly from a list, use random.choice(list).
              To pick True or False with a given probablity, generate uniform number in [0, 1]
              and compare it with your probability
        """

        if np.random.random() < self.epsilon:
            return np.random.choice(action_size)

        return self.get_best_action(state)

    def get_best_action(self, state):
        """
        Compute the best action to take in a state.
        """

        prediction = self.model.predict(state, verbose=0)
        best_action = tf.argmax(prediction[0]).numpy()
        return best_action

    def replay(self, batch_size):
        """
        Function learn network using randomly selected actions from the memory.
        First calculates Q value for the next state and choose action with the biggest value.
        Target value is calculated according to:
                Q(s,a) := (r + gamma * max_a(Q(s', a)))
        except the situation when the next action is the last action, in such case Q(s, a) := r.
        In order to change only those weights responsible for chosing given action, the rest values should be those
        returned by the network for state state.
        The network should be trained on batch_size samples.
        """

        sample_idx = np.random.choice(len(self.memory), size=batch_size, replace=False)
        batch_list = [self.memory[idx] for idx in sample_idx]

        states, actions, rewards, next_states, dones = [np.array(x).reshape((batch_size, 4 if i == 0 or i == 3 else 1)) for i, x in enumerate(zip(*batch_list))]

        next_state_preds = np.max(self.model.predict(next_states, verbose=0), axis=1).reshape((batch_size, 1))
        targets = np.where(dones, rewards, rewards + self.gamma * next_state_preds)

        predictions = self.model.predict(states, verbose=0)
        predictions[np.arange(len(actions)), actions[:, 0]] = targets[:, 0]
        self.model.fit(states, predictions, verbose=0)

    def update_epsilon_value(self):
        if self.epsilon > self.epsilon_min:
            self.epsilon -= self.epsilon_decay_diff
        else:
            self.epsilon = self.epsilon_min

In [5]:
env = gym.make("CartPole-v1").env
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
learning_rate = 0.001

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=(state_size,), activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(action_size, activation='linear'))
model.compile(loss=tf.keras.losses.mse, optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])#, run_eagerly=True)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                160       
                                                                 
 dense_1 (Dense)             (None, 64)                2112      
                                                                 
 dense_2 (Dense)             (None, 2)                 130       
                                                                 
Total params: 2,402
Trainable params: 2,402
Non-trainable params: 0
_________________________________________________________________


Czas nauczyć agenta gry w środowisku *CartPool*:
* 1 pkt < 10 epok,
* 0.5 pkt < 20 epok,
* 0.25 pkt - w pozostałych przypadkach.

In [6]:
agent = DQNAgent(action_size, state_size, learning_rate, model)

agent.epsilon = 0.5
agent.epsilon_decay_diff = 0.05
batch_size = 128
EPISODES = 1000
for e in range(EPISODES):
    summary = []
    pbar = tqdm(range(100))
    for _ in pbar:
        total_reward = 0
        env_state = env.reset()[0]
        state = tf.convert_to_tensor(env_state[np.newaxis, :], dtype=tf.float32)

        for time in range(300):
            action = agent.get_action(state)
            next_state_env, reward, done, _, _ = env.step(action)
            total_reward += reward
            next_state = tf.convert_to_tensor(next_state_env[np.newaxis, :], dtype=tf.float32)
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            if done:
                break

        if len(agent.memory) > batch_size:
            agent.replay(batch_size)

        summary.append(total_reward)
        pbar.set_description(f'training epoch')

    agent.update_epsilon_value()
    print("epoch #{}\tmean reward = {:.3f}\tepsilon = {:.3f}".format(e, np.mean(summary), agent.epsilon))
    agent.model.save_weights('test')
    if np.mean(summary) > 195:
        print ("You Win!")
        break

  if not isinstance(terminated, (bool, np.bool8)):
training epoch: 100%|██████████| 100/100 [00:44<00:00,  2.27it/s]


epoch #0	mean reward = 13.100	epsilon = 0.450


training epoch: 100%|██████████| 100/100 [02:00<00:00,  1.20s/it]


epoch #1	mean reward = 39.300	epsilon = 0.400


training epoch: 100%|██████████| 100/100 [02:36<00:00,  1.56s/it]


epoch #2	mean reward = 48.200	epsilon = 0.350


training epoch: 100%|██████████| 100/100 [04:11<00:00,  2.51s/it]


epoch #3	mean reward = 71.900	epsilon = 0.300


training epoch: 100%|██████████| 100/100 [08:22<00:00,  5.02s/it]


epoch #4	mean reward = 142.420	epsilon = 0.250


training epoch: 100%|██████████| 100/100 [11:33<00:00,  6.93s/it]


epoch #5	mean reward = 185.900	epsilon = 0.200


training epoch: 100%|██████████| 100/100 [17:03<00:00, 10.24s/it]


epoch #6	mean reward = 272.810	epsilon = 0.150
You Win!
