# Лабораторная работа 7


Цель лабораторной работы: ознакомление с базовыми методами обучения с подкреплением на основе алгоритмов Actor-Critic.

In [1]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from keras import layers

2023-06-23 20:28:56.347088: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-23 20:28:56.867475: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-06-23 20:28:56.868717: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
seed = 42
gamma = 0.99
max_steps_per_episode = 10000
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=seed)
eps = np.finfo(np.float32).eps.item()

In [3]:
num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
actor = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[actor, critic])

In [4]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

In [5]:
while True:  # Run until solved
    state, info = env.reset(seed=seed)
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.

            print(state)
            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            actor = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, actor]))

            # Apply the sampled action in our environment
            state, reward, done, _, _ = env.step(actor)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 150:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break


[ 0.0273956  -0.00611216  0.03585979  0.0197368 ]
[ 0.02727336  0.18847767  0.03625453 -0.26141977]
[ 0.03104291  0.38306385  0.03102613 -0.5424507 ]
[ 0.03870419  0.1875199   0.02017712 -0.24015574]
[ 0.04245459  0.38234788  0.015374   -0.5264066 ]
[ 0.05010155  0.18701302  0.00484587 -0.22891912]
[ 5.3841807e-02  3.8206539e-01  2.6748964e-04 -5.2006954e-01]
[ 0.06148311  0.57718354 -0.0101339  -0.8126682 ]
[ 0.07302678  0.7724428  -0.02638726 -1.1085213 ]
[ 0.08847564  0.5776774  -0.04855769 -0.8242319 ]
[ 0.10002919  0.7734288  -0.06504233 -1.1317831 ]
[ 0.11549777  0.5792158  -0.08767799 -0.86018866]
[ 0.12708208  0.38539037 -0.10488177 -0.5963117 ]
[ 0.13478988  0.19188045 -0.116808   -0.33842054]
[ 0.1386275   0.38845402 -0.12357641 -0.66553515]
[ 0.14639658  0.19524787 -0.13688712 -0.41417503]
[ 0.15030153  0.3920176  -0.14517061 -0.7466879 ]
[ 0.1581419   0.1991651  -0.16010438 -0.5029824 ]
[ 0.16212519  0.00661897 -0.17016402 -0.26472682]
[ 0.16225757 -0.18571742 -0.17545855 -