<font size=4 color='blue'>
    
# <center> Clase 6, marzo 7 del 2023</center>

<img src="./images/Picture-1.png" width=420 height=420 align = "center" >

<font size=5 color='blue'>
Introduction to Reinforcement Learning
$$ $$
    
<font size=4 color='black'>    
 
[Introduction Book](./Literature/Reinforcement-learning/Reinforcement-Learning-an-introduction-book_2015.pdf)
    
[Reinforcement Learning](https://www.synopsys.com/ai/what-is-reinforcement-learning.html)

<font size=5 color='blue'>
In Reinforcement Learning, an agent interacts with its environment to achieve a goal


<font size=5 color='black'>
A typical example of Reinforcement Learning
$$  $$   
<font size=4 color='black'>
Actor Critic method on CartPole environment. $$ $$

[Actor Critic algorithms](./Literature/Reinforcement-learning/Natural-Actor–Critic-Algorithms_2009.pdf)   

<font size=5 color='blue'>
Actor Critic Method

<font size=4 color='black'>

As an agent takes actions and moves through an environment, it learns to map
the observed state of the environment to two possible outputs:

1. Recommended action: A probability value for each action in the action space.
   The part of the agent responsible for this output is called the **actor**.$$ $$
2. Estimated rewards in the future: Sum of all rewards it expects to receive in the
   future. The part of the agent responsible for this output is the **critic**.

Agent and Critic learn to perform their tasks, such that the recommended actions
from the actor maximize the rewards.

<font size=5 color='blue'>
CartPole environment
$$ $$
    
<font size=4 color='black'>    
 
[Cart-pole system](./Literature/Reinforcement-learning/Neuronlike-Adaptive-Elements-That-Can-Solve-Difficult-Learnina-Control-Problems_1983.pdf)

<font size=4 color='black'>

If the library "gym" is not installed, run the following cell

<font size=4 color='black'>
    
[OpenAI Gym article](./Literature/Reinforcement-learning/openAI-GYM_2016.pdf)
    
[OpenAI Gym Github](https://github.com/openai/gym)

In [1]:
import gym
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

ModuleNotFoundError: No module named 'gym'

In [3]:
# Configuration parameters for the whole setup
seed = 42
gamma = 0.99  # Discount factor for past rewards
max_steps_per_episode = 10000
env = gym.make("CartPole-v0")  # Create the environment
env.seed(seed)
eps = np.finfo(np.float32).eps.item()  # Smallest number such that 1.0 + eps != 1.0
print(eps)

1.1920928955078125e-07


<font size=5 color='blue'>
Actor Critic neural network

<font size=4>    
This network learns two functions:$$ $$

1. Actor: This takes as input the state of our environment and returns a
probability value for each action in its action space. $$ $$
2. Critic: This takes as input the state of our environment and returns
an estimate of total rewards in the future.

<font size=4 color='black'>

Actor and Critic share the initial layer.


In [4]:
num_inputs = 4
num_actions = 2
num_hidden = 128

inputs = layers.Input(shape=(num_inputs,))
common = layers.Dense(num_hidden, activation="relu")(inputs)
action = layers.Dense(num_actions, activation="softmax")(common)
critic = layers.Dense(1)(common)

model = keras.Model(inputs=inputs, outputs=[action, critic])


<font size=5 color='blue'>
Optimizer

In [5]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
huber_loss = keras.losses.Huber()

<font size=5 color='blue'>
Training

In [6]:
action_probs_history = []
critic_value_history = []
rewards_history = []
running_reward = 0
episode_count = 0

In [7]:
while True:  # Run until solved
    state = env.reset()
    episode_reward = 0
    with tf.GradientTape() as tape:
        for timestep in range(1, max_steps_per_episode):
            # env.render(); Adding this line would show the attempts
            # of the agent in a pop up window.

            state = tf.convert_to_tensor(state)
            state = tf.expand_dims(state, 0)

            # Predict action probabilities and estimated future rewards
            # from environment state
            action_probs, critic_value = model(state)
            critic_value_history.append(critic_value[0, 0])

            # Sample action from action probability distribution
            action = np.random.choice(num_actions, p=np.squeeze(action_probs))
            action_probs_history.append(tf.math.log(action_probs[0, action]))

            # Apply the sampled action in our environment
            state, reward, done, _ = env.step(action)
            rewards_history.append(reward)
            episode_reward += reward

            if done:
                break

        # Update running reward to check condition for solving
        running_reward = 0.05 * episode_reward + (1 - 0.05) * running_reward

        # Calculate expected value from rewards
        # - At each timestep what was the total reward received after that timestep
        # - Rewards in the past are discounted by multiplying them with gamma
        # - These are the labels for our critic
        returns = []
        discounted_sum = 0
        for r in rewards_history[::-1]:
            discounted_sum = r + gamma * discounted_sum
            returns.insert(0, discounted_sum)

        # Normalize
        returns = np.array(returns)
        returns = (returns - np.mean(returns)) / (np.std(returns) + eps)
        returns = returns.tolist()

        # Calculating loss values to update our network
        history = zip(action_probs_history, critic_value_history, returns)
        actor_losses = []
        critic_losses = []
        for log_prob, value, ret in history:
            # At this point in history, the critic estimated that we would get a
            # total reward = `value` in the future. We took an action with log probability
            # of `log_prob` and ended up recieving a total reward = `ret`.
            # The actor must be updated so that it predicts an action that leads to
            # high rewards (compared to critic's estimate) with high probability.
            diff = ret - value
            actor_losses.append(-log_prob * diff)  # actor loss

            # The critic must be updated so that it predicts a better estimate of
            # the future rewards.
            critic_losses.append(
                huber_loss(tf.expand_dims(value, 0), tf.expand_dims(ret, 0))
            )

        # Backpropagation
        loss_value = sum(actor_losses) + sum(critic_losses)
        grads = tape.gradient(loss_value, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Clear the loss and reward history
        action_probs_history.clear()
        critic_value_history.clear()
        rewards_history.clear()

    # Log details
    episode_count += 1
    if episode_count % 10 == 0:
        template = "running reward: {:.2f} at episode {}"
        print(template.format(running_reward, episode_count))

    if running_reward > 195:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break


running reward: 10.70 at episode 10
running reward: 19.50 at episode 20
running reward: 34.58 at episode 30
running reward: 37.08 at episode 40
running reward: 35.89 at episode 50
running reward: 38.70 at episode 60
running reward: 48.47 at episode 70
running reward: 53.22 at episode 80
running reward: 77.48 at episode 90
running reward: 90.62 at episode 100
running reward: 88.15 at episode 110
running reward: 79.33 at episode 120
running reward: 74.41 at episode 130
running reward: 106.03 at episode 140
running reward: 141.24 at episode 150
running reward: 149.03 at episode 160
running reward: 165.80 at episode 170
running reward: 151.18 at episode 180
running reward: 149.83 at episode 190
running reward: 161.91 at episode 200
running reward: 177.19 at episode 210
running reward: 186.34 at episode 220
running reward: 191.82 at episode 230
running reward: 169.29 at episode 240
running reward: 153.30 at episode 250
running reward: 164.00 at episode 260
running reward: 156.87 at episode 

<font size=5 color='blue'>
Visualizations

<font size=5 color='black'>
In early stages of training:
    
![Imgur](https://i.imgur.com/5gCs5kH.gif)

In later stages of training:
![Imgur](https://i.imgur.com/5ziiZUD.gif)

