#### About

> Deep reinforcement learning

Deep reinforcement learning (DRL) is a type of machine learning that combines reinforcement learning (RL) with deep neural networks (DNN). It involves training the agent to make decisions in the environment by acting on observed states with the goal of maximizing the cumulative reward signal. Mathematically, DRL can be described using the following components:

1. Markov Decision Process (MDP): MDP is a mathematical model that defines the interaction between an agent and its environment. It is defined by the string (S, A, P, R), where:

- S is the state space representing all the possible states the environment can be in.
- A is the action space representing all the possible actions an agent can take. - "P" is a transition probability function that determines the probability of going from one state to another when an action is performed.
- R is the reward function that defines the immediate reward that the agent receives after performing a certain action in a certain state. 

2. Policy: A policy is a strategy used by an agent to decide what action to take in a given state. It can be deterministic or stochastic and is usually expressed as a function that maps states to actions.

3. Value Function: A value function estimates the expected cumulative reward that an agent can obtain from a given state or state-action pair under a given policy. It is used to guide the agent's decision-making process by evaluating the long-term desirability of various states or actions. 

4. Q-value function: The Q-value function, also known as the action-value function, is similar to the value function, but it considers the specific action performed in addition to the state. It calculates the expected cumulative reward that an agent can obtain from a given state-action pair under a given policy.

5. Bellman Equation: The Bellman equation is a key equation in reinforcement learning that expresses the relationship between the value of a state or state-action pair and the value of an adjacent state or state-action pair. It is used to update the value and Q value function during the learning process. 

6. Deep Neural Network (DNN): DNN is used to approximate a policy, value function or Q-value function in DRL. These are typically multilayer neural networks with multiple hidden layers that can learn complex representations from raw state or state action inputs.

7. Replay Buffer: The replay buffer is a DRL data structure used to store and sample the agent's previous experience. This helps to break the temporal relationship between successive samples and improve the stability of the learning process. 

8. Exploration vs. Exploitation: Exploration refers to trying out different behaviors to discover their impact on the environment, while exploitation refers to current policies. Achieving a balance between exploration and exploitation is a key challenge in DRL to ensure that the agent does enough exploration to learn an optimal policy without getting stuck in a suboptimal one.

9. Learning Algorithms: DRL algorithms typically use a combination of RL techniques (such as Q-learning, SARSA, or Actor-Critic) and DNNs to learn a policy, value function, or Q-value function. These algorithms update model parameters based on observed experience and the Bellman equation to optimize strategies or value estimates. 

In [8]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


In [9]:
# Define the gridworld environment
# The agent starts at position 0 and can take actions 'left' or 'right'
# The agent receives a reward of +1 for reaching position 3 and -1 for reaching position 1
num_states = 4
num_actions = 2
transitions = np.array([[0, -1, 1, -1],
                        [-1, 0, -1, 1]])
rewards = np.array([[0, -1, 1, -1],
                    [-1, 0, -1, 1]])


In [10]:
# Define DNN model
model = Sequential()
model.add(Dense(10, input_dim=1, activation='relu'))
model.add(Dense(num_actions, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=0.01))


In [12]:
num_episodes = 20
for episode in range(num_episodes):
    state = 0
    done = False
    total_reward = 0

    while not done:
        # Choose action based on epsilon-greedy policy
        epsilon = 0.1
        if np.random.rand() <= epsilon:
            action = np.random.randint(num_actions)
        else:
            q_values = model.predict(np.array([[state]]))[0]
            action = np.argmax(q_values)

        # Take action and observe next state and reward
        next_state = transitions[action, state]
        reward = rewards[action, state]
        total_reward += reward

        # Update Q-value function using Bellman equation
        q_values = model.predict(np.array([[state]]))[0]
        next_q_values = model.predict(np.array([[next_state]]))[0]
        q_values[action] = reward + np.max(next_q_values)

        # Update the model with the new Q-values
        model.fit(np.array([[state]]), np.array([q_values]), verbose=0)

        # Update current state for next iteration
        state = next_state

        # Check if the episode is done
        if state == 1 or state == 3:
            done = True

    print("Episode: {}, Total Reward: {}".format(episode, total_reward))

Episode: 0, Total Reward: 0
Episode: 1, Total Reward: 0
Episode: 2, Total Reward: 0
Episode: 3, Total Reward: 0
Episode: 4, Total Reward: 0
Episode: 5, Total Reward: 0
Episode: 6, Total Reward: 0
Episode: 7, Total Reward: 0
Episode: 8, Total Reward: 0
Episode: 9, Total Reward: 0
Episode: 10, Total Reward: 0
Episode: 11, Total Reward: 0
Episode: 12, Total Reward: 0
Episode: 13, Total Reward: 0
Episode: 14, Total Reward: 0
Episode: 15, Total Reward: -1
Episode: 16, Total Reward: 0
Episode: 17, Total Reward: 0
Episode: 18, Total Reward: 0
Episode: 19, Total Reward: 0


In [13]:
# Make predictions
state = 0
done = False

while not done:
    # Get Q-values for current state
    q_values = model.predict(np.array([[state]]))[0]

    # Choose action with highest Q-value
    action = np.argmax(q_values)

    # Take action and observe next state and reward
    next_state = transitions[action, state]
    reward = rewards[action, state]

    # Update current state for next iteration
    state = next_state

    # Print current state and action taken
    print("State: {}, Action: {}".format(state, action))

    # Check if the episode is done
    if state == 1 or state == 3:
        done = True

State: -1, Action: 1
State: 1, Action: 1


In the first episode, the agent received a reward of -1 when taking action 1 from state -1, and a reward of 1 when taking action 1 from state 1.
