# Articles on QRL


 Quantum reinforcement learning is an emerging field that combines principles of quantum mechanics with reinforcement learning techniques. It aims to leverage the unique properties of quantum systems, such as superposition and entanglement, to enhance the efficiency and effectiveness of reinforcement learning algorithms.
 
 Most of the algorithms are online, which requires interacting with the environment on quantum computers. There are however examples of articles following offline learning. Here are a few brief explanations of some quantum reinforcement learning algorithms: 


## Quantum Reinforcement Learning using VQC

### 1. VQC for DRL

The article by Chen et al.  presents a novel approach that combines Variational Quantum Circuit (VQC) with Deep Q-Network (DQN) to enhance the performance of Deep Reinforcement Learning (DRL) algorithms in two distinct environments. These environments are Frozen Lake  and Cognitive Radio . Frozen Lake is a maze environment representing a lake which is frozen with holes. The goal is to find the best path, without falling into any holes. In Cognitive Radio, the objective of this is that the agent has to choose a channel out of n channels which are not occupied. The below algorithm  explains the methodology used in this article.\newline

**Algorithm: Variational Quantum Deep Q Learning**

- **Initialize** replay memory 𝒟 to capacity N
- **Initialize** action-value function quantum circuit Q with random parameters

For episode = 1, 2, ..., M:
1. **Initialize** state \( s_1 \) and encode it into the quantum state.
2. For \( t = 1, 2, ..., T \):
   - With probability \( \epsilon \), select a random action \( a_t \).
   - Otherwise, select \( a_t = \max_a Q^*(s_t, a; \theta) \) from the output of the quantum circuit.
   - Execute action \( a_t \) in the emulator and observe reward \( r_t \) and next state \( s_{t+1} \).
   - Store transition \( (s_t, a_t, r_t, s_{t+1}) \) in 𝒟.
   - Sample a random minibatch of transitions \( (s_j, a_j, r_j, s_{j+1}) \) from 𝒟.
   - Set:
   $$
   y_j =
   \begin{cases}
   r_j & \text{if terminal } s_{j+1} \\
   r_j + \gamma \max_{a'} Q(s_{j+1}, a'; \theta) & \text{if non-terminal } s_{j+1}
   \end{cases}
   $$
   - Perform a gradient descent step on \( (y_j - Q(s_j, a_j; \theta))^2 \).

End For


In the Frozen Lake problem, state information undergoes basis encoding to convert it into the quantum format, which is then fed into the variational layer. On the other hand, the Cognitive Radio problem requires consideration of both the channel and time information, resulting in a two-tuple observation format (channel, time) that is processed through quantum encoding and variational layers.

The results of the study demonstrate the effectiveness of the proposed approach on both Classical simulators and Quantum computers. Notably, there is a notable improvement in the efficiency of the reinforcement learning process, achieved by reducing the number of parameters required . The combination of VQC and DQN holds significant promise for enhancing the efficiency and effectiveness of DRL algorithms in complex environments .





### 2. RL with QVC

The article from Lockwood et al.  highlights the application of Variational Quantum Circuits (VQC) in performing reinforcement learning, focusing on two specific environments: Cartpole  and Blackjack . Cartpole is an inverted pendulum problem mounted on a cart moving in the horizontal plane. The goal is to keep the pendulum stable by moving the cart.Blackjack is a card game environment. The expectation is to obtain a reward closer to 21 without exceeding it

The methodology employed in this article involves utilizing VQC in combination with both Deep Q-Network (DQN)  and Double Deep Q-Network (DDQN) algorithms, resulting in four distinct variations of models. These variations include pure VQC models as well as hybrid VQC models integrated with either DQN or DDQN . A notable aspect of the VQC implementation in this article is the incorporation of a quantum pooling mechanism to account for differences in the action space and observed readings. The below figure shows the circuit diagram for one layer of QVC.

<p>
<figure>
<img src="..//Images/2.png">
<figcaption>One layer of the QVC, composed of CNOT and
parametrized rotation gates
<figcaption>
</figure>
</p>
<br>

The pooling operation here is modelled by using the Pauli Quantum gates, then a controlled-not gate is followed by the application of the inverse Pauli gates, for example, $XX^{-1} = I$, to the sink qubit. The below figure  shows the circuit diagram for the pooling operation.


<p>
<figure>
<img src="..//Images/3.png">
<figcaption>Quantum pooling operation(Single Pooling Operation)
<figcaption>
</figure>
</p><br>

The article also explores two different encoding techniques for the data. The first is Directional Encoding, specifically designed for skewed data with certain data points having a range of (-inf, inf). The second technique is Scaled Encoding, suitable for environments where inputs have defined data ranges.


The results of the study demonstrate that the proposed VQC-based approach outperforms classical approaches, achieving a faster time to reach a reward threshold in both the Cartpole and Blackjack environments. This showcases the prospect of using VQC in enhancing the efficiency and effectiveness of reinforcement learning tasks. Overall, this survey highlights the successful application of VQC with DQN and DDQN algorithms in the context of reinforcement learning, providing valuable insights into the advantages of quantum-inspired techniques in complex environments such as Cartpole and Blackjack.


### 3. VQRL using evolutionary optimization

The article by Chen et al.  delves into the utilization of evolutionary optimization in Variational Quantum Reinforcement Learning (VQRL) problems. The goal of the article is to leverage evolutionary optimization concepts to tackle these challenges. The study focuses on two distinct environments: Cartpole  and Minigrid . For the Cartpole environment, which is an inverted pendulum problem mounted on a cart moving in the horizontal plane. The goal is to keep the pendulum stable by moving the cart, the article employs a two-step methodology. 


In the implementation firstly, it utilizes Amplitude Encoding to transform observations into the amplitudes of quantum states. The below figure shows the circuit for amplitude encoding routine.

<p>
<figure>
<img src="..//Images/6.png">
<figcaption>Amplitude encoding routine pre-inversion
<figcaption>
</figure>
</p><br>

Secondly, an Action Selection mechanism is implemented based on the maximum values of two variables, a and b. If a is the maximum, the action is defined as -1, and if b is the maximum, the action is defined as 1.



In the Minigrid  environment, which is a collection of grid-world environments, the article adopts a different approach. It begins by using a 147-dimensional vector representation. To reduce the dimensionality of the environment, a tensor network is applied prior to feeding the input into the quantum computer.


MPS is a type of 1D TN that decomposes a large
tensor into a series of matrices. A general N-qubit quantum state can be written as,


$$
|\Psi\rangle=\sum_{i_1} \sum_{i_2} \cdots \sum_{i_N} T_{i_1 i_2 \cdots i_N}\left|i_1\right\rangle \otimes\left|i_2\right\rangle \otimes \cdots \otimes\left|i_N\right\rangle
$$

Where $T_{i_1 i_2 \ldots i_N}$ is the amplitude of each basis state $|i_1\rangle \otimes |i_2\rangle \otimes \ldots \otimes |i_N\rangle$.


The agent in this scenario is a hybrid TN-VQC agent. The article further incorporates Variational Encoding based on rotation and applies action selection.

This is the variational encoding where the initial quantum state $|0\rangle \otimes \ldots \otimes |0\rangle$ (from equation ) is applied to the $H \otimes \ldots \otimes H$ operation to result in an equal distribution of states.

$$
|+ \rangle \otimes \ldots \otimes |+ \rangle
$$


This is a N-qubit system, where the corresponding unbiased state is described by the below equations,

$$
(H|0\rangle)^{\otimes N} = \underbrace{H|0\rangle \otimes \cdots \otimes H|0\rangle}_N
$$

$$
= \underbrace{|+\rangle \otimes \cdots \otimes|+\rangle}_N
$$

$$
= \underbrace{\left[\frac{1}{\sqrt{2}}(|0\rangle+|1\rangle)\right] \otimes \cdots \otimes\left[\frac{1}{\sqrt{2}}(|0\rangle+|1\rangle)\right]}_N
$$


$$
= \sum_{\substack{(q_1, q_2, \ldots, q_N) \in \{0,1\}^N}} \frac{1}{\sqrt{2^N}}\left|q_1\right\rangle \otimes \left|q_2\right\rangle \otimes \cdots \otimes \left|q_N\right\rangle.
$$


The results of the study indicate noteworthy findings. In the Cartpole problem, the proposed approach employs only 26 parameters, which is less compared to classical approaches. For the Minigrid problem, the article explores different grid dimensions and observes that the TN-VQC model with a chi value of 1 outperforms others in terms of requiring fewer generations to achieve higher scores. The results demonstrate the potential of these approaches in reducing parameter complexity and improving performance in complex reinforcement learning scenarios.


### 4. RL based VQC optimization for combinatorial problems

The article by Khairy et al.  focuses on the application of Reinforcement Learning (RL) techniques in optimizing Variational Quantum Circuits (VQC) for combinatorial problems, specifically the Max-Cut problem on Erdos-Renyi graphs. The objective of the article was to achieve maximum cut values in given graphs. Performed on Erdos-Renyi graphs to obtain the Max-Cut

The methodology employed in this article revolves around the Quantum Approximate Optimization Algorithm (QAOA) . The authors approach the problem as learning an optimizer for QAOA, which involves iteratively updating the algorithm's parameters based on state, action, and reward factors. The Proximal Policy Optimization (PPO) algorithm is utilized to search for the policy that updates the parameters in small increments. To prevent policy collapse caused by excessive updates, an early stopping method is implemented, terminating the optimizer algorithm when the difference between older and newer policies reaches a predefined threshold. As the problem involves parameterized quantum circuits, it can be categorized as a variational quantum circuit problem.

The results of the study highlight several key observations. Increasing the depth of the QAOA circuit is found to enhance the approximation ratio, leading to improved performance . The reward is obtained quickly, but a wiggly behavior is observed thereafter, indicating potential areas for future investigation and optimization.



### 5. On the Use of QRL in Energy-Efficiency Scenarios

The article by Andres et al.  explores the application of Quantum Reinforcement Learning (QRL) in energy-efficient scenarios, with a focus on HVAC control, energy management in EV vehicles, and charging station optimization. Such environments have been extensively studied in several articles addressing Classical RL problems .The objective of the article is to investigate the benefits and limitations of QRL in solving energy-efficient problems .


For the HVAC control, two kinds of the experiment were performed in the article. The first one uses a classical multi-layer perceptron to model the policy of the agent. The second one is used to implement the policy as a hybrid model, consisting of one variational quantum circuit and a linear layer. The environment used is Eplus-demo-v1.

For energy management in EV vehicles, the article proposes a NN agent which can solve the problem in both discrete and continuous action environments. It uses the Prius environment for training the agent. Along with this, it uses the Advantage Actor-Critic algorithm (A2C) . This is set as a standard to compare with. Similarly, a quantum agent is also designed.

For the charging station, the agent is designed to have two different networks. Each of this targets the selection of an action from an action space. This helps in management of higher computational power required.

The results obtained from the article are so: For HVAC, the computation time is larger for the model using VQC, which is a consequence of using quantum simulators. It however results in a better reward, with a significant difference in the rewards obtained in classical and quantum layer.

For EV vehicles, the quantum agent performed the management of energy better than the classical agent. However, the computation time is still a major overhead in quantum agent. VQC here overtakes the classical agent only at later time.

For charging stations, the quantum agent attains the optimal policy. Even though classical agent has faster convergence, it seems to get stuck in a local optimum.


### 6. Comparing QHRL to classical method




The article by Moll et al.  aims to compare Hybrid Quantum Reinforcement Learning (QHRL) with existing classical methods. The goal is to evaluate the VQ-DQN algorithm's performance in training the agent. The Frozen-lake scenario was employed, which is a basic maze style environment. The goal is to choose the best path while avoiding holes. This considers four different models for comparison: Q Learning, DQN, non-pure VQ-DQN, and the VQ-DQN circuit proposed by the authors.

The VQ-DQN algorithm is utilized to train the agent. The article introduces four models and conducts a comparative analysis among them. These models include Q Learning, DQN, non-pure VQ-DQN, and the VQ-DQN circuit developed by the authors.

Q Learning is a  classical RL algorithm that learns an optimal policy through trial and error by updating action values based on immediate rewards.DQN (Deep Q-Network) is a  deep neural network-based RL algorithm that combines Q Learning with a deep neural network to handle high-dimensional state spaces.

Notably, the non-pure VQ-DQN model refers to the work by Chen et al . The VQ-DQN circuit proposed by the authors of the article differs from the non-pure VQ-DQN model in terms of the gates used. The authors' VQ-DQN circuit employs only Ry gates for the variational layer and only Rx gates for the encoding layer, while the non-pure VQ-DQN model by Chen et al. incorporates the rotational gates Rx, Ry, and Rz gates in the variational layer and Ry and Rz gates in the encoding layer.

The results of the study indicate that the VQ-DQN circuit requires a lower number of parameters to train the agent compared to its classical equivalents, as shown in the below figure.

<table>
    <caption>Comparing the memory usage and runtime of the algorithms for 750 episodes of training</caption>
    <thead>
        <tr>
            <th>Algorithm</th>
            <th>Q-Learning</th>
            <th>DQN</th>
            <th>Pure VQ-DQN</th>
            <th>$VQ-DQN^*$</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Number of parameters</td>
            <td>64</td>
            <td>172</td>
            <td>40</td>
            <td>28</td>
        </tr>
        <tr>
            <td>Runtime</td>
            <td>336268 μs</td>
            <td>190 s</td>
            <td>> 21 h</td>
            <td></td>
        </tr>
    </tbody>
</table>



This finding suggests that the hybrid quantum RL approach offers advantages in terms of parameter efficiency. The parameters for optimal performance are shown in the below figure.


<table>
    <caption>Parameters for optimal performance of the RL algorithms</caption>
    <thead>
        <tr>
            <th>Algorithm</th>
            <th>&#945;</th>
            <th>&#947;</th>
            <th>&#949;</th>
            <th>Batch size</th>
            <th>Memory size</th>
            <th>Target update</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Q-Learning</td>
            <td>0.6</td>
            <td>0.8</td>
            <td>0.9</td>
            <td></td>
            <td></td>
            <td></td>
        </tr>
        <tr>
            <td>DQN</td>
            <td>0.5</td>
            <td>0.8</td>
            <td>0.9</td>
            <td>15</td>
            <td>80</td>
            <td>10</td>
        </tr>
        <tr>
            <td>Pure VQ-DQN</td>
            <td>0.22</td>
            <td>0.8</td>
            <td>0.9</td>
            <td>15</td>
            <td>80</td>
            <td>10</td>
        </tr>
        <tr>
            <td>VQ-DQN*</td>
            <td>0.4</td>
            <td>0.9999</td>
            <td>1.0</td>
            <td>15</td>
            <td>80</td>
            <td>20</td>
        </tr>
    </tbody>
</table>



### 7. Robustness of QRL under Hardware errors


 The article by Skolik et al . attempts to investigate the robustness of Quantum Reinforcement Learning (QRL) algorithms in handling shot noise, coherent errors, and incoherent errors. The study focuses on two environments which uses VQC, namely Cartpole and the Travelling Salesman Problem, and implements Q Learning and Policy Gradient approaches for both environments .

The Cartpole environment involves balancing a pole on top of a cart by applying appropriate forces, aiming to keep the pole upright.
The Travelling Salesman Problem requires finding the shortest route that visits a set of cities exactly once, challenging the agent to optimize the travel distance.

Q-learning is an example of a model-free reinforcement learning technique that estimates the optimal action-value function Q(s,a) to determine the best action a in a given state s. It involves updating Q-values based on the Bellman equation and exploring/exploiting the environment.
Policy gradient methods directly learn the optimal policy by gradient ascent on the expected rewards. The policy is modeled as a parameterized function like a neural network. The gradients are estimated from samples and the model is updated via backpropagation to maximize the cumulative reward.The research also aims to ensure the robustness of the models under various types of errors and uncertainties.

The trade-off between the performance of the agent and the amount of shots for measurement in each circuit is explored to assess the impact of shot noise. This analysis provides insights into the effect of shot noise on the system. Coherent errors are modeled by introducing Gaussian random perturbations to the parameters involved in the variation. This investigation helps understand the impact of coherent noise. Additionally, the study analyzes incoherent errors resulting from the inevitable interaction between qubits.

The results demonstrate that both Q Learning and Policy Gradient methods exhibit robustness to the errors considered, suggesting the potential for running these algorithms on quantum computers. The article provides a valuable method for maintaining a certain level of robustness in such quantum systems, which is essential for practical applications.


### 8. Parametrized Quantum Policies for RL

This article by Jerbi et al.  explores the use of policies based on Parameterized Quantum Circuits (PQCs) for classical Reinforcement Learning (RL) environments. The researchers introduce new model constructions and assess their learning performance through numerical investigations. They demonstrate that PQC policies can rival the performance of established Deep Neural Network (DNN) policies in benchmarking environments. Furthermore, the work demonstrates an empirical benefit of PQC rules over normal DNN policies in difficult RL tasks. The researchers also design environments where PQC policies outperform classical learners, incorporating the discrete logarithm problem, known for its quantum computational advantage. This research highlights the potential of quantum-inspired policies in RL settings.

 The article introduces the RAW-PQC and SOFTMAX-PQC policies, which are based on parametrized quantum circuits (PQCs). The PQC structure, composed of encoding and variational unitaries, is outlined. Two policy families, RAW-PQC and SOFTMAX-PQC, are defined, with the latter incorporating an adjustable softmax activation function for more nuanced policy adjustments. The policies are characterized by trainable parameters for rotation angles, scaling factors, and observables. The learning algorithm, as shown by Algorithm , based on the REINFORCE method, is outlined, emphasizing the computation of gradients for policy updates. The parameter-shift rule is introduced for estimating partial derivatives with respect to rotation angles and scaling parameters in the case of SOFTMAX-PQC policies, and a similar approach is used for RAW-PQC policies.
 By giving these observables trainable weights, it goes one step further in generalizing them. The formal definitions of the two models are provided below.



Definition 1 : (RAW and SOFTMAX-PQC). Given a $P Q C$ acting on $n$ qubits, taking as input a state $s \in \mathbb{R}^d$, rotation angles $\phi \in[0,2 \pi]^{|\phi|}$ and scaling parameters $\lambda \in \mathbb{R}^{|\lambda|}$, such that its corresponding unitary $U(s, \phi, \boldsymbol{\lambda})$ produces the quantum state $\left|\psi_{s, \phi, \boldsymbol{\lambda}}\right\rangle=U(s, \phi, \boldsymbol{\lambda})\left|0^{\otimes n}\right\rangle$, we define its associated RAW-PQC policy as:
$$
\pi_{\boldsymbol{\theta}}(a \mid s)=\left\langle P_a\right\rangle_{s, \boldsymbol{\theta}}
$$
where $\left\langle P_a\right\rangle_{s, \boldsymbol{\theta}}=\left\langle\psi_{s, \boldsymbol{\phi}, \boldsymbol{\lambda}}\left|P_a\right| \psi_{s, \boldsymbol{\phi}, \boldsymbol{\lambda}}\right\rangle$ is the expectation value of a projection $P_a$ associated to action $a$, such that $\sum_a P_a=I$ and $P_a P_{a^{\prime}}=\delta_{a, a^{\prime}}, \boldsymbol{\theta}=(\boldsymbol{\phi}, \boldsymbol{\lambda})$ constitute all of its trainable parameters. Using the same $P Q C$, we also define a SOFTMAX-PQC policy as:
$$
\pi_{\boldsymbol{\theta}}(a \mid s)=\frac{e^{\beta\left\langle O_a\right\rangle_{s, \boldsymbol{\theta}}}}{\sum_{a^{\prime}} e^{\beta\left\langle O_{a^{\prime}}\right\rangle_{s, \boldsymbol{\theta}}}}
$$
where $\left\langle O_a\right\rangle_{s, \boldsymbol{\theta}}=\left\langle\psi_{s, \phi, \boldsymbol{\lambda}}\left|\sum_i w_{a, i} H_{a, i}\right| \psi_{s, \boldsymbol{\phi}, \boldsymbol{\lambda}}\right\rangle$ is the expectation value of the weighted Hermitian operators $H_{a, i}$ associated to action $a, \beta \in \mathbb{R}$ is an inverse-temperature parameter and $\boldsymbol{\theta}=(\boldsymbol{\phi}, \boldsymbol{\lambda}, \boldsymbol{w})$.


It estimates the expected cumulative rewards through Monte Carlo simulations. These estimates are used to update policy parameters for better performance. The process involves computing the gradient of the log-policy, crucial for adjusting parameters. The approach aims to isolate the analysis of PQC policies from other learning mechanisms, providing a focused evaluation of their properties.
A softmax approach is calculated from the expectation values of a PQC with input scaling factors and trainable observable weights. These extra PQC traits increase the expressivity and adaptability of PQC policies (such as quantum classifiers), allowing them to train well in benchmarking environments that are comparable to those used by conventional DNNs.\newline

**Algorithm: Learning Algorithm with PQC Policy and Value Function**

$
\textbf{Input:} \text{a PQC policy } \pi_{\boldsymbol{\theta}} \text{ from Def. 1; a value-function approximator } \widetilde{V}_{\boldsymbol{\omega}}
$

1. Initialize parameters $\boldsymbol{\theta}$ and $\boldsymbol{\omega}$  
2. **While** True:  
    a. Generate $N$ episodes $\left\{\left(s_0, a_0, r_1, \ldots, s_{H-1}, a_{H-1}, r_H\right)\right\}_i$ following $\pi_{\boldsymbol{\theta}}$  
    b. **For** each episode $i$ in batch:  
        i. Compute the returns $G_{i,t} \leftarrow \sum_{t^{\prime}=1}^{H-t} \gamma^{t^{\prime}} r_{t+t^{\prime}}^{(i)}$  
        ii. Compute the gradients $\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t^{(i)} \mid s_t^{(i)})$ using Lemma 1  
    c. Fit $\left\{\widetilde{V}_{\boldsymbol{\omega}}(s_t^{(i)})\right\}_{i,t}$ to the returns $\left\{G_{i,t}\right\}_{i,t}$  
    d. Compute $\Delta \boldsymbol{\theta} = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{H-1} \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t^{(i)} \mid s_t^{(i)})$  
    e. Update $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \Delta \boldsymbol{\theta}$



### 9. Introduction to QRL: Theory and PennyLane-based Implementation

The article by Kwak et al.  focuses on performing Quantum Reinforcement Learning (QRL) on the Cartpole environment. The objective is to introduce QRL theory and demonstrate its implementation using the Pennylane framework .

The methodology used involves utilizing Variational Quantum Policy circuits, where the agent obtains state information from the environment and employs a policy-VQC to determine actions. The policy is evaluated using the Proximal Policy Optimization (PPO) algorithm, which updates the existing policy while constraining the updates to be small through probability ratio clipping, as shown in Algorithm . The implementation utilizes Pennylane and PyTorch frameworks for the QRL task.


**Algorithm: Variational Quantum Deep Q Learning with PPO**

$
\textbf{Input:} \text{Initialize replay memory } \mathcal{D} \text{ to capacity } N
$
- Initialize action-value function quantum circuit $Q$ with random parameters $\theta$  
- Initialize state value function $V(s ; \phi)$  

**For** each episode $=1,2, \ldots, M$:  
1. Initialize state $s_1$ and encode it into the quantum state  

#### 1. Inference Process  
**For** $t=1,2, \ldots, T$:  
- With probability $\epsilon$, select a random action $a_t$  
- Otherwise, select $a_t = \max_a Q^*(s_t, a ; \theta)$ from the output of the quantum circuit  
- Execute action $a_t$ in the emulator and observe reward $r_t$ and next state $s_{t+1}$  
- Store transition $\left(s_t, a_t, R_t, s_{t+1}\right)$ in $\mathcal{D}$  

#### 2. Training Process  
**For** $i=1, \ldots, K_{\text{epoch}}$:  
- Sample a random mini-batch of transitions $\left(s_j, a_j, R_j, s_{j+1}\right)$ from $\mathcal{D}$  
- Calculate temporal difference target:  
  $$
  y_j = \begin{cases} 
  R_j & \text{if terminal } s_{j+1} \\ 
  R_j + \gamma \max_{a'} Q\left(s_{j+1}, a' ; \theta\right) & \text{otherwise} 
  \end{cases}
  $$
- Calculate temporal difference:  
  $$
  \delta_j = y_j - V\left(s_j\right)
  $$
  for non-terminal $s_{j+1}$  
- Calculate estimated advantage function:  
  $$
  \hat{A}_j = \delta_j + (\gamma \lambda) \delta_{j+1} + \ldots + (\gamma \lambda)^{J-j+1} \delta_{J-1}
  $$
- Calculate ratio:  
  $$
  r_j = \frac{\pi_\theta\left(a_j \mid s_j\right)}{\pi_{\theta_{\text{OLD}}}\left(a_j \mid s_j\right)}
  $$
- Calculate surrogate actor loss function using Equation (3)  
- Calculate critic loss function:  
  $$
  \left|V(s) - y_j\right|
  $$
- Calculate gradient and update actor and critic parameters

The results indicate that QRL with Variational Quantum Policies exhibits a lower number of parameters compared to classical approaches. However, handling noise introduced by quantum computation becomes challenging as the deviation of rewards is high.




### 10. QDRL for Robot Navigation Tasks

The article by Heimann et al.  focuses on applying Quantum Deep Reinforcement Learning (QDRL) to a simulated robot task in the Turtlebot2 environment . Turtlebot2 an environment used for robot navigation tasks.The objective is to explore the usage of QRL in improving robot navigation.

The methodology used involves running the simulated robot in three different cases. The first case utilizes a classical approach, employing neural networks for both Q-value functions. The other two cases utilize parameterized quantum circuits as approximators. The first quantum case uses a fully connected neural network with three layers, with a RELU activation for hidden layers and a linear activation in the last layer. The second quantum case employs 3 qubits and differs from the first case in terms of the encoding method and the number of layers used.

The results indicate that the quantum cases require an order of magnitude fewer parameters compared to their classical equivalent. However, the training time is the shortest in the classical case. The second quantum case demonstrates improved training speed compared to the first case and shows similar performance to the classical approach.


### 11. Quantum Deep Recurrent RL

The article by Chen et al.  discusses recurrent connections that store the memory of past time steps . The article uses something called as quantum long short term memory (QLSTM) . It is a specific type of recurrent neural network (RNN) architecture designed for quantum computing applications. Just like classical Long Short-Term Memory (LSTM) networks, QLSTM is capable of processing sequences of data while preserving information over longer time periods. The VQC  Architecture for QLSTM used in this article is shown in the below figure 

<p>
<figure>
<img src="..//Images/paper10-1.png">
<figcaption>VQC Architecture for QLSTM
<figcaption>
</figure>
</p><br>


The QLSTM model size used in this article is VQCs consisting of 8 qubits. The dimension of input and hidden layer are 4.

In the Cart-Pole environment , the study compares different models for reinforcement learning (RL) agents. When the agent has complete access to the state of the environment, the Quantum Long Short-Term Memory (QLSTM) model with two Quantum Variational Circuit (VQC) layers outperforms the other studied configurations, as shown in the below figure.

<p>
<figure>
<img src="..//Images/result_paper10.png">
<figcaption>Result Obtained from DRQN for Cartpole
<figcaption>
</figure>
</p><br>

 Quantum models typically outperform their classical counterparts in terms of both stability and average scores. Learning is slower and more difficult in a partially observable environment, where the agent only observes some information . However, in terms of stability and performance, quantum models continue to outperform classical models. The classical LSTM with 16 hidden neurons, in particular, falters after 800 training events, but QLSTM agents remain stable.




### 12. Unentangled QRL agents in the OpenAI Gym

The article Hsiao et al.  aims to explore the use of Quantum Reinforcement Learning (QRL) agents that employ only single-qubit gates without entanglement. The objective is to leverage a novel approach called SVQC (Single qubit Variational Quantum Circuit) for QRL. Various environments are used such as Cartpole, Blackjack, Frozenlake, Acrobot, and Lunar Lander .

The methodology introduces the SVQC model, which differs from conventional Variational Quantum Circuits (VQCs) by excluding multi-qubit gates to prevent entanglement. The SVQC model comprises input encoding, a variational layer without multi-qubit gates, and output measurement. The measured output is connected to a classical neural network and strategies for output reuse. The performance of the SVQC model is compared with standard VQC models that allow entanglement.

The results demonstrate that the proposed SVQC model achieves better rewards compared to existing VQC models. Additionally, the SVQC model exhibits improved convergence speed when compared to classical neural networks. Moreover, the SVQC model utilizes a smaller number of parameters while delivering superior performance. The experiments were conducted on a quantum simulator.


### 13. QRL in continuous action space

Wu et al.  present a Quantum Reinforcement Learning (QRL) framework addressing Continuous Action Space (CAS)  and Discrete Action Space (DAS) . For CAS, their quantum Deep Deterministic Policy Gradient (DDPG) algorithm uses state amplitude encoding to handle dimensionality challenges. Classical simulations show QRL's effectiveness in quantum control problems like eigenstate preparation and state generation, particularly in low-dimensional systems.

To address both CAS and DAS, they introduce a quantum "environment" register representing the RL environment. Its state at time step $t$ corresponds to the classical state. The action $a(t)$ is represented by a parameterized action unitary $U(t)$. In CAS, $t$ is continuous, while in DAS, it is discrete.

They define a reward unitary $U_r$ and a measurement observable $M$ satisfying a specific equation. The quantum reward function is generated using a reward register with the equation:

$$
r_{t+1} \equiv f\left(\left\langle s_t\left|\left\langle 0\left|U^{\dagger}\left(\boldsymbol{\theta}_t\right) U_r^{\dagger} M U_r U\left(\boldsymbol{\theta}_t\right)\right| 0\right\rangle\right| s_t\right\rangle\right)
$$


The article also analyzed the gate complexity of their approach for a single RL iteration. They established that the effectiveness of their method depends on whether the Quantum Neural Networks (QNNs), particularly Variational Quantum Circuits (VQCs), can be executed with an efficient gate complexity of poly(log N).


--------------------------------------------

## Articles not using VQC for Quantum Reinforcement Learning

The below sections contain summaries of those articles which do not use VQC to implement a QRL pipeline. During the process of the literature review, we examined these articles to understand the different methodologies and their potential to solve QRL problems in different environments. 


### 1. DRL control of quantum cartpoles

 Wang et al.  apply deep reinforcement learning (DQN) to a quantum cartpole control problem, analogous to the classical cartpole setup. They utilize Q-learning, which estimates future rewards via the Q function, which is described in the equation :

$$
Q^\pi\left(s_t, a_t\right)=r\left(s_t\right)+\mathbb{E}_{\left\{\left(s_{t+i}, a_{t+i}\right) \mid \pi\right\}_{i=1}^{\infty}}\left[\sum_{i=1}^{\infty} \gamma_q^i r\left(s_{t+i}\right)\right]
$$

Using a deep Q network (DQN), they determine the optimal policy ($\pi*$) by approximating Q with a neural network. The approach involves discretizing actions within [-Fmax, Fmax] and making force decisions at each discrete control step.

Their simulations show the effectiveness of deep RL in measurement-feedback cooling and stabilization of quantum cartpoles. Measurement-feedback cooling reduces the system’s energy or temperature by adjusting the system's state based on real-time measurements.






### 2. Multiqubit and multilevel RL with quantum technologies

Lamata et al.  explore the application of quantum reinforcement learning (QRL) in multiqubit and multilevel systems. Their proposed QRL protocol involves encoding environmental data into register states, which then interact with the agent, leading to state changes in the agent.

The QRL protocol for multiqubit systems involves three interaction types: conditional updates of registers based on environmental information, connecting register qubits to environment qubits via CNOT gates, and updating the register subspace based on the agent’s state. This method employs logic operations such as the GXOR gate to conditionally update the register system.

Their research demonstrates that QRL can be applied across various quantum technologies, maintaining steady learning times and adaptability to imperfectly known environments. This approach is suitable for real-world applications, including superconducting circuits and trapped ions, paving the way for scalable quantum devices.




### 3. RL with Neural Networks for Quantum Feedback


Fosel et al.  present a network-based reinforcement learning (RL) approach to quantum feedback, focusing on developing new Quantum Error Correction (QEC) methods for few qubit systems under random noise and hardware constraints . Their autonomous, human-guidance-free method employs a network agent to develop and modify feedback techniques based on assessment outcomes.

The goal is to train a neural network to protect quantum information from decoherence in quantum memory. This includes both stabilizer-code-based QEC variations and specialized approaches like decoherence-free subspaces and phase estimation. The QEC process involves encoding, error detection, correction, and decoding to retrieve the state with maximum fidelity.

Given the impossibility of classically simulating a full-scale quantum computer, modular approaches to quantum computation and devices are being developed. The hierarchical application of the quantum-module concept, with error-correction strategies applied to coupled modules, is theoretically feasible and well-suited to meet this challenge.

### 4. Experimental quantum speed-up RL agents

Saggio et al.  explore reinforcement learning (RL) experiments using quantum computing to enhance communication channels between agents and the environment. While recent quantum-based studies focus on efficient decision-making algorithms, they have not achieved reductions in learning time until now.


Their methodology employs a quantum framework using deterministic strictly Epochal learning, where percepts and rewards are determined by actions. The RL workflow and comparison of classical and quantum channels are illustrated in the below figure. 

<p>
<figure>
<img src="..//Images/11.png">
<figcaption>Reinforcement Learning workflow and quantum equivalent 
<figcaption>
</figure>
</p><br>

A unitary operation $U_E$ models the environment's behavior on the action and reward quantum registers, enabling the quantum-enhanced agent to perform a quantum search for rewarded action sequences and teach these sequences to a classical agent, creating a hybrid agent with a feedback loop between quantum search and classical policy.


This approach quantifies the reduction in learning time, allowing the agent to evaluate performance and shorten the learning period, outperforming agents using conventional communication. The below figure  displays the average reward for different strategies. Additionally, new photonic circuit technology offers compactness, tunability, and low-loss transmission.

<p>
<figure>
<img src="..//Images/12.png">
<figcaption>Results for different learning strategies 
<figcaption>
</figure>
</p><br>

### 5. QRL via Policy Iteration


The article by Cherrat et al.  describes a quantum variant of the classical policy iteration technique that the researchers name Quantum Reinforcement Learning through Policy Iteration (QRL-PI). The Markov decision process (MDP) is a mathematical framework for modelling decision-making issues. The QRL-PI approach is meant to find optimum policies for MDPs.

It formulates the RL issue in a quantum mechanical context. They describe a quantum Markov decision process (QMDP) , a quantum system that changes over time and communicates with a quantum environment, as the quantum equivalent of an MDP. Here we have two steps. The first one is policy iteration. In this approach, it introduces the policy evaluation step as a quantum operation, akin to a unitary process or a quantum circuit. This step takes a classical policy as input and orchestrates a transformation, resulting in the creation of a quantum state. This state serves as an approximation, or more broadly, contains relevant information regarding the classical value function $Q(\pi)$.


The policy improvement step in quantum reinforcement learning (QRL) is defined as a quantum operation, akin to a generalized measurement process. Quantum states from policy evaluation are measured to extract classical data, used to calculate a new policy $\pi'$ following a predefined criterion. This two-step process forms the basis of QRL, as shown in Algorithm . The Quantum policy iteration step is depicted in the below figure.

<p>
<figure>
<img src="..//Images/paper2.png">
<figcaption>Quantum policy Iteration Step
<figcaption>
</figure>
</p><br>

The study presents a framework for QRL through exact and approximate policy iteration. Quantum policy evaluation encodes value functions, and policy improvement enhances policies based on measurement outcomes. The approach is effective in environments like FrozenLake and InvertedPendulum . Further research is needed to explore environments where quantum linear algebra surpasses classical methods. Optimizing measurements and addressing inherent noise in quantum procedures are crucial. Future work may extend theoretical guarantees for quantum policy iteration variants, enhancing the understanding and applicability of QRL.



### 6. Robust Optimization for QRL Control using Partial Observations

Jiang et al.  discuss the role of quantum control in areas like quantum communication and scalable computing, citing applications such as State Steering and Quantum Approximate Optimization Algorithm (QAOA) . They address Hamiltonian uncertainty and control precision issues, proposing Reinforcement Learning (RL) as an alternative method. Their RL scheme relies solely on partial observations for control decisions, eliminating the need for extra quantum measurements in reward functions.

Their quantum RL algorithm calculates rewards based on partial observations, reducing measurement and computational costs significantly. Unlike methods relying on classical simulation or fidelity, it operates practically for near-term quantum devices, adapting to varying noise levels and initial states .


