# Dicke quantum battery, coupling scheme: train

This Jupyter Notebook trains an RL agent to discover the optimal time-dependent coupling control $\lambda_\text{c}(t)$ that maximizes the single battery unit ergotropy of the Dicke battery (see [arXiv.2212.12397](https://doi.org/10.1088/1367-2630/8/5/083)). This is the so called "coupling case" in the manuscript. 

Let us consider a setup with $N$ two level systems (TLS) in a cavity. Setting $\hbar=1$, the Hamiltonian governing the evolution of the total quantum system (charger + battery) is given by
\begin{equation*}
	\mathcal{\hat H}(t) = \mathcal{\hat H}_{\rm C}+\mathcal{\hat H}_{\rm B}+\lambda_\text{c}(t)\,\mathcal{\hat H}_{\rm int}~,
\end{equation*}
where
\begin{equation*}
	\mathcal{\hat H}_{\rm C} =\omega_0\hat{a}^\dagger\hat{a}
\end{equation*}
is the charger Hamiltonian represented by a single mode cavity, $\hat{a}^\dagger,\hat{a}$ being the bosonic ladder operators.
\begin{align*}
	 \mathcal{\hat H}_{\rm B} &= \sum_{j=1}^N \hat{h}_j^{\rm B}~, & \hat{h}_j^{\rm B}&=\frac{\omega_0}{2}\big(\hat{\sigma}^{(z)}_j+1\big),
\end{align*}
where $\hat{\sigma}^{(\alpha)}_j$ are the $\alpha=x,y,z$ Pauli matrices acting on the $j$-th TLS, is the battery Hamiltonian, and
\begin{equation*}
	\mathcal{\hat H}_{\rm int}=\omega_0\sum_{j=1}^N\hat{\sigma}^{(x)}_j (\hat{a}+\hat{a}^\dagger)
\end{equation*}
is the interaction Hamiltonian. The total quantum system is initialized in the state
\begin{equation*}
	\ket{\Psi_0}=\ket{{\rm G}}\otimes\ket{N},
\end{equation*}
where $\ket{N}$ is the cavity's Fock state with $N$ excitations, and where
\begin{equation*}
	\ket{{\rm G}} = \otimes_{j=1}^N\ket{0}_j,
\end{equation*}
$\ket{0}_j$ being the ground state of the $j$-th TLS.

Given a charging time $\tau$, the aim of the RL algorithm is to maximize the single battery unit ergotropy $\mathcal{E}^{(N)}_1(\tau)$ at the final time $\tau$, where
\begin{equation}
	\mathcal{E}^{(N)}_1(\tau)=\frac{\braket{\psi(\tau) |\mathcal{\hat H}_{\rm B}| \psi(\tau)}}{N}-r_{1}(\tau)\omega_0~,
\end{equation}
$r_{1}(\tau)$ being the minimum eigenvalue of the single TLS reduced density matrix ${\rho}_{{\rm B},1}(\tau)$. Details on the calculation of the ergotropy are given in the appendix of the manuscript.

#### Import modules

In [None]:
import numpy as np
import sys
import os
sys.path.append(os.path.join("..", "src"))
import sac_epi_envs
import sac_epi
import extra

## Setup new Training
The following codes initiates a new training session for a given value of $N$ and $\tau$. All training logs, parameters, policy, and saved states will be stored under the ```data``` folder, within a folder with the current date and time. 
- ```env_params``` is a dictionary with the environment parameters.
- ```training_hyperparams``` is a dictionary with training hyperparameters.
- ```log_info``` is a dictionary that specifices which quantities to log.

The parameters below were used to produce the results in the manuscript relative to the coupling scheme. Notice that, to reproduce every point in Fig. 1 of the manuscript, the values of ```nq```, ```tau``` and ```dt``` must be varied accordingly (see Manuscript for details). 

The choice of ```nq```, ```tau``` and ```dt``` below reproduces the furthest green dot in Fig. 1(b), i.e. it is relative to $N=16$ with the largest value of $\tilde{g}\tau$.

In [None]:
nq = 16                           #number of qubit (two level systems)
tau = 5.6                         #charging time tau 
dt = 0.2                          #duration of a timestep

env_params = {
    "wc": 1.,                     #frequency of the cavity     
    "nc": nq*2,                   #cutoff value of the Fock space (maximum number of photons in the cavity)
    "nc_i": nq,                   #number of photons in the cavity at t=0
    "wq": 1.,                     #frequency of the qubits (two level systems)
    "nq": nq,                     #number of qubits (two level systems)
    "min_u": -0.3,                #minimum value of the coupling \lambda_{\rm c}(t). This determines \tilde{g}  
    "max_u": +0.3,                #maximum value of the coupling \lambda_{\rm c}(t). This determines \tilde{g}
    "dt": dt,                     #duration of a timestep
    "tau": tau,                   #charging time
    "reward_coeff": 1.,           #coefficient multiplying the rewards
    "quantity": "ergotropy"       #quantity whose difference is returned as reward
} 
training_hyperparams = {
    "BATCH_SIZE": 256,            #batch size
    "LR": 0.001,                  #learning rate for Q and Pi loss
    "ALPHA_LR": 0.003,            #learning rate to tune the temperature parameters alpha 
    "H_START": 0.72,              #initial target entropy of the policy
    "H_END": -3.,                 #final target entropy of the policy
    "H_DECAY": 200000,            #exponential decay of the target entropy of the policy
    "C_START": [1./nq, 0.],       #initial weights for energy and ergotropy to compute the reward (Eqs. S38 - S39)
    "C_END": [0., 1.],            #final weights for energy and ergotropy to compute the reward (Eqs. S38 - S39)
    "C_MEAN": 40000,              #timestep number where the weights are half way between start and end
    "C_WIDTH": 20000,             #width in timesteps to transition from start and end weights        
    "REPLAY_MEMORY_SIZE": 180000, #size of the replay buffer
    "POLYAK": 0.995,              #polyak coefficient
    "LOG_STEPS": 1000,            #save logs and display training every number of steps
    "GAMMA": 0.993,               #RL discount factor
    "RETURN_GAMMA": 0.9,          #exponential averaging of the return when logging during training
    "LOSS_GAMMA": 0.995,          #exponential averaging of the loss functions when logging during training
    "HIDDEN_SIZES": (512,256),    #size of hidden layers of the neural networks
    "SAVE_STATE_STEPS": 480000,   #saves complete state of trainig every number of steps
    "INITIAL_RANDOM_STEPS": 5000, #number of initial uniformly random steps
    "UPDATE_AFTER": 1000,         #start minimizing loss function after initial steps
    "UPDATE_EVERY": 50,           #performs this many updates every this many steps
    "USE_CUDA": True,             #use cuda for computation
    "MIN_COV_EIGEN": 1.e-7,       #security parameter for the covariance matrix of the policy
    "DONT_SAVE_MEMORY": True      #if true, it won't save the memory buffer, so cannot resume training from file
}
log_info = {
    "log_running_reward": True,   #log running reward 
    "log_running_loss": True,     #log running loss
    "log_actions": True,          #log chosen actions
    "extra_str": f"_nq={nq}_tau={np.round(tau,3)}_dt={np.round(dt,3)}" #string to append to training folder name
}

#initialize training object
train = sac_epi.SacTrain()
train.initialize_new_train(sac_epi_envs.DickeBatteryOneControlCoupling, env_params, training_hyperparams, log_info)

#### Train
Perform a given number of training steps. It can be run multiple times. While training, the following running averages are plotted:
- G: running average of the return, which is a running weighed average of the final energy and ergotropy (see Eqs. S38 S39 of the Manuscript);
- Obj 0: the first objective, i.e. the total energy of the battery;
- Obj 1: the second objective, i.e. the single TLS ergotropy;
- Q Runninng Loss;
- Pi Running Loss;
- alpha: th temperature parameter of the SAC method;
- entropy: the average entropy of the policy;
- u: The last 400 value of the time-dependent control that were proposed by the policy.

In [None]:
train.train(480000)

#### Clear object
If the previous block of code is to be run within a loop, it can help to clear the memory running the following 

In [None]:
extra.clear_memory(train)

## Save the State
The full state of the training session is saved every ```SAVE_STATE_STEPS``` steps. Run this command if you wish to manually save the current state.

In [None]:
train.save_full_state()

## Load Existing Training
Any training session that was saved can be loaded specifying the training session folder in ```log_dir```. 

If ```DONT_SAVE_MEMORY: False``` in the hyperparameters, one can set ```no_train=False``` below, and:
- this will produce a new folder for logging with the current date-time;
- it is then possible to train the model for a longer time.

If ```DONT_SAVE_MEMORY: True``` in the hyperparameters, one must set ```no_train=True``` below. This doesn't create a new folder, and doesn't allow to keep training, but one can use this to evaluate the current policy. See the ```2_evaluate_and_export_performance.ipynb``` for this use case.

Saving the memory can be useful to keep training loading an older session, but it uses up a lot of space on disk.

In [None]:
log_dir = "../data/2024_08_15-11_44_05_nq=16_tau=5.6_dt=0.2" #example of a training folder

#create a new SacTrain object
train = sac_epi.SacTrain()

#load state from a folder
train.load_train(log_dir, no_train=True)

#### Train
If ```no_train=False``` and the memory buffer was saved, one can keep training using the following

In [None]:
train.train(2000)