# Reinforcement learining with quantum computing cloud Quafu

Before starting the journey of executing reinforcement learining(RL) task on real quantum devices with Quafu, you have to make sure the environment is consistent. The following code is based on python 3.8 to meet the need of the specific version of tensorflow. Then, you can install the follwing packages:

In [None]:
%pip install pyquafu 
%pip install tensorflow==2.7.0
%pip install tensorflow-quantum==0.7.2
%pip install gym

In [1]:
# model imports
import argparse
import re
from functools import reduce

import cirq
import gym
import models.quantum_genotypes as genotypes
import numpy as np
import tensorflow as tf
from models.quantum_models import generate_circuit
from models.quantum_models import generate_model_policy as Network
from models.quantum_models import get_model_circuit_params
from PIL import Image
from quafu import QuantumCircuit as quafuQC
from quafu import Task, User

2023-03-15 14:42:36.950239: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-15 14:42:38.671011: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20593 MB memory:  -> device: 0, name: GeForce RTX 3090, pci bus id: 0000:5b:00.0, compute capability: 8.6
2023-03-15 14:42:38.672331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22306 MB memory:  -> device: 1, name: GeForce RTX 3090, pci bus id: 0000:9b:00.0, compute capability: 8.6
2023-03-15 14:42:38.673411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device

Set some parameters about the RL task:

In [2]:
parser = argparse.ArgumentParser('Reinforcement learining with quantum computing cloud Quafu')
parser.add_argument('--env_name', type=str, default='CartPole-v1', help='environment name')
parser.add_argument('--state_bounds', type=np.array, default=np.array([2.4, 2.5, 0.21, 2.5]), help='state bounds')
parser.add_argument('--n_qubits', type=int, default=4, help='the number of qubits')
parser.add_argument('--n_actions', type=int, default=2, help='the number of actions')
parser.add_argument('--arch', type=str, default='NSGANet_id10', help='which architecture to use')
parser.add_argument('--shots', type=int, default=1000, help='the number of sampling')
parser.add_argument('--backend', type=str, default='ScQ-P10', help='which backend to use')
parser.add_argument('--model_path', type=str, default='./weights/weights_id10_quafu_94.h5', help='path of pretrained model')
args = parser.parse_args(args=[])

According to the results retrieved by Quafu, you can compute expectations with observables($Z_{0} * Z_{1} * Z_{2} * Z_{3}$ for CartPole) as the follwing function:

In [3]:
def get_res_exp(res):
    # access to probabilities of all possibilities 
    prob = res.probabilities
    sumexp = 0
    for k, v in prob.items():
        count = 0
        for i in range(len(k)):
            if k[i] == '1':
                count += 1
        if count % 2 == 0:
            sumexp += v
        else:
            sumexp -= v
    return sumexp

It's important to construct a process to send circuits to Quafu and get results from it. The next part shows the whole pipeline of involing Cirq circuits with Quafu and acquire expectations with quantum devices.

In [4]:
def get_quafu_exp(circuit):
    # convert Cirq circuts to qasm
    openqasm = circuit.to_qasm(header='')
    openqasm = re.sub('//.*\n', '', openqasm)
    openqasm = "".join([s for s in openqasm.splitlines(True) if s.strip()])
    
    # fill in with your token, register on website http://quafu.baqis.ac.cn/
    user = User()
    user.save_apitoken(" ")
    
    # initialize to Quafu circuits
    q = quafuQC(args.n_qubits)
    q.from_openqasm(openqasm)
    
    # create the task
    task = Task()
    task.load_account()
    
    # choose sampling number and specific quantum devices
    shots = args.shots   
    task.config(backend=args.backend, shots=shots, compile=True)
    task_id = task.send(q, wait=True).taskid
    print('task_id:', task_id)
    
    # retrieve the result of completed tasks and compute expectations
    task_status = task.retrieve(task_id).task_status
    if task_status == 'Completed':
        task = Task()
        task.load_account()
        res = task.retrieve(task_id)
        OB = get_res_exp(res)
    return task_id, tf.convert_to_tensor([[OB]])

The next post-processing layer apply stored action-specific weights on expectation values.

In [5]:
class Alternating_(tf.keras.layers.Layer):
    def __init__(self, obsw):
        super(Alternating_, self).__init__()
        self.w = tf.Variable(
            initial_value=tf.constant(obsw), dtype="float32", trainable=True, name="obsw")

    def call(self, inputs):
        return tf.matmul(inputs, self.w)

Then the softmax layer outputs the policy of the agent to choose next actions.

In [6]:
def get_obs_policy(obsw):
    process = tf.keras.Sequential([ Alternating_(obsw),
                                    tf.keras.layers.Lambda(lambda x: x * 1.0),
                                    tf.keras.layers.Softmax()
                                ], name="obs_policy")
    return process

Prepare for loading model weights:

In [7]:
qubits = cirq.GridQubit.rect(1, args.n_qubits)
genotype = eval("genotypes.%s" % args.arch)
ops = [cirq.Z(q) for q in qubits]
observables = [reduce((lambda x, y: x * y), ops)] # Z_0*Z_1*Z_2*Z_3
model = Network(qubits, genotype, args.n_actions, observables)
model.load_weights(args.model_path)

The follwing part builds an interaction between the agent in CartPole environment and Quafu. Every action choice means a task completed by quantum devices and finally, you can get a gif picturing the whole process.

In [None]:
# update gym to the version having render_mode, which is 0.26.1 in this file
# initialize the environment
env = gym.make(args.env_name, render_mode="rgb_array")
state, _ = env.reset()
frames = []

# set the number of episodes
for epi in range(100):
    im = Image.fromarray(env.render())
    frames.append(im)  
    
    # get PQC model parameters and expectations
    stateb = state/args.state_bounds
    newtheta, newlamda = get_model_circuit_params(qubits, genotype, model)
    circuit, _, _ = generate_circuit(qubits, genotype, newtheta, newlamda, stateb)
    _, expectation = get_quafu_exp(circuit)
    
    # get policy model parameters
    obsw = model.get_layer('observables-policy').get_weights()[0]
    obspolicy = get_obs_policy(obsw)
    policy = obspolicy(expectation)
    print('policy:', policy)
    
    # choose actions and make a step
    action = np.random.choice(args.n_actions, p=policy.numpy()[0])
    state, reward, terminated, truncated, _ = env.step(action)
    if terminated or truncated:
        print(epi+1)
        break
env.close()

# save to your path
frames[1].save(' ', save_all=True, append_images=frames[2:], optimize=False, duration=20, loop=0)