# Speed bottlenecks

From `output.svg` generated by `profile/profile.sh`, I believe that the performance is limited by the generation of the stim objects. 

Here I test this hypothesis and conclude that we are indeed limited by constructing the stim objects and thus this stim wrapper is "as fast as it can be". 

The inputs for profiling this repository are described below and I get that it takes ~45.5s to create the circuit.

Let's test the hypothesis:

In [1]:
import stim

In [2]:
%timeit list(range(1_000))

7.17 μs ± 17.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [3]:
%timeit list(range(10_000))

89.7 μs ± 856 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [4]:
%timeit list(range(100_000))

1.09 ms ± 4.41 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [5]:
%timeit stim.CircuitInstruction("I", tuple(range(1_000)))

32.9 ms ± 89.7 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [6]:
%timeit stim.CircuitInstruction("I", tuple(range(10_000)))

331 ms ± 649 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
%timeit stim.CircuitInstruction("I", tuple(range(100_000)))

3.33 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


The `list(range(1_000))` is not limiting the creation of the stim object. 

We see a linear dependency with time and the number of qubits in the input. 

In [8]:
%timeit stim.CircuitInstruction("H", list(range(10_000)))

335 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%timeit stim.CircuitInstruction("CZ", list(range(10_000)))

338 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
%timeit stim.CircuitInstruction("DEPOLARIZE1", list(range(10_000)), [0.1])

339 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We also see that there is no speed difference between gates.

There is a difference between running the code in a jupyter-notebook and in the terminal. As I have run the `profile/profile.sh` in the terminal, I have run the checks from above and obtained that it takes ~95ms for 10.0000 qubits (and similar linear behavior for 1.000 and 100.000). 

Now, I compute the number of layers in the circuit that I used to benchmark:

In [11]:
# QEC rounds
num_rounds = 13
num_layers_per_round = 10 * 2 # there are noise channels for each gate
# logical gates
ave_num_layers_per_gate = 2 * 2 # there are noise channels for each gate

In [12]:
d = 41
num_log_qubits = 2
num_qubits = (2*(d**2 + (d-1)**2) +  - 1) * num_log_qubits

In [13]:
0.095/10_000 * num_qubits * num_rounds * num_layers_per_round + \
0.095/10_000 * num_qubits * (num_rounds+1) * ave_num_layers_per_gate # logical M and R

39.392244000000005

Therefore, we can see that out of the ~45.5s for generating the circuits ~39s are spent building the stim object and only ~7s are spent by `surface_sim`, which is only ~15% of the overall time. Therefore, any other improvement to `surface_sim` can only improve the speed a maximum of 15%. 