# Guide: working with large datasets

There are a lot of cases when you need to work with large datasets, that needs to be processed in parallel.

For this usecase we will be using [Ray datasets](https://docs.ray.io/en/latest/data/getting-started.html#datasets-getting-started). As QuantumServerless if fully compatible with Ray, we can do so without any interuption for our workflows.

In [38]:
from typing import List, Tuple

from qiskit import QuantumCircuit
from qiskit.circuit.random import random_circuit
from qiskit.primitives import Sampler
from quantum_serverless import QuantumServerless, distribute_task, get

from ray import data # let's import data from ray

In [10]:
%%capture

serverless = QuantumServerless()

serverless.context()

Let's create our first dataset of circuits

In [56]:
ds = data.from_items([
    random_circuit(1, 2, measure=True)
    for idx in range(100)
])

In [57]:
ds.show(1)

     ┌───┐┌───┐┌─┐
  q: ┤ T ├┤ Y ├┤M├
     └───┘└───┘└╥┘
c: 1/═══════════╩═
                0 


Now we can repartition it into 5 blocks, in order to provide parallel execution capabilities to data handlers

In [58]:
ds = ds.repartition(5)

Repartition: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 315.78it/s]


We can apply different mapping operations to data in order to process it. 

Let's calculate depth of circuit for the sake of example. 

In [59]:
def mapping_operation(circuits: List[QuantumCircuit]) -> List[QuantumCircuit]:
    return [
        {"depth": circuit.depth(), "circuit": circuit}
        for circuit in circuits
    ]

ds = ds.map_batches(mapping_operation)

Map_Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 178.15it/s]


In [60]:
ds.show(1)

{'depth': 3, 'circuit': <qiskit.circuit.quantumcircuit.QuantumCircuit object at 0x7f8bda0fb290>}


Now let's split our dataset into chunks and process them in parallel.

In [61]:
split_ds = ds.split(3)
split_ds

[Dataset(num_blocks=2, num_rows=40, schema=<class 'dict'>),
 Dataset(num_blocks=2, num_rows=40, schema=<class 'dict'>),
 Dataset(num_blocks=1, num_rows=20, schema=<class 'dict'>)]

Here we can use our decorator to create remote function for dataset

In [62]:
@distribute_task()
def sample(data):
    sampler = Sampler()
    circuits = [r["circuit"] for r in data.iter_rows()]
    return sampler.run(circuits).result()

In [63]:
sample_tasks = [sample(shard) for shard in split_ds]

In [64]:
results = get(sample_tasks)

In [65]:
results

[SamplerResult(quasi_dists=[{0: 0.0, 1: 1.0}, {0: 0.9999999999999998, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 0.7942974242374243, 1: 0.20570257576257575}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 0.05264990006648206, 1: 0.947350099933518}, {0: 0.9999999999999998, 1: 0.0}, {0: 0.0, 1: 1.0}, {0: 0.45480275093681083, 1: 0.5451972490631892}, {0: 0.10050746991952836, 1: 0.8994925300804716}, {0: 0.0, 1: 1.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 0.4999999999999999, 1: 0.5000000000000001}, {0: 1.0, 1: 0.0}, {0: 0.0, 1: 1.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 0.4999999999999999, 1: 0.4999999999999999}, {0: 0.0, 1: 1.0}, {0: 0.0010733773936409123, 1: 0.9989266226063591}, {0: 1.0, 1: 0.0}, {0: 0.5000000000000001, 1: 0.4999999999999999}, {0: 0.3381703724219888, 1: 0.6618296275780112}, {0: 0.0, 1: 1.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 1.0, 1: 0.0}, {0: 0.4999999999999999, 1: 0.4999999999999999}, {0: 0.7612155887733825, 

-----

For large datasets we can use s3 to read data from or any storage that supports Arrow. For more info refer to https://docs.ray.io/en/latest/data/dataset.html