# Indirect TSQR


**Input**: Matrix $A \in \mathbb{R}^{m \times n}$ with $m \gg n$.  

The indirect TSQR method avoids explicitly assembling the final $Q$ matrix block-by-block. The idea is to first compute a stable global $R$ factor first, and then derives $Q$ implicitly from it.  

---

### 1) First step (local QR factorizations)  
- The matrix $A$ is divided into $p$ row blocks:  

$$
A = \begin{bmatrix} A_1^T & A_2^T & \cdots & A_p^T \end{bmatrix}^T,
\quad A_j \in \mathbb{R}^{m_j \times n}.
$$  

- Each block is used locally to factor the small R_j blocks:  

$$
A_j = Q_j^{(1)} R_j, 
\quad Q_j^{(1)} \in \mathbb{R}^{m_j \times n}, \; R_j \in \mathbb{R}^{n \times n}.
$$  

  
- The $Q_j^{(1)}$ matrices obtained are discarded and only the small $R_j$ are passed along.  



### 2) Second step (global QR reduction)  
- The local triangular matrices are stacked vertically:  

$$
R_{\text{stack}} = 
\begin{bmatrix} 
R_1 \\ R_2 \\ \vdots \\ R_p 
\end{bmatrix}
\in \mathbb{R}^{pn \times n}.
$$  

- To obtain the **final global $R$** it is necessary to perform a second QR factorization:  

$$
R_{\text{stack}} = \tilde{Q} \, R, 
\quad \tilde{Q} \in \mathbb{R}^{pn \times n}, \; R \in \mathbb{R}^{n \times n}.
$$  

- Unlike other methods, the $Q_j^{(1)}$ blocks are not explicitily multiplied with pieces of $\tilde{Q}$ to assemble the final $Q$, instead they are discarded.  



### 3) Recovering $Q$ indirectly  
- By construction, $A = Q R$.  
- Since $R$ is already available, $Q$ can be obtained as:  

$$
Q = A R^{-1}.
$$  

- This avoids explicitly combining the intermediate $Q_j^{(1)}$ and $\tilde{Q}$ matrices.  
- Instead, a final *map* step applies the small matrix $R^{-1}$ (size $n \times n$) to each row block of $A$, yielding the blocks of $Q$ on the fly.  

---

The optimization idea is that only  the small $R_j$ factors are passed between workers, never the large $Q_j^{(1)}$. The tradeoff is that it  requires access to the full $A$ again to compute $Q$, which may be costly for very large datasets, but avoids storing intermediate $Q_j$.
On the other hand the two-level QR decomposition ensures orthogonality and therefore an improved numerical stability.  
  


In [1]:
import dask.array as da
import dask
import numpy as np
import time
from scipy.linalg import solve_triangular 
from sklearn.datasets import fetch_california_housing


In [2]:
# CLUSTER DEPLOYMENT ON CLOUDVENETO
from dask.distributed import Client, SSHCluster

cluster = SSHCluster(
    ["10.67.22.154", "10.67.22.216", "10.67.22.116", "10.67.22.113"],
    connect_options={"known_hosts": None},
    remote_python="/home/ubuntu/miniconda3/bin/python",
    scheduler_options={"port": 8786, "dashboard_address": ":8797"},
    worker_options={
        "n_workers": 4,       # N. of processess per VM. CloudVeneto's large VM offers 4-core CPU, but for now we only spawn 1 process per VM
        "nthreads": 1      # N. of threads per process
    }
)

client = Client(cluster)


2025-09-15 17:47:27,131 - distributed.deploy.ssh - INFO - 2025-09-15 17:47:27,130 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2025-09-15 17:47:27,150 - distributed.deploy.ssh - INFO - 2025-09-15 17:47:27,150 - distributed.scheduler - INFO - State start
2025-09-15 17:47:27,154 - distributed.deploy.ssh - INFO - 2025-09-15 17:47:27,153 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.67.22.154:8786
2025-09-15 17:47:29,342 - distributed.deploy.ssh - INFO - 2025-09-15 17:47:29,340 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.116:39853'
2025-09-15 17:47:29,343 - distributed.deploy.ssh - INFO - 2025-09-15 17:47:29,343 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.113:36973'
2025-09-15 17:47:29,347 - distributed.deploy.ssh - INFO - 2025-09-15 17:47:29,345 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.11

In [3]:
# check if everything went smoothly
cluster


0,1
Dashboard: http://10.67.22.154:8797/status,Workers: 12
Total threads: 12,Total memory: 23.25 GiB

0,1
Comm: tcp://10.67.22.154:8786,Workers: 0
Dashboard: http://10.67.22.154:8797/status,Total threads: 0
Started: 1 minute ago,Total memory: 0 B

0,1
Comm: tcp://10.67.22.113:33335,Total threads: 1
Dashboard: http://10.67.22.113:41099/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.113:37483,
Local directory: /tmp/dask-scratch-space/worker-07w90ev_,Local directory: /tmp/dask-scratch-space/worker-07w90ev_

0,1
Comm: tcp://10.67.22.113:34143,Total threads: 1
Dashboard: http://10.67.22.113:36131/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.113:41959,
Local directory: /tmp/dask-scratch-space/worker-mn95n2tl,Local directory: /tmp/dask-scratch-space/worker-mn95n2tl

0,1
Comm: tcp://10.67.22.113:38401,Total threads: 1
Dashboard: http://10.67.22.113:45263/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.113:43017,
Local directory: /tmp/dask-scratch-space/worker-1ogcj42u,Local directory: /tmp/dask-scratch-space/worker-1ogcj42u

0,1
Comm: tcp://10.67.22.113:39003,Total threads: 1
Dashboard: http://10.67.22.113:44283/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.113:36973,
Local directory: /tmp/dask-scratch-space/worker-xmehn759,Local directory: /tmp/dask-scratch-space/worker-xmehn759

0,1
Comm: tcp://10.67.22.116:35879,Total threads: 1
Dashboard: http://10.67.22.116:43275/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.116:40601,
Local directory: /tmp/dask-scratch-space/worker-_3uf6ebv,Local directory: /tmp/dask-scratch-space/worker-_3uf6ebv

0,1
Comm: tcp://10.67.22.116:36681,Total threads: 1
Dashboard: http://10.67.22.116:41285/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.116:39853,
Local directory: /tmp/dask-scratch-space/worker-15oup9e6,Local directory: /tmp/dask-scratch-space/worker-15oup9e6

0,1
Comm: tcp://10.67.22.116:44641,Total threads: 1
Dashboard: http://10.67.22.116:39447/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.116:34761,
Local directory: /tmp/dask-scratch-space/worker-tetpwj_e,Local directory: /tmp/dask-scratch-space/worker-tetpwj_e

0,1
Comm: tcp://10.67.22.116:46879,Total threads: 1
Dashboard: http://10.67.22.116:45965/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.116:44319,
Local directory: /tmp/dask-scratch-space/worker-0qp1tnot,Local directory: /tmp/dask-scratch-space/worker-0qp1tnot

0,1
Comm: tcp://10.67.22.216:36443,Total threads: 1
Dashboard: http://10.67.22.216:46357/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.216:43749,
Local directory: /tmp/dask-scratch-space/worker-lf1saoo5,Local directory: /tmp/dask-scratch-space/worker-lf1saoo5

0,1
Comm: tcp://10.67.22.216:39361,Total threads: 1
Dashboard: http://10.67.22.216:37973/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.216:42101,
Local directory: /tmp/dask-scratch-space/worker-lytj5w70,Local directory: /tmp/dask-scratch-space/worker-lytj5w70

0,1
Comm: tcp://10.67.22.216:44533,Total threads: 1
Dashboard: http://10.67.22.216:35677/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.216:42409,
Local directory: /tmp/dask-scratch-space/worker-1wvd3g11,Local directory: /tmp/dask-scratch-space/worker-1wvd3g11

0,1
Comm: tcp://10.67.22.216:46245,Total threads: 1
Dashboard: http://10.67.22.216:44333/status,Memory: 1.94 GiB
Nanny: tcp://10.67.22.216:41473,
Local directory: /tmp/dask-scratch-space/worker-e1rl6kvo,Local directory: /tmp/dask-scratch-space/worker-e1rl6kvo


In [4]:
import dask.dataframe as dd
import os

os.chdir("/home/ubuntu") 
path_HIGGS = os.getcwd() + "/datasets/HIGGS.csv"

df = dd.read_csv(path_HIGGS, header=None, blocksize="50MB")    # The block size is chosen accordingly to the previous benchmarking results
X_df = df.iloc[:, 1:] 
X_da = X_df.to_dask_array(lengths=True)   # We want it as a matrix (an array, that is)


In [5]:
X_da = X_da.persist()

X_da

Unnamed: 0,Array,Chunk
Bytes,2.29 GiB,14.69 MiB
Shape,"(11000000, 28)","(68752, 28)"
Dask graph,160 chunks in 1 graph layer,160 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.29 GiB 14.69 MiB Shape (11000000, 28) (68752, 28) Dask graph 160 chunks in 1 graph layer Data type float64 numpy.ndarray",28  11000000,

Unnamed: 0,Array,Chunk
Bytes,2.29 GiB,14.69 MiB
Shape,"(11000000, 28)","(68752, 28)"
Dask graph,160 chunks in 1 graph layer,160 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Variants of the Indirect TSQR Method
In the following, different versions of the **Indirect TSQR** algorithm are presented.  
The main difference across these implementations lies in how the global $R$ factor is handled: computed, persisted across workers, or kept as a delayed/Dask object.  

---

‚Ä¢ **Serial version of the Indirect TSQR**

This approach showcases the **basic formulation** of the indirect method using only NumPy. It has some serious limitations due to the lack of parallelizzation as everything runs on a single core The memory usage scales with the dataset size, making this approach infeasible for very large datasets (e.g. HIGGS) and only suitable for local analysis.

---
‚Ä¢ **Parallel / Dask approach (version 1)**

This version introduces Dask for parallelism. Each block of $A$ is processed in parallel using the function:
```python
R_blocks = X_da.map_blocks(compute_R, dtype=X_da.dtype, chunks=(n_cols, n_cols))
```
This is where the parallelizzation happens and local QR are created for each block. However to compute the inverse: $R^{-1}$, we call .compute() on $R$, pulling it back to the driver as a NumPy array, this introduces a bottleneck that breaks the laziness and centralizes $R$ on the driver.

---
‚Ä¢ **Parallel / Dask approach (version 2)**

This version is equivalent to version 1 but replaces `.compute()` with the function: `.persist()`.
The function `.persist()` keeps $R$ distributed across the workers rather than pulling it to the driver.
This allows for improved results since the scheduler tracks dependencies and ensures $R$ is reused without recomputation.

Dask has da.linalg.qr, but it assumes the whole array is large and chunked regularly.
To get the final global R, you must combine all the Ri
That means at some point, the data has to come together into a single place (can‚Äôt keep it sharded).
So we bring the data to the driver since it is very small, this it optimizes the uses of np.linalg.qr, we are gathering the small stuff

---

‚Ä¢ **Parallel / Dask approach (version 3)**

In this case instead of computing $R$ immediately it is left as a delayed object `dask.delayed()`, this means that no computation happens until `.compute()` is called at the end.
Afterwards instead of calculating the inverse immediately with NumPy this process is also delayed. This avoids pulling $R$ to the driver until the very end and allows Dask to schedule the inversion after $R$ is available in the graph, instead of serializing execution manually.
Ultimately also the final computation of $Q_da$ is lazy.
This fully delayed version allows the scheduler to optimize the entire pipeline together.

Cons:

Larger and more complex task graph (can become a scheduling overhead).

If you only need $R$ (or reuse $R$ multiple times), delaying everything might be inefficient compared to persist().

Execution time may fluctuate more since all steps (QR, stacking, inversion, multiplication) are chained into one big compute


In [6]:
def indirect_serial(A, n_div):
    """
    Indirect TSQR (serial, NumPy).
    Splits A by rows into n_div blocks, computes local R_i via QR,
    reduces to global R by QR on the stacked R_i, then recovers Q = A R^{-1}.
    Returns (Q, R).
    """

    n_samp = A.shape[0]
    
    div_points = int(np.floor(n_samp/n_div))
    A_divided = []
    Ri = []
    
    A_divided = [A[div_points * i : div_points * (i + 1)] for i in range(n_div - 1)]    # Divide the A matrix into multiple chunks
    A_divided.append(A[(n_div - 1) * div_points:, :])   # In the case n_samp wasn't divisible by n_div

    Ri = [np.linalg.qr(Ai, mode="reduced")[1] for Ai in A_divided]
    R_stack = np.concatenate(Ri, axis = 0)
    _, R = np.linalg.qr(R_stack, mode="reduced")

    # Here you could also use the numpy function "np.linalg.inv(R)" function but the triangular decomposition grants more numerical stability
    I = np.eye(n_samp, dtype=A.dtype)
    Rinv = solve_triangular(R, I, lower=False)

    Q = A @ Rinv

    return Q, R


def compute_R(block):
    # np.linalg.qr with mode='r' gives just the R matrix
    R = np.linalg.qr(block, mode="r")
    return R


def indirect_parallel(X_da):
    """
    Indirect TSQR with Dask.
    Output:
        R    : final global triangular factor (n x n, NumPy array on driver)
        Q_da : Dask Array (m x n), representing Q = A R^{-1} (lazy)
    """

    n_cols = X_da.shape[1]

    # Parallel mapping of the QR blocks
    R_blocks = X_da.map_blocks(compute_R, dtype=X_da.dtype, chunks=(n_cols, n_cols))
    # Now R_blocks is a stack of n x n matrices (one per partition)
    # Its shape is (#chunks * n, n)

    # Bring all the blocks together to compute
    R_stack = R_blocks.compute()   # NumPy array, shape (p*n, n)

    # Small QR on driver to combine them into the final R
    R = np.linalg.qr(R_stack, mode="r")

    # Instead of materializing Q, compute a small R^{-1} (n x n).
    I = np.eye(n_cols, dtype=X_da.dtype)
    R_inv = solve_triangular(R, I, lower=False)  # stable

    # Broadcast Rinv to every chunk: Q = A @ R^{-1}
    Q_da = X_da @ R_inv   # still a Dask Array, lazy

    return Q_da, R      #Q_da because it is lazy, it is still a Dask array

def indirect_parallel_persisted(X_da):
    """
    Indirect TSQR with Dask.
    Output:
        R    : final global triangular factor (n x n, persisted)
        Q_da : Dask Array (m x n), representing Q = A R^{-1} (lazy)
    """

    n_cols = X_da.shape[1]
    R_blocks = X_da.map_blocks(compute_R, dtype=X_da.dtype, chunks=(n_cols, n_cols))

    # In this case instead of computing, persist the R blocks
    R_stack = R_blocks.persist()  
    _, R = np.linalg.qr(R_stack)

    I = np.eye(n_cols, dtype=X_da.dtype)
    R_inv = solve_triangular(R, I, lower=False)  # stable
    Q_da = X_da @ R_inv  

    return Q_da, R     



def indirect_parallel_delayed(X_da):
    """
    Indirect TSQR with Dask, delayed version.
    Output:
        R    : delayed object
        Q_da : Dask Array (m x n), representing Q = A R^{-1} (lazy)
    """

    n_cols = X_da.shape[1]
    R_blocks = X_da.map_blocks(compute_R, dtype=X_da.dtype, chunks=(n_cols, n_cols))


    # Convert blocks to delayed NumPy arrays, stack via delayed
    R_list = list(R_blocks.to_delayed().ravel())     # each is delayed np.ndarray (n x n)
    R_stack = dask.delayed(np.vstack)(R_list)        # delayed (p*n x n)
    
    R_delayed = dask.delayed(compute_R)(R_stack)      # delayed np.ndarray (n x n)


    I = np.eye(n_cols, dtype=X_da.dtype)
    # compute R^{-1} lazily
    R_inv_delayed = dask.delayed(solve_triangular)(R_delayed, I, lower=False)
    R_inv_da = da.from_delayed(R_inv_delayed, shape=(n_cols, n_cols), dtype=X_da.dtype)

    # Broadcast multiply (keep Q lazy)
    Q_da = X_da @ R_inv_da

    return Q_da, R_delayed      #Q_da dask array, R delayed object


The following is an example of how to call the serial function in a local environment, an argument to pass is the number of partitions over which divide the dataset.

In [7]:
%%time

Q, R = indirect_serial(data.data.values, 50)  # Divide in 50 partitions


CPU times: user 7 Œºs, sys: 1e+03 ns, total: 8 Œºs
Wall time: 10.7 Œºs


NameError: name 'data' is not defined

In [7]:
from dask.distributed import wait

def measure_time(A, tsqr_func, client, timeout=300): 
    """Run one TSQR variant once on A, measuring only the compute stage."""

    import time
    t0 = time.time()
    Q, R = tsqr_func(A)

    # If they are Dask objects ‚Üí persist + wait
    if hasattr(R, "persist"):
        R = R.persist()
        wait(R, timeout=timeout)
    if hasattr(Q, "persist"):
        Q = Q.persist()
        wait(Q, timeout=timeout)

    t1 = time.time()

    # optional cleanup (after stopping the timer!)
    client.cancel([Q, R])

    return t1 - t0

In [11]:
# not-delayed version
t = measure_time(X_da, indirect_parallel, client)

print(t, "(s)")



#Dask Dashboard: Big green block: compute_R. map stage of mapblock that returns the small Ri
# Tiny yellow finalize-hlg block - Dask housekeeping
"""Why you don‚Äôt see ‚Äúreduce‚Äù or ‚Äúbroadcast‚Äù here
The reduce to the final 
R is done on the driver with NumPy:"""



"""Teal blocks around ~140‚Äì150 ms ‚Äî blockwise-matmul-‚Ä¶
In the not-delayed variant, R^{-1} is a NumPy constant, so each matmul task deserializes it; you may notice slightly more per-task overhead compared to the delayed/Dask-array constant.
Purple block ‚Äî reduction for the norm

After the matmul, da.linalg.norm(Qv) triggers a reduction:
"""


1.8805696964263916 (s)


'Teal blocks around ~140‚Äì150 ms ‚Äî blockwise-matmul-‚Ä¶\nIn the not-delayed variant, R^{-1} is a NumPy constant, so each matmul task deserializes it; you may notice slightly more per-task overhead compared to the delayed/Dask-array constant.\nPurple block ‚Äî reduction for the norm\n\nAfter the matmul, da.linalg.norm(Qv) triggers a reduction:\n'

In [9]:
# not-delayed persisted version
t = measure_time(X_da, indirect_parallel_persisted, client)

print(t, "(s)")


# better due to only calculating R once but with a cost
"""3. The subtlety

Persist = compute and cache now, but still return a Dask collection (with futures).

Compute = compute now and return the final NumPy array (collected to driver).

So:

If you persist inside your function and return R, you are indeed returning a Dask object backed by futures, not a NumPy matrix.

If you compute inside your function and return R, you‚Äôre returning a NumPy array, which is often what you want for the small triangular 
ùëÖ
R."""


1.3808841705322266 (s)


'3. The subtlety\n\nPersist = compute and cache now, but still return a Dask collection (with futures).\n\nCompute = compute now and return the final NumPy array (collected to driver).\n\nSo:\n\nIf you persist inside your function and return R, you are indeed returning a Dask object backed by futures, not a NumPy matrix.\n\nIf you compute inside your function and return R, you‚Äôre returning a NumPy array, which is often what you want for the small triangular \nùëÖ\nR.'

In [10]:

# fully-delayed version
t = measure_time(X_da, indirect_parallel_delayed, client)

print(t, "(s)")

"""Efficiency Implications

Computation of 
ùëÖ
R: essentially unchanged, still dominated by the map stage (compute_R).

Broadcast of 
ùëÖ
‚àí
1
R
‚àí1
: somewhat less efficient as a Dask Array, since it adds extra bookkeeping without reducing the numerical cost.

Norm benchmark (Q @ v): the visible red/yellow stages are expected; they confirm that your graph is carrying the computation fully through Dask.

because da.from_delayed introduces those extraction tasks.

Broadcast multiply still shows up as teal + red.

The dashboard is ‚Äúbusier‚Äù ‚Äî more small tasks, more scheduler chatter ‚Äî because everything, even tiny constants, was lifted into the Dask graph.

Efficiency interpretation

For small 
ùëõ
n: making 
ùëÖ
R a Dask array (fully delayed) adds overhead without real benefit ‚Äî the norm compute shows more yellow/red fragmentation than the optimized/persisted version.

For large distributed runs: it‚Äôs still correct, but NumPy constants (or scattered small arrays) are cheaper to handle than wrapping them in Dask.

That‚Äôs why your timings showed the fully delayed version wasn‚Äôt consistently faster"""


13.229716300964355 (s)


'Efficiency Implications\n\nComputation of \nùëÖ\nR: essentially unchanged, still dominated by the map stage (compute_R).\n\nBroadcast of \nùëÖ\n‚àí\n1\nR\n‚àí1\n: somewhat less efficient as a Dask Array, since it adds extra bookkeeping without reducing the numerical cost.\n\nNorm benchmark (Q @ v): the visible red/yellow stages are expected; they confirm that your graph is carrying the computation fully through Dask.\n\nbecause da.from_delayed introduces those extraction tasks.\n\nBroadcast multiply still shows up as teal + red.\n\nThe dashboard is ‚Äúbusier‚Äù ‚Äî more small tasks, more scheduler chatter ‚Äî because everything, even tiny constants, was lifted into the Dask graph.\n\nEfficiency interpretation\n\nFor small \nùëõ\nn: making \nùëÖ\nR a Dask array (fully delayed) adds overhead without real benefit ‚Äî the norm compute shows more yellow/red fragmentation than the optimized/persisted version.\n\nFor large distributed runs: it‚Äôs still correct, but NumPy constants (or s

In [12]:
N_WORKERS = 3
# Initialization of a distributed random matrix
m, n = int(1e7), 4
chunks = [m // N_WORKERS for _ in range(N_WORKERS-1)]
chunks.append(m - sum(chunks))
A = da.random.random((m, n), chunks=(chunks, n))

# Persist in memory to avoid recomputation
A = A.persist() 

print(f"Input matrix A: m = {A.shape[0]}, n = {A.shape[1]}")
print(f"The {len(A.chunks[0])} blocks are: {A.chunks[0]}")
print(f"Total size of A: {A.nbytes / 1e6} MB")
A


Input matrix A: m = 10000000, n = 4
The 3 blocks are: (3333333, 3333333, 3333334)
Total size of A: 320.0 MB


Unnamed: 0,Array,Chunk
Bytes,305.18 MiB,101.73 MiB
Shape,"(10000000, 4)","(3333334, 4)"
Dask graph,3 chunks in 1 graph layer,3 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 305.18 MiB 101.73 MiB Shape (10000000, 4) (3333334, 4) Dask graph 3 chunks in 1 graph layer Data type float64 numpy.ndarray",4  10000000,

Unnamed: 0,Array,Chunk
Bytes,305.18 MiB,101.73 MiB
Shape,"(10000000, 4)","(3333334, 4)"
Dask graph,3 chunks in 1 graph layer,3 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [15]:
# QR decomposition


print("\n-- Computed (QR) --")
print( measure_time(X_da, indirect_parallel, client))
print("\n-- Persisted (QR) --")
print( measure_time(X_da, indirect_parallel_persisted, client))
print("\n-- Delayed (QR) --")
print( measure_time(X_da, indirect_parallel_delayed, client))



-- Computed (QR) --
1.8783607482910156

-- Persisted (QR) --
1.9411554336547852

-- Delayed (QR) --
11.838286876678467


In [19]:
client.close()
cluster.close()