## Cholesky QR decomposition of a Tall and Skinny matrix

N.B. This is not the direct TSQR method proposed in the article. The Cholesky-based approach lacks precision and numerical stability, making it impractical for high-performance environments. Nevertheless, we found it interesting to implement it as an exercise, exploring the Dask implementation and benchmarking it, despite its inherent numerical instability

Let $B$ be a symmetric and positive definite $n\times n$ matrix. Then its Cholesky decomposition is:
$$
B = L L^T 
$$
where $L$ is a lower triangular $n\times n$ matrix. Cholesky decomposition turns out to be particularly useful when computing the QR decomposition of a $m\times n$ matrix $A$. Let's first build the temporary matrix $T = A^T A$, symmetric by definition. Hence, if:
$$
A = QR
$$
then:
$$
T = A^T A = (QR)^T(QR) = R^T Q^T Q R
$$
The matrix $Q$ is orthogonal, so $T = R^T R$. Since $R$ is a triangular matrix too, then we have effectively found the Cholesky decomposition of $T$ in terms of $R$. In other words, the Cholesky QR decomposition proceeds as follow:
1) Given $A$, build the symmetric matrix $T = A^T A$
2) Apply the Cholesky decomposition on $ T = L L^T$
3) Obtain $R = L^T$. Obtain Q by solving $A = QR$

This procedure is numerically unstable, and for certain matrices $A$ it may fail to produce accurate results, particularly for the orthogonal matrix $Q$. However, it is easily parallelizable, making it a useful testbed for experimenting with Dask.


In [1]:
# CLUSTER DEPLOYMENT, TO BE EXECUTED ONLY IN A LOCAL ENVIRONMENT!!
from dask.distributed import Client, LocalCluster

# For now, local deployment on my computer (multicore)
ncore = 4
cluster = LocalCluster(n_workers=ncore, threads_per_worker=1)
client = Client(cluster)

# Print the dashboard link over the port 8787
print(client.dashboard_link)

http://127.0.0.1:8787/status


In [65]:
# CLUSTER DEPLOYMENT ON CLOUDVENETO
from dask.distributed import Client, SSHCluster

cluster = SSHCluster(
    ["10.67.22.154", "10.67.22.216", "10.67.22.116", "10.67.22.113"],
    connect_options={"known_hosts": None},
    remote_python="/home/ubuntu/miniconda3/bin/python",
    scheduler_options={"port": 8786, "dashboard_address": ":8797"},
    worker_options={
        "nprocs": 1,     
        "nthreads": 1  
    }
)

client = Client(cluster)

2025-09-09 07:40:01,974 - distributed.deploy.ssh - INFO - 2025-09-09 07:40:01,974 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2025-09-09 07:40:01,994 - distributed.deploy.ssh - INFO - 2025-09-09 07:40:01,994 - distributed.scheduler - INFO - State start
2025-09-09 07:40:01,995 - distributed.deploy.ssh - INFO - 2025-09-09 07:40:01,995 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/scheduler-fgzafylc', purging
2025-09-09 07:40:01,998 - distributed.deploy.ssh - INFO - 2025-09-09 07:40:01,998 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.67.22.154:8786
2025-09-09 07:40:03,289 - distributed.deploy.ssh - INFO - 2025-09-09 07:40:03,288 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.116:44889'
2025-09-09 07:40:03,292 - distributed.deploy.ssh - INFO - 2025-09-09 07:40:03,283 - distributed.nanny - INFO -

In [66]:
# check if everything went smoothly
print(client)

<Client: 'tcp://10.67.22.154:8786' processes=3 threads=3, memory=5.81 GiB>


Import the necessary stuff along with the California housing dataset

In [67]:
import dask.array as da
import dask
import numpy as np
from sklearn.datasets import fetch_california_housing

# Download California Housing dataset
data = fetch_california_housing(as_frame=True)

# Convert features into Dask Array (it's a matrix).
n_partition = 3        # number of partition in memory. We have 4 VMS (1 master + 3 workers), so let's start with just 3 partitions
length_partition = data.data.shape[0] // n_partition
X_da = da.from_array(data.data.values, chunks=(length_partition, data.data.shape[1]))

print("Number of Dask partitions:",  X_da.npartitions) 
print("Length of each partition:", length_partition, "rows")
print("Length of the whole dataset:", data.data.shape[0], "rows")

Number of Dask partitions: 3
Length of each partition: 6880 rows
Length of the whole dataset: 20640 rows


Now we'll define the parallel and serial algorithm for the Cholesky QR decomposition.

This first parallel version of the Cholesky method works as follows:
1) The array should already be splitted by rows in partitions across workers (let's call each partition $A_p$). Each worker computes a local version of $A^T A$, i.e. $A_p^T A_p$. Since $A_p$ is smaller than $A$, the matrix multiplication should proceed faster. Furthermore, $A_p$ being smaller may fully reside in the RAM of a worker
2) Once each worker has finished, the full Gram matrix $A^T A$ is computed in a single worker by summing up all the smaller and local $A_p$: $A^T A = \sum_p A_P^T A_p$
3) The matrix $A^T A$ is small, $n\times n$. A serial Cholesky decomposition is performed and will output the final $R$ matrix
4) To get $Q$, we will use the defining equation $A = QR \Rightarrow Q = A R^{-1}$. Computing the inverse of $R$ is straightforward and can be done by a single worker, whereas the MatMul between $A$ and the inverse of $R$ can be parallelized

In [81]:
def compute_choleskyQR_parallel(X_da : dask.array.Array):
    # X_da.persist()
    # A list of delayed tasks for each partition of the dataset
    # Each partition computes the local Gram matrix (as a delayed task)
    chunks_delayed = [dask.delayed(lambda x : x.T @ x)(chunk) for chunk in X_da.to_delayed().ravel()]

    # Now sum all the local Gram matrices to get the global Gram matrix
    Gram_global_delayed = dask.delayed(sum)(chunks_delayed)   ## !! This is not strictly parallel, meaning that a single worker will perform the sum instead of a tree-like operation. This is ok here, I guess, since we only have 8 chunks that need to be summed up

    # Compute R as the Cholesky decomposition on the global Gram matrix (as a delayed even if a serial operation just call .compute at the end)
    R = dask.delayed(np.linalg.cholesky)(Gram_global_delayed)
    #R.visualize("fig/CholeskyR.png")
    R = R.compute() # Compute R. This will put a stop at the parallel operation
    R_inv = np.linalg.inv(R) # It's a small matrix, so this operation is fast even if serial

    Q = X_da.map_blocks(lambda block: block @ R_inv, dtype=X_da.dtype)
    #Q.visualize("fig/CholeskyQ.png")
    Q = Q.compute() # Compute Q
    return Q, R

def compute_choleskyQR_serial(X):
    # Global gram matrix
    G = X.T @ X
    R = np.linalg.cholesky(G)
    R_inv = np.linalg.inv(R)
    Q = X @ R_inv
    
    return Q, R

def compute_choleskyR_parallel(X_da : dask.array.Array):
    # A list of delayed tasks for each partition of the dataset
    # Each partition computes the local Gram matrix
    chunks_delayed = [dask.delayed(lambda x : x.T @ x)(chunk) for chunk in X_da.to_delayed().ravel()]
    # Now sum all the local Gram matrices to get the global Gram matrix
    Gram_global_delayed = dask.delayed(sum)(chunks_delayed)
    # Compute R as the Cholesky decomposition on the global Gram matrix (as a delayed even if a serial operation just call .compute at the end)
    R = dask.delayed(np.linalg.cholesky)(Gram_global_delayed)
    R = R.compute() # Compute R
    return  R

def compute_choleskyR_serial(X):
    # Global gram matrix
    G = X.T @ X
    R = np.linalg.cholesky(G)
    return R

The DAG should look like (for the computation of R)


![](fig/CholeskyR.png)

Let's measure the time it takes to perform the parallel Cholesky QR decomposition:

In [79]:
%%time
# parallel
Q_p, R_p = compute_choleskyQR_parallel(X_da)

CPU times: user 8.67 ms, sys: 0 ns, total: 8.67 ms
Wall time: 66.5 ms


As of now, we have 3 VMs and we specifically asked Dask to only create one worker per node, thus we have deployed 3 workers. Accessing the dashboard, we can see what happens under the hood:

![](fig/CloudVeneto_Cal_3workers.png)

Since we have three workers, we can see three horizontal segments, each corresponding to a worker. The first three bands (greenish) are labeled _array_ by Dask and are related to array access and reading. This is because the dataset is stored in the scheduler VM.
When workers need to access this data, the scheduler sends it over the network. If we had run _X_da.persist()_ prior to the function call, all the data would already have been stored in the workers' memory, and no additional time or transfer would have been required.

The following parallel blocks (three, as expected) correspond to the lambda function, i.e., the local MatMul.

The red block (followed by the yellow one) is executed on a single worker, as requested. These blocks represent the serial sum: a single worker collects all the temporary Gram matrices (red block) and performs the sum operation (yellow block).

All the gaps between the colored bands represent Dask overhead (orchestration, scheduling, etc.), which, in this specific case, appears to consume a significant amount of time. This essentially means that workers were idle most of the time and long and slow transfers (red blocks) can be observed

Indeed, running the same algorithm serially:

In [80]:
%%time
# serial
Q_s, R_s = compute_choleskyQR_serial(data.data.values)

CPU times: user 1.04 ms, sys: 0 ns, total: 1.04 ms
Wall time: 1.06 ms


The serial implementation is much faster than the parallel one. This was, unfortunately, expected for several reasons:

1) The dataset is relatively small (only $20k$ rows). It easily fits in the master's RAM, so there is really no need to create partitions and transfer them over the network. Numpy, which also uses multithreading internally, will certainly be faster in this case.
2) The algorithm still has some limitations, mainly the bottleneck caused by the serial part: only one worker is responsible for summing all the local matrices.
3) We used $n\_partitions = n\_workers$, which might not be the optimal configuration. Creating more partitions means that individual parallel operations are faster (because each data block is smaller), but it also means that a single worker may have to process multiple partitions, generating additional overhead.

Furthermore, Cholesky QR is sadly known to be unstable. In fact:

In [82]:
# Let's see whether the results are compatible
diffR = np.linalg.norm(R_p - R_s, 2)
diffQ = np.linalg.norm(Q_p - Q_s, 2)
print(f"||R_parallel - R_serial||_2 = {diffR}")
print(f"||Q_parallel - Q_serial||_2 = {diffQ}")

# Check orthogonality of Q
orthogonality_metric = np.linalg.norm(Q_s.T @ Q_s - np.eye(Q_s.shape[1]), 2)
print(f"||Q^T @ Q- I||_2 = {orthogonality_metric}")
# Check decomposition
decomp_metric = np.linalg.norm(data.data.values - Q_s @ R_s, 2)
print(f"||X - Q @ R||_2 = {decomp_metric}")

||R_parallel - R_serial||_2 = 1.8265817285310699e-09
||Q_parallel - Q_serial||_2 = 1.0852657367110989e-10
||Q^T @ Q- I||_2 = 7971678.680289975
||X - Q @ R||_2 = 8.723161201902093e-10


As expected, the decomposition yielded a non reasonnable result (Q is not orthogonal, the algorithm is highly unstable)

## A larger dataset

Let's try with a different and larger dataset (HIGGS dataset)

In [123]:
# create again a cluster
# CLUSTER DEPLOYMENT ON CLOUDVENETO
client.close()   
cluster.close()

cluster = SSHCluster(
    ["10.67.22.154", "10.67.22.216", "10.67.22.116", "10.67.22.113"],
    connect_options={"known_hosts": None},
    remote_python="/home/ubuntu/miniconda3/bin/python",
    scheduler_options={"port": 8786, "dashboard_address": ":8797"},
    worker_options={
        "nprocs": 4,     
        "nthreads": 1  
    }
)

client = Client(cluster)

2025-09-09 08:49:44,174 - distributed.deploy.ssh - INFO - 2025-09-09 08:49:44,173 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2025-09-09 08:49:44,193 - distributed.deploy.ssh - INFO - 2025-09-09 08:49:44,193 - distributed.scheduler - INFO - State start
2025-09-09 08:49:44,196 - distributed.deploy.ssh - INFO - 2025-09-09 08:49:44,196 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.67.22.154:8786
2025-09-09 08:49:45,702 - distributed.deploy.ssh - INFO - 2025-09-09 08:49:45,704 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.113:33701'
2025-09-09 08:49:45,705 - distributed.deploy.ssh - INFO - 2025-09-09 08:49:45,707 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.113:40119'
2025-09-09 08:49:45,707 - distributed.deploy.ssh - INFO - 2025-09-09 08:49:45,709 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.67.22.11

In [124]:
print(client)

<Client: 'tcp://10.67.22.154:8786' processes=12 threads=12, memory=23.25 GiB>


Now we have 12 workers (4 worker on each VM, excluding the scheduler/master)

In [125]:
import dask.dataframe as dd
import os

os.chdir("/home/ubuntu") 
path_HIGGS = os.getcwd() + "/datasets/HIGGS.csv"
# A huge dataset
df = dd.read_csv(path_HIGGS, header=None, blocksize="200MB")
X_df = df.iloc[:, 1:] 
X_da = X_df.to_dask_array(lengths=True)

In [126]:
#Let's print it
X_da

Unnamed: 0,Array,Chunk
Bytes,2.29 GiB,58.75 MiB
Shape,"(11000000, 28)","(275002, 28)"
Dask graph,40 chunks in 1 graph layer,40 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.29 GiB 58.75 MiB Shape (11000000, 28) (275002, 28) Dask graph 40 chunks in 1 graph layer Data type float64 numpy.ndarray",28  11000000,

Unnamed: 0,Array,Chunk
Bytes,2.29 GiB,58.75 MiB
Shape,"(11000000, 28)","(275002, 28)"
Dask graph,40 chunks in 1 graph layer,40 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


As of now, nothing has yet happened. Let's load in the worker's memory the partitions:

In [127]:
X_da.persist()

Unnamed: 0,Array,Chunk
Bytes,2.29 GiB,58.75 MiB
Shape,"(11000000, 28)","(275002, 28)"
Dask graph,40 chunks in 1 graph layer,40 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.29 GiB 58.75 MiB Shape (11000000, 28) (275002, 28) Dask graph 40 chunks in 1 graph layer Data type float64 numpy.ndarray",28  11000000,

Unnamed: 0,Array,Chunk
Bytes,2.29 GiB,58.75 MiB
Shape,"(11000000, 28)","(275002, 28)"
Dask graph,40 chunks in 1 graph layer,40 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


The dataset now resides in the worker's memory. Having a look at the dashboard:


!!! IMMAGINE


This means that the dataset was uploaded partially on the worker's RAM and partially in their mass memory

Let's now run our algorithm. However, we can't obtain both $Q$ and $R$. This is because $Q$ is a $m \times m$ matrix, where $m$ here is approximately $11$ billion. If each element is a double ($8 \> B$), then $Q$ is about $88 \> GB$, way to much to be collected on the client RAM. Hence, we modify our function so that both $R$ and $Q$ are persisted but not directly sent to the client

In [174]:
def compute_choleskyQR_parallel(X_da : dask.array.Array):
    
    def gramMatMul(x): #Declaring it this way will make the name appear in the Dask dashboard
        return x.T @ x
    def MatMul(x): 
        return x @ R_inv
        
    # A list of delayed tasks for each partition of the dataset. Each partition computes the local Gram matrix (as a delayed task)
    chunks_delayed = [dask.delayed(gramMatMul)(chunk) for chunk in X_da.to_delayed().ravel()]
    # Now sum all the local Gram matrices to get the global Gram matrix
    Gram_global_delayed = dask.delayed(sum)(chunks_delayed)   ## !! This is not parallel
    # Compute R as the Cholesky decomposition on the global Gram matrix (as a delayed even if a serial operation just call .compute at the end)
    R = dask.delayed(np.linalg.cholesky)(Gram_global_delayed)
    #R.visualize("fig/CholeskyR.png")
    R = R.compute() # Compute R. This will put a stop at the parallel operation
    R_inv = np.linalg.inv(R) # It's a small matrix, so this operation is fast even if serial

    
    X_da = X_da.persist()    # Persist again X_da, since X_da.to_delayed seems to cause troubles
    Q = X_da.map_blocks(MatMul, dtype=X_da.dtype)
    #Q.visualize("fig/CholeskyQ.png")
    Q = Q.persist() # Compute Q
    return Q, R
    

In [192]:
%%time
Q, R = compute_choleskyQR_parallel(X_da)

CPU times: user 17.4 ms, sys: 3.12 ms, total: 20.6 ms
Wall time: 238 ms


Qui da incollare l'immagine della dashboard spiegando i vari blocchi

Nota che qui, con dataset più grandi, la storia della somma in parallelo/raccolta dati importa poco

## Tree reduction

Qui la storia della somma in parallelo, riduzione come da articolo!

In [255]:
def compute_choleskyQR_parallel_tree(X_da : dask.array.Array):
    
    def gramMatMul(x): #Declaring it this way will make the name appear in the Dask dashboard
        return x.T @ x
    def MatMul(x, R_inv): 
        return x @ R_inv
    def PartialSum(a,b):
        return a+b
    def Inverse(R):
        return np.linalg.inv(R)
        
    # A list of delayed tasks for each partition of the dataset. Each partition computes the local Gram matrix (as a delayed task)
    chunks_delayed = [dask.delayed(gramMatMul)(chunk) for chunk in X_da.to_delayed().ravel()]
    while len(chunks_delayed) > 1:
        new_level = []
        for i in range(0, len(chunks_delayed), 2):
            if i + 1 < len(chunks_delayed):
                new_level.append(dask.delayed(PartialSum)(chunks_delayed[i], chunks_delayed[i+1]))
            else:
                new_level.append(chunks_delayed[i])
        chunks_delayed = new_level

    Gram_global_delayed = chunks_delayed[0]
    # Compute R as the Cholesky decomposition on the global Gram matrix (as a delayed even if a serial operation just call .compute at the end)
    R = dask.delayed(np.linalg.cholesky)(Gram_global_delayed)
    #R.visualize("fig/CholeskyR.png")
    R = R.persist()
    #R = R.compute() # Compute R. This will put a stop at the parallel operation
    R_inv = dask.delayed(Inverse)(R) # It's a small matrix, so this operation is fast even if serial
    
    X_da = X_da.persist()    # Persist again X_da, since X_da.to_delayed seems to cause troubles
    Q = X_da.map_blocks(MatMul,R_inv, dtype=X_da.dtype)
    #Q.visualize("fig/CholeskyQ.png")
    Q = Q.persist() # Compute Q
    return Q, R

In [259]:
%%time
client.cancel(Q)
client.cancel(R)
Q, R = compute_choleskyQR_parallel_tree(X_da)

CPU times: user 16.3 ms, sys: 0 ns, total: 16.3 ms
Wall time: 16.1 ms


In [253]:
%%time
client.cancel(Q)
client.cancel(R)
Q, R = compute_choleskyQR_parallel(X_da)

CPU times: user 21.3 ms, sys: 253 μs, total: 21.5 ms
Wall time: 168 ms


Commento sulla versione definitiva, migliorata con la tree reduction

## Definitive function

In [261]:
def compute_choleskyQR_parallel_optimal(X_da : dask.array.Array):
    def gramMatMul(x): #Declaring it this way will make the name appear in the Dask dashboard
        return x.T @ x
    def MatMul(x, R_inv): 
        return x @ R_inv
    def PartialSum(a,b):
        return a+b
    def Inverse(R):
        return np.linalg.inv(R)
        
    # A list of delayed tasks for each partition of the dataset. Each partition computes the local Gram matrix (as a delayed task)
    chunks_delayed = [dask.delayed(gramMatMul)(chunk) for chunk in X_da.to_delayed().ravel()]
    while len(chunks_delayed) > 1:
        new_level = []
        for i in range(0, len(chunks_delayed), 2):
            if i + 1 < len(chunks_delayed):
                new_level.append(dask.delayed(PartialSum)(chunks_delayed[i], chunks_delayed[i+1]))
            else:
                new_level.append(chunks_delayed[i])
        chunks_delayed = new_level

    Gram_global_delayed = chunks_delayed[0]
    # Compute R as the Cholesky decomposition on the global Gram matrix (as a delayed even if a serial operation just call .compute at the end)
    R = dask.delayed(np.linalg.cholesky)(Gram_global_delayed)
    #R.visualize("fig/CholeskyR.png")
    R = R.persist()
    #R = R.compute() # Compute R. This will put a stop at the parallel operation
    R_inv = dask.delayed(Inverse)(R) # It's a small matrix, so this operation is fast even if serial
    
    X_da = X_da.persist()    # Persist again X_da, since X_da.to_delayed seems to cause troubles
    Q = X_da.map_blocks(MatMul,R_inv, dtype=X_da.dtype)
    #Q.visualize("fig/CholeskyQ.png")
    Q = Q.persist() # Compute Q
    return Q, R