# LoRA Subspace Similarity Measure for Merging

First, let's run some code to open two `.safetensors` files storing two LoRA models named `lora1` and `lora2`, and print the keys in their dictionaries 

In [1]:
import torch
from safetensors import safe_open
from scipy.linalg import svd
import numpy as np

# Load the LoRA tensors from .safetensors files
with safe_open("lora1.safetensors", framework="pt", device="cpu") as f:
    lora1_tensors = {}
    for k in f.keys():
        lora1_tensors[k] = f.get_tensor(k)

# Print the available keys
print("Keys in lora1_tensors:")
for key in lora1_tensors.keys():
    print(key)

Keys in lora1_tensors:
lora_te_text_model_encoder_layers_0_mlp_fc1.alpha
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.alpha
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self

In [2]:
with safe_open("lora2.safetensors", framework="pt", device="cpu") as f:
    lora2_tensors = {}
    for k in f.keys():
        lora2_tensors[k] = f.get_tensor(k)

print("\nKeys in lora2_tensors:")
for key in lora2_tensors.keys():
    print(key)


Keys in lora2_tensors:
lora_te_text_model_encoder_layers_0_mlp_fc1.alpha
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.alpha
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_sel

In [3]:
import numpy as np
from scipy.linalg import svd

# Given matrices B1, A1, B2, A2
# B1, A1 = np.random.rand(8, 64), np.random.rand(64, 8)  # for example
# B2, A2 = np.random.rand(64, 64), np.random.rand(64, 64)  # for example

def compute_subspace_similarity(B1, A1, B2, A2, i, j):
    # Perform SVD to get the right singular unitary matrices
    _, _, U_A1 = svd(A1)
    _, _, U_A2 = svd(A2)

    # Take the first i and j singular vectors
    U_A1_i = U_A1[:, :i]
    U_A2_j = U_A2[:, :j]

    # Compute the Frobenius norm
    norm = np.linalg.norm(U_A1_i.T @ U_A2_j, 'fro')**2

    # Normalize by min(i, j)
    normalized_norm = norm / min(i, j)

    return normalized_norm


In [12]:
# similarity = compute_subspace_similarity(B1, A1, B2, A2, 3, 5)
# print("Similarity:", similarity)

Now, in the paper on LoRAs, the following concept is introduced: The subspace similarity measure is a way of measuring the similarity between the subspaces spanned by the top singular vectors of two low-rank adaptation matrices, $A_{r=8}$ and $A_{r=64}$, from the same pre-trained model. Here's how it's done:

First, you perform a singular value decomposition (SVD) on each of these matrices to obtain their right-singular unitary matrices, denoted $U_{A_{r=8}}$ and $U_{A_{r=64}}$.

The goal is then to quantify how much of the subspace spanned by the top $i$ singular vectors in $U_{A_{r=8}}$ is contained in the subspace spanned by the top $j$ singular vectors of $U_{A_{r=64}}$.

This is measured using a normalized subspace similarity based on the Grassmann distance. The formula for this measure, denoted $\phi(A_{r=8}, A_{r=64}, i, j)$, is given as follows:

$$
\phi(A_{r=8}, A_{r=64}, i, j) = \frac{||U_{A_{r=8}}^{(i)} {U_{A_{r=64}}^{(j)}}^T||_F^2}{\min(i, j)}
$$

where $U_{A_{r}}^{(i)}$ represents the columns of $U_{A_{r}}$ corresponding to the top $i$ singular vectors, and $||\cdot||_F$ denotes the Frobenius norm.

The measure $\phi(·)$ ranges from 0 to 1, where 1 represents a complete overlap of the subspaces (i.e., they are the same), and 0 represents a complete separation (i.e., they are orthogonal). This is a normalized measure because it's divided by $\min(i, j)$, which is the maximum possible square of the Frobenius norm of the product matrix $U_{A_{r=8}}^{(i)} {U_{A_{r=64}}^{(j)}}^T$.

This process is performed for all pairs $(i, j)$ where $1 \leq i \leq 8$ and $1 \leq j \leq 64$. The results give an understanding of how much the learned subspaces for different ranks overlap with each other.

This can also be performed on two layers $\Delta W_1 = B_1A_1$ and $\Delta W_2 = B_2A_2$  in two different LoRAs. In particular, suppose we choose a layer `n` of each LoRA and run the subspace similarity measure comparison on $U_{\Delta W_1}^{(i)} {U_{\Delta W_2}^{(j)}}^T$. Then this will tell us how much those to LoRAs overlap with one another. 

This could be useful in determining which LoRAs to merge. If we run this analysis on all of the weight matrices of two different LoRAs, then we can determine how much layer `n` of `lora1` overlaps with layer `n` of `lora2`. If the overlap is small, then the two weight martices $\Delta W_1^{(n)} = B_1^{(n)}A_1^{(n)}$ and $\Delta W_2^{(n)} = B_2^{(n)}A_2^{(n)}$ may express very different things because the subspaces that they span do not overlap very much. So, to be more explicit, we compute

$$
\phi(\Delta W_1^{(n)}, \Delta W_2^{(n)}, i, j) = \frac{||U_{\Delta W_1^{(n)}}^{(i)} {U_{\Delta W_2^{(n)}}^{(j)}}^T||_F^2}{\min(i, j)}
$$

for a weight matrix $\Delta W_1$ from the first LoRA, and the corresponding $\Delta W_2$ from the second LoRA. This could indicate that merging the two LoRAs will create a more general model, able to create a wider range of diverse styles. This might also help in explaining why two LoRAs create something very muddy or undesirable when merges. Obviously, this is all conjecture based on a mathematical analysis that needs to be tested, and it does not provide a precise theshold for the overlap. What upper or lower bound might we use for this subspace similarity measure $\phi$? Could this hypthesis be wrong, or inverted? That is, is it possible that in some cases we actually want *high* overlap between models so that we merge very similar concepts?

In [14]:
import torch

# The A matrices
A1 = lora1_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight'].float()
A2 = lora2_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight'].float()

# The B matrices
B1 = lora1_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight'].float()
B2 = lora2_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight'].float()

# Compute the update matrices
Delta_W1 = torch.matmul(B1, A1)
Delta_W2 = torch.matmul(B2, A2)

# Compute the SVD of the update matrices
U1, _, _ = torch.svd(Delta_W1)
U2, _, _ = torch.svd(Delta_W2)

# Define the subspace similarity measure
def phi(U1, U2, i, j):
    U1_i = U1[:, :i]  # First i columns of U1
    U2_j = U2[:, :j]  # First j columns of U2
    
    product = torch.matmul(U1_i.t(), U2_j)  # Matrix multiplication
    norm = torch.norm(product)  # Frobenius norm
    
    return norm ** 2 / min(i, j)

# Calculate the subspace similarity measure
i = U1.size(1)  # Number of columns in U1
j = U2.size(1)  # Number of columns in U2
result = phi(U1, U2, i, j)  # Replace i and j with the desired values

print(f'The subspace similarity measure is: {result}')


The subspace similarity measure is: 0.26528483629226685


In [17]:
import torch

# Number of layers in your LoRAs
# num_layers = len(lora1_tensors.keys()) // 2  # Assuming that each layer has a down and up weight
num_layers = 12

# Define the subspace similarity measure
def phi(U1, U2, i, j):
    U1_i = U1[:, :i]  # First i columns of U1
    U2_j = U2[:, :j]  # First j columns of U2
    
    product = torch.matmul(U1_i.t(), U2_j)  # Matrix multiplication
    norm = torch.norm(product)  # Frobenius norm
    
    return norm ** 2 / min(i, j)

# Iterate over all layers
for layer in range(num_layers):
    # Extract the corresponding A and B matrices
    A1 = lora1_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_down.weight'].float()
    B1 = lora1_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_up.weight'].float()
    
    A2 = lora2_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_down.weight'].float()
    B2 = lora2_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_up.weight'].float()

    # Compute the update matrices
    Delta_W1 = torch.matmul(B1, A1)
    Delta_W2 = torch.matmul(B2, A2)

    # Compute the SVD of the update matrices
    U1, _, _ = torch.svd(Delta_W1)
    U2, _, _ = torch.svd(Delta_W2)

    # Calculate the subspace similarity measure
    i = U1.size(1)  # Number of columns in U1
    j = U2.size(1)  # Number of columns in U2
    result = phi(U1, U2, i, j)  # Replace i and j with the desired values

    print(f'The subspace similarity measure for layer {layer} is: {result}')


The subspace similarity measure for layer 0 is: 0.26528483629226685
The subspace similarity measure for layer 1 is: 0.26683029532432556
The subspace similarity measure for layer 2 is: 0.27087515592575073
The subspace similarity measure for layer 3 is: 0.2644103169441223
The subspace similarity measure for layer 4 is: 0.2637903094291687
The subspace similarity measure for layer 5 is: 0.2647744119167328
The subspace similarity measure for layer 6 is: 0.26757362484931946
The subspace similarity measure for layer 7 is: 0.26692089438438416
The subspace similarity measure for layer 8 is: 0.26369643211364746
The subspace similarity measure for layer 9 is: 0.2677040100097656
The subspace similarity measure for layer 10 is: 0.26571375131607056
The subspace similarity measure for layer 11 is: 0.28491613268852234


In [23]:
import torch

# Gather all keys and sort them
all_keys = sorted(list(lora1_tensors.keys()))

# Filter keys for lora_down and lora_up pairs
lora_down_keys = [key for key in all_keys if 'lora_down' in key]
lora_up_keys = [key for key in all_keys if 'lora_up' in key]

# Ensure we have matching pairs of keys
assert len(lora_down_keys) == len(lora_up_keys), "Mismatch in number of 'lora_down' and 'lora_up' keys"

# Define the subspace similarity measure
def phi(U1, U2, i, j):
    U1_i = U1[:, :i]  # First i columns of U1
    U2_j = U2[:, :j]  # First j columns of U2
    
    product = torch.matmul(U1_i.t(), U2_j)  # Matrix multiplication
    norm = torch.norm(product)  # Frobenius norm
    
    return norm ** 2 / min(i, j)

# Iterate over all layers
for layer in range(len(lora_down_keys)):
    try:
        # Extract the corresponding A and B matrices
        A1 = lora1_tensors[lora_down_keys[layer]].float()
        B1 = lora1_tensors[lora_up_keys[layer]].float()

        A2 = lora2_tensors[lora_down_keys[layer]].float()
        B2 = lora2_tensors[lora_up_keys[layer]].float()

        # Print the shapes of A1 and B1 matrices for troubleshooting
        print(f"A1 shape: {A1.shape}")
        print(f"B1 shape: {B1.shape}")

        # Compute the update matrices
        Delta_W1 = torch.matmul(B1, A1)
        Delta_W2 = torch.matmul(B2, A2)

        # Compute the SVD of the update matrices
        U1, _, _ = torch.svd(Delta_W1)
        U2, _, _ = torch.svd(Delta_W2)

        # Calculate the subspace similarity measure
        i = U1.size(1)  # Number of columns in U1
        j = U2.size(1)  # Number of columns in U2
        result = phi(U1, U2, i, j)  # Replace i and j with the desired values

        print(f"The subspace similarity measure for layer {layer} is: {result}")

    except RuntimeError as e:
        # Print the layer number and the error message
        print(f"Error occurred at layer {layer}: {e}")


A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([3072, 32])
The subspace similarity measure for layer 0 is: 0.26528483629226685
A1 shape: torch.Size([32, 3072])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 1 is: 0.9999878406524658
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 2 is: 0.9999885559082031
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 3 is: 0.9999887347221375
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 4 is: 0.9999908804893494
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 5 is: 0.9999882578849792
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([3072, 32])
The subspace similarity measure for layer 6 is: 0.26571375131607056
A1 shape: torch.Size([32, 3072])
B1 shape: torch.Size([768, 32])
The subspace 

Let's break this down once more. 

From the above code, we see that many of the update matrices overlap quite a lot, but some have a much smaller overlap. How can we better understand how many of the overlaps contribute to a more general or expressive model? In other words, if a significant portion of the update matrices have high (or low) overlap, what does this mean, and what number or porportion of the update matrices can we call "significant" in this case?

LoRA (Low-Rank Adaptation) is a technique used to fine-tune large language models with a significantly lower computational cost compared to traditional methods. The key idea behind LoRA is to restrict the updates during the fine-tuning process to a low-rank subspace of the parameter space.

Given a large pre-trained model with parameters $\Theta$, LoRA introduces two sets of new parameters $A$ and $B$ such that the updates to the parameters of the model during fine-tuning can be written as $\Delta W = BA$, where $A \in \mathbb{R}^{k \times m}$ and $B \in \mathbb{R}^{n \times k}$, $n$ is the number of parameters in the layer, $m$ is the size of the context window, and $k$ is the rank of the low-rank update, typically much smaller than $n$ and $m$. This formulation ensures that the updates $\Delta W$ lie in a $k$-dimensional subspace of the parameter space.

The code provided above computes a subspace similarity metric between pairs of update matrices in two different LoRA models. Specifically, for each layer $n$ in the models, it computes the update matrices $\Delta W_1^{(n)} = B_1^{(n)}A_1^{(n)}$ and $\Delta W_2^{(n)} = B_2^{(n)}A_2^{(n)}$ and then calculates the similarity between the subspaces spanned by $\Delta W_1^{(n)}$ and $\Delta W_2^{(n)}$. 

The subspace similarity metric $\phi$ is computed as follows:

$$
\phi(\Delta W_1, \Delta W_2, i, j) = \frac{||U_{\Delta W_1^{(n)}}^{(i)} {U_{\Delta W_2^{(n)}}^{(j)}}^T||_F^2}{\min(i, j)}
$$

Here, $U_{\Delta W_1^{(n)}}^{(i)}$ and $U_{\Delta W_2^{(n)}}^{(j)}$ are the first $i$ and $j$ left singular vectors of $\Delta W_1^{(n)}$ and $\Delta W_2^{(n)}$, respectively, obtained through singular value decomposition. The operator $||\cdot||_F$ denotes the Frobenius norm, and $\min(i, j)$ ensures the normalization of the metric to the smaller dimension of the two subspaces.

This metric $\phi$ effectively measures how aligned the two subspaces are. If the two subspaces are similar, their corresponding left singular vectors will be closely aligned, leading to a large Frobenius norm of their product, and thus a large value of $\phi$. Conversely, if the subspaces are dissimilar, their left singular vectors will not align well, resulting in a smaller Frobenius norm and a smaller $\phi$.

This is related to the Grassmann manifold because the set of all $k$-dimensional subspaces of an $n$-dimensional vector space forms a Grassmann manifold $G(n, k)$. The metric $\phi$ can be viewed as a measure of distance on this manifold, allowing us to quantify the difference between the subspaces spanned by the updates of different LoRA models.

When interpreting the subspace similarity metric, a high value (like 0.99 or above) indicates that the corresponding update matrices (and hence the associated subspaces) of the two LoRA models are very similar or aligned. This suggests that the two models are learning very similar "concepts" or features.

In the context of a text-to-vision model like DALL-E or StableDiffusion, this might mean that the two LoRA models are both focusing on the same kind of visual or textual features in the data, and making similar updates during training to capture these features. If these models are meant to learn different concepts or tasks, this high similarity might indicate that they are failing to differentiate between these tasks and are instead learning similar representations.

Conversely, a low value (like 0.25 or below) of the subspace similarity metric indicates that the corresponding update matrices of the two models are quite different or unaligned. This suggests that the two models are learning different "concepts" or features.

Again, in the context of a text-to-vision model, this might mean that the two LoRA models are focusing on different kinds of visual or textual features in the data, and making different updates during training to capture these features. If these models are meant to learn different concepts or tasks, this low similarity could be a good sign that they are successfully differentiating between these tasks and learning distinct representations.

Of course, these interpretations are only rough indications and might not capture the full complexity of what's happening in the models. It's also important to remember that the subspace similarity metric is just one of many ways to measure the similarity between models, and it might not always reflect the actual functional or semantic similarity between the models. For a more comprehensive understanding of what the models are learning, you would likely need to use a combination of different analysis techniques, including both quantitative measures like the subspace similarity metric and qualitative methods like visual inspection of the generated outputs.

Merging two models that have high subspace overlaps for a significant portion of their update matrices can be a challenging task. The high overlap suggests that the models are learning similar features or concepts, so simply averaging the update matrices might not result in a meaningful combination of the models' knowledge.

One way to merge the models might be to use a weighted average of the update matrices, where the weights are determined based on some measure of the models' performance or relevance to the task at hand. For instance, if one model performs better on a validation set or is known to be more relevant to the task, you could assign it a higher weight. This would give more influence to the better or more relevant model in the merged model.

However, this approach assumes that the models' update matrices can be meaningfully averaged, which might not be the case if the models are learning very different features or concepts, or if their update matrices lie in different parts of the parameter space.

If the models' update matrices are very close together in the Grassmann manifold (i.e., they have high subspace overlap), and you want to push them further apart, you might need to modify the models' training process to encourage them to learn more diverse features or concepts. This could be done by adding a regularization term to the loss function that penalizes similarity between the models' update matrices, effectively pushing their subspaces apart in the Grassmann manifold.

Such a regularization term could be based on the subspace similarity metric itself, or on some other measure of similarity between the update matrices. The exact form of the regularization term would depend on the specific characteristics of your models and your training setup, and would likely require some experimentation to find the best approach.

It's worth noting that pushing the models' update matrices further apart in the Grassmann manifold is not always desirable. If the models are learning similar features or concepts because these are important for the task at hand, forcing them to diverge might hurt their performance. It's important to carefully consider the implications of pushing the models apart before deciding to implement such a strategy.

Modifying the regularization term to push the subspaces apart can be inappropriate in a few scenarios:

1. **Shared Essential Features**: If the models are learning similar features because these features are fundamental or essential for the task at hand, forcing the models to diverge might result in them missing out on these essential features, thereby hurting their performance.

2. **Task Similarity**: If the models are designed to solve very similar tasks, or tasks with a high degree of overlap, pushing the subspaces apart may not make sense. In this case, the models might naturally learn similar representations, and forcing divergence could lead to less optimal solutions.

3. **Increased Complexity**: Adding a regularization term to push the subspaces apart increases the complexity of the model and the training process. If the benefits of divergence (e.g., increased diversity of learned features) do not outweigh the costs (e.g., increased computational resources, potential for overfitting), it may not be appropriate to use this approach.

4. **Model Interpretability**: Adding a subspace divergence term could make the model more difficult to interpret, as it's another factor influencing the learning process that needs to be accounted for when analyzing the model's behavior. If interpretability is a key concern, it may be inappropriate to add this complexity.

5. **Lack of Evidence for Improvement**: If there's no empirical evidence or theoretical justification suggesting that pushing the subspaces apart will improve the performance of the models on the task at hand, it might be inappropriate to use this approach. It's generally recommended to have a clear hypothesis or rationale for any changes made to the model or training process.

These are general guidelines, and the specifics will depend on the particular use case, the models being used, and the tasks they're designed to solve. It's always important to carefully consider the potential implications and trade-offs of any changes to the model or training process.

Now, let's explain how we might train muliple LoRA models so that there is a minimal amount of overlap between each pair of models. 

1. We want to train multiple LoRA models simultaneously such that their update matrices are as orthogonal as possible in the Grassmann manifold. This is to ensure that they capture different directions of the loss landscape, leading to a more expressive ensemble of models.

2. The similarity between two update matrices $\Delta W^{(n)}_i$ and $\Delta W^{(n)}_j$ from different models $i$ and $j$ can be measured using the subspace similarity metric $\phi(\Delta W_i, \Delta W_j, k, l)$ defined as follows:

$$
\phi(\Delta W_i, \Delta W_j, k, l) = \frac{\left\|U_{\Delta W_i^{(n)}}^{(k)} {U_{\Delta W_j^{(n)}}^{(l)}}^T\right\|_F^2}{\min(k, l)}
$$

where $U_{\Delta W_i^{(n)}}^{(k)}$ and $U_{\Delta W_j^{(n)}}^{(l)}$ are the left singular vectors of the SVD of the update matrices, and $\|\cdot\|_F$ is the Frobenius norm.

3. The loss function to encourage the update matrices to be orthogonal can be defined as the negative sum of the squared subspace similarity metrics for all pairs of models:

$$
\mathcal{L} = -\sum_{i \neq j} \phi(\Delta W_i, \Delta W_j, k, l)^2
$$

This loss function is minimized when the update matrices are orthogonal in the Grassmann manifold (i.e., the subspace similarity metric is zero).

4. The models can be trained using gradient descent or some other optimization method, where the gradients of the loss function with respect to the parameters of the models are computed and used to update the parameters. The gradients can be computed using automatic differentiation.

5. Note that the computational complexity of this process is $O(n^2)$ for $n$ models due to the pairwise similarity computation, and the optimization process can be tricky due to the interdependence of the updates to different models.

6. The choice of $k$ and $l$ can be important and might need to be adjusted depending on the specific properties of your models and data.

The dimension `r` of a LoRA model determines the rank of the learned update matrices and consequently the dimensionality of the subspace that they span in the Grassmann manifold. If `r` is very low (it is shown that `r=8` is more than sufficient to retain accuracy), it means that the update matrices (and hence the corresponding points in the Grassmann manifold) live in a low-dimensional subspace.

This low-dimensionality has several important implications for the training process described above:

1. **Computational Efficiency**: The computation of the subspace similarity metric and the corresponding loss function becomes more efficient. The singular value decomposition (SVD) used in the computation of the subspace similarity metric scales with the size of the matrices involved, so smaller matrices lead to faster computations. This makes it feasible to compute these quantities for a large number of models and update steps.

2. **Simpler Optimization Landscape**: As the dimensionality of the update matrices decreases, the complexity of the optimization problem decreases as well. This can potentially make it easier to find good solutions, although the problem remains non-trivial due to the interdependence of the updates for different models.

3. **More Orthogonality**: With a smaller `r`, there is a greater chance of achieving more orthogonality between the models, as there are fewer dimensions for them to overlap in. This could potentially lead to a more diverse set of models, as they are forced to learn different aspects of the data.

4. **Less Overfitting**: Since the number of trainable parameters is small, the models are less likely to overfit the training data. This could potentially improve their generalization performance.

However, it's important to note that a smaller `r` also means that each individual model is less expressive, as it can only capture a low-rank approximation of the gradient of the loss function. This limitation could potentially be offset by the increased diversity of the ensemble of models.

## Another Approach to Merging

*Suppose we are are studying the update weight matrices $\Delta W_1^{(n)} = B_1^{(n)}A_1^{(n)}$ for layer `n`of a reference LoRA model (Low Rank Adaptation) named `lora1.safetensors` or just `lora1`. Suppose further we have a second LoRA named `lora2` (also stored as a .safetensors file) with update weight matrices $\Delta W_2^{(n)} = B_2^{(n)}A_2^{(n)}$. Now, define a subspace similarity metric 

$$\phi(\Delta W_i, \Delta W_j, k, l) = \frac{\left\|U_{\Delta W_i^{(n)}}^{(k)} {U_{\Delta W_j^{(n)}}^{(l)}}^T\right\|_F^2}{\min(k, l)}$$ 

that helps us measure how far apart two update weight matrices are in the. Now please explain how this related to points in the Grassmannian (or on the "Grassmann manifold"). Please explain how the Grassmannian is realized as a quotient space (a homogeneous space) in terms of orthogonal groups. Please also explain how we might use this to define and study the equivalence classes of matrices corresponding to the update weight matrices of the two LoRA models. Please also explain how we might use this to replace to update weight matrices of `lora2` with equivalent matrices. How will these new matrices compare in the subspace similarity metric $\phi(\Delta W_i, \Delta W_j, k, l)$ we defined? If we wish to merge the two LoRA models in some way, how can we use the theory and properties of the Grassmannian to do so?*

Let's start by discussing the Grassmannian, a space that parametrizes all k-dimensional subspaces in an n-dimensional vector space. The Grassmannian, often denoted as G(k, n), can be thought of as a set of points, each representing a unique k-dimensional subspace in n-dimensional space.

The Grassmannian is realized as a quotient space (a homogeneous space) in terms of orthogonal groups as follows: An orthogonal group, O(n), is the group of all orthogonal n×n matrices (those that preserve the dot product in Euclidean space). It includes rotations and reflections. The special orthogonal group, SO(n), is the subgroup consisting of rotations only. The Stiefel manifold, V(k, n), is the space of all orthonormal k-frames in R^n, which is a subset of O(n). It can be thought of as the space of all ordered sets of k orthogonal unit vectors in R^n.

When we mod out the Stiefel manifold by the action of the orthogonal group O(k) (the group of all orthogonal k×k matrices), we get the Grassmannian. In other words, the Grassmannian G(k, n) is the quotient space V(k, n) / O(k). The action of O(k) on V(k, n) is to change the basis of each k-subspace in R^n without changing the subspace itself. Hence, each point in the Grassmannian represents an equivalence class of k-frames that span the same k-subspace in R^n.

Now, let's relate this to the problem involving the update weight matrices of the two LoRA models.

In your context, the update weight matrices ΔW_i^(n) and ΔW_j^(n) correspond to linear transformations in some high-dimensional space. The subspaces of interest here are the column spaces of these matrices. The subspace similarity metric φ(ΔW_i, ΔW_j, k, l) measures how "close" these column spaces are to each other in terms of their principal subspaces spanned by the first k and l singular vectors respectively. This notion of "closeness" can be thought of as a distance in the Grassmannian.

When you talk about replacing the update weight matrices of lora2 with "equivalent" matrices, you're essentially discussing a change of basis within each subspace. In terms of the Grassmannian, you're moving within an equivalence class. The subspace similarity metric φ(ΔW_i, ΔW_j, k, l) should remain the same, because the metric is a function of the subspaces themselves, not the specific bases (i.e., matrices) that you choose to represent them.

If you wish to "merge" the two LoRA models, one possibility could be to find a common subspace that best represents the column spaces of both models' update weight matrices. This common subspace can be thought of as a point in the Grassmannian that is "close" to the points representing the column spaces of both models. How exactly to find such a common subspace would depend on the specifics of your problem, but in general, you could use techniques such as principal angles (canonical correlations) or other Grassmannian-based optimization methods to do this. 

The Grassmannian's geometry and algebraic structure (e.g., its topology as a homogeneous space and its structure as a manifold) can provide a rich mathematical framework for studying these kinds of problems in machine learning.

*Let's discuss the merging process more. How might one create a third LoRA model `lora3`, that have update weight matrices $\Delta W_3^{(n)} = B_3^{(n)}A_3^{(n)}$ that has maximum overlap with both $\Delta W_1^{(n)} = B_1^{(n)}A_1^{(n)}$ and $\Delta W_2^{(n)} = B_2^{(n)}A_2^{(n)}$ for each layer `n`? That is, so that each $\Delta W_3^{(n)}$ is close to $\Delta W_1^{(n)}$ and $\Delta W_2^{(n)}$ in the Grassmannian manifold for all $n$?*

One approach to creating a third LoRA model `lora3` that has maximum overlap with both `lora1` and `lora2` is to find a common subspace for each layer `n` that is close to the subspaces represented by ΔW_1^(n) and ΔW_2^(n) in the Grassmannian manifold. This could be achieved by solving an optimization problem that minimizes the distance (in terms of the subspace similarity metric φ) between the column spaces of ΔW_3^(n), ΔW_1^(n), and ΔW_2^(n). 

This optimization problem can be formalized as:

Minimize: $\phi(\Delta W_3^{(n)}, \Delta W_1^{(n)}, k, l) + \phi(\Delta W_3^{(n)}, \Delta W_2^{(n)}, k, l)$
Subject to: $\Delta W_3^{(n)} = B_3^{(n)}A_3^{(n)}$

Here, $B_3^{(n)}$ and $A_3^{(n)}$ are weight matrices that need to be determined. 

The minimization can be done using gradient descent or other optimization techniques, where the gradient of the objective function can be computed using the properties of the Grassmannian manifold.

This method would give you a new set of weight matrices for `lora3` that represent a common subspace for each layer `n`, which is close to the subspaces represented by `lora1` and `lora2`.

However, keep in mind that this process might not always yield a meaningful or useful model, because the common subspace that minimizes the distance might not preserve the essential features that `lora1` and `lora2` have learned.

Also note that this is a somewhat simplified approach and does not take into account the structure of the neural network models or the nature of the data they have been trained on. In practice, creating a merged model that performs well might involve more complex techniques, like transfer learning or knowledge distillation, and might require a careful examination of the models' architectures and training histories.

## Yet Another Approach Using the Plücker Embedding of the Grassmannian Manifold

The Plücker embedding is a map from the Grassmannian G(k, n) into the projective space P^(N-1), where N is the binomial coefficient "n choose k". The Plücker coordinates of a k-dimensional subspace V of n-dimensional space are defined as the wedge products of the vectors spanning V. If the vectors are the columns of a matrix, then the Plücker coordinates are the determinants of all kxk submatrices.

Let's say that $\Delta W_1$ and $\Delta W_2$ are two matrices whose columns span the subspaces represented by points p1 and p2 in the Grassmannian. You want to find the midpoint of the geodesic connecting p1 and p2.

The Plücker embedding represents each subspace (each matrix) as a point in projective space. In this space, the geodesic between p1 and p2 is a straight line. Therefore, the midpoint of the geodesic is simply the arithmetic mean of the Plücker coordinates of p1 and p2. 

However, there's a caveat: The Plücker coordinates of a point in the Grassmannian satisfy certain relations called the Plücker relations, which are a consequence of the alternating property of the wedge product. The sum of two sets of Plücker coordinates might not satisfy these relations. Therefore, after computing the mean of the Plücker coordinates of p1 and p2, you need to project this point back onto the Grassmannian, i.e., find the point in the Grassmannian that is closest to the mean point under the Fubini-Study metric.

This projection step is nontrivial. It requires solving an optimization problem that minimizes the distance from the mean point to the Grassmannian, subject to the Plücker relations. The result will be the Plücker coordinates of the midpoint of the geodesic.

To find a matrix representation of this midpoint subspace, you would need to find a set of vectors whose wedge products give the Plücker coordinates of the midpoint. This is also a nontrivial problem, but in some cases it might be possible to solve it by using techniques from linear algebra or combinatorics.

Keep in mind that this is a high-level description of the process, and the actual computations might be quite complex, especially for large values of k and n. This approach is more commonly used in theoretical studies than in practical computations. For practical purposes, other methods such as optimization on the Grassmannian might be more efficient.

### Example

Let's consider a simple example where we are dealing with 2-dimensional subspaces of a 4-dimensional space. This corresponds to the Grassmannian $G(2,4)$. The Plücker coordinates in this case are given by $2 \times 2$ determinants, and there are six Plücker coordinates for each point in $G(2,4)$.

Let's consider two $2 \times 4$ matrices $\Delta W_1$ and $\Delta W_2$ that represent these subspaces:

$$\Delta W_1 = \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \end{bmatrix}$$
$$\Delta W_2 = \begin{bmatrix} -1 & -2 & -3 & -4 \\ -5 & -6 & -7 & -8 \end{bmatrix}$$

The Plücker coordinates of these subspaces are given by the determinants of all $2 \times 2$ submatrices. For $\Delta W_1$, these are $(1 \times 6 - 2 \times 5, 1 \times 7 - 3 \times 5, 1 \times 8 - 4 \times 5, 2 \times 7 - 3 \times 6, 2 \times 8 - 4 \times 6, 3 \times 8 - 4 \times 7) = (-4, -8, -12, -6, -8, -4)$. Similarly, for $\Delta W_2$, the Plücker coordinates are $(1, 2, 3, 1, 2, 1)$.

The midpoint of the line segment in the Plücker space connecting these two points is simply the arithmetic mean of the two sets of Plücker coordinates. This gives $(-1.5, -3, -4.5, -2.5, -3, -1.5)$.

However, as mentioned previously, this point might not satisfy the Plücker relations, which in this case are:

\begin{align*}
P_{12}P_{34} - P_{13}P_{24} + P_{14}P_{23} &= 0 \\
P_{15}P_{26} - P_{16}P_{25} + P_{56}P_{12} &= 0 \\
P_{25}P_{34} - P_{26}P_{35} + P_{56}P_{13} &= 0 \\
P_{35}P_{12} - P_{36}P_{15} + P_{56}P_{14} &= 0 \\
\end{align*}

These equations represent the fact that the Plücker coordinates come from a 2-dimensional subspace of a 4-dimensional space. To find the point in $G(2,4)$ that is closest to the mean point under the Fubini-Study metric, you would need to solve an optimization problem that minimizes the distance from the mean point to $G(2,4)$ subject to these Plücker relations. This is a nontrivial problem that might require numerical methods or specialized algebraic techniques.

Once you have the Plücker coordinates of the midpoint, you can find a matrix representation of the corresponding subspace by finding a set of vectors whose $2 \times 2$ determinants give these Plücker coordinates. Again, this is a nontrivial problem, but in some cases it might be possible to solve it by using techniques from linear algebra or combinatorics.

Please note that this is a simplified example for illustrative purposes, and the actual computations in a real-world scenario could be much more complex.

Let's consider once more a scenario where we have multiple LoRA models, each adapted for a different task, and we want to update the matrices of a fixed layer of these models. One possible approach is to use the concept of geometric means on the Grassmannian to interpolate between the subspaces spanned by the adapted weights of these models. Here is a rough sketch of how this could be done:

1. First, represent the adapted weights of each model as a point in the Grassmannian. This is done by considering the column space (or row space, depending on the specifics of your model) of the weight matrix as a subspace of the whole space of possible weights.

2. Use the Plücker embedding to embed these points into a projective space. This gives you a more manageable space to work in.

3. Compute the Karcher mean of these points in the projective space. This gives you a new point in projective space that represents the "average" of the adapted weights of the models.

4. Finally, use the inverse of the Plücker embedding to map this point back to the Grassmannian, giving you a new subspace in the original space of weights.

5. The new subspace represents the geometric mean of the adapted weights of the models. You can then use this to update the weights of the layer in question in each of the models.

This process allows you to interpolate between the adapted weights of multiple models in a way that respects the underlying geometry of the weight space. This can be a powerful tool for tasks such as ensemble learning, where you want to combine the strengths of multiple models.

However, please note that the computation of the Karcher mean can be computationally expensive, and it may not always be well-defined or unique. Furthermore, the process of mapping back and forth between the Grassmannian and projective space can introduce additional complexities.

Here is some code to compute the Plücker coordinates for random matrices. 

In [30]:
import numpy as np
from itertools import combinations

# Define the dimensions
k, n = 2, 4  

# Generate a random k x n matrix X
X1 = np.random.rand(k, n)

# Compute the Plücker coordinates
plucker_coords = []
for cols in combinations(range(n), k):
    submatrix = X1[:, cols]
    if submatrix.shape[0] == submatrix.shape[1]:  # Ensure the submatrix is square
        plucker_coords.append(np.linalg.det(submatrix))

plucker_coords = np.array(plucker_coords)

print(plucker_coords)

[-0.13713356  0.29770262  0.01930738  0.44963994  0.19357615 -0.35692766]


In [40]:
import numpy as np
import itertools

def plucker_coordinates(X):
    n, m = X.shape
    k = min(n, m)
    
    plucker_coords = []

    for rows in itertools.combinations(range(n), k):
        for cols in itertools.combinations(range(m), k):
            submatrix = X[np.ix_(rows, cols)]
            plucker_coords.append(np.linalg.det(submatrix))

    return np.array(plucker_coords)

# Generate two random matrices
n, m = 5, 3
X1 = np.random.rand(n, m)
X2 = np.random.rand(n, m)

# Compute Plücker coordinates
plucker_coords1 = plucker_coordinates(X1)
plucker_coords2 = plucker_coordinates(X2)

# Compute arithmetic mean
# mean_coords = (plucker_coords1 + plucker_coords2) / 2

print("Plücker coordinates of X1: ", plucker_coords1)
print("Plücker coordinates of X2: ", plucker_coords2)
# print("Arithmetic mean of Plücker coordinates: ", mean_coords)


Plücker coordinates of X1:  [-0.12599737 -0.11151461  0.30825908 -0.02148901  0.47624255  0.36892691
 -0.029441   -0.20845442 -0.25652258 -0.14683273]
Plücker coordinates of X2:  [-0.44510487  0.23468189 -0.16576491  0.43581685  0.06579786 -0.19699786
  0.07852637 -0.0959877   0.02136498  0.10559293]


## References

[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)