# LoRA Subspace Similarity Measure for Merging

First, let's run some code to open two `.safetensors` files storing two LoRA models named `lora1` and `lora2`, and print the keys in their dictionaries 

In [1]:
import torch
from safetensors import safe_open
from scipy.linalg import svd
import numpy as np

# Load the LoRA tensors from .safetensors files
with safe_open("lora1.safetensors", framework="pt", device="cpu") as f:
    lora1_tensors = {}
    for k in f.keys():
        lora1_tensors[k] = f.get_tensor(k)

# Print the available keys
print("Keys in lora1_tensors:")
for key in lora1_tensors.keys():
    print(key)

Keys in lora1_tensors:
lora_te_text_model_encoder_layers_0_mlp_fc1.alpha
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.alpha
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self

In [2]:
with safe_open("lora2.safetensors", framework="pt", device="cpu") as f:
    lora2_tensors = {}
    for k in f.keys():
        lora2_tensors[k] = f.get_tensor(k)

print("\nKeys in lora2_tensors:")
for key in lora2_tensors.keys():
    print(key)


Keys in lora2_tensors:
lora_te_text_model_encoder_layers_0_mlp_fc1.alpha
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.alpha
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_down.weight
lora_te_text_model_encoder_layers_0_mlp_fc2.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_k_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_out_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.alpha
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_down.weight
lora_te_text_model_encoder_layers_0_self_attn_q_proj.lora_up.weight
lora_te_text_model_encoder_layers_0_sel

In [3]:
import numpy as np
from scipy.linalg import svd

# Given matrices B1, A1, B2, A2
# B1, A1 = np.random.rand(8, 64), np.random.rand(64, 8)  # for example
# B2, A2 = np.random.rand(64, 64), np.random.rand(64, 64)  # for example

def compute_subspace_similarity(B1, A1, B2, A2, i, j):
    # Perform SVD to get the right singular unitary matrices
    _, _, U_A1 = svd(A1)
    _, _, U_A2 = svd(A2)

    # Take the first i and j singular vectors
    U_A1_i = U_A1[:, :i]
    U_A2_j = U_A2[:, :j]

    # Compute the Frobenius norm
    norm = np.linalg.norm(U_A1_i.T @ U_A2_j, 'fro')**2

    # Normalize by min(i, j)
    normalized_norm = norm / min(i, j)

    return normalized_norm


In [12]:
# similarity = compute_subspace_similarity(B1, A1, B2, A2, 3, 5)
# print("Similarity:", similarity)

Now, in the paper on LoRAs, the following concept is introduced: The subspace similarity measure is a way of measuring the similarity between the subspaces spanned by the top singular vectors of two low-rank adaptation matrices, $A_{r=8}$ and $A_{r=64}$, from the same pre-trained model. Here's how it's done:

First, you perform a singular value decomposition (SVD) on each of these matrices to obtain their right-singular unitary matrices, denoted $U_{A_{r=8}}$ and $U_{A_{r=64}}$.

The goal is then to quantify how much of the subspace spanned by the top $i$ singular vectors in $U_{A_{r=8}}$ is contained in the subspace spanned by the top $j$ singular vectors of $U_{A_{r=64}}$.

This is measured using a normalized subspace similarity based on the Grassmann distance. The formula for this measure, denoted $\phi(A_{r=8}, A_{r=64}, i, j)$, is given as follows:

$$
\phi(A_{r=8}, A_{r=64}, i, j) = \frac{||U_{A_{r=8}}^{(i)} {U_{A_{r=64}}^{(j)}}^T||_F^2}{\min(i, j)}
$$

where $U_{A_{r}}^{(i)}$ represents the columns of $U_{A_{r}}$ corresponding to the top $i$ singular vectors, and $||\cdot||_F$ denotes the Frobenius norm.

The measure $\phi(·)$ ranges from 0 to 1, where 1 represents a complete overlap of the subspaces (i.e., they are the same), and 0 represents a complete separation (i.e., they are orthogonal). This is a normalized measure because it's divided by $\min(i, j)$, which is the maximum possible square of the Frobenius norm of the product matrix $U_{A_{r=8}}^{(i)} {U_{A_{r=64}}^{(j)}}^T$.

This process is performed for all pairs $(i, j)$ where $1 \leq i \leq 8$ and $1 \leq j \leq 64$. The results give an understanding of how much the learned subspaces for different ranks overlap with each other.

This can also be performed on two layers $\Delta W_1 = B_1A_1$ and $\Delta W_2 = B_2A_2$  in two different LoRAs. In particular, suppose we choose a layer `n` of each LoRA and run the subspace similarity measure comparison on $U_{\Delta W_1}^{(i)} {U_{\Delta W_2}^{(j)}}^T$. Then this will tell us how much those to LoRAs overlap with one another. 

This could be useful in determining which LoRAs to merge. If we run this analysis on all of the weight matrices of two different LoRAs, then we can determine how much layer `n` of `lora1` overlaps with layer `n` of `lora2`. If the overlap is small, then the two weight martices $\Delta W_1^{(n)} = B_1^{(n)}A_1^{(n)}$ and $\Delta W_2^{(n)} = B_2^{(n)}A_2^{(n)}$ may express very different things because the subspaces that they span do not overlap very much. So, to be more explicit, we compute

$$
\phi(\Delta W_1, \Delta W_2, i, j) = \frac{||U_{\Delta W_1^{(n)}}^{(i)} {U_{\Delta W_2^{(n)}}^{(j)}}^T||_F^2}{\min(i, j)}
$$

for a weight matrix $\Delta W_1$ from the first LoRA, and the corresponding $\Delta W_2$ from the second LoRA. This could indicate that merging the two LoRAs will create a more general model, able to create a wider range of diverse styles. This might also help in explaining why two LoRAs create something very muddy or undesirable when merges. Obviously, this is all conjecture based on a mathematical analysis that needs to be tested, and it does not provide a precise theshold for the overlap. What upper or lower bound might we use for this subspace similarity measure $\phi$? Could this hypthesis be wrong, or inverted? That is, is it possible that in some cases we actually want *high* overlap between models so that we merge very similar concepts?

In [14]:
import torch

# The A matrices
A1 = lora1_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight'].float()
A2 = lora2_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_down.weight'].float()

# The B matrices
B1 = lora1_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight'].float()
B2 = lora2_tensors['lora_te_text_model_encoder_layers_0_mlp_fc1.lora_up.weight'].float()

# Compute the update matrices
Delta_W1 = torch.matmul(B1, A1)
Delta_W2 = torch.matmul(B2, A2)

# Compute the SVD of the update matrices
U1, _, _ = torch.svd(Delta_W1)
U2, _, _ = torch.svd(Delta_W2)

# Define the subspace similarity measure
def phi(U1, U2, i, j):
    U1_i = U1[:, :i]  # First i columns of U1
    U2_j = U2[:, :j]  # First j columns of U2
    
    product = torch.matmul(U1_i.t(), U2_j)  # Matrix multiplication
    norm = torch.norm(product)  # Frobenius norm
    
    return norm ** 2 / min(i, j)

# Calculate the subspace similarity measure
i = U1.size(1)  # Number of columns in U1
j = U2.size(1)  # Number of columns in U2
result = phi(U1, U2, i, j)  # Replace i and j with the desired values

print(f'The subspace similarity measure is: {result}')


The subspace similarity measure is: 0.26528483629226685


In [17]:
import torch

# Number of layers in your LoRAs
# num_layers = len(lora1_tensors.keys()) // 2  # Assuming that each layer has a down and up weight
num_layers = 12

# Define the subspace similarity measure
def phi(U1, U2, i, j):
    U1_i = U1[:, :i]  # First i columns of U1
    U2_j = U2[:, :j]  # First j columns of U2
    
    product = torch.matmul(U1_i.t(), U2_j)  # Matrix multiplication
    norm = torch.norm(product)  # Frobenius norm
    
    return norm ** 2 / min(i, j)

# Iterate over all layers
for layer in range(num_layers):
    # Extract the corresponding A and B matrices
    A1 = lora1_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_down.weight'].float()
    B1 = lora1_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_up.weight'].float()
    
    A2 = lora2_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_down.weight'].float()
    B2 = lora2_tensors[f'lora_te_text_model_encoder_layers_{layer}_mlp_fc1.lora_up.weight'].float()

    # Compute the update matrices
    Delta_W1 = torch.matmul(B1, A1)
    Delta_W2 = torch.matmul(B2, A2)

    # Compute the SVD of the update matrices
    U1, _, _ = torch.svd(Delta_W1)
    U2, _, _ = torch.svd(Delta_W2)

    # Calculate the subspace similarity measure
    i = U1.size(1)  # Number of columns in U1
    j = U2.size(1)  # Number of columns in U2
    result = phi(U1, U2, i, j)  # Replace i and j with the desired values

    print(f'The subspace similarity measure for layer {layer} is: {result}')


The subspace similarity measure for layer 0 is: 0.26528483629226685
The subspace similarity measure for layer 1 is: 0.26683029532432556
The subspace similarity measure for layer 2 is: 0.27087515592575073
The subspace similarity measure for layer 3 is: 0.2644103169441223
The subspace similarity measure for layer 4 is: 0.2637903094291687
The subspace similarity measure for layer 5 is: 0.2647744119167328
The subspace similarity measure for layer 6 is: 0.26757362484931946
The subspace similarity measure for layer 7 is: 0.26692089438438416
The subspace similarity measure for layer 8 is: 0.26369643211364746
The subspace similarity measure for layer 9 is: 0.2677040100097656
The subspace similarity measure for layer 10 is: 0.26571375131607056
The subspace similarity measure for layer 11 is: 0.28491613268852234


In [23]:
import torch

# Gather all keys and sort them
all_keys = sorted(list(lora1_tensors.keys()))

# Filter keys for lora_down and lora_up pairs
lora_down_keys = [key for key in all_keys if 'lora_down' in key]
lora_up_keys = [key for key in all_keys if 'lora_up' in key]

# Ensure we have matching pairs of keys
assert len(lora_down_keys) == len(lora_up_keys), "Mismatch in number of 'lora_down' and 'lora_up' keys"

# Define the subspace similarity measure
def phi(U1, U2, i, j):
    U1_i = U1[:, :i]  # First i columns of U1
    U2_j = U2[:, :j]  # First j columns of U2
    
    product = torch.matmul(U1_i.t(), U2_j)  # Matrix multiplication
    norm = torch.norm(product)  # Frobenius norm
    
    return norm ** 2 / min(i, j)

# Iterate over all layers
for layer in range(len(lora_down_keys)):
    try:
        # Extract the corresponding A and B matrices
        A1 = lora1_tensors[lora_down_keys[layer]].float()
        B1 = lora1_tensors[lora_up_keys[layer]].float()

        A2 = lora2_tensors[lora_down_keys[layer]].float()
        B2 = lora2_tensors[lora_up_keys[layer]].float()

        # Print the shapes of A1 and B1 matrices for troubleshooting
        print(f"A1 shape: {A1.shape}")
        print(f"B1 shape: {B1.shape}")

        # Compute the update matrices
        Delta_W1 = torch.matmul(B1, A1)
        Delta_W2 = torch.matmul(B2, A2)

        # Compute the SVD of the update matrices
        U1, _, _ = torch.svd(Delta_W1)
        U2, _, _ = torch.svd(Delta_W2)

        # Calculate the subspace similarity measure
        i = U1.size(1)  # Number of columns in U1
        j = U2.size(1)  # Number of columns in U2
        result = phi(U1, U2, i, j)  # Replace i and j with the desired values

        print(f"The subspace similarity measure for layer {layer} is: {result}")

    except RuntimeError as e:
        # Print the layer number and the error message
        print(f"Error occurred at layer {layer}: {e}")


A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([3072, 32])
The subspace similarity measure for layer 0 is: 0.26528483629226685
A1 shape: torch.Size([32, 3072])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 1 is: 0.9999878406524658
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 2 is: 0.9999885559082031
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 3 is: 0.9999887347221375
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 4 is: 0.9999908804893494
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([768, 32])
The subspace similarity measure for layer 5 is: 0.9999882578849792
A1 shape: torch.Size([32, 768])
B1 shape: torch.Size([3072, 32])
The subspace similarity measure for layer 6 is: 0.26571375131607056
A1 shape: torch.Size([32, 3072])
B1 shape: torch.Size([768, 32])
The subspace 

Let's break this down once more. 

From the above code, we see that many of the update matrices overlap quite a lot, but some have a much smaller overlap. How can we better understand how many of the overlaps contribute to a more general or expressive model? In other words, if a significant portion of the update matrices have high (or low) overlap, what does this mean, and what number or porportion of the update matrices can we call "significant" in this case?

LoRA (Low-Rank Adaptation) is a technique used to fine-tune large language models with a significantly lower computational cost compared to traditional methods. The key idea behind LoRA is to restrict the updates during the fine-tuning process to a low-rank subspace of the parameter space.

Given a large pre-trained model with parameters $\Theta$, LoRA introduces two sets of new parameters $A$ and $B$ such that the updates to the parameters of the model during fine-tuning can be written as $\Delta W = BA$, where $A \in \mathbb{R}^{k \times m}$ and $B \in \mathbb{R}^{n \times k}$, $n$ is the number of parameters in the layer, $m$ is the size of the context window, and $k$ is the rank of the low-rank update, typically much smaller than $n$ and $m$. This formulation ensures that the updates $\Delta W$ lie in a $k$-dimensional subspace of the parameter space.

The code provided above computes a subspace similarity metric between pairs of update matrices in two different LoRA models. Specifically, for each layer $n$ in the models, it computes the update matrices $\Delta W_1^{(n)} = B_1^{(n)}A_1^{(n)}$ and $\Delta W_2^{(n)} = B_2^{(n)}A_2^{(n)}$ and then calculates the similarity between the subspaces spanned by $\Delta W_1^{(n)}$ and $\Delta W_2^{(n)}$. 

The subspace similarity metric $\phi$ is computed as follows:

$$
\phi(\Delta W_1, \Delta W_2, i, j) = \frac{||U_{\Delta W_1^{(n)}}^{(i)} {U_{\Delta W_2^{(n)}}^{(j)}}^T||_F^2}{\min(i, j)}
$$

Here, $U_{\Delta W_1^{(n)}}^{(i)}$ and $U_{\Delta W_2^{(n)}}^{(j)}$ are the first $i$ and $j$ left singular vectors of $\Delta W_1^{(n)}$ and $\Delta W_2^{(n)}$, respectively, obtained through singular value decomposition. The operator $||\cdot||_F$ denotes the Frobenius norm, and $\min(i, j)$ ensures the normalization of the metric to the smaller dimension of the two subspaces.

This metric $\phi$ effectively measures how aligned the two subspaces are. If the two subspaces are similar, their corresponding left singular vectors will be closely aligned, leading to a large Frobenius norm of their product, and thus a large value of $\phi$. Conversely, if the subspaces are dissimilar, their left singular vectors will not align well, resulting in a smaller Frobenius norm and a smaller $\phi$.

This is related to the Grassmann manifold because the set of all $k$-dimensional subspaces of an $n$-dimensional vector space forms a Grassmann manifold $G(n, k)$. The metric $\phi$ can be viewed as a measure of distance on this manifold, allowing us to quantify the difference between the subspaces spanned by the updates of different LoRA models.

When interpreting the subspace similarity metric, a high value (like 0.99 or above) indicates that the corresponding update matrices (and hence the associated subspaces) of the two LoRA models are very similar or aligned. This suggests that the two models are learning very similar "concepts" or features.

In the context of a text-to-vision model like DALL-E or StableDiffusion, this might mean that the two LoRA models are both focusing on the same kind of visual or textual features in the data, and making similar updates during training to capture these features. If these models are meant to learn different concepts or tasks, this high similarity might indicate that they are failing to differentiate between these tasks and are instead learning similar representations.

Conversely, a low value (like 0.25 or below) of the subspace similarity metric indicates that the corresponding update matrices of the two models are quite different or unaligned. This suggests that the two models are learning different "concepts" or features.

Again, in the context of a text-to-vision model, this might mean that the two LoRA models are focusing on different kinds of visual or textual features in the data, and making different updates during training to capture these features. If these models are meant to learn different concepts or tasks, this low similarity could be a good sign that they are successfully differentiating between these tasks and learning distinct representations.

Of course, these interpretations are only rough indications and might not capture the full complexity of what's happening in the models. It's also important to remember that the subspace similarity metric is just one of many ways to measure the similarity between models, and it might not always reflect the actual functional or semantic similarity between the models. For a more comprehensive understanding of what the models are learning, you would likely need to use a combination of different analysis techniques, including both quantitative measures like the subspace similarity metric and qualitative methods like visual inspection of the generated outputs.

Merging two models that have high subspace overlaps for a significant portion of their update matrices can be a challenging task. The high overlap suggests that the models are learning similar features or concepts, so simply averaging the update matrices might not result in a meaningful combination of the models' knowledge.

One way to merge the models might be to use a weighted average of the update matrices, where the weights are determined based on some measure of the models' performance or relevance to the task at hand. For instance, if one model performs better on a validation set or is known to be more relevant to the task, you could assign it a higher weight. This would give more influence to the better or more relevant model in the merged model.

However, this approach assumes that the models' update matrices can be meaningfully averaged, which might not be the case if the models are learning very different features or concepts, or if their update matrices lie in different parts of the parameter space.

If the models' update matrices are very close together in the Grassmann manifold (i.e., they have high subspace overlap), and you want to push them further apart, you might need to modify the models' training process to encourage them to learn more diverse features or concepts. This could be done by adding a regularization term to the loss function that penalizes similarity between the models' update matrices, effectively pushing their subspaces apart in the Grassmann manifold.

Such a regularization term could be based on the subspace similarity metric itself, or on some other measure of similarity between the update matrices. The exact form of the regularization term would depend on the specific characteristics of your models and your training setup, and would likely require some experimentation to find the best approach.

It's worth noting that pushing the models' update matrices further apart in the Grassmann manifold is not always desirable. If the models are learning similar features or concepts because these are important for the task at hand, forcing them to diverge might hurt their performance. It's important to carefully consider the implications of pushing the models apart before deciding to implement such a strategy.

Modifying the regularization term to push the subspaces apart can be inappropriate in a few scenarios:

1. **Shared Essential Features**: If the models are learning similar features because these features are fundamental or essential for the task at hand, forcing the models to diverge might result in them missing out on these essential features, thereby hurting their performance.

2. **Task Similarity**: If the models are designed to solve very similar tasks, or tasks with a high degree of overlap, pushing the subspaces apart may not make sense. In this case, the models might naturally learn similar representations, and forcing divergence could lead to less optimal solutions.

3. **Increased Complexity**: Adding a regularization term to push the subspaces apart increases the complexity of the model and the training process. If the benefits of divergence (e.g., increased diversity of learned features) do not outweigh the costs (e.g., increased computational resources, potential for overfitting), it may not be appropriate to use this approach.

4. **Model Interpretability**: Adding a subspace divergence term could make the model more difficult to interpret, as it's another factor influencing the learning process that needs to be accounted for when analyzing the model's behavior. If interpretability is a key concern, it may be inappropriate to add this complexity.

5. **Lack of Evidence for Improvement**: If there's no empirical evidence or theoretical justification suggesting that pushing the subspaces apart will improve the performance of the models on the task at hand, it might be inappropriate to use this approach. It's generally recommended to have a clear hypothesis or rationale for any changes made to the model or training process.

These are general guidelines, and the specifics will depend on the particular use case, the models being used, and the tasks they're designed to solve. It's always important to carefully consider the potential implications and trade-offs of any changes to the model or training process.

## References

[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)