Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there an F.conv1d in the forward() of the MergedLinear? #17

Closed
vgoklani opened this issue May 20, 2023 · 3 comments
Closed

Why is there an F.conv1d in the forward() of the MergedLinear? #17

vgoklani opened this issue May 20, 2023 · 3 comments

Comments

@vgoklani
Copy link

I saw this in the original implementation too:

https://github.com/microsoft/LoRA/blob/main/loralib/layers.py#L248

Why do we need to use an F.conv1d as opposed to just using a linear method? That would avoid all these transposes too?

Thanks for sharing your work!

@Andrei-Aksionov
Copy link
Owner

Andrei-Aksionov commented May 21, 2023

The short answer is: I am not sure 🤷 .
That's where my limited experience is becoming obvious.

I have three suspicions:

  1. Perhaps there is a performance benefit of using conv1d on a specific hardware, though I've seen online only the opposite, like this. In the meanwhile I've just tested on Google Colab and saw a bit more interesting results: yes, on cpu conv1d is slower, but the opposite is true when running on gpu. You can check the notebook here.
  2. As you can see from this stackoverflow post conv1d and linear are not fully identical. Those differences might come from two factors: a) different initialization b) "However, it should be pointed out the operator used in Conv1d is a 2D cross-correlation operator which measures the similarity of two series" (the last answer in the stackoverflow post).
  3. Group convolution might come into play. Though I don't have any significant experience with grouped convolutions, from what I've read and seen on diagram each group is applied to "it's" portion of input features. Since with MergedLinear the weight matrix is a combination of weights for query+key+value matrices perhaps it's beneficial to have separate set of kernels for the corresponding section of the combined matrix. But it's of course only my assumption and I might be seriously wrong.

None of the options above explains why for self.lora_A we are doing F.linear, but for self.lora_B - F.conv1d.

after_A = F.linear( # noqa: N806
self.lora_dropout(x),
self.lora_A,
) # (64, 64, 128) @ (4, 128) -> (64, 64, 4)
# For F.conv1d:
# - input: input tensor of shape (minibatch,in_channels,iW)
# - weight: filters of shape (out_channels,in_channels/groups,kW)
# - groups: split input into groups, in_channels should be divisible by the number of groups. Default: 1
# presumably iW - sequence width/length, kW - kernel width
after_B = F.conv1d( # noqa: N806
after_A.transpose(-2, -1), # (64, 64, 4) -> (64, 4, 64)
self.lora_B.unsqueeze(-1), # (256, 2) -> (256, 2, 1)
groups=sum(self.enable_lora),
).transpose( # (64, 4, 64) @ (256, 2, 1) -> (64, 256, 64)

Perhaps it's something that is obvious for any experienced FAANG ML engineer. Like a secret code, something like you need to say when entering some upscale speak-easy club or something like that 😄 . Or this approach worked well on some previous project and someone decided to use it here. Or maybe with help of almost unlimited computational resources Microsoft engineers just used grid search and this combo of Linear and Conv1d layers worked the best 🤷‍♂️ .

Basically that's why I hope in the future researches will release not only implementation of a paper but also additional notes of the implementation.


Maybe we should keep this issue opened so maybe I'll revisit it when I have more info.

@Andrei-Aksionov
Copy link
Owner

Thanks for sharing your work!

You are welcome.
I hope I was helpful.

@Andrei-Aksionov
Copy link
Owner

Hello @vgoklani
So it looks like option number 3 is for the win.

In a nutshell:

  1. for lora_A we use nn.Linear simply because the input is the same for all three parts of the combined qkv matrix (that is query, key and value) and there is no need to add an extra code to process anything independently.
  2. for lora_B we use nn.Conv1d with groups parameter because we want to process each part independently.
    From the nn.Conv1d docs:

At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels and producing half the output channels, and both subsequently concatenated.

So having number of groups equivalent to the number of enabled LoRA layers we can process each part of the weight matrix independently in a single pass.


That's essentially the same that I wrote before. The only question was why do we have Linear layer and right after it - Conv layer (and not let's say Conv+Conv). Well the answer turned out to be pretty simple. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants