Why is there an F.conv1d in the forward() of the MergedLinear? #17

vgoklani · 2023-05-20T19:28:09Z

I saw this in the original implementation too:

https://github.com/microsoft/LoRA/blob/main/loralib/layers.py#L248

Why do we need to use an F.conv1d as opposed to just using a linear method? That would avoid all these transposes too?

Thanks for sharing your work!

The text was updated successfully, but these errors were encountered:

Andrei-Aksionov · 2023-05-21T14:41:14Z

The short answer is: I am not sure 🤷 .
That's where my limited experience is becoming obvious.

I have three suspicions:

Perhaps there is a performance benefit of using conv1d on a specific hardware, though I've seen online only the opposite, like this. In the meanwhile I've just tested on Google Colab and saw a bit more interesting results: yes, on cpu conv1d is slower, but the opposite is true when running on gpu. You can check the notebook here.
As you can see from this stackoverflow post conv1d and linear are not fully identical. Those differences might come from two factors: a) different initialization b) "However, it should be pointed out the operator used in Conv1d is a 2D cross-correlation operator which measures the similarity of two series" (the last answer in the stackoverflow post).
Group convolution might come into play. Though I don't have any significant experience with grouped convolutions, from what I've read and seen on diagram each group is applied to "it's" portion of input features. Since with MergedLinear the weight matrix is a combination of weights for query+key+value matrices perhaps it's beneficial to have separate set of kernels for the corresponding section of the combined matrix. But it's of course only my assumption and I might be seriously wrong.

None of the options above explains why for self.lora_A we are doing F.linear, but for self.lora_B - F.conv1d.

nanoGPTplus/src/model/gpt_language_model/peft/lora.py

Lines 332 to 346 in fb208fb

    
           after_A = F.linear(  # noqa: N806 
        
               self.lora_dropout(x), 
        
               self.lora_A, 
        
           )  # (64, 64, 128) @ (4, 128) -> (64, 64, 4) 
        
           # For F.conv1d: 
        
           # - input: input tensor of shape (minibatch,in_channels,iW) 
        
           # - weight: filters of shape (out_channels,in_channels/groups,kW) 
        
           # - groups: split input into groups, in_channels should be divisible by the number of groups. Default: 1 
        
           # presumably iW - sequence width/length, kW - kernel width 
        
           after_B = F.conv1d(  # noqa: N806 
        
               after_A.transpose(-2, -1),  # (64, 64, 4) -> (64, 4, 64) 
        
               self.lora_B.unsqueeze(-1),  # (256, 2) -> (256, 2, 1) 
        
               groups=sum(self.enable_lora), 
        
           ).transpose(  # (64, 4, 64) @ (256, 2, 1) -> (64, 256, 64)

Perhaps it's something that is obvious for any experienced FAANG ML engineer. Like a secret code, something like you need to say when entering some upscale speak-easy club or something like that 😄 . Or this approach worked well on some previous project and someone decided to use it here. Or maybe with help of almost unlimited computational resources Microsoft engineers just used grid search and this combo of Linear and Conv1d layers worked the best 🤷‍♂️ .

Basically that's why I hope in the future researches will release not only implementation of a paper but also additional notes of the implementation.

Maybe we should keep this issue opened so maybe I'll revisit it when I have more info.

Andrei-Aksionov · 2023-05-21T14:46:53Z

Thanks for sharing your work!

You are welcome.
I hope I was helpful.

Andrei-Aksionov · 2023-07-10T13:23:08Z

Hello @vgoklani
So it looks like option number 3 is for the win.

In a nutshell:

for lora_A we use nn.Linear simply because the input is the same for all three parts of the combined qkv matrix (that is query, key and value) and there is no need to add an extra code to process anything independently.
for lora_B we use nn.Conv1d with groups parameter because we want to process each part independently.
From the nn.Conv1d docs:

At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels and producing half the output channels, and both subsequently concatenated.

So having number of groups equivalent to the number of enabled LoRA layers we can process each part of the weight matrix independently in a single pass.

That's essentially the same that I wrote before. The only question was why do we have Linear layer and right after it - Conv layer (and not let's say Conv+Conv). Well the answer turned out to be pretty simple. 😄

Andrei-Aksionov mentioned this issue Jun 1, 2023

How does MergedLinear work? microsoft/LoRA#69

Open

Andrei-Aksionov closed this as completed Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is there an F.conv1d in the forward() of the MergedLinear? #17

Why is there an F.conv1d in the forward() of the MergedLinear? #17

vgoklani commented May 20, 2023

Andrei-Aksionov commented May 21, 2023 •

edited

Andrei-Aksionov commented May 21, 2023

Andrei-Aksionov commented Jul 10, 2023

Why is there an F.conv1d in the forward() of the MergedLinear? #17

Why is there an F.conv1d in the forward() of the MergedLinear? #17

Comments

vgoklani commented May 20, 2023

Andrei-Aksionov commented May 21, 2023 • edited

Andrei-Aksionov commented May 21, 2023

Andrei-Aksionov commented Jul 10, 2023

Andrei-Aksionov commented May 21, 2023 •

edited