Thanks for the excellent work!
In the paper, the LoRA or Adapter modules are tied within the attention layers from my understanding. I am wondering why the implementation seems to always share the first layer's parameter with other layers, which makes it tied across the attention layers.
Thanks for the excellent work!
In the paper, the LoRA or Adapter modules are tied within the attention layers from my understanding. I am wondering why the implementation seems to always share the first layer's parameter with other layers, which makes it tied across the attention layers.