Integrate mHC with DeepSeek custom model#3115
Conversation
60b389b to
d629c9c
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
8a7e2a0 to
5b938c0
Compare
2fdb6e1 to
222286f
Compare
There was a problem hiding this comment.
Thanks for the integration and testing!
Update mHC a little bit to align with normalization (flexible with pre-norm and post-norm). Please note mHC also has a norm across last k * dim dimension inside.
I’m not sure I fully follow this part. My understanding was that HC is designed to replace residual connection (either pre-norm and post-norm form), rather than co-exist. HC subsumes both forms and overcomes their limitation, ref from HC paper Page2.
[Update] I need another look at mhc paper, this
d3c1185 to
22890d5
Compare
I think that's right. HC has learnable weights and scale to make flexible as pre/post-norm. This makes it much easier! Let me update it back to original implementation. |
8e20dbd to
cc0e744
Compare
Discussed offline. There are misalignment between DS mHC and original HC implementation (HC has pre-norm, weights shape are different to mHC, and HC does not utilize mapping for residuals) . We would like to wait for the clarification when official implementation is available. Also added a comment in the codebase. |
cc0e744 to
eefb134
Compare
eefb134 to
9e783cf
Compare
9e783cf to
0f04370
Compare
0f04370 to
0147f4b
Compare
|
Thanks for the update, LGTM! I have high confidence that current implementation aligns with both papers.
|
Description
decoders.pywhen feature is enabled.General pre-norm when mHC feature is disabled:
When mHC feature is enabled:
Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.