Recently graph safe support has been added to te Sequential GroupedLinear Op https://github.com/NVIDIA/TransformerEngine/pull/2923. But it suffers from CPU overheads. Nail the bottlenecks and fix them
Recently graph safe support has been added to te Sequential GroupedLinear Op #2923.
But it suffers from CPU overheads. Nail the bottlenecks and fix them