Hi, Is there a way to enable parallel residual similar to HF GPT-neox [use_parallel_residual](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/configuration_gpt_neox.py#L78C9-L78C30) config to speed up training? @ksivaman If currently not supported do you have any plans to support?