Parallel residual Transformer layer

Hi,

Is there a way to enable parallel residual similar to HF GPT-neox [use_parallel_residual](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/configuration_gpt_neox.py#L78C9-L78C30) config to speed up training?

 @ksivaman  If currently not supported do you have any plans to support?