Initializing Custom Trainer Downloading (…)/main/tokenizer.json: 0%| | 0.00/1.39M [00:00 gpu-compute-shankar-4x16:350:350 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. gpu-compute-shankar-4x16:350:350 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4) gpu-compute-shankar-4x16:350:350 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. gpu-compute-shankar-4x16:350:350 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5). gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Plugin Path : /usr/local/nccl-rdma-sharp-plugins/lib/libnccl-net.so gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P plugin IBext gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO NET/IB : No device found. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.4<0> gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Using network Socket gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Setting affinity for GPU 2 to ffff,00000000 gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Channel 00 : 2[300000] -> 3[400000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Channel 01 : 2[300000] -> 3[400000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Connected all rings gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Channel 00 : 2[300000] -> 1[200000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Channel 01 : 2[300000] -> 1[200000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO Connected all trees gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer gpu-compute-shankar-4x16:350:2599 [2] NCCL INFO comm 0x56079b8f8300 rank 2 nranks 4 cudaDev 2 busId 300000 - Init COMPLETE Missing logger folder: /mnt/azureml/cr/j/30e0e2e93ade4186bd3779f617afd768/exe/wd/lightning_logs LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3] gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Using network Socket gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 0(=100000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 3 and 2. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 3(=400000) and dev 2(=300000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 0(=100000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 1(=200000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Setting affinity for GPU 2 to ffff,00000000 gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Channel 00 : 2[300000] -> 3[400000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Channel 01 : 2[300000] -> 3[400000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Connected all rings gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO P2P is disabled between connected GPUs 2 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 3(=400000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Channel 00 : 2[300000] -> 1[200000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Could not enable P2P between dev 2(=300000) and dev 1(=200000) gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Channel 01 : 2[300000] -> 1[200000] via SHM/direct/direct gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO Connected all trees gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer gpu-compute-shankar-4x16:350:2616 [2] NCCL INFO comm 0x56079b910a60 rank 2 nranks 4 cudaDev 2 busId 300000 - Init COMPLETE Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Loading extension module utils... Time to load utils op: 22.33176565170288 seconds Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.001119852066040039 seconds Cleaning up all outstanding Run operations, waiting 300.0 seconds 1 items cleaning up... Cleanup took 0.10608601570129395 seconds Traceback (most recent call last): File "training_script.py", line 102, in train(trainer) File "training_script.py", line 91, in train trainer.train(train_df=train_df, eval_df=test_df) File "/mnt/azureml/cr/j/30e0e2e93ade4186bd3779f617afd768/exe/wd/trainer.py", line 111, in train trainer.fit(self.T5Model, self.data_module) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 520, in fit call._call_and_handle_interrupt( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 92, in launch return function(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run results = self._run_stage() File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 978, in _run_stage self.fit_loop.run() File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run self.advance() File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance self.epoch_loop.run(self._data_fetcher) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run self.advance(data_fetcher) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 218, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 185, in run self._optimizer_step(kwargs.get("batch_idx", 0), closure) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 261, in _optimizer_step call._call_lightning_module_hook( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 142, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1266, in optimizer_step optimizer.step(closure=optimizer_closure) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 158, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 257, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 224, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/deepspeed.py", line 92, in optimizer_step closure_result = closure() File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in __call__ self._result = self.closure(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 126, in closure step_output = self._step_fn() File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 308, in _training_step training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values()) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 288, in _call_strategy_hook output = fn(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 329, in training_step return self.model(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1846, in forward loss = self.module(*inputs, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 90, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/mnt/azureml/cr/j/30e0e2e93ade4186bd3779f617afd768/exe/wd/model_module.py", line 75, in training_step loss, outputs = self( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/mnt/azureml/cr/j/30e0e2e93ade4186bd3779f617afd768/exe/wd/model_module.py", line 58, in forward output = self.model( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1626, in forward encoder_outputs = self.encoder( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1055, in forward layer_outputs = layer_module( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 687, in forward self_attention_outputs = self.layer[0]( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 593, in forward attention_output = self.SelfAttention( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 556, in forward attn_weights = nn.functional.dropout( File "/azureml-envs/azureml_3263cc21f12e8d16ce50e2ff0b93f3ff/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 890.00 MiB (GPU 2; 14.76 GiB total capacity; 13.28 GiB already allocated; 75.75 MiB free; 13.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF