model parallel training research (#616)

* adds network * adds basic training * update loading * working prototype * update validation set * [MONAI] Add author; paper info; PDDCA18 (#6) + Author + Early accept + PDDCA18 link * Update README.md * adds network * adds basic training * update loading * working prototype * update validation set * [MONAI] Update TRAIN_PATH, VAL_PATH (#8) + Update TRAIN_PATH, VAL_PATH * [MONAI] Add data link (#7) + Add data link https://drive.google.com/file/d/1A2zpVlR3CkvtkJPvtAF3-MH0nr1WZ2Mn/view?usp=sharing * fixes typos * tested new dataset * print more infor, checked new dataset * [MONAI] Add paper link (#9) Add paper link https://arxiv.org/abs/2006.12575 * [MONAI] Use dice loss + focal loss to train (#10) Use dice loss + focal loss to train * [MONAI] Support non-one-hot ground truth (#11) Support non-one-hot ground truth * fixes format and docstrings, adds argparser options * resume the focal_loss * adds tests * [MONAI] Support non-one-hot ground truth (#11) Support non-one-hot ground truth * adds tests * update docstring * [MONAI] Keep track of best validation scores (#12) Keep track of best validation scores * model saving * adds window sampling * update readme * update docs * fixes flake8 error * update window sampling * fixes model name * fixes channel size issue * [MONAI] Update --pretrain, --lr (#13) + lr from 5e-4 to 1e-3 because we use mean for class channel instead of sum for class channel. + pretrain path is consistent with current model_name. * [MONAI] Pad image; elastic; best class model (#14) * [MONAI] Pad image; elastic; best class model + Pad image bigger than crop_size, avoid potential issues in RandCropByPosNegLabeld + Use Rand3DElasticd + Save best model for each class * Update train.py Co-authored-by: Wenqi Li <wenqil@nvidia.com> * flake8 fixes * removes -1 cropsize deform * testing commands * fixes unit tests * update spatial padding * [MONAI] Add full image deform augmentation (#15) + Add full image deform augmentation by Rand3DElasticd + Please use latest MONAI in #623 * Adding py.typed * updating setup.py to comply with black * update based on comments * excluding research from packaging * update tests * update setup.py Co-authored-by: Wentao Zhu <wentaozhu1991@gmail.com> Co-authored-by: Neil Tenenholtz <ntenenz@users.noreply.github.com> Co-authored-by: Nic Ma <nma@nvidia.com>
Project-MONAI · Jun 26, 2020 · 379c959 · 379c959
1 parent f262355
commit 379c959
Show file tree

Hide file tree

Showing 8 changed files with 595 additions and 1 deletion.
diff --git a/research/lamp-automated-model-parallelism/README.md b/research/lamp-automated-model-parallelism/README.md
@@ -0,0 +1,53 @@
+# LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation
+
+<p>
+<img src="./fig/acc_speed_han_0_5hor.png" alt="LAMP on Head and Neck Dataset" width="500"/>
+</p>
+
+
+> If you use this work in your research, please cite the paper.
+
+A reimplementation of the LAMP system originally proposed by:
+
+Wentao Zhu, Can Zhao, Wenqi Li, Holger Roth, Ziyue Xu, and Daguang Xu (2020)
+"LAMP: Large Deep Nets with Automated Model Parallelism for Image Segmentation."
+MICCAI 2020 (Early Accept, paper link: https://arxiv.org/abs/2006.12575)
+
+
+## To run the demo:
+
+### Prerequisites
+- install the latest version of MONAI: `git clone https://github.com/Project-MONAI/MONAI` and `pip install -e .`
+- `pip install torchgpipe`
+
+### Data
+```bash
+mkdir ./data;
+cd ./data;
+```
+Head and Neck CT dataset
+
+Please download and unzip the images into `./data` folder.
+
+- `HaN.zip`: https://drive.google.com/file/d/1A2zpVlR3CkvtkJPvtAF3-MH0nr1WZ2Mn/view?usp=sharing
+```bash
+unzip HaN.zip;  # unzip
+```
+
+Please find more details of the dataset at https://github.com/wentaozhu/AnatomyNet-for-anatomical-segmentation.git
+
+
+### Minimal hardware requirements for full image training
+- U-Net (`n_feat=32`): 2x 16Gb GPUs
+- U-Net (`n_feat=64`): 4x 16Gb GPUs
+- U-Net (`n_feat=128`): 2x 32Gb GPUs
+
+
+### Commands
+The number of features in the first block (`--n_feat`) can be 32, 64, or 128.
+```bash
+mkdir ./log;
+python train.py --n_feat=128 --crop_size='64,64,64' --bs=16 --ep=4800  --lr=0.001 > ./log/YOURLOG.log
+python train.py --n_feat=128 --crop_size='128,128,128' --bs=4 --ep=1200 --lr=0.001 --pretrain='./HaN_32_16_1200_64,64,64_0.001_*'  > ./log/YOURLOG.log
+python train.py --n_feat=128 --crop_size='-1,-1,-1' --bs=1 --ep=300 --lr=0.001 --pretrain='./HaN_32_16_1200_64,64,64_0.001_*' > ./log/YOURLOG.log
+```
diff --git a/research/lamp-automated-model-parallelism/__init__.py b/research/lamp-automated-model-parallelism/__init__.py
@@ -0,0 +1,10 @@
+# Copyright 2020 MONAI Consortium
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/research/lamp-automated-model-parallelism/data_utils.py b/research/lamp-automated-model-parallelism/data_utils.py
@@ -0,0 +1,66 @@
+# Copyright 2020 MONAI Consortium
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import numpy as np
+from monai.transforms import DivisiblePad
+
+STRUCTURES = (
+    "BrainStem",
+    "Chiasm",
+    "Mandible",
+    "OpticNerve_L",
+    "OpticNerve_R",
+    "Parotid_L",
+    "Parotid_R",
+    "Submandibular_L",
+    "Submandibular_R",
+)
+
+
+def get_filenames(path, maskname=STRUCTURES):
+    """
+    create file names according to the predefined folder structure.
+
+    Args:
+        path: data folder name
+        maskname: target structure names
+    """
+    maskfiles = []
+    for seg in maskname:
+        if os.path.exists(os.path.join(path, "./structures/" + seg + "_crp_v2.npy")):
+            maskfiles.append(os.path.join(path, "./structures/" + seg + "_crp_v2.npy"))
+        else:
+            # the corresponding mask is missing seg, path.split("/")[-1]
+            maskfiles.append(None)
+    return os.path.join(path, "img_crp_v2.npy"), maskfiles
+
+
+def load_data_and_mask(data, mask_data):
+    """
+    Load data filename and mask_data (list of file names)
+    into a dictionary of {'image': array, "label": list of arrays, "name": str}.
+    """
+    pad_xform = DivisiblePad(k=32)
+    img = np.load(data)  # z y x
+    img = pad_xform(img[None])[0]
+    item = dict(image=img, label=[])
+    for idx, maskfnm in enumerate(mask_data):
+        if maskfnm is None:
+            ms = np.zeros(img.shape, np.uint8)
+        else:
+            ms = np.load(maskfnm).astype(np.uint8)
+            assert ms.min() == 0 and ms.max() == 1
+        mask = pad_xform(ms[None])[0]
+        item["label"].append(mask)
+    assert len(item["label"]) == 9
+    item["name"] = str(data)
+    return item
diff --git a/research/lamp-automated-model-parallelism/fig/acc_speed_han_0_5hor.png b/research/lamp-automated-model-parallelism/fig/acc_speed_han_0_5hor.png
diff --git a/research/lamp-automated-model-parallelism/test_unet_pipe.py b/research/lamp-automated-model-parallelism/test_unet_pipe.py
@@ -0,0 +1,52 @@
+# Copyright 2020 MONAI Consortium
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import unittest
+
+import torch
+from parameterized import parameterized
+
+from unet_pipe import UNetPipe
+
+TEST_CASES = [
+    [  # 1-channel 3D, batch 12
+        {"spatial_dims": 3, "out_channels": 2, "in_channels": 1, "depth": 3, "n_feat": 8},
+        torch.randn(12, 1, 32, 64, 48),
+        (12, 2, 32, 64, 48),
+    ],
+    [  # 1-channel 3D, batch 16
+        {"spatial_dims": 3, "out_channels": 2, "in_channels": 1, "depth": 3},
+        torch.randn(16, 1, 32, 64, 48),
+        (16, 2, 32, 64, 48),
+    ],
+    [  # 4-channel 3D, batch 16, batch normalisation
+        {"spatial_dims": 3, "out_channels": 3, "in_channels": 2},
+        torch.randn(16, 2, 64, 64, 64),
+        (16, 3, 64, 64, 64),
+    ],
+]
+
+
+class TestUNETPipe(unittest.TestCase):
+    @parameterized.expand(TEST_CASES)
+    def test_shape(self, input_param, input_data, expected_shape):
+        net = UNetPipe(**input_param)
+        if torch.cuda.is_available():
+            net = net.to(torch.device("cuda"))
+            input_data = input_data.to(torch.device("cuda"))
+        net.eval()
+        with torch.no_grad():
+            result = net.forward(input_data.float())
+            self.assertEqual(result.shape, expected_shape)
+
+
+if __name__ == "__main__":
+    unittest.main()