Skip to content
This repository has been archived by the owner on Apr 24, 2023. It is now read-only.

multi-GPU training problem #6

Open
LilySys opened this issue Sep 25, 2019 · 10 comments
Open

multi-GPU training problem #6

LilySys opened this issue Sep 25, 2019 · 10 comments

Comments

@LilySys
Copy link

LilySys commented Sep 25, 2019

HI, when i trained the model with multi-gpu training, the model didn't start training after more than 30 minutes, and i don't konw why, could you give me some suggestions? Thank you!

2019-09-25 14:56:36,708 reid_baseline.train INFO: More than one gpu used, convert model to use SyncBN.
2019-09-25 14:56:40,504 reid_baseline.train INFO: Using pytorch SyncBN implementation
2019-09-25 14:56:40,535 reid_baseline.train INFO: Trainer Built

@DTennant
Copy link
Owner

Can you please provide your config file?

@LilySys
Copy link
Author

LilySys commented Sep 26, 2019

ok.

In config.py, i jush modified the _C.TEST.VIS = True.

In debug_multi-gpu.yml, the configurations are show as follows:
MODEL:
PRETRAIN_PATH: '/home/wl/.torch/models/resnet50-19c8e357.pth'

INPUT:
SIZE_TRAIN: [256, 128]
SIZE_TEST: [256, 128]
PIXEL_MEAN: [0.485, 0.456, 0.406]
PIXEL_STD: [0.229, 0.224, 0.225]
PROB: 0.5 # random horizontal flip
RE_PROB: 0.5 # random erasing
PADDING: 10

DATASETS:
NAMES: 'retrieval'
DATA_PATH: '/home/wl/.data/retrieval/vehicle'
TRAIN_PATH: 'train.txt'
QUERY_PATH: 'query.txt'
GALLERY_PATH: 'gallery.txt'

DATALOADER:
SAMPLER: 'softmax_triplet'
NUM_INSTANCE: 8
NUM_WORKERS: 4

SOLVER:
OPTIMIZER_NAME: 'Adam'
MAX_EPOCHS: 120
BASE_LR: 0.00035
BIAS_LR_FACTOR: 1
WEIGHT_DECAY: 0.0005
WEIGHT_DECAY_BIAS: 0.0005
IMS_PER_BATCH: 128

STEPS: [40,70]
GAMMA: 0.1

WARMUP_FACTOR: 0.01
WARMUP_ITERS: 10
WARMUP_METHOD: 'linear'

CHECKPOINT_PERIOD: 10
LOG_PERIOD: 20
EVAL_PERIOD: 20

TEST:
IMS_PER_BATCH: 128
DEBUG: True
WEIGHT: "path"
MULTI_GPU: True

OUTPUT_DIR: "/home/wl/.pytorch_project/person_reid/reid_baseline_with_syncbn-master/outputs/20190925"

@DTennant
Copy link
Owner

the TRAIN_PATH, QUERY_PATH and GALLERY_PATH should be the folder to the images

@LilySys
Copy link
Author

LilySys commented Sep 26, 2019

yes, i know it, so i modified the data.py according to my requirements.so i think the problem may not be here.

@DTennant
Copy link
Owner

Can you post your data.py?

@LilySys
Copy link
Author

LilySys commented Sep 26, 2019

import torch
import os.path as osp
from PIL import Image
from torch.utils.data import Dataset
import numpy as np
from torchvision import transforms as T
import glob
import re
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

def read_image(img_path):
"""Keep reading image until succeed.
This can avoid IOError incurred by heavy IO process."""
got_img = False
if not osp.exists(img_path):
raise IOError("{} does not exist".format(img_path))
while not got_img:
try:
img_type = 'RGB'
img = Image.open(img_path).convert(img_type)
got_img = True
except IOError:
print("IOError incurred when reading '{}'. "
"Will redo. Don't worry. Just chill.".format(img_path))
pass
return img

class ImageDataset(Dataset):
"""Image Person ReID Dataset"""

def __init__(self, dataset, cfg, transform=None):
    self.dataset = dataset
    self.cfg = cfg
    self.transform = transform

def __len__(self):
    return len(self.dataset)

def __getitem__(self, index):
    img_path, pid, camid = self.dataset[index]
    img = read_image(img_path)

    if self.transform is not None:
        img = self.transform(img)

    return img, pid, camid, img_path

class BaseDataset:
def init(self, root='/home/wl/.data/retrieval',
train_dir='', query_dir='', gallery_dir='',
verbose=True, **kwargs):
self.dataset_dir = root
self.train_dir = osp.join(self.dataset_dir, 'train')
self.query_dir = osp.join(self.dataset_dir, 'query')
self.gallery_dir = osp.join(self.dataset_dir, 'gallery')
self.list_train_path = osp.join(self.dataset_dir, 'train/' + train_dir)
self.list_query_path = osp.join(self.dataset_dir, 'query/' + query_dir)
self.list_gallery_path = osp.join(self.dataset_dir, 'gallery/' + gallery_dir)

    self._check_before_run()
    train = self._process_dir(self.train_dir, self.list_train_path)
    query = self._process_dir(self.query_dir, self.list_query_path)
    gallery = self._process_dir(self.gallery_dir, self.list_gallery_path)
    if verbose:
        print("=> retrieval loaded")
        self.print_dataset_statistics(train, query, gallery)

    self.train = train
    self.query = query
    self.gallery = gallery

    self.num_train_pids, self.num_train_imgs, self.num_train_cams = self.get_imagedata_info(self.train)
    self.num_query_pids, self.num_query_imgs, self.num_query_cams = self.get_imagedata_info(self.query)
    self.num_gallery_pids, self.num_gallery_imgs, self.num_gallery_cams = self.get_imagedata_info(self.gallery)

def get_imagedata_info(self, data):
    pids, cams = [], []
    for _, pid, camid in data:
        pids += [pid]
        cams += [camid]
    pids = set(pids)
    cams = set(cams)
    num_pids = len(pids)
    num_cams = len(cams)
    num_imgs = len(data)
    return num_pids, num_imgs, num_cams

def print_dataset_statistics(self, train, query, gallery):
    num_train_pids, num_train_imgs, num_train_cams = self.get_imagedata_info(train)
    num_query_pids, num_query_imgs, num_query_cams = self.get_imagedata_info(query)
    num_gallery_pids, num_gallery_imgs, num_gallery_cams = self.get_imagedata_info(gallery)

    print("Dataset statistics:")
    print("  ----------------------------------------")
    print("  subset   | # ids | # images | # cameras")
    print("  ----------------------------------------")
    print("  train    | {:5d} | {:8d} | {:9d}".format(num_train_pids, num_train_imgs, num_train_cams))
    print("  query    | {:5d} | {:8d} | {:9d}".format(num_query_pids, num_query_imgs, num_query_cams))
    print("  gallery  | {:5d} | {:8d} | {:9d}".format(num_gallery_pids, num_gallery_imgs, num_gallery_cams))
    print("  ----------------------------------------")

def _check_before_run(self):
    """Check if all files are available before going deeper"""
    if not osp.exists(self.dataset_dir):
        raise RuntimeError("'{}' is not available".format(self.dataset_dir))
    if not osp.exists(self.train_dir):
        raise RuntimeError("'{}' is not available".format(self.train_dir))
    if not osp.exists(self.query_dir):
        raise RuntimeError("'{}' is not available".format(self.query_dir))
    if not osp.exists(self.gallery_dir):
        raise RuntimeError("'{}' is not available".format(self.gallery_dir))

def _process_dir(self, dir_path, list_path):
    with open(list_path, 'r') as txt:
        lines = txt.readlines()

    pid_container = set()
    for img_idx, img_info in enumerate(lines):
        img_path, pid = img_info.split(' ')
        pid = int(pid)
        assert pid>=0, "pid less than 0"
        pid_container.add(pid)
    pid2label = {pid: label for label, pid in enumerate(pid_container)}

    dataset = []
    for img_idx, img_info in enumerate(lines):
        img_path, pid = img_info.split(' ')
        camid = 1
        pid = int(pid)
        img_path = dir_path + img_path
        pid = pid2label[pid]
        dataset.append((img_path, pid, camid))

    return dataset

def init_dataset(cfg):
"""
Use path in cfg to init a dataset
train set and val set should be organzed as
cfg.DATASETS.TRAIN_PATH: the path of train.txt
cfg.DATASETS.QUERY_PATH: the path of query.txt
cfg.DATASETS.GALLERY_PATH: the path of gallery.txt
"""
return BaseDataset(root=cfg.DATASETS.DATA_PATH, train_dir=cfg.DATASETS.TRAIN_PATH,
query_dir=cfg.DATASETS.QUERY_PATH, gallery_dir=cfg.DATASETS.GALLERY_PATH)

@LilySys
Copy link
Author

LilySys commented Sep 26, 2019

HI,when i decrease the IMS_PER_BATCH form 128 to 64, the model started to training.
2019-09-26 16:34:26,818 reid_baseline.train INFO: More than one gpu used, convert model to use SyncBN.
2019-09-26 16:34:29,920 reid_baseline.train INFO: Using pytorch SyncBN implementation
2019-09-26 16:34:29,936 reid_baseline.train INFO: Trainer Built
2019-09-26 16:35:27,140 reid_baseline.train INFO: Epoch[1] Iteration[20/5582] Loss: 14.669,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:35:43,517 reid_baseline.train INFO: Epoch[1] Iteration[40/5582] Loss: 14.248,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:35:59,911 reid_baseline.train INFO: Epoch[1] Iteration[60/5582] Loss: 14.028,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:36:16,507 reid_baseline.train INFO: Epoch[1] Iteration[80/5582] Loss: 13.833,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:36:32,851 reid_baseline.train INFO: Epoch[1] Iteration[100/5582] Loss: 13.673,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:36:49,295 reid_baseline.train INFO: Epoch[1] Iteration[120/5582] Loss: 13.562,Acc: 0.000, Base Lr: 1.40e-05
2019-09-26 16:37:05,658 reid_baseline.train INFO: Epoch[1] Iteration[140/5582] Loss: 13.454,Acc: 0.000, Base Lr: 1.40e-05

@LilySys
Copy link
Author

LilySys commented Sep 26, 2019

Hi,there is another question?
Why do I use multi-GPU training more slowly than single-gpu training?

@HaoWang1006
Copy link

@LilySys Did you reduce the batch size to solve this problem? and, Do you have any decrease in test accuracy after training?

@YUHANG-Ma
Copy link

HI,when i decrease the IMS_PER_BATCH form 128 to 64, the model started to training. 2019-09-26 16:34:26,818 reid_baseline.train INFO: More than one gpu used, convert model to use SyncBN. 2019-09-26 16:34:29,920 reid_baseline.train INFO: Using pytorch SyncBN implementation 2019-09-26 16:34:29,936 reid_baseline.train INFO: Trainer Built 2019-09-26 16:35:27,140 reid_baseline.train INFO: Epoch[1] Iteration[20/5582] Loss: 14.669,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:35:43,517 reid_baseline.train INFO: Epoch[1] Iteration[40/5582] Loss: 14.248,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:35:59,911 reid_baseline.train INFO: Epoch[1] Iteration[60/5582] Loss: 14.028,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:36:16,507 reid_baseline.train INFO: Epoch[1] Iteration[80/5582] Loss: 13.833,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:36:32,851 reid_baseline.train INFO: Epoch[1] Iteration[100/5582] Loss: 13.673,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:36:49,295 reid_baseline.train INFO: Epoch[1] Iteration[120/5582] Loss: 13.562,Acc: 0.000, Base Lr: 1.40e-05 2019-09-26 16:37:05,658 reid_baseline.train INFO: Epoch[1] Iteration[140/5582] Loss: 13.454,Acc: 0.000, Base Lr: 1.40e-05

Hi, I also met this problem. I changed the batch size to 32 but it doesn't work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants