Training terminates after the first epoch due to excessive RAM usage #13

ivannson · 2019-12-09T20:08:42Z

I am trying to train a semantic segmentation model from scratch using COCO dataset, and every time I try to run the training script, it is Killed at the validation step after epoch 0.

At first, I got RuntimeError: Dataloader worker (pid xxxx) is killed by signal: Killed. After looking online, I tried setting number of workers to 0, which caused a similar error at the same stage, but the message just says Killed. Looking at the memory usage, just before the process was killed, the RAM usage went all the way up to 97%. I have 64Gb of RAM, which is enough to fit the entire training set if needed, so I don't really understand where the issue originates.

I have attached two screenshots showing the errors. The first one suggests that it failed when trying to colourise the images with colorizer.py.

Could you suggest a workaround? I am hoping to train a model on COCO data to understand how it works, and then train it on my own data which I will format to be COCO-like.

The text was updated successfully, but these errors were encountered:

tano297 · 2019-12-11T11:51:36Z

Hi,

We've seen this problems before, and usually it was a problem with a too high number of workers.
The only other thing that may be going on is that the rand_img list is getting too large. A simple workaround to this would be to try training with save_imgs: False in the config file. Can you try this and report back what happens?

ivannson · 2019-12-11T12:13:28Z

Hi,

Changing the number of workers didn't seem to help. To analyse this further I commented out the colorizer part in trainer.py. After training for one epoch I got a much more descriptive MemoryError when executing make_log_image in trainer.py:

def make_log_image(self, pred, target):
    # colorize and put in format
    pred = pred.cpu().numpy().argmax(0)
    #^MemoryError here^
    target = target.cpu().numpy()
    output = np.concatenate((pred, target), axis=1)

    return output

Setting save_imgs: False helped though, and the model has been training for 20 epochs now. I will try checking the size of rand_img and pred, because it's odd that there would be a memory error since I have cut down the number of images significantly (train2017 = 3.8GB, val2017 = 788MB).

ivannson · 2019-12-11T12:31:18Z

Also, regarding creating a custom dataset (didn't want to open an issue as it's just a question).

I have a set of images consisting of a camera image and semantic segmentation ground truth of that image, but no json annotation file, so I was wondering (sorry if it's a stupid question) what is the purpose of remapping the labels with generate_gt.py, converting images from colour to grayscale?

Would you suggest looking into generating a json file before using generate_gt.py or just using it as a guide to make my own remapping script? Correct me if I'm wrong, but the json file is used just to get the label for each pixel, and then refer back to the image to get the RGB values.

tano297 · 2019-12-11T16:31:27Z

Hi,

The whole reason for doing it in monochrome is that I need a way to parse the labels that is more or less standard for all datasets, since the idea is to try to be as general as possible. Therefore, you don't need the csv, OR to change to monochrome, as long as your parser knows how to get your pixel-wise labels and turn them into tensors for the training, which are expected to contain per-pixel a value between 0 and Nclasses-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training terminates after the first epoch due to excessive RAM usage #13

Training terminates after the first epoch due to excessive RAM usage #13

ivannson commented Dec 9, 2019

tano297 commented Dec 11, 2019

ivannson commented Dec 11, 2019 •

edited

Loading

ivannson commented Dec 11, 2019

tano297 commented Dec 11, 2019

Training terminates after the first epoch due to excessive RAM usage #13

Training terminates after the first epoch due to excessive RAM usage #13

Comments

ivannson commented Dec 9, 2019

tano297 commented Dec 11, 2019

ivannson commented Dec 11, 2019 • edited Loading

ivannson commented Dec 11, 2019

tano297 commented Dec 11, 2019

ivannson commented Dec 11, 2019 •

edited

Loading