Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training terminates after the first epoch due to excessive RAM usage #13

Open
ivannson opened this issue Dec 9, 2019 · 4 comments
Open

Comments

@ivannson
Copy link

ivannson commented Dec 9, 2019

I am trying to train a semantic segmentation model from scratch using COCO dataset, and every time I try to run the training script, it is Killed at the validation step after epoch 0.

At first, I got RuntimeError: Dataloader worker (pid xxxx) is killed by signal: Killed. After looking online, I tried setting number of workers to 0, which caused a similar error at the same stage, but the message just says Killed. Looking at the memory usage, just before the process was killed, the RAM usage went all the way up to 97%. I have 64Gb of RAM, which is enough to fit the entire training set if needed, so I don't really understand where the issue originates.

I have attached two screenshots showing the errors. The first one suggests that it failed when trying to colourise the images with colorizer.py.

Could you suggest a workaround? I am hoping to train a model on COCO data to understand how it works, and then train it on my own data which I will format to be COCO-like.

Screenshot 2019-12-06 at 10 05 44

Screenshot 2019-12-06 at 00 27 05

@tano297
Copy link
Contributor

tano297 commented Dec 11, 2019

Hi,

We've seen this problems before, and usually it was a problem with a too high number of workers.
The only other thing that may be going on is that the rand_img list is getting too large. A simple workaround to this would be to try training with save_imgs: False in the config file. Can you try this and report back what happens?

@ivannson
Copy link
Author

ivannson commented Dec 11, 2019

Hi,

Changing the number of workers didn't seem to help. To analyse this further I commented out the colorizer part in trainer.py. After training for one epoch I got a much more descriptive MemoryError when executing make_log_image in trainer.py:

def make_log_image(self, pred, target):
    # colorize and put in format
    pred = pred.cpu().numpy().argmax(0)
    #^MemoryError here^
    target = target.cpu().numpy()
    output = np.concatenate((pred, target), axis=1)

    return output

Setting save_imgs: False helped though, and the model has been training for 20 epochs now. I will try checking the size of rand_img and pred, because it's odd that there would be a memory error since I have cut down the number of images significantly (train2017 = 3.8GB, val2017 = 788MB).

@ivannson
Copy link
Author

Also, regarding creating a custom dataset (didn't want to open an issue as it's just a question).

I have a set of images consisting of a camera image and semantic segmentation ground truth of that image, but no json annotation file, so I was wondering (sorry if it's a stupid question) what is the purpose of remapping the labels with generate_gt.py, converting images from colour to grayscale?

Would you suggest looking into generating a json file before using generate_gt.py or just using it as a guide to make my own remapping script? Correct me if I'm wrong, but the json file is used just to get the label for each pixel, and then refer back to the image to get the RGB values.

@tano297
Copy link
Contributor

tano297 commented Dec 11, 2019

Hi,

The whole reason for doing it in monochrome is that I need a way to parse the labels that is more or less standard for all datasets, since the idea is to try to be as general as possible. Therefore, you don't need the csv, OR to change to monochrome, as long as your parser knows how to get your pixel-wise labels and turn them into tensors for the training, which are expected to contain per-pixel a value between 0 and Nclasses-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants