-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
app crashes during training #75
Comments
It seems, based on the output and the pytorch issue you linked above, that there is some indexing error happening. You should be able to label a subset of the extracted frames and train models without issue. Were you trying to label more frames during training? (This also should be fine, but just trying to get a sense of what you were doing in the app leading up to the crash) It's interesting that it crashes so far into training, I haven't seen that before. The model should have been run on every frame in the dataset by that point. |
I labelled all the extracted frames (240). |
Are all of those attempts from training within the app? Did training from the command line help, or did you get crashes there as well? And these crashes are happening with the supervised baseline model, correct? (easier to start debugging with the simplest model) |
they are all from within the app. I have not tried from the command line. Shall I run that now?
Yes this is the supervised model. |
Yes trying from the command line is a good next step, if it crashes there then we know it's not the app, and it focuses the following steps a bit more. |
In the meantime, if you look at your CollectedData.csv file, does it include any rows that are all empty? (i.e. are there any frames with no labels?) |
looks like the error is because |
ok cool, good to hear. maybe try running from the app again? It's very strange that it progressed further and further each time. There is not really a difference between the training script you ran directly from the command line and the training function called by the app (but clearly something is going on). Re: epochs vs iterations, see the config docs here, specifically the min/max epochs parameter - let me know if that answers your question! |
So I ran it again in the app and this time it completed the supervised model! But now it seems to be stuck with starting the semisupervised model. The app remains 'running' but nothing is happening in the terminal Here is an error I spotted in my terminal: as a temporary solution, I tried training from the terminal using a fresh config (as per your config guides). The supervised training runs successfully but my output directory is still missing the bold files below. I assume the semisupervised training also ran due to the existence of the predictions_pca, but I am not sure about the context model? /path/to/models/YYYY-MM-DD/HH-MM-SS/ |
Re: fiftyone error: if you go to the FIFTYONE tab does it render the page? (Even if it say "No dataset selected" that's fine, I'm curious if you just get a blank page or a page that looks like it's trying to load something but never does) Re: model training: I'm wondering if this has something to do with WSL...I haven't had this problem with native Linux before. If the semi-supervised model ran it would be in a completely separate Regarding the video predictions, these are additional flags you need to set in the config file; see the relevant parameters in the The flags will automatically run inference on a set of videos after the model completes training. If, as in your case, a model has already been trained and you'd like to run inference on a set of videos, see the inference documenation. |
Also, I'm not sure what version of the app you're currently using, but if you run (from inside the |
I get a page that is trying to load something but never does.
Can each model be ran separately in the terminal? if so How can I manually run the semi-supervised and context models?
It is already up to date. Perhaps next week's updates will make a difference as you suggest. |
Ok for fiftyone I'm guessing there's an issue with WSL, but I'll look into the error a bit more. Yes you can run each model separately in the terminal. Check out the docs for the config file and let me know if you have any questions that aren't answered there. You might also try training models one at a time from the app - if it worked for the supervised model, you could try just selecting the context model and see if it trains to completion. Please keep me updated. |
within the app, |
Ok not ideal but at least it's (partially) working! |
@DavidGill159 just wanted to check in on this - are you still only able to train one model at a time, or are you able to train multiple now? |
Hi @themattinthehatt the app didnt end up being reliable enough to train any networks beyond supervised. For not I am just training my models directly through the terminal. on another note, I am currently training a semisupervised ctx model on 600 labelled frames (the quality of my previous network, with 240 labelled frames was poor) and based on the time spent producing 1 epoch, it seems like it will take 3 days to run through all 300 epochs. Is this normal? my GPU is being utilised. |
ugh I'm really sorry to hear about the app not working. We haven't had that issue before with the linux installations, so maybe it is a windows/WSL issue? We'll try to look into that. 3 days for training is definitely not normal (although the semisupervised context models are the most computationally demanding). A few questions:
|
Sure, here is all you requested:
488 KB, 1280x1024, 96 dpi, 24 bit
NVIDIA GeForce RTX 4080 with Cuda 11.6
about 10-15 minutes on 250 labelled frames.
config attached |
another note... My computer shut down during training on my latest network (supervised + ctx). I tried resuming training from the checkpoint according to these instructions. However, this didn't work as the total number of epochs (first training session up until checkpoint + epochs since resuming) sums to more than my set epoch limit (750). The absolute path of the ckpt file is correct. |
@DavidGill159 sorry for the slow reply, I've been traveling for the past couple weeks. Regarding the slow training, I don't see any obvious reason why this should be happening given your config file. I am a bit surprised that you're not getting out of memory errors with the semi-supervised context model with resize dims of 384x384 and batch sizes of 16 for both the labeled and unlabeled frames. You could try reducing Regarding resuming training: apologies for the lack of clarity in the documentation - the |
app crashes with app server connection error. Terminal seems to indicate training has failed.
I had a similar app connection issue when extracting frames, for which the solution was to reduce the number of videos from 24 to 12. However, I assume the issue with training is related to GPU utilisation instead, similarly described here?).
my GPU setup:
closing and re-launching the app after completing labelling and before starting training seems to improve the progress but still results in the same crash eventually.
potential temporary resolution: run training directly within my terminal to see if the issue is related to just the app?
e.g. run:
python scripts/train_hydra.py --config-path=/mnt/c/Users/X48823DG/lightning-pose/Pose-app/data/lp_mk1/ --config-name=model_config_lp_mk1.yaml
The text was updated successfully, but these errors were encountered: