New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of Memory on Small Dataset #151
Comments
Hi @stevesmit, Could you try again with i.e.:
|
I agree this isn't expected behavior though. With tensorflow's
So that's 2.5GB of GPU memory required when using identical settings to you. Will have to dig into this issue further, might need some more info on your side to reproduce here in a bit. |
As an aside, this issue should be independent of dataset size. Dataset size increase CPU memory usage but not GPU memory usage as each batch is sent to the GPU independently, so only batch_size and max_length modifications should cause GPU memory usage to vary. |
I tried that but I got the same error 👎 |
I've managed to replicate your problem by reverting to 0.4.1 and installing tensorflow-gpu==1.8.0. I believe installing the development branch of this github repo will resolve things for you. I'm not 100% certain of the root cause but my guess is it's related to whether not a portion of the language model graph is built when it is not required. If you experience further issues after installing the development version of the repo, the next option is to use the lighter weight version of the model.
Sorry for the difficulty! This warrants a new pypi image associated with the dev branch -- will hopefully have a 0.4.2 release out here shortly. --Madison |
Hi Madison I cloned the
|
Hrmmm... seems to me like the setup didn't complete for one reason or another. That function definitely exists in the development branch. Perhaps the
|
I can't be sure, but we've just pushed an update to finetune (0.5.8) that may resolve your issue. If you've installed via pip you can upgrade via:
|
Hi @stevesmit, just checking in again. Have you had a chance to try this out on your machine? Curious to know if the refactor to the tf.Estimator API helped to resolve your issue. |
@madisonmay Sadly I'm still having trouble. Upgraded to 0.5.11. After some ResourceExhaustedErrors I toned down the parameters and used a toy example dataset. It manages to start finetuning but ends up with a weird error. Here's my code and the error:
Here's the full traceback:
|
Hi there, I'm still at a loss as to what's going on in your environment, but it may be worth trying to run this in a docker container through nvidia runtime or on a linux machine if you have access to one -- seems like there's a reasonable chance this issue is an artifact of running on a windows environment. The new estimator finetune version also requires DST tensor is not initialized is essentially the same as any other OOM error -- it's always been the same problem by a different name in my experience. Sorry for the trouble, hope you're able to get something sorted out. --Madison |
Going to close this down due to lack of activity. Feel free to re-open if you have more information we might be able to use to help diagnose the problem. |
Describe the bug
When attempting to train a classifier on a small dataset of 8,000 documents, I get an out of memory error and the script stops running.
Minimum Reproducible Example
Version of
finetune
= 0.4.1Version of
tensorflow-gpu
= 1.8.0Version of
cuda
= release 9.0, V9.0.176Windows 10 Pro
Load a dataset of documents (X_train) and labels (Y_train), where each document and label is simply a string.
model = finetune.Classifier(max_length = 256, batch_size = 1) #tried reducing the memory footprint
model.fit(X_train, Y_train)
Expected behavior
I expected the model to train, but it doesn't manage to start training.
Additional context
I get the following warnings in the jupyter notebook:
And then I get the following diagnostic info showing up in the command prompt:
...and so on. This is, in my opinion a pretty small dataset and I've made the max characters pretty small so I don't think this is a hardware limitation, but a bug.
The text was updated successfully, but these errors were encountered: