Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow unsupervised training #107

Closed
chiayewken opened this issue Aug 15, 2018 · 9 comments
Closed

Slow unsupervised training #107

chiayewken opened this issue Aug 15, 2018 · 9 comments

Comments

@chiayewken
Copy link

Thank you for your library, the supervised finetuning works very well. However, when I try to train on unlabelled data ( model.fit(unlabeledX) ), the training is much slower (9s/it) compared to supervised training (1.7s/it). This is on one K80 gpu. I am not sure why unsupervised training is slower, as doesn't the supervised training tune the language model as well?

@benleetownsend
Copy link
Contributor

You are correct that this is what the paper states, The default for this repo is to not tune the language model. It was found by us to have a negative effect at the dataset sizes we are interested in. (Hundreds or low thousands of samples.)

This said, I am extremely surprised that you are seeing this amount of slow down. An interesting test would be to change, in the config, the lm_loss_coef to non zero and see if you still see this slowdown.

Can I ask which variant of the model you are using. This could be possible with large batch sizes or sequence lengths as the language model computes a full projection for each token in each sequence. Are you running on both processors in the k80 or just one?

I will attempt to reproduce if you can give me some more info.

@chiayewken
Copy link
Author

chiayewken commented Aug 15, 2018

Thank you for the reply. I'm confused, isn't the default lm_loss_coef 0.5 for both unsupervised and supervised? I assumed that that meant the language model was being trained as well. I'm currently using the Classifier. I can't share the actual data, but I have been able to reproduce the slowdowns in this demo colab notebook:

https://colab.research.google.com/drive/1M0XAbGicO8-Vtn01Tw9UhCd4NCz43IiQ

I have tried varying the lm_loss_coef to no avail, unfortunately. I used a batch size of 10, but the default of 2 did not affect the timings.

@madisonmay
Copy link
Contributor

madisonmay commented Aug 15, 2018

Hi @chiayewken -- you may be encountering a performance bug that I think was part of our last PyPI release. Could you try running installing from source? I think you will likely see a significant speedup.

In the meantime I'll make sure to update our python package -- if you'd prefer there will be release 0.3.1 up on PyPI in ~30 minutes.

The default was originally 0.5 but was changed to 0.0 in recent commits as a result of quite a bit of empirical testing. The crossover point will vary dataset to dataset and task to task but expect to need a few thousand examples before it's useful to turn on lm_loss_coef.

Thanks for the bug report!

@chiayewken
Copy link
Author

Ah I see! Thank you, I will try installing from source and check out the results.

@chiayewken
Copy link
Author

chiayewken commented Aug 15, 2018

Just checked, the performance is way faster now! Thank you! (I did notice I had to reduce batch size to avoid oom issues though). On another topic, with the default lm_loss_coef of 0.0, I'm wondering if the lm is being trained at this point because there is this:

    if target_dim is None:
        lm_loss_coef = 1.0

Edit: Sorry, I stand corrected, the checkpoints were saved, but I'm still unclear on the state of unsupervised training in this repository in general. I'll find out how the training goes when it's done!

@madisonmay
Copy link
Contributor

madisonmay commented Aug 15, 2018

Awesome, glad to hear it!

The point of setting lm_loss_coef to 0.0 by default is, in fact, to prevent fitting the language model. The softmax over tokens is expensive, so this helps speed things up. If you're working with a dataset of 5k examples or more, you can manually set this to something other than 0.0 (probably 0.5), but we've found that at low data volumes or with particularly difficult classification tasks the classifier performance degrades because the model is able to reduce loss better by simply modeling language better.

Unsupervised training of the LM only is fully supported by finetune, with the exception that we need to fix #83 before the auto-checkpointing functionality will work. Note that this is not something that was tested by the original OpenAI paper and was an addition after the fact -- we have good reason to believe it will help improve classifier performance with large amounts of unlabeled data but haven't run many experiments to prove this out.

Curious to hear how the training goes!

@chiayewken
Copy link
Author

After leaving the language model to train unsupervised for a while, I was surprised to find that loading that model into a Classifier then fitting on my labelled data didn't improve my validation scores. On the other hand, I found that for purely supervised training, lm_loss_coef=0.5 produces better results than 0.0, but this is likely influenced by my dataset size like you said. My current best setup is a semi-supervised pseudo-labelling loop with lm_loss_coef=0.5, with a slight improvement over training only on labelled examples. I need to run more experiments... Many thanks!

@madisonmay
Copy link
Contributor

Curious that the dual objectives helped while the unsupervised training did not. Will make sure to ping you if we find anything interesting while testing out unsupervised fit on our own internal datasets.

@chiayewken
Copy link
Author

Great, thanks! This repository is awesome :)

@allentran allentran mentioned this issue Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants