-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve training data #2
Comments
I have a feeling that this step is going to be the key towards approaching CAI, especially the idea of injecting external knowledge... |
Another report I got: model calling the user random unrelated names. Might be worth running some NLP toolkit over the data to see if there's any significant imbalances towards certain names, and try to clean that up. |
As for non-conversational data: a recent paper by Google seems to indicate that starting from an instruction-tuned model rather than a regular pre-trained LM might actually improve downstream task performance when performing further fine-tuning. Twitter thread about this, relevant arxiv and code repository. |
With respect to more data sources, some thoughts:
I have time to help out with either of these if it would be useful. |
Hey @lloorree! Indeed, forums seem like they might be a good source. The community has contributed around ~350MB of forum posts that I'll attempt to write some parsing code for. It's all in SQlite databases, so the code will be somewhat similar to the Discord DHT parsing stuff I wrote. If you're interested in helping out with that, let me know and I can send you a sample. As for TV stuff, it didn't seem that great to me because a big portion of the context is the stuff that's happening on screen, so if you take just the text/dialogue, it's usually pretty bland and uninteresting. |
Hey! That would be perfect, thanks. The main other thing I'd need to get started would be a few lines of one of the correctly parsed output files to double-check that what I'm getting running locally is right. Fair enough about the dialogue. |
Old discussion, very informative, but closing so that I can tidy up this repo |
I haven't been able to make any significant improvements on the models by twiddling around with hyperparameters and training objectives ever since around experiment 2, so I'm going to shift into focusing on improving the training data instead.
Some relevant points to consider:
The text was updated successfully, but these errors were encountered: