-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training data from Twitter #126
Comments
I would love to see this plan as someone trying to educate themselves. |
Would love to collaborate here. Lmk if there is task that can be split. |
this looks like a neat plan. I think this would be really supercharged if we had some sort of "instruction/request detector", i.e. a model (or heuristic) that detects when the initial tweet is an instruction. I think this would reduce data noise by a huge amount. do you think that's a good plan? I've created #143 for this. Do you want to work on that or do you already want to start here? |
@yk I think that's a good idea, a deberta-like model would be great for this task. It adapts to varied styles quickly helping us have a singular model to run on all different social media(style of conversation differs)/normal web-styled texts. Further, a task like To be fair, heuristic-based would be much quicker to infer and therefore data collection would greatly increase given the huge amount of data we would need, but firstly I will keep it as an offline experiment to see how a few of those would work on a task like this and if there are certain patterns we could use. I would like to start working on #143 first, you could assign it to me. I will propose a more detailed plan for that task here or on discord DM tonight. |
amazing, thank you! |
you need to comment over there, I can't assign you otherwise 😄 |
Done 👯 |
Hi all, I have some experience dealing with twitter data science projects (had to load, filter, and process a TB of tweets), and finding suitable threads may be difficult or time-consuming to run the model to check due to the vast amounts of raw data. It might be possible to reduce the search space by filtering for certain hashtags or keywords like "help", "lazyweb", or other suitable ones. At the very least, it might help to get initial training data for the instruction model. Just an idea. I'm willing to help if I can. |
that's true. at least we could have multi-stage filters that optimize for recall, the first ones being very quick. Would you like to set up some code that would allow us to collect twitter data? We can then plug in the instruction detector once we have it. |
Sure, I'll work on some initial code! I reviewed the data structures file, but I might need to follow up for the best way to store the conversation threads. For example, as json or in a relational db with parent-ids, or big csv files. The archive links stores them in .tar and inside are some compressed json files (.json.gz). Any feedback from experienced devs to optimize it is always helpful. |
I think we can stay with gzipped json-line files for now, they're very versatile, and we can process them into other formats easily. |
Hi @yk and others, I’ve been working on the twitter data. Some initial findings:
Action items:
|
hey this is really nice work so far, thank you! @lewtun do you have inputs on output format and storage? as for keywords or hashtags, we might need to investigate this some more. one strategy is to search for hashtags or keywords that are commonly used when someone is requesting help with something , like |
Hi @yk @lewtun and others interested, I seem to have gotten the code working to turn the twitter dumps into conversation trees. Copying this post from discord. I have a jsonl file to share if anyone wants to look at it, can DM on discord. More details: Step 2: Step 3: Notes:
|
thanks a lot for this update, this sounds very cool! don't worry about getting the data format exactly correct, having the data in any way is already great! |
Any update here? |
Hi @andreaskoepf , Apologies, I have been busy with work for the past month or so. I ran into some problems getting quality prompts out of it and there are issues with changes to how Twitter is operating recently. I can make some commits though soon for some extra scripts I have made to clean it and apply a detector to help remove some of them spam. We have 2 approaches:
After that, I would suggest either we consider it for later or if someone else can take the mantle of cleaning it up. |
Closing old data issue that has not been completed by now. |
Twitter is a good source for gathering multi-turn conversation training data. One approach is to convert any thread into
A: B: A:
format by linking user_id to the tweets. It has worked well in my experience.I can prepare a plan, notebooks and share sample results(a few 1000 for a start) on what the data would look like, and further access quality over toxicity metrics on random samples.
The data can be taken freely from twitterstream archives.
@yk what do you think?
Tasks:
The text was updated successfully, but these errors were encountered: