Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data from Twitter #126

Closed
3 of 4 tasks
dhruv2601 opened this issue Dec 28, 2022 · 18 comments
Closed
3 of 4 tasks

Training data from Twitter #126

dhruv2601 opened this issue Dec 28, 2022 · 18 comments
Labels

Comments

@dhruv2601
Copy link

dhruv2601 commented Dec 28, 2022

Twitter is a good source for gathering multi-turn conversation training data. One approach is to convert any thread into
A: B: A: format by linking user_id to the tweets. It has worked well in my experience.

I can prepare a plan, notebooks and share sample results(a few 1000 for a start) on what the data would look like, and further access quality over toxicity metrics on random samples.
The data can be taken freely from twitterstream archives.

@yk what do you think?

Tasks:

  • Evaluation of legality / usefulness / scope (discussion in this issue)
  • Raw data set is published (or just a link to the data if it already exists)
  • 126 twitter data #620
  • Code run and data published to OpenAssistant Hugging Face
@degenai
Copy link

degenai commented Dec 29, 2022

I would love to see this plan as someone trying to educate themselves.

@tusharagarwal25
Copy link

Would love to collaborate here. Lmk if there is task that can be split.

@yk
Copy link
Collaborator

yk commented Dec 29, 2022

this looks like a neat plan. I think this would be really supercharged if we had some sort of "instruction/request detector", i.e. a model (or heuristic) that detects when the initial tweet is an instruction. I think this would reduce data noise by a huge amount. do you think that's a good plan?

I've created #143 for this. Do you want to work on that or do you already want to start here?

@dhruv2601
Copy link
Author

@yk I think that's a good idea, a deberta-like model would be great for this task. It adapts to varied styles quickly helping us have a singular model to run on all different social media(style of conversation differs)/normal web-styled texts. Further, a task like instruction classifier it would need something like 10-15k high-quality samples and there is enough instruction-based data to get good positive and negative samples.

To be fair, heuristic-based would be much quicker to infer and therefore data collection would greatly increase given the huge amount of data we would need, but firstly I will keep it as an offline experiment to see how a few of those would work on a task like this and if there are certain patterns we could use.

I would like to start working on #143 first, you could assign it to me. I will propose a more detailed plan for that task here or on discord DM tonight.

@yk
Copy link
Collaborator

yk commented Dec 29, 2022

amazing, thank you!

@yk
Copy link
Collaborator

yk commented Dec 29, 2022

you need to comment over there, I can't assign you otherwise 😄

@dhruv2601
Copy link
Author

Done 👯
I'll circle back to this current issue of Twitter collection when done with 143.

@Jmete
Copy link
Contributor

Jmete commented Dec 30, 2022

Hi all, I have some experience dealing with twitter data science projects (had to load, filter, and process a TB of tweets), and finding suitable threads may be difficult or time-consuming to run the model to check due to the vast amounts of raw data. It might be possible to reduce the search space by filtering for certain hashtags or keywords like "help", "lazyweb", or other suitable ones. At the very least, it might help to get initial training data for the instruction model. Just an idea. I'm willing to help if I can.

@yk
Copy link
Collaborator

yk commented Dec 30, 2022

Hi all, I have some experience dealing with twitter data science projects (had to load, filter, and process a TB of tweets), and finding suitable threads may be difficult or time-consuming to run the model to check due to the vast amounts of raw data. It might be possible to reduce the search space by filtering for certain hashtags or keywords like "help", "lazyweb", or other suitable ones. At the very least, it might help to get initial training data for the instruction model. Just an idea. I'm willing to help if I can.

that's true. at least we could have multi-stage filters that optimize for recall, the first ones being very quick. Would you like to set up some code that would allow us to collect twitter data? We can then plug in the instruction detector once we have it.

@Jmete
Copy link
Contributor

Jmete commented Dec 31, 2022

that's true. at least we could have multi-stage filters that optimize for recall, the first ones being very quick. Would you like to set up some code that would allow us to collect twitter data? We can then plug in the instruction detector once we have it.

Sure, I'll work on some initial code! I reviewed the data structures file, but I might need to follow up for the best way to store the conversation threads. For example, as json or in a relational db with parent-ids, or big csv files. The archive links stores them in .tar and inside are some compressed json files (.json.gz). Any feedback from experienced devs to optimize it is always helpful.

@yk
Copy link
Collaborator

yk commented Dec 31, 2022

I think we can stay with gzipped json-line files for now, they're very versatile, and we can process them into other formats easily.

@yk yk assigned Jmete Dec 31, 2022
@Jmete
Copy link
Contributor

Jmete commented Dec 31, 2022

Hi @yk and others, I’ve been working on the twitter data. Some initial findings:

  • I downloaded the first archive as a test. Its some tar files consisting of 8640 jzon.gz files.

  • Each json file has around 2000 rows of data in it. After some pre-filtering to only get rows that have replies or are a reply to something else, and also non-truncated text, its about 500 rows per file. This seems rather consistent so far in this dump at least. This could definitely reduce if we use stricter filters like keywords and hashtags.

  • The tweet data isn’t very deep. It will mention that this tweet has replies, or which tweet it is a reply to, but it doesn’t seem to carry the whole thread. We need to create the conversation threads by piecing them together and matching ids from the bottom up.

  • There is no guarantee that the original / response are in the same json.gz file.

  • Due to this, we might need to process lots of them as unified files in order to try to match them. Right now I am doing some pre-filtering and then storing them in larger parquet files to save on the IO time of opening lots of little files. Doesn’t have to be parquet, but it makes it easy as a temporary solution and the file size is good. Every 512 files of processing is between 30 - 50MB which is for around 250k rows of tweets.

  • Alternatively, we could possibly highlight original tweets (the instruction / prompt) that have replies (possibly with some extra filters like keywords, non-truncated text, etc), and then use those tweet IDs to extract the entire thread using external APIs instead of trying to mine them from the existing dump.
    For now I’ve written scripts to create the easier to digest pre-filtered parquet files by looping through the file list, processing them, and then exporting the combined files.

Action items:

  • Decide on how to create the conversation threads from this raw data. We can also determine the final output file format too.
  • Decide on the filters to use. Other potential filters include certain hashtags, keywords, follower count of the people tweeting, language (English or all?), etc.
  • Do we have any storage location to store pre-processed files? For example, someone can run the scripts to create the parquet (or other format files), and then someone else can take those files and do extra processing on them.

@yk
Copy link
Collaborator

yk commented Jan 1, 2023

hey this is really nice work so far, thank you!
with respect to how to get the conversations, I suggest you do what you feel is the most appropriate. seems like the choice is between a) keeping lots of stuff in ram to build the threads, b) spinning up a local db like redis to do that or c) query the API.

@lewtun do you have inputs on output format and storage?

as for keywords or hashtags, we might need to investigate this some more. one strategy is to search for hashtags or keywords that are commonly used when someone is requesting help with something , like #help. On the other hand, we could search for topical keywords such as math, sports, etc. and rely on the instruction detector to filter out the good instruction data. What do you think?

@Jmete
Copy link
Contributor

Jmete commented Jan 7, 2023

Hi @yk @lewtun and others interested, I seem to have gotten the code working to turn the twitter dumps into conversation trees. Copying this post from discord. I have a jsonl file to share if anyone wants to look at it, can DM on discord.

More details:
Step 1:
I downloaded a few archive dumps and wrote code that took some standardized columns into parquet files for non-truncated tweets. This acts as a standardized dump that is easier to process. There is about 90M Tweets.

Step 2:
I processed those parquet files into large dataframes and did some mix and matching to find out what tweets were origin tweets (just looking at english for now). Then wrote code to loop through the origin tweets and extract the conversation tree based on the origin node. I used the general tree and node class structure. Still missing some elements like metadata of the full tree, and I haven't combined prompt replies to themselves yet, they will show up as children, but I have identified them by their role as prompter or assistant.

Step 3:
I exported that list of conversation trees into a jsonl file. There is 17,378 lines, each is a tree.

Notes:

  • This is just based on data that I extracted from the dump, not all possible tweets.
  • I feel a lot of the tweets are low quality / spam. We could run a future instruction model on them to try to determine quality I guess. I did try to explore and check for helpful hashtags or words but its either messy or not too many at all in the dataset even over millions of tweets.
  • We might need to try a direct separate scrape directly to twitter to search for types of tweets / hashtags we want to get higher quality.
  • I will try to clean up my code and update my github fork. Currently just running on local notebooks.

@yk
Copy link
Collaborator

yk commented Jan 7, 2023

thanks a lot for this update, this sounds very cool! don't worry about getting the data format exactly correct, having the data in any way is already great!

@Jmete Jmete mentioned this issue Jan 11, 2023
@andreaskoepf
Copy link
Collaborator

Any update here?

@Jmete
Copy link
Contributor

Jmete commented May 5, 2023

Hi @andreaskoepf , Apologies, I have been busy with work for the past month or so. I ran into some problems getting quality prompts out of it and there are issues with changes to how Twitter is operating recently. I can make some commits though soon for some extra scripts I have made to clean it and apply a detector to help remove some of them spam. We have 2 approaches:

  • Scraping archive files. This has a lot of data, but a lot of spam. The original commits focused on this.
  • Second approach is for scraping twitter rolled-up threads (from third-party), but then these would need to be run through a separate process to generate questions out of the paragraphs to form proper Q/A or instruct pairs.

After that, I would suggest either we consider it for later or if someone else can take the mantle of cleaning it up.

@andreaskoepf
Copy link
Collaborator

Closing old data issue that has not been completed by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants