New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
126 twitter data #620
126 twitter data #620
Conversation
…arquet files focused on tweet reply rows for further processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean code and a best-practice readme & overall code documentation. Thanks a lot.
|
||
- The issue is that we can easily scrape replies, but there is no guarantee the | ||
original tweet is in the archive file. Furthermore, the archives are large so | ||
they would need to be kept completely in-memory or in a db to reference. We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is roughly the total size of data we are dealing with (e.g. all compressed files together)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andreaskoepf , much appreciated!
The interesting part of the twitter archives is the potential size is massive in totality. There is ~128 pages and possibly more of them on archive.org. Each one of those pages has tar files inside with thousands of compressed json files which makes the IO part a bit of a hassle (thus my initial consolidation into parquet files with some pre-processing). I would say each dump is around 50 - 100GB in tar format. Rough napkin calculation means the total might be around 6 - 12TB of data if we wanted to use all of it.
My script doesn't handle the downloading or decompression of the initial tar files since archive.org is slow (it expects folders of compressed json files), but this could be added in the future. I found using external tools specialized for downloading and decompression of large files is easier for initial setup and more efficient.
The initial script for parquet files can handle that but most likely would run into issues in the conversation thread script if we tried to ingest all of it at once to do the conversation thread matching so would need more intelligent code or db usage to handle the full scale in my eyes unless we process only a few dumps at a time and accept any missed connections of tweets between dumps.
I downloaded a few of them for variety for about 65GB of data, and it turned out to be around 90M tweets after some minimal preprocessing to remove truncated tweets or deletion rows which exist in the json. That 90M tweets then became around 17K english conversation threads, but most are spammy / noisy or not useful instructions by brief manual observations due to the nature of twitter. Filtering and most likely post-processing will be needed to elevate the quality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreaskoepf @yk Hi gents, can we merge this? If any adjustments are needed, feel free to let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jmete is the 17k english conversation threads shared anywhere? I'm doing cleanup/aggregation of the data we currently have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pruksmhc Hey, It is scraped data so currently just the scripts are uploaded publically, but I can share a link on discord. I have 17k that are very messy (most are spam). I am working on an instruction-detector model to help filter that down to around 800 to remove a lot of the gibberish. Those remaining 800 though would still need to go through safety filters, and other filters which will be part of the data pipeline.
Hello @yk , @andreaskoepf , and others, I added code under scripts/data-collection/twitter for converting compressed jsonl twitter archive files into jsonl conversation threads based on the tree and node architecture. I have ran pre-commit. If there are any issues, please let me know and I can help fix it. This is for issue #126
Further work will need to be done to filter them into usable instruction -> fulfilment since a lot of twitter threads are spam which we can use the instruction-detector issue 143 on. We can also use a .env file to make managing paths easier probably.
Note: User should download the large .tar archive files from archive.org and decompress them. We could automate this also in the future if needed but using a downloader like jdownloader is better for archive.org or else it is painfully slow.