Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

126 twitter data #620

Merged
merged 9 commits into from Jan 21, 2023
Merged

126 twitter data #620

merged 9 commits into from Jan 21, 2023

Conversation

Jmete
Copy link
Contributor

@Jmete Jmete commented Jan 11, 2023

Hello @yk , @andreaskoepf , and others, I added code under scripts/data-collection/twitter for converting compressed jsonl twitter archive files into jsonl conversation threads based on the tree and node architecture. I have ran pre-commit. If there are any issues, please let me know and I can help fix it. This is for issue #126

Further work will need to be done to filter them into usable instruction -> fulfilment since a lot of twitter threads are spam which we can use the instruction-detector issue 143 on. We can also use a .env file to make managing paths easier probably.

Note: User should download the large .tar archive files from archive.org and decompress them. We could automate this also in the future if needed but using a downloader like jdownloader is better for archive.org or else it is painfully slow.

Copy link
Collaborator

@andreaskoepf andreaskoepf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean code and a best-practice readme & overall code documentation. Thanks a lot.


- The issue is that we can easily scrape replies, but there is no guarantee the
original tweet is in the archive file. Furthermore, the archives are large so
they would need to be kept completely in-memory or in a db to reference. We
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is roughly the total size of data we are dealing with (e.g. all compressed files together)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andreaskoepf , much appreciated!

The interesting part of the twitter archives is the potential size is massive in totality. There is ~128 pages and possibly more of them on archive.org. Each one of those pages has tar files inside with thousands of compressed json files which makes the IO part a bit of a hassle (thus my initial consolidation into parquet files with some pre-processing). I would say each dump is around 50 - 100GB in tar format. Rough napkin calculation means the total might be around 6 - 12TB of data if we wanted to use all of it.

My script doesn't handle the downloading or decompression of the initial tar files since archive.org is slow (it expects folders of compressed json files), but this could be added in the future. I found using external tools specialized for downloading and decompression of large files is easier for initial setup and more efficient.

The initial script for parquet files can handle that but most likely would run into issues in the conversation thread script if we tried to ingest all of it at once to do the conversation thread matching so would need more intelligent code or db usage to handle the full scale in my eyes unless we process only a few dumps at a time and accept any missed connections of tweets between dumps.

I downloaded a few of them for variety for about 65GB of data, and it turned out to be around 90M tweets after some minimal preprocessing to remove truncated tweets or deletion rows which exist in the json. That 90M tweets then became around 17K english conversation threads, but most are spammy / noisy or not useful instructions by brief manual observations due to the nature of twitter. Filtering and most likely post-processing will be needed to elevate the quality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreaskoepf @yk Hi gents, can we merge this? If any adjustments are needed, feel free to let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jmete is the 17k english conversation threads shared anywhere? I'm doing cleanup/aggregation of the data we currently have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pruksmhc Hey, It is scraped data so currently just the scripts are uploaded publically, but I can share a link on discord. I have 17k that are very messy (most are spam). I am working on an instruction-detector model to help filter that down to around 800 to remove a lot of the gibberish. Those remaining 800 though would still need to go through safety filters, and other filters which will be part of the data pipeline.

@andreaskoepf andreaskoepf merged commit d15d835 into LAION-AI:main Jan 21, 2023
@bitplane bitplane mentioned this pull request Mar 5, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants