126 twitter data #620

Jmete · 2023-01-11T09:20:38Z

Hello @yk , @andreaskoepf , and others, I added code under scripts/data-collection/twitter for converting compressed jsonl twitter archive files into jsonl conversation threads based on the tree and node architecture. I have ran pre-commit. If there are any issues, please let me know and I can help fix it. This is for issue #126

Further work will need to be done to filter them into usable instruction -> fulfilment since a lot of twitter threads are spam which we can use the instruction-detector issue 143 on. We can also use a .env file to make managing paths easier probably.

Note: User should download the large .tar archive files from archive.org and decompress them. We could automate this also in the future if needed but using a downloader like jdownloader is better for archive.org or else it is painfully slow.

…arquet files focused on tweet reply rows for further processing.

andreaskoepf

Clean code and a best-practice readme & overall code documentation. Thanks a lot.

andreaskoepf · 2023-01-11T20:14:16Z

scripts/data-collection/twitter/README.md

+
+- The issue is that we can easily scrape replies, but there is no guarantee the
+  original tweet is in the archive file. Furthermore, the archives are large so
+  they would need to be kept completely in-memory or in a db to reference. We


What is roughly the total size of data we are dealing with (e.g. all compressed files together)?

Thank you @andreaskoepf , much appreciated!

The interesting part of the twitter archives is the potential size is massive in totality. There is ~128 pages and possibly more of them on archive.org. Each one of those pages has tar files inside with thousands of compressed json files which makes the IO part a bit of a hassle (thus my initial consolidation into parquet files with some pre-processing). I would say each dump is around 50 - 100GB in tar format. Rough napkin calculation means the total might be around 6 - 12TB of data if we wanted to use all of it.

My script doesn't handle the downloading or decompression of the initial tar files since archive.org is slow (it expects folders of compressed json files), but this could be added in the future. I found using external tools specialized for downloading and decompression of large files is easier for initial setup and more efficient.

The initial script for parquet files can handle that but most likely would run into issues in the conversation thread script if we tried to ingest all of it at once to do the conversation thread matching so would need more intelligent code or db usage to handle the full scale in my eyes unless we process only a few dumps at a time and accept any missed connections of tweets between dumps.

I downloaded a few of them for variety for about 65GB of data, and it turned out to be around 90M tweets after some minimal preprocessing to remove truncated tweets or deletion rows which exist in the json. That 90M tweets then became around 17K english conversation threads, but most are spammy / noisy or not useful instructions by brief manual observations due to the nature of twitter. Filtering and most likely post-processing will be needed to elevate the quality.

@andreaskoepf @yk Hi gents, can we merge this? If any adjustments are needed, feel free to let me know.

@Jmete is the 17k english conversation threads shared anywhere? I'm doing cleanup/aggregation of the data we currently have.

@pruksmhc Hey, It is scraped data so currently just the scripts are uploaded publically, but I can share a link on discord. I have 17k that are very messy (most are spam). I am working on an instruction-detector model to help filter that down to around 800 to remove a lot of the gibberish. Those remaining 800 though would still need to go through safety filters, and other filters which will be part of the data pipeline.

Jmete and others added 9 commits January 2, 2023 00:20

Added a script file to process json archive files into more unified p…

59576e6

…arquet files focused on tweet reply rows for further processing.

Merge branch 'LAION-AI:main' into 126-twitter-data

c5d4bcf

Added README file for Twitter data collection.

25df452

Merge branch 'LAION-AI:main' into 126-twitter-data

bb77aab

Merge branch 'LAION-AI:main' into 126-twitter-data

c3d3edc

Re did code for processing json into standardized parquet files.

4bd0a0e

Added file to process parquet files into a conversation tree jsonl file.

651f72a

Merge branch 'LAION-AI:main' into 126-twitter-data

2544538

Added requirements and ran pre-commit.

606aca6

Jmete requested review from yk and andreaskoepf as code owners January 11, 2023 09:20

fozziethebeat added the data label Jan 11, 2023

andreaskoepf approved these changes Jan 11, 2023

View reviewed changes

andreaskoepf merged commit d15d835 into LAION-AI:main Jan 21, 2023

bitplane mentioned this pull request Mar 5, 2023

Training data from Twitter #126

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

126 twitter data #620

126 twitter data #620

Jmete commented Jan 11, 2023 •

edited

andreaskoepf left a comment

andreaskoepf Jan 11, 2023

Jmete Jan 11, 2023

Jmete Jan 15, 2023

pruksmhc Mar 3, 2023

Jmete Mar 3, 2023

126 twitter data #620

126 twitter data #620

Conversation

Jmete commented Jan 11, 2023 • edited

andreaskoepf left a comment

Choose a reason for hiding this comment

andreaskoepf Jan 11, 2023

Choose a reason for hiding this comment

Jmete Jan 11, 2023

Choose a reason for hiding this comment

Jmete Jan 15, 2023

Choose a reason for hiding this comment

pruksmhc Mar 3, 2023

Choose a reason for hiding this comment

Jmete Mar 3, 2023

Choose a reason for hiding this comment

Jmete commented Jan 11, 2023 •

edited