Big-Interleaved-Dataset

Big-Interleaved-Dataset is a LAION project to create an open source multimodal dataset to the likes of Deepmind M3W (MultiModal MassiveWeb dataset) .

Communications:

Real time convos #big-interleaved-dataset channel in LAION discord.

Meeting's: Weekly either on Tuesday or Thursday at 8pm Cet. Link is provided in the channel.

Meeting notes at this doc.

Current Progress:

Current progress is being tracked here

Structure of the project:

Presently BILD is divided into three phases.

Phase 1: Data extraction from common crawl, maybe licensed part of the internet archive from Webis group. Being tracked here
Phase 2: Data filtering for NSFW components, data quality, duplicated data, and other broad things.Being tracked here.
Phase 3: Filtered data can be used for creating datasets of various modalities. However, this project would like to tackle the interleaved format. Being tracked here

Phase 1

Data extraction pipeline from data sources.

Data Sources:

Common Crawl
Maybe, licensed part of Internet archive from Webis.de group.
Other sources that the community can recommend.

Pipeline:

Common crawl provides most of its dataset in form of WARC files consisting of HTTPS responses. Thus pipeline will have to parse the WARC file and then the underneath HTML response to extract the required data mainly text, different media links etc, disregarding the script, CSS, and other components.

Naturally, it'll be divided into two parts.

WARC file parser.
HTML parser.

WARC parser

There are many open-source WARC parsers available in the wild. WARCIO is most commonly used, but there is an improved version known as FastWARC .

HTML parsers

There are various HTML parsers available, we may need to select the best suited for our requirements which basically corresponds to text and media-link attributes preservation.
Our ideal parser should retain the text as well as multimodal attributes in the corresponding HTML along with their locality.

Phase 2

To add

Phase 3

To add

More about project here:

https://docs.google.com/document/d/1R8WYJ1YcEZ5fAYJH91FCglhzAWS92R3zeRy-bppP1ho/edit?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design.md

design.md

Big-Interleaved-Dataset

Communications:

Current Progress:

Structure of the project:

Phase 1

Data Sources:

Pipeline:

WARC parser

HTML parsers

Phase 2

Phase 3

More about project here:

Files

design.md

Latest commit

History

design.md

File metadata and controls

Big-Interleaved-Dataset

Communications:

Current Progress:

Structure of the project:

Phase 1

Data Sources:

Pipeline:

WARC parser

HTML parsers

Phase 2

Phase 3

More about project here: