Skip to content

This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.

License

Notifications You must be signed in to change notification settings

BusyBeaver-AI/bluesky-posts-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

2 Million Bluesky Posts

This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.

The with-language-predictions config contains the same data as the default config but with language predictions added using the glotlid model. Dataset Details Dataset Description

This dataset consists of 2 million public posts from Bluesky Social, collected through the platform's firehose API. Each post contains text content, metadata, and information about media attachments and reply relationships.

  • Curated by: Alpin Dale
  • Language(s) (NLP): Multiple (primarily English)
  • License: Dataset usage is subject to Bluesky's Terms of Service

Uses

This dataset could be used for:

  • Training and testing language models on social media content
  • Analyzing social media posting patterns
  • Studying conversation structures and reply networks
  • Research on social media content moderation
  • Natural language processing tasks using social media datas

Dataset Structure

The dataset is available in two configurations:

Default Configuration

Contains the following fields for each post:

  • text: The main content of the post
  • created_at: Timestamp of post creation
  • author: The Bluesky handle of the post author
  • uri: Unique identifier for the post
  • has_images: Boolean indicating if the post contains images
  • reply_to: URI of the parent post if this is a reply (null otherwise)

With Language Predictions Configuration

Contains all fields from the default configuration plus:

  • predicted_language: The predicted language code (e.g., eng_Latn, deu_Latn)
  • language_confidence: Confidence score for the language prediction (0-1)

Language predictions were added using the glotlid model via fasttext.

Bias, Risks, and Limitations

The goal of this dataset is for you to have fun :)

About

This dataset contains 2 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published