FediData, the first open multi-modal dataset collected from Mastodon, which is dedicated to providing realistic and reliable data support for social behavior modeling, multi-modal learning, and research on user interaction mechanisms.
If you use FediData in a scientific publication, we kindly request that you cite the following paper:
@inproceedings{gao2025fedidata,
title={{FediData: A Comprehensive Multi-Modal Fediverse Dataset from Mastodon}},
author={Min Gao and Haoran Du and Wen Wen and Qiang Duan and Xin Wang and Yang Chen},
year={2025},
booktitle={Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM’25)}
}
- Python 3.8+
- Required packages (see individual module requirements)
- API keys for OpenAI/Qwen models (for classification tasks)
# Clone the repository
git clone https://github.com/mgao97/FediData.git
cd FediData
pip install requirements.txt
for example: in the data collection module, pip install -r data_collection/userprofile_ugc_download/requirements.txt
FediData/
├── 📂 data_collection/ # Data collection tools
│ ├── 📂 image_download/ # Image downloading utilities
│ └── 📂 userprofile_ugc_download/ # User and post data collection
├── 📂 bot_detection/ # Social bot detection models
├── 📂 image_category_classification/ # Image classification tools
├── 📂 topic_emotion/ # Topic and emotion analysis
├── 📂 dataset/ # Raw and processed data
└── README.md # Project overview and usage guide
This repository contains multiple modules for different aspects of Mastodon data processing. Each module has its own detailed README with specific usage instructions:
- User Profile & UGC Collection: Complete pipeline for collecting user profiles, social networks, and posts from Mastodon instances
- Image Download: Concurrent image downloader for extracting images from collected posts
- You might either collect the data using our provided code or directly download the anonymized dataset from Zenodo and extract it into the dataset folder.
- Topic & Emotion Analysis: Topic classification and sentiment analysis of posts
- Bot Detection: Social bot detection using multiple machine learning models
- Image Classification: Automated image categorization using vision-language models
💡 Tip: Each module's README contains detailed prerequisites, configuration steps, and usage examples.
- Topic Classification: Automated topic categorization using LLMs
- Emotion Analysis: Sentiment and emotion detection in posts
- Visualization: Comprehensive charts and comparative analysis
Model | Description | Type |
---|---|---|
BECE | Bot detection using embedding and classification | Deep Learning |
BotRGCN | Relational Graph Convolutional Network | Graph Neural Network |
SGBot | Statistical and graph-based features | Random Forest |
- Qwen 2.5 VL-32B Instruct: Vision-language model for image categorization
- Supports batch processing with configurable thread pools
- Automatic retry and error handling