This repository provides structured datasets of embedded and industrial systems across multiple firmware versions.
Each dataset includes automated scripts for downloading, unpacking, and preparing firmware samples for controlled experiments such as semantic diffing, retrieval evaluation, and supply-chain security analysis.
The OpenWRT dataset offers a consistent, open-source baseline for analyzing software evolution and update behavior across firmware versions.
Contents
build_openwrt.sh
– Automates downloading and unpacking multiple OpenWRT firmware images.tags.txt
– Lists component identifiers used for semantic retrieval and evaluation.
Use cases
- Measuring function-level change granularity
- Evaluating differential triage methods
- Validating semantic embeddings across versions
This dataset contains real-world firmware samples from WAGO PFC200 programmable logic controllers (PLCs).
It supports controlled binary replacement for generating labeled clean/backdoor variants, enabling reproducible semantic diffing experiments.
Contents
-
setup_wago_data.sh
– End-to-end automation script that:- Downloads firmware versions
03.10.10
and03.10.08
from WAGO’s public GitHub releases. - Extracts filesystem contents with
binwalk
. - Locates and removes the original
usr/sbin/dropbear
binary. - Inserts controlled replacement binaries:
dropbear86-backdoor
→ backdoor variantdropbear86-clean
/dropbear83-clean
→ clean variants
- Supports choosing between stripped (firmware-like) and symbol-rich binaries via CLI flag.
- Produces three labeled datasets:
03.10.10-backdoor
03.10.10-clean
03.10.08-clean
- Moves final datasets into the
experiment_samples/
directory.
- Downloads firmware versions
-
dropbear_samples/
dropbear_samples/
├─ stripped/
│ ├─ dropbear83-clean.stripped
│ ├─ dropbear86-clean.stripped
│ └─ dropbear86-backdoor.stripped
└─ symbols/
├─ dropbear83-clean
├─ dropbear86-clean
└─ dropbear86-backdoorstripped/
contains firmware-realistic binaries with symbols removed.symbols/
contains binaries compiled with symbols for analysis and reverse engineering.
-
experiment_samples/
– Contains the final processed datasets after running the setup script.
Purpose This dataset is used for evaluating:
- semantic diffing frameworks (e.g., DRIFT),
- function-level retrieval robustness,
- and the impact of controlled binary modification on analysis pipelines.
Required Tools
- Python 3
binwalk
(firmware unpacking)lzop
(decompression helper)rsync
(optional, faster copying)
Install on Ubuntu/Debian
sudo apt-get update
sudo apt-get install binwalk lzop squashfs-tools rsync
pip install pandas requests
cd datasets/openwrt_data
./build_openwrt.sh
This will automatically download and unpack multiple OpenWRT versions for analysis.
Generate stripped firmware variants (default):
cd datasets/wago_data
./setup_wago_data.sh
Explicitly use stripped binaries:
./setup_wago_data.sh --stripped
Use symbol-rich binaries:
./setup_wago_data.sh --symbols
Dry-run mode (no changes, just print actions):
./setup_wago_data.sh --dry-run --symbols
After completion, experiment_samples/
will contain:
03.10.10-backdoor/
03.10.10-clean/
03.10.08-clean/
Each contains a fully unpacked firmware root filesystem with:
usr/sbin/dropbear
replaced by the selected clean or backdoor sample.
WAGO Firmware Version | Dropbear Sample (stripped) | Dropbear Sample (symbols) |
---|---|---|
03.10.08 clean | dropbear83-clean.stripped | dropbear83-clean |
03.10.10 clean | dropbear86-clean.stripped | dropbear86-clean |
03.10.10 backdoor | dropbear86-backdoor.stripped | dropbear86-backdoor |
- OpenWRT Firmware: https://downloads.openwrt.org/
- WAGO PLC Firmware: https://github.com/WAGO/pfc-firmware
- Original Multi-Firmware Dataset: https://github.com/WUSTL-CSPL/Firmware-Dataset