Error Detection Pipeline (ForestED)

This repository provides the codebase for the paper Ensembling LLM-Induced Decision Trees for Explainable and Robust Error Detection, including:

Building a Data Catalog
Sampling rows using GP-based uncertainty
Running the Decision Forest (ForestED) Error Detection Model

1. Build a Data Catalog

The data catalog extracts column profiles, statistical summaries, value distributions, and metadata from the input CSV.

Run:

python3 build_data_catalog.py path/to/table.csv --out path/to/catalog.json

2. Generate Samples

To reduce LLM cost and guide tree induction, the system uses Gaussian Process (GP) uncertainty sampling to select ~5% of rows.

Run:

python3 sample_gp.py input.csv --out sample.csv --frac 0.05

Run Error Detection Using Decision Forest

After preparing

Run the full ForestED error detection pipeline:

python3 decision_forest.py \
    --clean ./data/hospital_clean.csv \
    --dirty ./data/hospital_error-01.csv \
    --catalog ./data/hospital_error-01_catalog_1.json \
    --outdir ./data/hospital_output

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
figures		figures
LICENSE		LICENSE
README.md		README.md
consensus.py		consensus.py
data_catalog.py		data_catalog.py
decision_forest.py		decision_forest.py
gp_sampling.py		gp_sampling.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Error Detection Pipeline (ForestED)

1. Build a Data Catalog

2. Generate Samples

About

Uh oh!

Releases

Packages

Languages

License

T-Lab/ForestED

Folders and files

Latest commit

History

Repository files navigation

Error Detection Pipeline (ForestED)

1. Build a Data Catalog

2. Generate Samples

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages