Reporting Risk Cascade

reporting-risk-cascade is a reproducible research workspace for measuring corporate reporting risk at the filing origin. The project studies whether a publicly observable review-and-correction process can be predicted from information available when an issuer files, rather than treating an ex post detected misstatement label as the only empirical target.

The current paper has two linked evidence layers. The legacy benchmark layer uses a gvkey x data_year detected-misstatement panel to diagnose timing, observability, drift, and missingness. The public cascade layer builds an issuer-year panel from SEC and PCAOB sources and predicts subsequent public scrutiny, amendment or filing friction, Item 4.02 non-reliance, and rare AAER severity-tail events.

Research Object

Layer	Unit	Role
Legacy benchmark	`gvkey x data_year`	Diagnostic benchmark for detected-misstatement labels, timing assumptions, concept drift, and missingness.
Public lake	SEC/PCAOB filings and events	Public-data construction layer for filings, XBRL, Notes summaries, Form AP, PCAOB inspections, comment letters, amendments, 8-K Item 4.02, and AAER support data.
Public cascade	`issuer_cik x fiscal_year`	Main filing-origin prediction target for public review-and-correction risk.
Bridge validation	`gvkey-CIK-year`	Construct-overlap layer linking legacy detected-misstatement labels to public cascade labels and scores.

Documentation site: Reporting Risk Cascade. The detailed research design is in Paper Plan. The static run interpretation is in Results Snapshot. Deferred extensions are in Future Work.

Public Review-And-Correction Labels

The public cascade is a multi-label outcome design. These labels are public observability states, not alternative names for fraud.

Label	Horizon	Interpretation	Primary public source
`label_comment_thread_365`	365 days	SEC comment-letter scrutiny after the filing origin.	SEC filing review process and public EDGAR correspondence.
`label_amendment_365`	365 days	Amended filing, correction, or filing-friction signal.	SEC EDGAR filing access and amended filing form metadata.
`label_8k_402_365`	365 days	Item 4.02 non-reliance or material-correction signal.	SEC Form 8-K, Item 4.02.
`label_aaer_proxy_730`	730 days	Rare enforcement severity-tail descriptor.	SEC Accounting and Auditing Enforcement Releases and `farr::aaer_*` support data.

AAER is retained as severity-tail evidence because it is sparse and selective; it is not the headline prediction target.

Repository Layout

config/       YAML settings for benchmark, public cascade, public data, and study runs
data/         local raw inputs and ignored public-lake data
docs/         MkDocs source pages
scripts/      thin command-line wrappers and operational scripts
src/          reusable implementation modules
tests/        tests for runtime behavior, docs, public lake, and model contracts
artifacts/    generated outputs, sample panels, logs, and run reports
doc/          local reference PDFs

Setup

Use just as the stable command surface. It loads .env and uses the external UV_PROJECT_ENVIRONMENT configured for this machine.

cp .env.example .env
just setup
just status

The default local benchmark input is:

data/raw_dataset_misstatement.parquet

If only the legacy CSV exists, convert it once:

uv run python scripts/convert_raw_dataset.py

A clean GitHub checkout without the benchmark data can run just check and fixture-based smoke checks. Benchmark, study, and full workflows require the raw benchmark Parquet or the legacy CSV.

Execution Contract

Quality gate:

just check

just check is data-free. It runs the core pytest coverage gate, the focused public-lake coverage gate, ruff, and the strict MkDocs build. Core runtime modules must remain at or above 95% coverage; the larger public-lake builder has a separate 93% toy-data gate.

Paper-facing core run:

just full full raw artifacts/full

This runs setup, tests, lint, public-lake build or resume, and the core study components: benchmark, public cascade, bridge probe, and construct-overlap validation when inputs exist. If a full build has already completed earlier stages, resume from DAG markers:

just full mode=full dataset=raw out_dir=artifacts/full resume=1

Peer-compatible model-family transfer:

just task study raw artifacts/full_with_peer \
  extra="--peer-comparison-mode full --peer-target both --parallel-jobs 4 --model-threads 2 --seed-policy task-isolated"

This reruns the study layer against the completed public lake and adds the legacy benchmark peer suite plus the public-label peer transfer suite on issuer_origin_panel.parquet. The public peer suite covers comment_thread, amendment, and 8k_402; aaer_proxy remains a sparse severity-tail status. Use --peer-target public to refresh public-label peer outputs without rerunning the legacy benchmark peer suite.

Component reruns:

just task benchmark raw artifacts/benchmark
just task cascade raw artifacts/public_cascade
just task bridge raw artifacts/bridge_probe
just task study raw artifacts/study
uv run python scripts/run_construct_overlap.py --study-dir artifacts/full_with_peer
just docs

just docs first runs a strict clean MkDocs build, then serves the site on the first free local port in 8001-8010.

Public Lake

For the full public lake, use the operational script so logs and monitoring are captured:

bash scripts/run_public_lake_full.sh --dry-run
bash scripts/run_public_lake_full.sh --mode smoke --submissions-max-ciks 200 --fetch-workers 2 --engine duckdb --duckdb-threads 4 --duckdb-memory-limit 10GB --duckdb-max-temp-size 400GB --fsds-batch-size 4 --notes-batch-size 2 --storage-format parquet --notes-mode summary --fresh-build
nohup bash scripts/run_public_lake_full.sh --mode full > artifacts/logs/public_lake_full/nohup.log 2>&1 &

Large Silver and Gold tables use Parquet by default: issuer_dim.parquet, filing_dim.parquet, form_ap_event.parquet, xbrl_fact_summary.parquet, batch-sharded xbrl_core_fact/, note_summary.parquet, issuer_origin_panel.parquet, and year-sharded filing_origin_panel.parquet. Parquet is the default for tables that can grow past roughly 10MB because it avoids repeated gzip CSV decompression, preserves types, and lets DuckDB project only needed columns.

Small diagnostics and human-readable status files remain CSV, JSON, or Markdown. Notes raw text is skipped unless notes_mode="raw" is requested.

Primary Artifacts

Benchmark:

artifacts/benchmark/rolling_metrics.csv
artifacts/benchmark/timing_coverage.csv
artifacts/benchmark/feature_family_importance.csv
artifacts/benchmark/missing_profile_clusters.csv
artifacts/benchmark/benchmark_summary.md

Public cascade:

data/public_lake/gold/issuer_origin_panel.parquet
data/public_lake/gold/filing_origin_panel.parquet
artifacts/public_cascade/public_cascade_metrics.csv
artifacts/public_cascade/public_cascade_predictions.parquet
artifacts/public_cascade/public_opacity_dml.csv
artifacts/public_cascade/public_cascade_summary.md

Bridge and construct overlap:

artifacts/bridge_probe/bridge_probe_summary.json
artifacts/bridge_probe/coverage_report.csv
artifacts/bridge_probe/multiplicity_report.csv
artifacts/full_with_peer/construct_overlap/construct_overlap_summary.md
artifacts/full_with_peer/construct_overlap/public_score_legacy_ranking.csv
artifacts/full_with_peer/construct_overlap/reciprocal_alignment.csv

Peer-compatible model-family suites:

artifacts/full_with_peer/peer_comparison/
artifacts/full_with_peer/public_peer_comparison/

Bridge Inputs

The integration bridge is:

data/external/gvkey_cik_year.csv

The repo cannot infer this table from the legacy benchmark alone because the benchmark has gvkey and data_year, but not CIK, ticker, company name, CUSIP, or PERMNO. Prefer a WRDS/Compustat CIK-GVKEY link export or equivalent institutional crosswalk:

set -a; source .env; set +a
uv run python scripts/prepare_gvkey_cik_crosswalk.py \
  --input path/to/wrds_cik_gvkey_link.csv \
  --out data/external/gvkey_cik_year.csv \
  --source wrds_compustat_cik_gvkey_link \
  --source-version "YYYY-MM-DD"

just task bridge raw

When WRDS is unavailable, prepare a provenance-tagged candidate bridge from farr::gvkey_ciks:

bash scripts/prepare_farr_gvkey_cik_bridge.sh --install-missing

This exports data/external/farr_gvkey_ciks_raw.csv, normalizes annual links to data/external/gvkey_cik_year.csv, and runs the bridge probe. Treat this as a candidate bridge whose coverage and multiplicity must be reported, not as a silent substitute for a WRDS-verified table.

The same package provides support inputs:

bash scripts/prepare_farr_support_data.sh --install-missing

This exports farr::aaer_dates, farr::aaer_firm_year, and farr::state_hq. The AAER files support severity-tail overlap diagnostics; they do not replace the main public-cascade labels. farr::state_hq is used as a date-bounded, public-origin headquarters-state metadata feature when data/external/farr_state_hq.csv exists.

Accepted crosswalk columns are gvkey, issuer_cik or cik, plus either data_year/fiscal_year/fyear or start_year and end_year. Prepared files retain provenance fields: source, source_version, extracted_at, match_method, and match_score.

Current Gates

The public-cascade run is the current non-metadata xbrl_ratio_baseline snapshot.
farr::gvkey_ciks is the current high-coverage candidate bridge when WRDS is unavailable; coverage and multiplicity must be reported.
Candidate bridge overlap can support a related-but-non-identical construct argument. WRDS or equivalent validation remains preferred for final manuscript-grade integrated claims.
AAER is a severity-tail descriptor and robustness anchor, not the headline prediction target.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
config		config
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
justfile		justfile
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reporting Risk Cascade

Research Object

Public Review-And-Correction Labels

Repository Layout

Setup

Execution Contract

Public Lake

Primary Artifacts

Bridge Inputs

Current Gates

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reporting Risk Cascade

Research Object

Public Review-And-Correction Labels

Repository Layout

Setup

Execution Contract

Public Lake

Primary Artifacts

Bridge Inputs

Current Gates

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages