Experimental pipeline for studying how contextual information affects open-source LLM performance in misinformation detection.
Investigate whether increasing context consistently improves model performance, or whether excessive or misleading context degrades model reliability.
Task: Binary classification (real/fake) + explanation generation.
| Dataset | Domain | Modality | Status |
|---|---|---|---|
| ClaimReview | fact-check metadata | text | ✅ working (auto-downloads) |
| Fakeddit | Reddit posts | text + images | ✅ working (via fetch script) |
| FakeNewsNet | news articles | text + images | ✅ working (via fetch script) |
| MuMiN | social media claims | text + images | ✅ working (via fetch script) |
src/
├── run_experiment.py # main unified pipeline runner
├── predictors.py # inference backend abstraction + HF/OpenAI providers
├── google_factcheck.py # cached Google Fact Check grounding layer
├── context_ablation.py # context-budget parsing and truncation helpers
├── schema.py # canonical UnifiedRecord schema
├── cleaning.py # data cleaning pipeline
├── prompts.py # context-variant prompt generation
├── evaluation.py # metrics, confusion matrix, comparison
├── training.py # deterministic corpus export for adapter tuning
├── train_hf_adapter.py # LoRA/PEFT training entrypoint
├── summary.py # dataset summary sheets
├── costs.py # optional cost estimation
├── google_factcheck_starter.py # original starter downloader (legacy)
└── datasets/
├── base.py # abstract dataset loader
├── claimreview.py # Google/Data Commons ClaimReview feed
├── fakeddit.py # Fakeddit (Reddit)
├── fakenewsnet.py # FakeNewsNet (PolitiFact + GossipCop)
├── mumin.py # MuMiN (stub)
└── data/ # local dataset files (gitignored)
docs/
├── experimental-design.md # experiment framing + context variants
├── label-mapping.md # per-dataset label mapping + justification
├── meeting-task-checklist.md # RA1/RA2 task status
├── normalized-schema.md # schema design notes
├── data-source-notes.md # data source investigation
├── reproducibility.md # how to rerun consistently
├── visualisation.md # embedded charts and narrative interpretation
├── presentation-brief.md # presentation guide
└── sources.md # references
templates/ # prompt templates
config/ # configuration examples
slides/ # presentation materials
CHANGELOG.md # recent history + unreleased changes
artifacts/ # pipeline output artifacts
The fetcher downloads ClaimReview plus manageable sample subsets for Fakeddit, FakeNewsNet, and MuMiN into src/datasets/data/:
claimreviewis stored as the raw feed JSON payloadfakedditis stored as TSVfakenewsnetis stored as a fetched CSV export and loaded directly by the pipelinemuminis stored as a fetched CSV export and loaded directly by the pipeline
# Install required dependencies
pip install -r requirements.txt
# Run the fetcher (default: attempt full corpora, then fall back to a sampled mirror when needed)
python src/datasets/fetch_datasets.py
# Force a sampled acquisition path for a quick local refresh
python src/datasets/fetch_datasets.py --max-samples 5000If the local dataset mirror already exists, the fetcher leaves it in place. For the Hugging Face-backed sources, omitting --max-samples attempts a full split first and only retries as a sample when that full export fails.
The repository reads provider credentials and model settings from process environment variables. A repository-local .env file is loaded automatically when present.
Create .env from .env.example and set the values you actually need for your workflow.
Common variables:
HF_MODEL_ID— Hugging Face model used for local inference and adapter trainingHF_TOKEN— optional but recommended for higher HF Hub rate limits and faster authenticated downloadsHF_DEVICE_MAP/HF_TORCH_DTYPE— optional local loading controls for large modelsHF_MAX_NEW_TOKENS,HF_TEMPERATURE,HF_TOP_P— local generation controlsGOOGLE_FACTCHECK_API_KEY— required for--ground-with-googleOPENAI_API_KEY,OPENAI_BASE_URL,OPENAI_MODEL— required for--mode openai-compatible
If HF_TOKEN is unset, Hugging Face downloads still work but the Hub treats them as unauthenticated requests.
Use the dedicated builder to merge the currently available local mirrors into one canonical dataset plus one flat analysis export:
python3 src/build_master_dataset.py \
--dataset all \
--output-dir artifacts/master_datasetThe builder reuses the existing loaders and cleaning pipeline, then writes:
normalized_records.{json,jsonl,csv}— deduplicated canonicalUnifiedRecordcorpus for experiments and trainingmaster_records.{json,jsonl,csv}— canonical merged dataset with provenance fieldsmaster_flat_records.{json,jsonl,csv}— flat export withid,Title,Claim,news source,Real,tweet_nummaster_dataset_manifest.json— per-dataset acquisition and row-count summarymaster_dataset_coverage.json— null-rate coverage for the flat export fields
Important label note:
- Internal pipeline records still use
mapped_label: 0 = real, 1 = fake - The flat master export uses
Real: 1 = real, 0 = fake
The default inference path now targets the combined master dataset instead of an individual source dataset. Qwen is the current first-wave target:
export HF_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
python3 src/run_experiment.py \
--dataset master_dataset \
--limit 100 \
--balanced \
--mode huggingface \
--output-dir artifacts/master_run--dataset master_dataset resolves the canonical builder outputs automatically. By default it prefers artifacts/master_dataset/ and falls back to src/datasets/data/master_dataset/ if needed.
# Minimal context (claim text only)
python3 src/run_experiment.py \
--dataset master_dataset \
--context-mode minimal \
--limit 100 \
--mode huggingface \
--output-dir artifacts/context_minimal
# Full context (claim + article + metadata)
python3 src/run_experiment.py \
--dataset master_dataset \
--context-mode full \
--limit 100 \
--mode huggingface \
--output-dir artifacts/context_full
# Misleading context (adversarial)
python3 src/run_experiment.py \
--dataset master_dataset \
--context-mode misleading \
--limit 100 \
--mode huggingface \
--output-dir artifacts/context_misleadingThis evaluates the same records across multiple context budgets so you can measure when predictions flip as context is removed:
export HF_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
python3 src/run_experiment.py \
--dataset master_dataset \
--context-mode full \
--context-ablation-levels 1.0,0.75,0.5,0.25 \
--limit 100 \
--mode huggingface \
--output-dir artifacts/context_ablationThe evaluation report now includes by_context_budget metrics and context_ablation threshold summaries.
You can enrich records with cached Google fact-check lookups before prompt generation:
export HF_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
export GOOGLE_FACTCHECK_API_KEY=your-api-key
python3 src/run_experiment.py \
--dataset master_dataset \
--context-mode full \
--mode huggingface \
--ground-with-google \
--google-factcheck-ttl-hours 24 \
--output-dir artifacts/grounded_runEach grounded run writes grounding_report.json, a reusable Google cache file, and enriched prompts/records when matches are found.
The adapter trainer reuses the same prompt contract as inference and exports deterministic train/eval chat examples before training.
export HF_MODEL_ID=Qwen/Qwen2.5-1.5B-Instruct
# Export train/eval corpora only
python3 src/train_hf_adapter.py \
--dataset master_dataset \
--model-id "$HF_MODEL_ID" \
--prepare-only \
--output-dir artifacts/training/qwen_adapter
# Fine-tune a LoRA adapter
python3 src/train_hf_adapter.py \
--dataset master_dataset \
--model-id "$HF_MODEL_ID" \
--output-dir artifacts/training/qwen_adapterFor local Apple Silicon validation, prefer --device-map mps. CPU-only training remains available, but even tiny Qwen adapter runs can be impractically slow without MPS acceleration.
# Validated tiny local training run on Apple Silicon
python3 src/train_hf_adapter.py \
--dataset master_dataset \
--limit 4 \
--context-mode minimal \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--output-dir artifacts/training/qwen_adapter_tiny_run_mps \
--num-train-epochs 1 \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 1 \
--gradient-accumulation-steps 1 \
--logging-steps 1 \
--device-map mps \
--max-length 256 \
--lora-rank 4 \
--lora-alpha 8Training outputs include:
training_data/train_examples.jsonltraining_data/eval_examples.jsonltraining_data/training_manifest.jsonadapter/with the saved LoRA weights and tokenizer filestraining_summary.json
If you want grounded training examples, add --ground-with-google and set GOOGLE_FACTCHECK_API_KEY.
Per-dataset runs remain available as an explicit fallback when you want source-specific comparisons.
# Fakeddit (uses fetched data in src/datasets/data/fakeddit/)
python3 src/run_experiment.py --dataset fakeddit --limit 100 --mode huggingface
# FakeNewsNet (uses fetched data in src/datasets/data/fakenewsnet/)
python3 src/run_experiment.py --dataset fakenewsnet --limit 100 --mode huggingface
# All individual datasets as separate runs
python3 src/run_experiment.py --dataset all --limit 50 --mode huggingfaceThis remains available for comparison, but it is not the primary research path.
export OPENAI_API_KEY=...
export OPENAI_BASE_URL=https://api.openai.com/v1
export OPENAI_MODEL=gpt-4o-mini
python3 src/run_experiment.py \
--dataset claimreview \
--context-mode full \
--mode openai-compatible \
--output-dir artifacts/llm_runnormalized_records.{json,jsonl,csv}— cleaned unified recordsprompts.jsonl— generated prompts with contextpredictions.jsonl— model predictionsevaluation_report.{json,md}— accuracy, F1, confusion matrixdataset_summary.{json,md}— sample counts, label distribution, missing datavisualizations/dashboard.html— browser-friendly chart dashboard for the runvisualizations/visualization_report.md— explains each chart, its source artifact, and the main trendcleaning_report.json— what was removed during cleaninggrounding_report.json— Google fact-check cache and match statistics when grounding is enabledrun_manifest.json— full run parameters
Training runs additionally write training_data/ manifests and adapter checkpoints under the chosen training output directory.
For a concise summary of recent committed and unreleased changes, see CHANGELOG.md.
See docs/visualisation.md for a docs-native page that embeds the latest aggregate visuals and explains them.
The pipeline supports three context variants to study context sensitivity:
| Mode | What's included | Tests |
|---|---|---|
minimal |
claim text only | can the model classify from surface-level text? |
full |
claim + article body + metadata | does relevant context improve classification? |
misleading |
claim + adversarial context | does irrelevant context degrade reliability? |
The full-context path also supports --context-budget and --context-ablation-levels so runs can quantify how much context can be removed before predictions flip.
- CLI-first, no hidden notebook state
- explicit CLI arguments and saved run manifests
- explicit model selection and context-budget settings saved in run manifests
- stable record IDs across reruns
- artifacts saved at every pipeline stage
- offline-capable token cost tracking and balanced sampling
- automated
src/datasets/fetch_datasets.pyfor standardizing data ingestion undersrc/datasets/data/ - cached Google Fact Check lookups with TTL-based reuse
- deterministic train/eval corpus export for adapter fine-tuning
Phase 1 (open-source): Qwen, Gemma, Granite via Hugging Face Phase 2 (optional closed-source APIs): GPT-4/4o, Gemini, Claude
- Google/Data Commons gives structured ClaimReview metadata, not the full article body.
- Dataset loaders read from
src/datasets/data/by default; usepython src/datasets/fetch_datasets.pyto populate that directory. - FakeNewsNet's Hugging Face mirror exposes a
realcolumn in CSV form; the pipeline interprets1 => realand0 => fake, which is consistent with the upstream FakeNewsNet repository's separate*_real.csvand*_fake.csvsplits. - MuMiN is currently loaded from the fetched CSV export under
src/datasets/data/mumin/mumin.csv; the loader derives the split from thetrain_mask,val_mask, andtest_maskcolumns. - The primary workflow is now open-source-model-first; the OpenAI-compatible path remains optional and secondary.
- Adapter fine-tuning supervision is deterministic and derived from the current binary labels plus prompt context, which makes it reproducible but not yet explanation-rich.
- TechRxiv survey: From Fact Verification to Understanding Misleadingness: A Survey and Roadmap on Reader-Centric Multimodal Misinformation Detection