Skip to content

IEQLab/office_profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Capacity for Structured Extraction from Occupant Feedback

Companion code for the SocialSys'26 workshop paper "Towards Uncovering Indoor Satisfaction Profiles: LLM Capacity for Structured Extraction from Occupant Feedback."

This repository contains the analysis pipeline, paper source, and a synthetic demo dataset for a study that combines latent profile analysis (LPA) of occupant satisfaction ratings with locally deployed LLM extraction of structured complaint dimensions (tone, severity, attribution, work impact) from free-text feedback.

The published numerical results — eight occupant profiles, model-size thresholds for kappa ≥ 0.6 with human coders, and the classification benchmark showing text/rating orthogonality — were produced on the licensed CBE Occupant Survey database, which cannot be redistributed. The repository ships a fully synthetic dataset so that the pipeline can be exercised end-to-end on a fresh clone.

Contents

office_profiler/
├── _targets.R                  # targets pipeline definition
├── R/                          # pure functions sourced by targets
├── scripts/                    # one-shot scripts for the real-data path
├── data/
│   ├── raw/README.md           # how to obtain the real CBE database
│   └── synthetic/              # synthetic demo dataset + generator
├── paper/                      # ACM sigconf source for the workshop paper
│   ├── main-text.tex
│   ├── img/                    # figures (regenerated by tar_make)
│   └── references.bib
├── LICENSE                     # MIT (code)
├── LICENSE-paper               # CC-BY 4.0 (paper text + figures)
├── CITATION.cff                # citation metadata
├── CONTRIBUTING.md             # issue / PR guidance
├── renv.lock                   # locked package versions
└── .Rprofile                   # renv activation

Quick start (synthetic demo)

The synthetic demo runs the full pipeline on simulated data without needing access to the CBE database. Targets that call Ollama use a pre-computed embedding cache, so the demo runs CPU-only.

git clone https://github.com/IEQLab/office_profiler.git
cd office_profiler
# in R, from the repo root
renv::restore()           # install locked package versions

# regenerate the synthetic dataset (already checked in; this is optional)
source("data/synthetic/generate_synthetic.R")

targets::tar_make()       # build all figures

After tar_make() completes, the figures are written to paper/img/:

  • 6_validation_kappa.png — LLM-vs-human agreement by model size
  • 8_classification_comparison.png — 5-model classification benchmark
  • 1_fit_comparison.png, 1_classification_quality.png, 1_split_half_ari.png — LPA diagnostics

A successful demo run takes about 3–5 minutes on a recent laptop.

Reproducing the paper results (real CBE data)

  1. Request access to the CBE Occupant Survey database (see data/raw/README.md).

  2. Place the export at data/raw/db_all.rds.

  3. Pull the LLM and embedding models in Ollama:

    ollama pull gemma3:27b       # main extraction model (validated)
    ollama pull llama3.1:8b      # alt model variant for kappa benchmark
    ollama pull llama3.2:3b      # alt model variant
    ollama pull nomic-embed-text # embedding model
  4. Run the real-data path:

    source("scripts/01-data.R")           # cleans data/raw/db_all.rds
    # the LPA-driven profile assignments must already exist at
    # data/processed/df_profiles.rds — run the LPA targets via
    # targets::tar_make(c(model_lpa, df_profiles)) and write df_profiles
    # from the model output, or supply them manually.
    source("scripts/02-llm-extraction.R") # runs Ollama, ~1 hour
    source("scripts/03-llm-validation.R") # kappa across models
    targets::tar_make()                   # rebuilds figures from real outputs

    On a workstation with a single 24 GB GPU, the gemma3:27b extraction step takes roughly 60 minutes for ~6,000 stratified responses; embedding generation is cached after the first run.

Hardware notes

  • CPU-only: the synthetic demo runs in a few minutes on any modern laptop. The full real-data pipeline will work CPU-only but the gemma3:27b extraction takes many hours.
  • GPU: an NVIDIA card with ≥ 16 GB VRAM (or an Apple Silicon machine with ≥ 32 GB unified memory) is recommended for the real-data path.

Citation

If you use this repository or the synthetic demo data, please cite the workshop paper (preferred) and the repository:

@inproceedings{parkinson2026towards,
  title     = {Towards Uncovering Indoor Satisfaction Profiles:
               LLM Capacity for Structured Extraction from
               Occupant Feedback},
  author    = {Parkinson, Thomas and Schiavon, Stefano and
               Zhang, Wenhao and Miller, Clayton},
  booktitle = {Proceedings of the SocialSys'26 Workshop},
  year      = {2026},
  publisher = {ACM}
}

Licence

Acknowledgements

The Center for the Built Environment (UC Berkeley) for access to the Occupant Survey database. ACM SocialSys'26 reviewers for their feedback.

About

Testing LLMs for structured extraction of open-ended occupant feedback

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE-paper

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors