LLM Capacity for Structured Extraction from Occupant Feedback

Companion code for the SocialSys'26 workshop paper "Towards Uncovering Indoor Satisfaction Profiles: LLM Capacity for Structured Extraction from Occupant Feedback."

This repository contains the analysis pipeline, paper source, and a synthetic demo dataset for a study that combines latent profile analysis (LPA) of occupant satisfaction ratings with locally deployed LLM extraction of structured complaint dimensions (tone, severity, attribution, work impact) from free-text feedback.

The published numerical results — eight occupant profiles, model-size thresholds for kappa ≥ 0.6 with human coders, and the classification benchmark showing text/rating orthogonality — were produced on the licensed CBE Occupant Survey database, which cannot be redistributed. The repository ships a fully synthetic dataset so that the pipeline can be exercised end-to-end on a fresh clone.

office_profiler/
├── _targets.R                  # targets pipeline definition
├── R/                          # pure functions sourced by targets
├── scripts/                    # one-shot scripts for the real-data path
├── data/
│   ├── raw/README.md           # how to obtain the real CBE database
│   └── synthetic/              # synthetic demo dataset + generator
├── paper/                      # ACM sigconf source for the workshop paper
│   ├── main-text.tex
│   ├── img/                    # figures (regenerated by tar_make)
│   └── references.bib
├── LICENSE                     # MIT (code)
├── LICENSE-paper               # CC-BY 4.0 (paper text + figures)
├── CITATION.cff                # citation metadata
├── CONTRIBUTING.md             # issue / PR guidance
├── renv.lock                   # locked package versions
└── .Rprofile                   # renv activation

Quick start (synthetic demo)

The synthetic demo runs the full pipeline on simulated data without needing access to the CBE database. Targets that call Ollama use a pre-computed embedding cache, so the demo runs CPU-only.

git clone https://github.com/IEQLab/office_profiler.git
cd office_profiler

# in R, from the repo root
renv::restore()           # install locked package versions

# regenerate the synthetic dataset (already checked in; this is optional)
source("data/synthetic/generate_synthetic.R")

targets::tar_make()       # build all figures

After tar_make() completes, the figures are written to paper/img/:

6_validation_kappa.png — LLM-vs-human agreement by model size
8_classification_comparison.png — 5-model classification benchmark
1_fit_comparison.png, 1_classification_quality.png, 1_split_half_ari.png — LPA diagnostics

A successful demo run takes about 3–5 minutes on a recent laptop.

Reproducing the paper results (real CBE data)

Request access to the CBE Occupant Survey database (see data/raw/README.md).
Place the export at data/raw/db_all.rds.

Pull the LLM and embedding models in Ollama:

ollama pull gemma3:27b       # main extraction model (validated)
ollama pull llama3.1:8b      # alt model variant for kappa benchmark
ollama pull llama3.2:3b      # alt model variant
ollama pull nomic-embed-text # embedding model

Run the real-data path:

source("scripts/01-data.R")           # cleans data/raw/db_all.rds
# the LPA-driven profile assignments must already exist at
# data/processed/df_profiles.rds — run the LPA targets via
# targets::tar_make(c(model_lpa, df_profiles)) and write df_profiles
# from the model output, or supply them manually.
source("scripts/02-llm-extraction.R") # runs Ollama, ~1 hour
source("scripts/03-llm-validation.R") # kappa across models
targets::tar_make()                   # rebuilds figures from real outputs

On a workstation with a single 24 GB GPU, the gemma3:27b extraction step takes roughly 60 minutes for ~6,000 stratified responses; embedding generation is cached after the first run.

Hardware notes

CPU-only: the synthetic demo runs in a few minutes on any modern laptop. The full real-data pipeline will work CPU-only but the gemma3:27b extraction takes many hours.
GPU: an NVIDIA card with ≥ 16 GB VRAM (or an Apple Silicon machine with ≥ 32 GB unified memory) is recommended for the real-data path.

Citation

If you use this repository or the synthetic demo data, please cite the workshop paper (preferred) and the repository:

@inproceedings{parkinson2026towards,
  title     = {Towards Uncovering Indoor Satisfaction Profiles:
               LLM Capacity for Structured Extraction from
               Occupant Feedback},
  author    = {Parkinson, Thomas and Schiavon, Stefano and
               Zhang, Wenhao and Miller, Clayton},
  booktitle = {Proceedings of the SocialSys'26 Workshop},
  year      = {2026},
  publisher = {ACM}
}

Licence

Code — MIT
Paper text and figures — CC BY 4.0

Acknowledgements

The Center for the Built Environment (UC Berkeley) for access to the Occupant Survey database. ACM SocialSys'26 reviewers for their feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Capacity for Structured Extraction from Occupant Feedback

Contents

Quick start (synthetic demo)

Reproducing the paper results (real CBE data)

Hardware notes

Citation

Licence

Acknowledgements

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
R		R
data		data
paper		paper
renv		renv
scripts		scripts
.Rprofile		.Rprofile
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-paper		LICENSE-paper
README.md		README.md
_targets.R		_targets.R
renv.lock		renv.lock

Folders and files

Latest commit

History

Repository files navigation

LLM Capacity for Structured Extraction from Occupant Feedback

Contents

Quick start (synthetic demo)

Reproducing the paper results (real CBE data)

Hardware notes

Citation

Licence

Acknowledgements

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages