# EcoMetricx — Retrieval Demo (MVP)

This notebook demos the early query system pipeline:
- Load normalized documents (Phase 2 output)
- If missing, auto-run normalization to generate it
- Chunk into retrieval units (simple, page-aware)
- Build a TF-IDF index (CPU-only)
- Run keyword/semantic-ish search and show citations

Run cells top-to-bottom to try queries.


In [5]:
from pathlib import Path
import json, subprocess, sys
from datetime import datetime

project_root = Path.cwd()
run_id_path = project_root / '.current_run_id'
run_id = run_id_path.read_text().strip() if run_id_path.exists() else None
print('Project root:', project_root)
print('Current run_id:', run_id)

Project root: /root/Programming Projects/Personal/EcoMetricx
Current run_id: 20250903_093826


In [6]:
# Ensure normalized documents exist; auto-run normalization if missing
norm_base = project_root / 'data' / 'normalized' / 'visual_extraction'
if run_id is None or not (norm_base / run_id / 'documents.jsonl').exists():
	print('Normalized documents missing; running normalization script...')
	ret = subprocess.run([sys.executable, str(project_root / 'scripts' / 'normalize_to_documents.py')], capture_output=True, text=True)
	print(ret.stdout or ret.stderr)
	# Refresh run_id if it was None
	if run_id is None and run_id_path.exists():
		run_id = run_id_path.read_text().strip()

norm_doc = norm_base / run_id / 'documents.jsonl'
assert norm_doc.exists(), f'Missing {norm_doc}'
rows = [json.loads(l) for l in norm_doc.read_text(encoding='utf-8').splitlines() if l.strip()]
print('Documents loaded:', len(rows))
print('Document id:', rows[0]['document_id'])


Documents loaded: 1
Document id: emx:visual_extraction:6a55e73ff2d9


## Create simple chunks (page-aware)
We split the document text by pages into retrieval units and attach page numbers and a section path placeholder.


In [7]:
doc = rows[0]
chunks = []
for p in doc.get('pages', []):
	text = p.get('text','').strip()
	if not text:
		continue
	chunks.append({
		'chunk_index': len(chunks),
		'page_num': p.get('page_number', 0),
		'parent_document_id': doc['document_id'],
		'section_path': f"page/{p.get('page_number', 0)}",
		'text': text
	})
print('Chunks:', len(chunks))
print(chunks[0]['text'][:200] + ('...' if len(chunks[0]['text'])>200 else ''))


Chunks: 2
Home Energy Report: electricity March report Account number: 954137 Service address: 1627 Tulip Lane Dear JILL DOE, here is your usage analysis for March. Your electric use: 18% more than similar near...


## Build TF-IDF index
We use scikit-learn's `TfidfVectorizer` to index chunk texts for quick keyword/semantic-ish search.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [c['text'] for c in chunks]
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(corpus)
print('Index shape:', X.shape)



Index shape: (2, 135)


## Try a query
Enter a query. We compute cosine similarity in TF-IDF space and show top results with citations (document id + page).


In [9]:
def search(query: str, k: int = 3):
	qv = vectorizer.transform([query])
	scores = cosine_similarity(qv, X)[0]
	topk = scores.argsort()[::-1][:k]
	results = []
	for idx in topk:
		c = chunks[idx]
		results.append({
			'score': float(scores[idx]),
			'document_id': c['parent_document_id'],
			'page_num': c['page_num'],
			'section_path': c['section_path'],
			'snippet': c['text'][:240] + ('...' if len(c['text'])>240 else '')
		})
	return results

for r in search('energy savings tips'):
	print(r['score'], r['document_id'], f"page {r['page_num']}")
	print(r['snippet'])
	print('---')



0.4048466114956216 emx:visual_extraction:6a55e73ff2d9 page 1
Your top three tailored energy-saving tips Caulk windows and doors Upgrade your refrigerator Adjust thermostat settings Save money and energy Look for an Energy Star label Biggest energy saving option One of the biggest Older model Set your...
---
0.12844813733286536 emx:visual_extraction:6a55e73ff2d9 page 0
Home Energy Report: electricity March report Account number: 954137 Service address: 1627 Tulip Lane Dear JILL DOE, here is your usage analysis for March. Your electric use: 18% more than similar nearby homes You TT A bove Similar nearby ho...
---
