# EcoMetricx — Retrieval Demo (MVP)

This notebook demos the early query system pipeline:
- Load normalized documents (Phase 2 output)
- If missing, auto-run normalization to generate it
- Chunk into retrieval units (simple, page-aware)
- Build a TF-IDF index (CPU-only)
- Run keyword/semantic-ish search and show citations

Run cells top-to-bottom to try queries.


In [5]:
from pathlib import Path
import json, subprocess, sys
from datetime import datetime

project_root = Path.cwd()
run_id_path = project_root / '.current_run_id'
run_id = run_id_path.read_text().strip() if run_id_path.exists() else None
print('Project root:', project_root)
print('Current run_id:', run_id)

Project root: /root/Programming Projects/Personal/EcoMetricx
Current run_id: 20250903_093826


In [6]:
# Ensure normalized documents exist; auto-run normalization if missing
norm_base = project_root / 'data' / 'normalized' / 'visual_extraction'
if run_id is None or not (norm_base / run_id / 'documents.jsonl').exists():
	print('Normalized documents missing; running normalization script...')
	ret = subprocess.run([sys.executable, str(project_root / 'scripts' / 'normalize_to_documents.py')], capture_output=True, text=True)
	print(ret.stdout or ret.stderr)
	# Refresh run_id if it was None
	if run_id is None and run_id_path.exists():
		run_id = run_id_path.read_text().strip()

norm_doc = norm_base / run_id / 'documents.jsonl'
assert norm_doc.exists(), f'Missing {norm_doc}'
rows = [json.loads(l) for l in norm_doc.read_text(encoding='utf-8').splitlines() if l.strip()]
print('Documents loaded:', len(rows))
print('Document id:', rows[0]['document_id'])


Documents loaded: 1
Document id: emx:visual_extraction:6a55e73ff2d9


## Create simple chunks (page-aware)
We split the document text by pages into retrieval units and attach page numbers and a section path placeholder.


In [10]:
chunk_file = project_root / 'data' / 'chunks' / 'visual_extraction' / run_id / 'chunks.jsonl'
if not chunk_file.exists():
	print('Chunks missing; running chunking script...')
	ret = subprocess.run([sys.executable, str(project_root / 'scripts' / 'chunk_and_redact.py')], capture_output=True, text=True)
	print(ret.stdout or ret.stderr)

chunks = [json.loads(l) for l in chunk_file.read_text(encoding='utf-8').splitlines() if l.strip()]
print('Chunks loaded:', len(chunks))
print('Sample chunk:', {k: chunks[0][k] for k in ('parent_document_id','page_num','section_path')})



Chunks loaded: 1
Sample chunk: {'parent_document_id': 'emx:visual_extraction:6a55e73ff2d9', 'page_num': 1, 'section_path': 'page/1'}


In [11]:
doc = rows[0]
chunks = []
for p in doc.get('pages', []):
	text = p.get('text','').strip()
	if not text:
		continue
	chunks.append({
		'chunk_index': len(chunks),
		'page_num': p.get('page_number', 0),
		'parent_document_id': doc['document_id'],
		'section_path': f"page/{p.get('page_number', 0)}",
		'text': text
	})
print('Chunks:', len(chunks))
print(chunks[0]['text'][:200] + ('...' if len(chunks[0]['text'])>200 else ''))


Chunks: 2
Home Energy Report: electricity March report Account number: 954137 Service address: 1627 Tulip Lane Dear JILL DOE, here is your usage analysis for March. Your electric use: 18% more than similar near...


## Build TF-IDF index
We use scikit-learn's `TfidfVectorizer` to index chunk texts for quick keyword/semantic-ish search.


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [c['text'] for c in chunks]
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(corpus)
print('Index shape:', X.shape)



Index shape: (2, 135)


## Try a query
Enter a query. We compute cosine similarity in TF-IDF space and show top results with citations (document id + page).


## Embeddings-based retrieval (BGE-small)
We use FastEmbed with `BAAI/bge-small-en-v1.5` to embed chunks and queries, then perform cosine similarity search. Falls back to TF-IDF if unavailable.


In [15]:
# Ensure embeddings exist; auto-generate if missing (auto-install fastembed)
from importlib import util as _iu

emb_dir = project_root / 'data' / 'index' / 'pgvector' / run_id
emb_file = emb_dir / 'embeddings.jsonl'
manifest_file = emb_dir / 'index_manifest.json'

# Ensure fastembed is available in this kernel
if _iu.find_spec('fastembed') is None:
	print('Installing fastembed in current kernel...')
	_ = subprocess.run([sys.executable, '-m', 'pip', 'install', 'fastembed', '--quiet'], text=True)

# Generate embeddings if missing
if not emb_file.exists():
	print('Embeddings missing; generating...')
	ret = subprocess.run([sys.executable, str(project_root / 'scripts' / 'embed_chunks.py')], capture_output=True, text=True)
	print(ret.stdout or ret.stderr)

if emb_file.exists():
	print('Embeddings ready:', emb_file)
	manifest = json.loads(manifest_file.read_text())
	print('Model:', manifest.get('model'), 'dim:', manifest.get('embedding_dim'))
else:
	print('Embeddings not available; search will use TF-IDF fallback.')


Installing fastembed in current kernel...


[0m

Embeddings missing; generating...
Wrote data/index/pgvector/20250903_093826/embeddings.jsonl and data/index/pgvector/20250903_093826/index_manifest.json

Embeddings ready: /root/Programming Projects/Personal/EcoMetricx/data/index/pgvector/20250903_093826/embeddings.jsonl
Model: BAAI/bge-small-en-v1.5 dim: 384


In [16]:
# Build in-memory embedding matrix if available
import numpy as np

emb_records = []
if emb_file.exists():
	for line in emb_file.read_text().splitlines():
		if not line.strip():
			continue
		r = json.loads(line)
		emb_records.append(r)
	E = np.vstack([np.array(r['embedding_vector'], dtype=np.float32) for r in emb_records]) if emb_records else None
	id_to_idx = {r['chunk_id']: i for i, r in enumerate(emb_records)}
	print('Embedding matrix:', E.shape if E is not None else None)
else:
	E = None
	id_to_idx = {}



Embedding matrix: (1, 384)


## Normalize chunk IDs (safety)
Ensure every chunk has a `chunk_id` and build a lookup by id. This prevents KeyError if earlier cells created temporary chunks without IDs.


In [18]:
# Ensure each chunk has a stable chunk_id and build lookup
for i, c in enumerate(chunks):
	if 'chunk_id' not in c:
		c['chunk_id'] = f"{c['parent_document_id']}:c{c.get('chunk_index', i)}"
chunk_by_id = {c['chunk_id']: c for c in chunks}
print('Unique chunk ids:', len(chunk_by_id))



Unique chunk ids: 2


In [19]:
# Override search_embedded to use chunk_by_id
import numpy as np

def search_embedded(query: str, k: int = 3):
	if E is None:
		print('Embeddings not loaded; falling back to TF-IDF search()')
		return search(query, k)
	from fastembed import TextEmbedding
	emb = TextEmbedding('BAAI/bge-small-en-v1.5')
	qv = np.array(list(emb.embed([query]))[0], dtype=np.float32)
	qv = qv / (np.linalg.norm(qv) + 1e-9)
	Ev = E / (np.linalg.norm(E, axis=1, keepdims=True) + 1e-9)
	scores = Ev @ qv
	idxs = scores.argsort()[-k:][::-1]
	results = []
	for idx in idxs:
		chunk_id = emb_records[idx]['chunk_id']
		c = chunk_by_id.get(chunk_id)
		if not c:
			continue
		results.append({
			'score': float(scores[idx]),
			'document_id': c['parent_document_id'],
			'page_num': c['page_num'],
			'section_path': c['section_path'],
			'snippet': c['text'][:240] + ('...' if len(c['text'])>240 else '')
		})
	return results

# Demo
for r in search_embedded('energy savings tips'):
	print(r['score'], r['document_id'], f"page {r['page_num']}")
	print(r['snippet'])
	print('---')



0.7733845710754395 emx:visual_extraction:6a55e73ff2d9 page 0
Home Energy Report: electricity March report Account number: 954137 Service address: 1627 Tulip Lane Dear JILL DOE, here is your usage analysis for March. Your electric use: 18% more than similar nearby homes You TT A bove Similar nearby ho...
---


In [20]:
# Backfill chunk_id if missing and build map
for i, c in enumerate(chunks):
	if 'chunk_id' not in c:
		c['chunk_id'] = f"{c['parent_document_id']}:c{c.get('chunk_index', i)}"
chunk_by_id = {c['chunk_id']: c for c in chunks}
print('Unique chunk ids:', len(chunk_by_id))



Unique chunk ids: 2


In [21]:
# Embedding-backed search; falls back to TF-IDF

def search_embedded(query: str, k: int = 3):
	if E is None:
		print('Embeddings not loaded; falling back to TF-IDF search()')
		return search(query, k)
	from fastembed import TextEmbedding
	emb = TextEmbedding('BAAI/bge-small-en-v1.5')
	qv = np.array(list(emb.embed([query]))[0], dtype=np.float32)
	# cosine similarity
	qv = qv / (np.linalg.norm(qv) + 1e-9)
	Ev = E / (np.linalg.norm(E, axis=1, keepdims=True) + 1e-9)
	scores = Ev @ qv
	idxs = scores.argsort()[-k:][::-1]
	results = []
	for idx in idxs:
		chunk_id = emb_records[idx]['chunk_id']
		c = next(c for c in chunks if c['chunk_id'] == chunk_id)
		results.append({
			'score': float(scores[idx]),
			'document_id': c['parent_document_id'],
			'page_num': c['page_num'],
			'section_path': c['section_path'],
			'snippet': c['text'][:240] + ('...' if len(c['text'])>240 else '')
		})
	return results

# Demo
for r in search_embedded('energy savings tips'):
	print(r['score'], r['document_id'], f"page {r['page_num']}")
	print(r['snippet'])
	print('---')



0.7733845710754395 emx:visual_extraction:6a55e73ff2d9 page 0
Home Energy Report: electricity March report Account number: 954137 Service address: 1627 Tulip Lane Dear JILL DOE, here is your usage analysis for March. Your electric use: 18% more than similar nearby homes You TT A bove Similar nearby ho...
---


In [22]:
def search(query: str, k: int = 3):
	qv = vectorizer.transform([query])
	scores = cosine_similarity(qv, X)[0]
	topk = scores.argsort()[::-1][:k]
	results = []
	for idx in topk:
		c = chunks[idx]
		results.append({
			'score': float(scores[idx]),
			'document_id': c['parent_document_id'],
			'page_num': c['page_num'],
			'section_path': c['section_path'],
			'snippet': c['text'][:240] + ('...' if len(c['text'])>240 else '')
		})
	return results

for r in search('energy savings tips'):
	print(r['score'], r['document_id'], f"page {r['page_num']}")
	print(r['snippet'])
	print('---')



0.4048466114956216 emx:visual_extraction:6a55e73ff2d9 page 1
Your top three tailored energy-saving tips Caulk windows and doors Upgrade your refrigerator Adjust thermostat settings Save money and energy Look for an Energy Star label Biggest energy saving option One of the biggest Older model Set your...
---
0.12844813733286536 emx:visual_extraction:6a55e73ff2d9 page 0
Home Energy Report: electricity March report Account number: 954137 Service address: 1627 Tulip Lane Dear JILL DOE, here is your usage analysis for March. Your electric use: 18% more than similar nearby homes You TT A bove Similar nearby ho...
---


## Optional: Ingest current run into Postgres (Phase 5)
This cell applies the migration and ingests documents, chunks, and embeddings into Postgres using `DATABASE_URL`. Requires pgvector extension installed.


In [23]:
# Ingest to Postgres using DATABASE_URL
import os
from importlib import util as _iu

# Ensure psycopg and dotenv available in kernel
missing = []
for m in ('psycopg', 'dotenv'):
	if _iu.find_spec(m) is None:
		missing.append(m)
if missing:
	print('Installing missing packages:', missing)
	_ = subprocess.run([sys.executable, '-m', 'pip', 'install', *missing, '--quiet'], text=True)

# Apply migration (requires psql available). Skip if not present.
DATABASE_URL = os.environ.get('DATABASE_URL') or os.environ.get('POSTGRES_DSN')
if DATABASE_URL:
	print('Using DATABASE_URL')
	# Best effort migration via psql if available
	psql = subprocess.run(['which', 'psql'], capture_output=True, text=True)
	if psql.returncode == 0:
		print('Applying migration 001_init.sql')
		_ = subprocess.run(['psql', DATABASE_URL, '-f', str(project_root / 'db' / 'migrations' / '001_init.sql')], text=True)
	else:
		print('psql not found; please apply db/migrations/001_init.sql manually')
	# Ingest
	print('Ingesting current run...')
	ret = subprocess.run([sys.executable, str(project_root / 'scripts' / 'ingest_to_postgres.py')], capture_output=True, text=True)
	print(ret.stdout or ret.stderr)
else:
	print('DATABASE_URL not set; skipping ingestion')



Installing missing packages: ['psycopg', 'dotenv']
DATABASE_URL not set; skipping ingestion


[0m