# Part A: Chroma Vector Database Ingestion

Set up a local Chroma store that the TTD-DR agent can reuse for feasibility-research retrieval. **Do not run the cells yet**‚Äîwe will swap the dummy source links for production data before ingestion.

## Notebook Goals
- Import the minimal tooling for Chroma + embeddings
- Define reusable configuration plus placeholder (HTML/JSON) data sources
- Fetch, clean, and chunk remote text; PDF ingestion is intentionally removed
- Persist a Chroma collection and a lightweight pickle manifest for downstream agents
- Provide clear entry points to replace dummy URLs with the real parcel intelligence feeds

In [7]:
# 1. Imports & Environment (run only after providing real source URLs)
import os
import json
import time
import pickle
from pathlib import Path
from typing import List, Dict
from uuid import uuid4

import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

load_dotenv()
print("‚úÖ Imports ready (execute once sources are finalized)")

‚úÖ Imports ready (execute once sources are finalized)


## 2. Configuration
Define paths and parameters. Sources will be loaded from YAML in the next cell.


In [8]:
DATA_DIR = Path("data")
VECTOR_DIR = DATA_DIR / "vectorstores" / "chroma_feasibility"
MANIFEST_PATH = DATA_DIR / "vectorstores" / "chroma_manifest.pkl"
SOURCES_FILE = DATA_DIR / "sources.yaml"

EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "ttd_dr_feasibility_seed"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
REQUEST_TIMEOUT = 30

DATA_DIR.mkdir(parents=True, exist_ok=True)
VECTOR_DIR.mkdir(parents=True, exist_ok=True)
print(f"üìÇ Persist directory: {VECTOR_DIR}")
print(f"üóÇÔ∏è Manifest path: {MANIFEST_PATH}")
print(f"üìÑ Sources file: {SOURCES_FILE}")


üìÇ Persist directory: data/vectorstores/chroma_feasibility
üóÇÔ∏è Manifest path: data/vectorstores/chroma_manifest.pkl
üìÑ Sources file: data/sources.yaml


## 3. Load Sources from YAML
Load the source list from `data/sources.yaml`.


In [9]:
# 3. Load Sources from YAML
import yaml

def load_sources(path: Path) -> List[Dict[str, str]]:
    if not path.exists():
        raise FileNotFoundError(f"Sources file not found at {path}")
    with path.open("r", encoding="utf-8") as f:
        data = yaml.safe_load(f) or []
    if not isinstance(data, list):
        raise ValueError("Expected a list of sources in YAML")
    return data

SOURCES = load_sources(SOURCES_FILE)
print(f"üìö Loaded {len(SOURCES)} sources from {SOURCES_FILE}")
for src in SOURCES:
    print(f"  - {src.get('name', 'Unnamed Source')} [{src.get('type', 'unknown')}] -> {src.get('url')}")


üìö Loaded 15 sources from data/sources.yaml
  - Los Angeles County Assessor Portal [html] -> https://portal.assessor.lacounty.gov/
  - USGS The National Map [html] -> https://www.usgs.gov/programs/national-geospatial-program/national-map
  - OpenStreetMap Land Use & POIs [html] -> https://wiki.openstreetmap.org/wiki/Map_features
  - NYC Zoning Resolution Portal [html] -> https://zr.planning.nyc.gov/
  - US Census TIGER/Line Overview [html] -> https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
  - FEMA Flood Map Service Center [html] -> https://msc.fema.gov/portal/home
  - USDA Web Soil Survey [html] -> https://websoilsurvey.sc.egov.usda.gov/App/HomePage.htm
  - EPA Envirofacts [html] -> https://enviro.epa.gov/
  - EPA EJScreen Portal [html] -> https://www.epa.gov/ejscreen
  - Transit.land National Transit Map [html] -> https://www.transit.land/
  - OpenAddresses Global Address Data [html] -> https://openaddresses.io/
  - US Census ACS Data Portal [h

## 3. Source Fetching Helpers (HTML & JSON)
These utilities intentionally skip PDF/file downloads. Replace the dummy URLs with real feeds before running.

In [10]:
def fetch_html(url: str) -> str:
    response = requests.get(url, timeout=REQUEST_TIMEOUT)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    for tag in soup(["script", "style", "noscript"]):
        tag.extract()
    text = " ".join(chunk.strip() for chunk in soup.stripped_strings)
    return text


def fetch_json(url: str) -> str:
    response = requests.get(url, timeout=REQUEST_TIMEOUT)
    response.raise_for_status()
    payload = response.json()
    return json.dumps(payload, indent=2)


def build_documents(sources: List[Dict[str, str]]) -> List[Document]:
    documents = []
    for source in sources:
        try:
            if source["type"].lower() == "html":
                raw_text = fetch_html(source["url"])
            elif source["type"].lower() == "json":
                raw_text = fetch_json(source["url"])
            else:
                print(f"‚ö†Ô∏è Skipping unsupported type: {source['type']} for {source['name']}")
                continue

            documents.append(
                Document(
                    page_content=raw_text,
                    metadata={
                        "source": source["url"],
                        "name": source["name"],
                        "notes": source.get("notes", ""),
                    },
                )
            )
            print(f"‚úÖ Loaded {source['name']}")
        except Exception as exc:
            print(f"‚ùå Failed to ingest {source['name']}: {exc}")
    return documents

## 4. Chunk Strategy
Configure a `RecursiveCharacterTextSplitter` so long municipal reports or API payloads become retrieval-friendly passages.

In [11]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " "],
)
print(text_splitter)

<langchain_text_splitters.character.RecursiveCharacterTextSplitter object at 0x1405cd4f0>


## 5. Initialize Embeddings + Chroma
Instantiate OpenAI embeddings and a persistent Chroma collection. Swap providers or deployment modes as needed.

In [12]:
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vector_store = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
    persist_directory=str(VECTOR_DIR),
)
print(f"üìö Chroma collection ready: {COLLECTION_NAME}")

üìö Chroma collection ready: ttd_dr_feasibility_seed


## 6. Build Documents from YAML Sources
Load and process documents from the sources defined in `data/sources.yaml`.


In [13]:
raw_documents = build_documents(SOURCES)
print(f"Total raw documents: {len(raw_documents)}")

‚úÖ Loaded Los Angeles County Assessor Portal
‚úÖ Loaded USGS The National Map
‚úÖ Loaded OpenStreetMap Land Use & POIs
‚úÖ Loaded NYC Zoning Resolution Portal
‚úÖ Loaded US Census TIGER/Line Overview
‚ùå Failed to ingest FEMA Flood Map Service Center: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))
‚úÖ Loaded USDA Web Soil Survey
‚úÖ Loaded EPA Envirofacts
‚ùå Failed to ingest EPA EJScreen Portal: 404 Client Error: Not Found for url: https://www.epa.gov/ejscreen
‚úÖ Loaded Transit.land National Transit Map
‚úÖ Loaded OpenAddresses Global Address Data
‚úÖ Loaded US Census ACS Data Portal
‚ùå Failed to ingest Bureau of Labor Statistics Regional Data: 403 Client Error: Forbidden for url: https://www.bls.gov/regions/home.htm
‚ùå Failed to ingest Zillow Research Hub: 403 Client Error: Forbidden for url: https://www.zillow.com/research/data/
‚úÖ Loaded TTD-DR Paper (arXiv abstract)
Total raw documents: 11


## 7. Chunk & Upsert into Chroma
Split the raw payloads, add them to the vector store, and persist the collection to disk.

In [14]:
chunked_documents = []
for doc in raw_documents:
    chunked_documents.extend(text_splitter.split_documents([doc]))

if chunked_documents:
    ids = [str(uuid4()) for _ in chunked_documents]
    vector_store.add_documents(documents=chunked_documents, ids=ids)
    vector_store.persist()
    print(f"‚úÖ Stored {len(chunked_documents)} chunks in {COLLECTION_NAME}")
else:
    print("‚ö†Ô∏è No documents were ingested. Update DUMMY_SOURCES and rerun.")

AttributeError: 'Chroma' object has no attribute 'persist'

## 8. Persist Retrieval Manifest (Pickle)
The manifest lets downstream agents reconnect to the same Chroma index without re-embedding.

In [None]:
def generate_manifest(documents: List[Document]) -> Dict:
    summary = []
    for doc in documents:
        summary.append(
            {
                "source": doc.metadata.get("source"),
                "name": doc.metadata.get("name"),
                "char_count": len(doc.page_content),
            }
        )

    manifest = {
        "collection_name": COLLECTION_NAME,
        "persist_directory": str(VECTOR_DIR),
        "embedding_model": EMBEDDING_MODEL,
        "document_summary": summary,
        "chunk_size": CHUNK_SIZE,
        "chunk_overlap": CHUNK_OVERLAP,
        "generated_at": time.strftime("%Y-%m-%d %H:%M:%S"),
    }
    return manifest


if raw_documents:
    manifest = generate_manifest(raw_documents)
    with MANIFEST_PATH.open("wb") as f:
        pickle.dump(manifest, f)
    print(f"üìù Manifest saved to {MANIFEST_PATH}")
else:
    print("‚ö†Ô∏è Manifest not written because no documents were ingested.")

## 9. Next Steps
1. Replace the dummy URLs with the real address/parcel research links you will provide.
2. Run the notebook top-to-bottom to ingest and persist the data.
3. Load `chroma_manifest.pkl` inside the retrieval layer to avoid re-embedding.
4. Version the `data/vectorstores` directory (or sync to object storage) so every agent run can mount the same context.