Initial version push by crvernon · Pull Request #1 · JGCRI/explorer

crvernon · 2026-02-19T21:21:51Z

This pull request introduces a new CLI and API for building, updating, and querying a vectorstore of GCAM simulation metadata using text embeddings. It adds all core code, example data, configuration, and documentation needed to use the system locally or deploy it. The CLI supports building/updating the vectorstore from simulation JSON, querying by topic similarity, and running a local Flask API server. The OpenAI-compatible embeddings backend is supported, with configuration via environment variables.

Major features and changes:

Core functionality and CLI/API:

Added explorer/cli.py implementing a CLI for building/updating a persisted vectorstore from a JSON file, querying by topic similarity, and running a local Flask API server for search.
Added explorer/web.py providing a Flask app with /search and /health endpoints, delegating to the CLI query logic.
Exposed main vectorstore and app utilities in explorer/__init__.py for easier imports and reusability.

Configuration and documentation:

Added .env.example with environment variable templates for OpenAI API key, embedding model, API base URL, and vectorstore path.
Added a comprehensive README.md with setup instructions, CLI usage, API usage, and deployment notes.

Sample data and CI:

Added example simulation metadata in data/gcam_simulations.json and a corresponding vectorstore metadata file in data/gcam_vectorstore.json for testing/demo purposes. [1] [2]
Added a GitHub Actions workflow .github/workflows/tests.yml to install dependencies, build the package, and run tests on pushes and pull requests.

Copilot

Pull request overview

This pull request introduces a complete CLI and API system for building, managing, and querying a vectorstore of GCAM simulation metadata using text embeddings. The system supports both OpenAI-compatible embedding backends and a local deterministic embedding backend for development/testing. The implementation uses FAISS for efficient similarity search with a fallback to pure Python cosine similarity when FAISS is unavailable.

Changes:

Core vectorstore implementation with incremental updates, FAISS-based indexing, and embedding backend abstraction
CLI with build, query, and serve commands for vectorstore management and local API serving
Flask-based REST API with /search and /health endpoints for production deployment
Comprehensive test suite covering CLI, API, and vectorstore functionality with conditional FAISS support
Documentation, example data, CI/CD workflow, and environment configuration templates

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
pyproject.toml	Package configuration with Python 3.13 requirement, dependencies, and test configuration
explorer/vectorstore.py	Core vectorstore implementation with OpenAI and local embedding backends, FAISS indexing, and persistence
explorer/cli.py	CLI implementation for building/updating vectorstore, querying, and running local API server
explorer/web.py	Flask application providing search and health check endpoints
explorer/init.py	Package exports for public API
tests/test_vectorstore.py	Tests for vectorstore operations including persistence, incremental updates, and FAISS support
tests/test_cli.py	Tests for CLI commands including build, query, and environment variable handling
tests/test_web.py	Tests for Flask API endpoints including search validation and result filtering
data/gcam_simulations.json	Example simulation metadata with 12 GCAM simulation records
data/gcam_vectorstore.json	Pre-built vectorstore metadata for demo/testing purposes
README.md	Comprehensive documentation for setup, usage, and deployment
.env.example	Environment variable template for OpenAI configuration
.gitignore	Git ignore rules for generated files and environment configuration
.github/workflows/tests.yml	GitHub Actions workflow for automated testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

explorer/web.py

README.md

.env.example

explorer/web.py

Copilot · 2026-02-19T21:30:09Z

pyproject.toml

+[project]
+name = "explorer"
+version = "0.1.0"
+description = "GCIMS explorer AI capabilites."


The project description contains a typo: "capabilites" should be "capabilities".

Suggested change

description = "GCIMS explorer AI capabilites."

description = "GCIMS explorer AI capabilities."

Copilot · 2026-02-19T21:30:10Z

data/gcam_vectorstore.json

@@ -0,0 +1 @@
+{"format_version": 2, "vector_backend": "faiss", "backend_metadata": {"type": "openai", "model": "text-embedding-3-large-project"}, "records": [{"simulation_id": "GCAM-SIM-001", "simulation_name": "Baseline Energy Demand Pathway", "description": "Reference scenario projecting energy demand under moderate economic growth and current technology trends.", "keywords": ["baseline", "energy demand", "electric load", "technology trends", "resource use", "demand projection", "regional systems"]}, {"simulation_id": "GCAM-SIM-002", "simulation_name": "High Electrification Transition", "description": "Explores rapid electrification of transport and industry with accelerated clean power deployment.", "keywords": ["electrification", "transport", "industry", "power sector", "grid expansion", "technology adoption"]}, {"simulation_id": "GCAM-SIM-003", "simulation_name": "High-Efficiency End-Use Scenario", "description": "Assesses how efficient appliances, buildings, and industrial processes reshape long-term energy consumption.", "keywords": ["efficiency", "buildings", "industry", "end-use technology", "energy demand", "load reduction"]}, {"simulation_id": "GCAM-SIM-004", "simulation_name": "Flexible Fuel Mix Dynamics", "description": "Evaluates how changing fuel availability influences electricity generation, industrial heat, and transport energy use.", "keywords": ["fuel mix", "electricity generation", "industrial heat", "transport", "resource availability", "system balance"]}, {"simulation_id": "GCAM-SIM-005", "simulation_name": "Low Renewable Cost Breakthrough", "description": "Models a technology breakthrough that sharply lowers solar and wind costs across all regions.", "keywords": ["renewables", "cost decline", "solar", "wind", "technology learning", "energy transition"]}, {"simulation_id": "GCAM-SIM-006", "simulation_name": "Water-Constrained Energy Supply", "description": "Examines how limited freshwater availability affects cooling technologies, thermal generation, and regional power reliability.", "keywords": ["water constraints", "thermal power", "cooling systems", "power reliability", "freshwater demand", "regional supply"]}, {"simulation_id": "GCAM-SIM-007", "simulation_name": "Land Productivity Expansion", "description": "Evaluates improved crop yields and managed forests on land allocation, biomass supply, and food production.", "keywords": ["land productivity", "crop yields", "managed forests", "biomass", "food production", "land allocation"]}, {"simulation_id": "GCAM-SIM-008", "simulation_name": "Urban Cooling and Water Demand", "description": "Analyzes interactions between rising cooling needs in cities, electricity demand peaks, and municipal water withdrawals.", "keywords": ["urban cooling", "peak demand", "municipal water", "electricity load", "city infrastructure", "resource planning"]}, {"simulation_id": "GCAM-SIM-009", "simulation_name": "Sustainable Bioenergy Scale-Up", "description": "Investigates expansion of bioenergy supply with constraints on food prices, soil quality, and available cropland.", "keywords": ["bioenergy", "sustainability", "food-energy nexus", "soil quality", "land competition", "supply potential"]}, {"simulation_id": "GCAM-SIM-010", "simulation_name": "Hydrogen Economy Acceleration", "description": "Tests aggressive deployment of hydrogen in heavy industry, shipping, and power balancing.", "keywords": ["hydrogen", "heavy industry", "shipping", "power balancing", "alternative fuels", "infrastructure"]}, {"simulation_id": "GCAM-SIM-011", "simulation_name": "Regional Resource Balancing", "description": "Compares regional pathways for balancing electricity generation, water withdrawals, and agricultural land demand.", "keywords": ["regional analysis", "electricity generation", "water withdrawals", "agricultural land", "resource balancing", "system integration"]}, {"simulation_id": "GCAM-SIM-012", "simulation_name": "Delayed Infrastructure Catch-Up", "description": "Simulates delayed investments in energy and water infrastructure followed by rapid capacity expansion after 2035.", "keywords": ["delayed action", "infrastructure expansion", "energy systems", "water systems", "capacity planning", "investment timing"]}], "faiss_index_path": "gcam_vectorstore.faiss"}


The model name in the committed vectorstore metadata is "text-embedding-3-large-project", which doesn't match the standard model name "text-embedding-3-large" documented throughout the codebase. This appears to be a custom or project-specific model name that was used to generate the example vectorstore. This inconsistency could confuse users who expect to use the standard OpenAI model. Consider either regenerating this file with the standard model name or documenting why a custom model was used.

Copilot · 2026-02-19T21:30:10Z

pyproject.toml

+version = "0.1.0"
+description = "GCIMS explorer AI capabilites."
+readme = "README.md"
+requires-python = ">=3.13"


The pyproject.toml specifies requires-python = ">=3.13", which is a very recent Python version. Python 3.13 was released in October 2024. This strict requirement may limit adoption, as many users and production systems are still on Python 3.10, 3.11, or 3.12. Consider testing with and supporting older Python versions (e.g., Python 3.10+) unless there are specific features from 3.13 that are absolutely required. Looking at the code, the main features used are type annotations with | syntax (available since 3.10) and standard library features that should work on earlier versions.

Suggested change

requires-python = ">=3.13"

requires-python = ">=3.10"

Copilot · 2026-02-19T21:30:10Z

pyproject.toml

+dependencies = [
+    "faiss-cpu",


The dependency faiss-cpu is listed as a required dependency, but the code is designed to work without FAISS (with a fallback to cosine similarity). Since FAISS is treated as optional in the code with graceful handling when it's not available, it should be listed under optional-dependencies rather than as a required dependency. This would allow users to install the package without FAISS for development or testing purposes, and only install FAISS when needed for production use.

Copilot · 2026-02-19T21:30:10Z

explorer/web.py

+    @app.post("/search")
+    def search() -> tuple[object, int]:
+        payload = request.get_json(silent=True) or {}
+        topic = str(payload.get("topic", "")).strip()
+        if not topic:
+            return jsonify({"status": "error", "error": "topic is required"}), 400
+
+        top_k = payload.get("top_k", default_top_k)
+        try:
+            top_k_int = int(top_k)
+            if top_k_int <= 0:
+                raise ValueError
+        except (TypeError, ValueError):
+            return jsonify({"status": "error", "error": "top_k must be a positive integer"}), 400
+
+        min_score = payload.get("min_score")
+        min_score_float = None
+        if min_score is not None:
+            try:
+                min_score_float = float(min_score)
+            except (TypeError, ValueError):
+                return jsonify({"status": "error", "error": "min_score must be numeric"}), 400
+        ids_only = bool(payload.get("ids_only", False))
+
+        try:
+            response = run_query_command(
+                vectorstore_path=vectorstore_path,
+                topic=topic,
+                top_k=top_k_int,
+                min_score=min_score_float,
+                ids_only=ids_only,
+                env_path=env_path,
+            )
+        except Exception as error:  # pragma: no cover - catches runtime errors from query.
+            return jsonify({"status": "error", "error": str(error)}), 500
+        return jsonify(response), 200


The API endpoint does not implement any rate limiting or request size limits. For a production deployment, the /search endpoint should have rate limiting to prevent abuse, and the topic string should have a maximum length limit to prevent potential resource exhaustion attacks. Consider adding these protections or documenting that they should be implemented at the reverse proxy/load balancer level.

explorer/web.py

crvernon added 12 commits February 19, 2026 10:59

example env file

a322ad4

ignore system files

3a5816e

add synthetic data for testing simulation retrieval

1b1855f

project file init for package

c54d021

initial readme update

1cb8418

add actions build and test workflow

a28ab93

vectorstore builder

a2bb320

add vectorstore tests

2f7c2ad

enable cli

37816aa

ensure env vars are used when passed in

522527a

modify query command in readme

2f792d5

tranition to faiss

ddf120b

crvernon requested a review from Copilot February 19, 2026 21:21

Copilot started reviewing on behalf of crvernon February 19, 2026 21:22 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

simple test app

cfb6e32

crvernon merged commit 6f16b0a into main Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial version push#1

Initial version push#1
crvernon merged 13 commits intomainfrom
develop

crvernon commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	description = "GCIMS explorer AI capabilites."
	description = "GCIMS explorer AI capabilities."

		@@ -0,0 +1 @@
		{"format_version": 2, "vector_backend": "faiss", "backend_metadata": {"type": "openai", "model": "text-embedding-3-large-project"}, "records": [{"simulation_id": "GCAM-SIM-001", "simulation_name": "Baseline Energy Demand Pathway", "description": "Reference scenario projecting energy demand under moderate economic growth and current technology trends.", "keywords": ["baseline", "energy demand", "electric load", "technology trends", "resource use", "demand projection", "regional systems"]}, {"simulation_id": "GCAM-SIM-002", "simulation_name": "High Electrification Transition", "description": "Explores rapid electrification of transport and industry with accelerated clean power deployment.", "keywords": ["electrification", "transport", "industry", "power sector", "grid expansion", "technology adoption"]}, {"simulation_id": "GCAM-SIM-003", "simulation_name": "High-Efficiency End-Use Scenario", "description": "Assesses how efficient appliances, buildings, and industrial processes reshape long-term energy consumption.", "keywords": ["efficiency", "buildings", "industry", "end-use technology", "energy demand", "load reduction"]}, {"simulation_id": "GCAM-SIM-004", "simulation_name": "Flexible Fuel Mix Dynamics", "description": "Evaluates how changing fuel availability influences electricity generation, industrial heat, and transport energy use.", "keywords": ["fuel mix", "electricity generation", "industrial heat", "transport", "resource availability", "system balance"]}, {"simulation_id": "GCAM-SIM-005", "simulation_name": "Low Renewable Cost Breakthrough", "description": "Models a technology breakthrough that sharply lowers solar and wind costs across all regions.", "keywords": ["renewables", "cost decline", "solar", "wind", "technology learning", "energy transition"]}, {"simulation_id": "GCAM-SIM-006", "simulation_name": "Water-Constrained Energy Supply", "description": "Examines how limited freshwater availability affects cooling technologies, thermal generation, and regional power reliability.", "keywords": ["water constraints", "thermal power", "cooling systems", "power reliability", "freshwater demand", "regional supply"]}, {"simulation_id": "GCAM-SIM-007", "simulation_name": "Land Productivity Expansion", "description": "Evaluates improved crop yields and managed forests on land allocation, biomass supply, and food production.", "keywords": ["land productivity", "crop yields", "managed forests", "biomass", "food production", "land allocation"]}, {"simulation_id": "GCAM-SIM-008", "simulation_name": "Urban Cooling and Water Demand", "description": "Analyzes interactions between rising cooling needs in cities, electricity demand peaks, and municipal water withdrawals.", "keywords": ["urban cooling", "peak demand", "municipal water", "electricity load", "city infrastructure", "resource planning"]}, {"simulation_id": "GCAM-SIM-009", "simulation_name": "Sustainable Bioenergy Scale-Up", "description": "Investigates expansion of bioenergy supply with constraints on food prices, soil quality, and available cropland.", "keywords": ["bioenergy", "sustainability", "food-energy nexus", "soil quality", "land competition", "supply potential"]}, {"simulation_id": "GCAM-SIM-010", "simulation_name": "Hydrogen Economy Acceleration", "description": "Tests aggressive deployment of hydrogen in heavy industry, shipping, and power balancing.", "keywords": ["hydrogen", "heavy industry", "shipping", "power balancing", "alternative fuels", "infrastructure"]}, {"simulation_id": "GCAM-SIM-011", "simulation_name": "Regional Resource Balancing", "description": "Compares regional pathways for balancing electricity generation, water withdrawals, and agricultural land demand.", "keywords": ["regional analysis", "electricity generation", "water withdrawals", "agricultural land", "resource balancing", "system integration"]}, {"simulation_id": "GCAM-SIM-012", "simulation_name": "Delayed Infrastructure Catch-Up", "description": "Simulates delayed investments in energy and water infrastructure followed by rapid capacity expansion after 2035.", "keywords": ["delayed action", "infrastructure expansion", "energy systems", "water systems", "capacity planning", "investment timing"]}], "faiss_index_path": "gcam_vectorstore.faiss"} No newline at end of file

		dependencies = [
		"faiss-cpu",

Conversation

crvernon commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants