Skip to content

Initial version push#1

Merged
crvernon merged 13 commits intomainfrom
develop
Feb 19, 2026
Merged

Initial version push#1
crvernon merged 13 commits intomainfrom
develop

Conversation

@crvernon
Copy link
Copy Markdown
Member

This pull request introduces a new CLI and API for building, updating, and querying a vectorstore of GCAM simulation metadata using text embeddings. It adds all core code, example data, configuration, and documentation needed to use the system locally or deploy it. The CLI supports building/updating the vectorstore from simulation JSON, querying by topic similarity, and running a local Flask API server. The OpenAI-compatible embeddings backend is supported, with configuration via environment variables.

Major features and changes:

Core functionality and CLI/API:

  • Added explorer/cli.py implementing a CLI for building/updating a persisted vectorstore from a JSON file, querying by topic similarity, and running a local Flask API server for search.
  • Added explorer/web.py providing a Flask app with /search and /health endpoints, delegating to the CLI query logic.
  • Exposed main vectorstore and app utilities in explorer/__init__.py for easier imports and reusability.

Configuration and documentation:

  • Added .env.example with environment variable templates for OpenAI API key, embedding model, API base URL, and vectorstore path.
  • Added a comprehensive README.md with setup instructions, CLI usage, API usage, and deployment notes.

Sample data and CI:

  • Added example simulation metadata in data/gcam_simulations.json and a corresponding vectorstore metadata file in data/gcam_vectorstore.json for testing/demo purposes. [1] [2]
  • Added a GitHub Actions workflow .github/workflows/tests.yml to install dependencies, build the package, and run tests on pushes and pull requests.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a complete CLI and API system for building, managing, and querying a vectorstore of GCAM simulation metadata using text embeddings. The system supports both OpenAI-compatible embedding backends and a local deterministic embedding backend for development/testing. The implementation uses FAISS for efficient similarity search with a fallback to pure Python cosine similarity when FAISS is unavailable.

Changes:

  • Core vectorstore implementation with incremental updates, FAISS-based indexing, and embedding backend abstraction
  • CLI with build, query, and serve commands for vectorstore management and local API serving
  • Flask-based REST API with /search and /health endpoints for production deployment
  • Comprehensive test suite covering CLI, API, and vectorstore functionality with conditional FAISS support
  • Documentation, example data, CI/CD workflow, and environment configuration templates

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
pyproject.toml Package configuration with Python 3.13 requirement, dependencies, and test configuration
explorer/vectorstore.py Core vectorstore implementation with OpenAI and local embedding backends, FAISS indexing, and persistence
explorer/cli.py CLI implementation for building/updating vectorstore, querying, and running local API server
explorer/web.py Flask application providing search and health check endpoints
explorer/init.py Package exports for public API
tests/test_vectorstore.py Tests for vectorstore operations including persistence, incremental updates, and FAISS support
tests/test_cli.py Tests for CLI commands including build, query, and environment variable handling
tests/test_web.py Tests for Flask API endpoints including search validation and result filtering
data/gcam_simulations.json Example simulation metadata with 12 GCAM simulation records
data/gcam_vectorstore.json Pre-built vectorstore metadata for demo/testing purposes
README.md Comprehensive documentation for setup, usage, and deployment
.env.example Environment variable template for OpenAI configuration
.gitignore Git ignore rules for generated files and environment configuration
.github/workflows/tests.yml GitHub Actions workflow for automated testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[project]
name = "explorer"
version = "0.1.0"
description = "GCIMS explorer AI capabilites."
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project description contains a typo: "capabilites" should be "capabilities".

Suggested change
description = "GCIMS explorer AI capabilites."
description = "GCIMS explorer AI capabilities."

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
{"format_version": 2, "vector_backend": "faiss", "backend_metadata": {"type": "openai", "model": "text-embedding-3-large-project"}, "records": [{"simulation_id": "GCAM-SIM-001", "simulation_name": "Baseline Energy Demand Pathway", "description": "Reference scenario projecting energy demand under moderate economic growth and current technology trends.", "keywords": ["baseline", "energy demand", "electric load", "technology trends", "resource use", "demand projection", "regional systems"]}, {"simulation_id": "GCAM-SIM-002", "simulation_name": "High Electrification Transition", "description": "Explores rapid electrification of transport and industry with accelerated clean power deployment.", "keywords": ["electrification", "transport", "industry", "power sector", "grid expansion", "technology adoption"]}, {"simulation_id": "GCAM-SIM-003", "simulation_name": "High-Efficiency End-Use Scenario", "description": "Assesses how efficient appliances, buildings, and industrial processes reshape long-term energy consumption.", "keywords": ["efficiency", "buildings", "industry", "end-use technology", "energy demand", "load reduction"]}, {"simulation_id": "GCAM-SIM-004", "simulation_name": "Flexible Fuel Mix Dynamics", "description": "Evaluates how changing fuel availability influences electricity generation, industrial heat, and transport energy use.", "keywords": ["fuel mix", "electricity generation", "industrial heat", "transport", "resource availability", "system balance"]}, {"simulation_id": "GCAM-SIM-005", "simulation_name": "Low Renewable Cost Breakthrough", "description": "Models a technology breakthrough that sharply lowers solar and wind costs across all regions.", "keywords": ["renewables", "cost decline", "solar", "wind", "technology learning", "energy transition"]}, {"simulation_id": "GCAM-SIM-006", "simulation_name": "Water-Constrained Energy Supply", "description": "Examines how limited freshwater availability affects cooling technologies, thermal generation, and regional power reliability.", "keywords": ["water constraints", "thermal power", "cooling systems", "power reliability", "freshwater demand", "regional supply"]}, {"simulation_id": "GCAM-SIM-007", "simulation_name": "Land Productivity Expansion", "description": "Evaluates improved crop yields and managed forests on land allocation, biomass supply, and food production.", "keywords": ["land productivity", "crop yields", "managed forests", "biomass", "food production", "land allocation"]}, {"simulation_id": "GCAM-SIM-008", "simulation_name": "Urban Cooling and Water Demand", "description": "Analyzes interactions between rising cooling needs in cities, electricity demand peaks, and municipal water withdrawals.", "keywords": ["urban cooling", "peak demand", "municipal water", "electricity load", "city infrastructure", "resource planning"]}, {"simulation_id": "GCAM-SIM-009", "simulation_name": "Sustainable Bioenergy Scale-Up", "description": "Investigates expansion of bioenergy supply with constraints on food prices, soil quality, and available cropland.", "keywords": ["bioenergy", "sustainability", "food-energy nexus", "soil quality", "land competition", "supply potential"]}, {"simulation_id": "GCAM-SIM-010", "simulation_name": "Hydrogen Economy Acceleration", "description": "Tests aggressive deployment of hydrogen in heavy industry, shipping, and power balancing.", "keywords": ["hydrogen", "heavy industry", "shipping", "power balancing", "alternative fuels", "infrastructure"]}, {"simulation_id": "GCAM-SIM-011", "simulation_name": "Regional Resource Balancing", "description": "Compares regional pathways for balancing electricity generation, water withdrawals, and agricultural land demand.", "keywords": ["regional analysis", "electricity generation", "water withdrawals", "agricultural land", "resource balancing", "system integration"]}, {"simulation_id": "GCAM-SIM-012", "simulation_name": "Delayed Infrastructure Catch-Up", "description": "Simulates delayed investments in energy and water infrastructure followed by rapid capacity expansion after 2035.", "keywords": ["delayed action", "infrastructure expansion", "energy systems", "water systems", "capacity planning", "investment timing"]}], "faiss_index_path": "gcam_vectorstore.faiss"} No newline at end of file
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model name in the committed vectorstore metadata is "text-embedding-3-large-project", which doesn't match the standard model name "text-embedding-3-large" documented throughout the codebase. This appears to be a custom or project-specific model name that was used to generate the example vectorstore. This inconsistency could confuse users who expect to use the standard OpenAI model. Consider either regenerating this file with the standard model name or documenting why a custom model was used.

Copilot uses AI. Check for mistakes.
version = "0.1.0"
description = "GCIMS explorer AI capabilites."
readme = "README.md"
requires-python = ">=3.13"
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyproject.toml specifies requires-python = ">=3.13", which is a very recent Python version. Python 3.13 was released in October 2024. This strict requirement may limit adoption, as many users and production systems are still on Python 3.10, 3.11, or 3.12. Consider testing with and supporting older Python versions (e.g., Python 3.10+) unless there are specific features from 3.13 that are absolutely required. Looking at the code, the main features used are type annotations with | syntax (available since 3.10) and standard library features that should work on earlier versions.

Suggested change
requires-python = ">=3.13"
requires-python = ">=3.10"

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +18
dependencies = [
"faiss-cpu",
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency faiss-cpu is listed as a required dependency, but the code is designed to work without FAISS (with a fallback to cosine similarity). Since FAISS is treated as optional in the code with graceful handling when it's not available, it should be listed under optional-dependencies rather than as a required dependency. This would allow users to install the package without FAISS for development or testing purposes, and only install FAISS when needed for production use.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +58
@app.post("/search")
def search() -> tuple[object, int]:
payload = request.get_json(silent=True) or {}
topic = str(payload.get("topic", "")).strip()
if not topic:
return jsonify({"status": "error", "error": "topic is required"}), 400

top_k = payload.get("top_k", default_top_k)
try:
top_k_int = int(top_k)
if top_k_int <= 0:
raise ValueError
except (TypeError, ValueError):
return jsonify({"status": "error", "error": "top_k must be a positive integer"}), 400

min_score = payload.get("min_score")
min_score_float = None
if min_score is not None:
try:
min_score_float = float(min_score)
except (TypeError, ValueError):
return jsonify({"status": "error", "error": "min_score must be numeric"}), 400
ids_only = bool(payload.get("ids_only", False))

try:
response = run_query_command(
vectorstore_path=vectorstore_path,
topic=topic,
top_k=top_k_int,
min_score=min_score_float,
ids_only=ids_only,
env_path=env_path,
)
except Exception as error: # pragma: no cover - catches runtime errors from query.
return jsonify({"status": "error", "error": str(error)}), 500
return jsonify(response), 200
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API endpoint does not implement any rate limiting or request size limits. For a production deployment, the /search endpoint should have rate limiting to prevent abuse, and the topic string should have a maximum length limit to prevent potential resource exhaustion attacks. Consider adding these protections or documenting that they should be implemented at the reverse proxy/load balancer level.

Copilot uses AI. Check for mistakes.
@crvernon crvernon merged commit 6f16b0a into main Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants