Conversation
There was a problem hiding this comment.
Pull request overview
This pull request introduces a complete CLI and API system for building, managing, and querying a vectorstore of GCAM simulation metadata using text embeddings. The system supports both OpenAI-compatible embedding backends and a local deterministic embedding backend for development/testing. The implementation uses FAISS for efficient similarity search with a fallback to pure Python cosine similarity when FAISS is unavailable.
Changes:
- Core vectorstore implementation with incremental updates, FAISS-based indexing, and embedding backend abstraction
- CLI with build, query, and serve commands for vectorstore management and local API serving
- Flask-based REST API with
/searchand/healthendpoints for production deployment - Comprehensive test suite covering CLI, API, and vectorstore functionality with conditional FAISS support
- Documentation, example data, CI/CD workflow, and environment configuration templates
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Package configuration with Python 3.13 requirement, dependencies, and test configuration |
| explorer/vectorstore.py | Core vectorstore implementation with OpenAI and local embedding backends, FAISS indexing, and persistence |
| explorer/cli.py | CLI implementation for building/updating vectorstore, querying, and running local API server |
| explorer/web.py | Flask application providing search and health check endpoints |
| explorer/init.py | Package exports for public API |
| tests/test_vectorstore.py | Tests for vectorstore operations including persistence, incremental updates, and FAISS support |
| tests/test_cli.py | Tests for CLI commands including build, query, and environment variable handling |
| tests/test_web.py | Tests for Flask API endpoints including search validation and result filtering |
| data/gcam_simulations.json | Example simulation metadata with 12 GCAM simulation records |
| data/gcam_vectorstore.json | Pre-built vectorstore metadata for demo/testing purposes |
| README.md | Comprehensive documentation for setup, usage, and deployment |
| .env.example | Environment variable template for OpenAI configuration |
| .gitignore | Git ignore rules for generated files and environment configuration |
| .github/workflows/tests.yml | GitHub Actions workflow for automated testing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [project] | ||
| name = "explorer" | ||
| version = "0.1.0" | ||
| description = "GCIMS explorer AI capabilites." |
There was a problem hiding this comment.
The project description contains a typo: "capabilites" should be "capabilities".
| description = "GCIMS explorer AI capabilites." | |
| description = "GCIMS explorer AI capabilities." |
| @@ -0,0 +1 @@ | |||
| {"format_version": 2, "vector_backend": "faiss", "backend_metadata": {"type": "openai", "model": "text-embedding-3-large-project"}, "records": [{"simulation_id": "GCAM-SIM-001", "simulation_name": "Baseline Energy Demand Pathway", "description": "Reference scenario projecting energy demand under moderate economic growth and current technology trends.", "keywords": ["baseline", "energy demand", "electric load", "technology trends", "resource use", "demand projection", "regional systems"]}, {"simulation_id": "GCAM-SIM-002", "simulation_name": "High Electrification Transition", "description": "Explores rapid electrification of transport and industry with accelerated clean power deployment.", "keywords": ["electrification", "transport", "industry", "power sector", "grid expansion", "technology adoption"]}, {"simulation_id": "GCAM-SIM-003", "simulation_name": "High-Efficiency End-Use Scenario", "description": "Assesses how efficient appliances, buildings, and industrial processes reshape long-term energy consumption.", "keywords": ["efficiency", "buildings", "industry", "end-use technology", "energy demand", "load reduction"]}, {"simulation_id": "GCAM-SIM-004", "simulation_name": "Flexible Fuel Mix Dynamics", "description": "Evaluates how changing fuel availability influences electricity generation, industrial heat, and transport energy use.", "keywords": ["fuel mix", "electricity generation", "industrial heat", "transport", "resource availability", "system balance"]}, {"simulation_id": "GCAM-SIM-005", "simulation_name": "Low Renewable Cost Breakthrough", "description": "Models a technology breakthrough that sharply lowers solar and wind costs across all regions.", "keywords": ["renewables", "cost decline", "solar", "wind", "technology learning", "energy transition"]}, {"simulation_id": "GCAM-SIM-006", "simulation_name": "Water-Constrained Energy Supply", "description": "Examines how limited freshwater availability affects cooling technologies, thermal generation, and regional power reliability.", "keywords": ["water constraints", "thermal power", "cooling systems", "power reliability", "freshwater demand", "regional supply"]}, {"simulation_id": "GCAM-SIM-007", "simulation_name": "Land Productivity Expansion", "description": "Evaluates improved crop yields and managed forests on land allocation, biomass supply, and food production.", "keywords": ["land productivity", "crop yields", "managed forests", "biomass", "food production", "land allocation"]}, {"simulation_id": "GCAM-SIM-008", "simulation_name": "Urban Cooling and Water Demand", "description": "Analyzes interactions between rising cooling needs in cities, electricity demand peaks, and municipal water withdrawals.", "keywords": ["urban cooling", "peak demand", "municipal water", "electricity load", "city infrastructure", "resource planning"]}, {"simulation_id": "GCAM-SIM-009", "simulation_name": "Sustainable Bioenergy Scale-Up", "description": "Investigates expansion of bioenergy supply with constraints on food prices, soil quality, and available cropland.", "keywords": ["bioenergy", "sustainability", "food-energy nexus", "soil quality", "land competition", "supply potential"]}, {"simulation_id": "GCAM-SIM-010", "simulation_name": "Hydrogen Economy Acceleration", "description": "Tests aggressive deployment of hydrogen in heavy industry, shipping, and power balancing.", "keywords": ["hydrogen", "heavy industry", "shipping", "power balancing", "alternative fuels", "infrastructure"]}, {"simulation_id": "GCAM-SIM-011", "simulation_name": "Regional Resource Balancing", "description": "Compares regional pathways for balancing electricity generation, water withdrawals, and agricultural land demand.", "keywords": ["regional analysis", "electricity generation", "water withdrawals", "agricultural land", "resource balancing", "system integration"]}, {"simulation_id": "GCAM-SIM-012", "simulation_name": "Delayed Infrastructure Catch-Up", "description": "Simulates delayed investments in energy and water infrastructure followed by rapid capacity expansion after 2035.", "keywords": ["delayed action", "infrastructure expansion", "energy systems", "water systems", "capacity planning", "investment timing"]}], "faiss_index_path": "gcam_vectorstore.faiss"} No newline at end of file | |||
There was a problem hiding this comment.
The model name in the committed vectorstore metadata is "text-embedding-3-large-project", which doesn't match the standard model name "text-embedding-3-large" documented throughout the codebase. This appears to be a custom or project-specific model name that was used to generate the example vectorstore. This inconsistency could confuse users who expect to use the standard OpenAI model. Consider either regenerating this file with the standard model name or documenting why a custom model was used.
| version = "0.1.0" | ||
| description = "GCIMS explorer AI capabilites." | ||
| readme = "README.md" | ||
| requires-python = ">=3.13" |
There was a problem hiding this comment.
The pyproject.toml specifies requires-python = ">=3.13", which is a very recent Python version. Python 3.13 was released in October 2024. This strict requirement may limit adoption, as many users and production systems are still on Python 3.10, 3.11, or 3.12. Consider testing with and supporting older Python versions (e.g., Python 3.10+) unless there are specific features from 3.13 that are absolutely required. Looking at the code, the main features used are type annotations with | syntax (available since 3.10) and standard library features that should work on earlier versions.
| requires-python = ">=3.13" | |
| requires-python = ">=3.10" |
| dependencies = [ | ||
| "faiss-cpu", |
There was a problem hiding this comment.
The dependency faiss-cpu is listed as a required dependency, but the code is designed to work without FAISS (with a fallback to cosine similarity). Since FAISS is treated as optional in the code with graceful handling when it's not available, it should be listed under optional-dependencies rather than as a required dependency. This would allow users to install the package without FAISS for development or testing purposes, and only install FAISS when needed for production use.
| @app.post("/search") | ||
| def search() -> tuple[object, int]: | ||
| payload = request.get_json(silent=True) or {} | ||
| topic = str(payload.get("topic", "")).strip() | ||
| if not topic: | ||
| return jsonify({"status": "error", "error": "topic is required"}), 400 | ||
|
|
||
| top_k = payload.get("top_k", default_top_k) | ||
| try: | ||
| top_k_int = int(top_k) | ||
| if top_k_int <= 0: | ||
| raise ValueError | ||
| except (TypeError, ValueError): | ||
| return jsonify({"status": "error", "error": "top_k must be a positive integer"}), 400 | ||
|
|
||
| min_score = payload.get("min_score") | ||
| min_score_float = None | ||
| if min_score is not None: | ||
| try: | ||
| min_score_float = float(min_score) | ||
| except (TypeError, ValueError): | ||
| return jsonify({"status": "error", "error": "min_score must be numeric"}), 400 | ||
| ids_only = bool(payload.get("ids_only", False)) | ||
|
|
||
| try: | ||
| response = run_query_command( | ||
| vectorstore_path=vectorstore_path, | ||
| topic=topic, | ||
| top_k=top_k_int, | ||
| min_score=min_score_float, | ||
| ids_only=ids_only, | ||
| env_path=env_path, | ||
| ) | ||
| except Exception as error: # pragma: no cover - catches runtime errors from query. | ||
| return jsonify({"status": "error", "error": str(error)}), 500 | ||
| return jsonify(response), 200 |
There was a problem hiding this comment.
The API endpoint does not implement any rate limiting or request size limits. For a production deployment, the /search endpoint should have rate limiting to prevent abuse, and the topic string should have a maximum length limit to prevent potential resource exhaustion attacks. Consider adding these protections or documenting that they should be implemented at the reverse proxy/load balancer level.
This pull request introduces a new CLI and API for building, updating, and querying a vectorstore of GCAM simulation metadata using text embeddings. It adds all core code, example data, configuration, and documentation needed to use the system locally or deploy it. The CLI supports building/updating the vectorstore from simulation JSON, querying by topic similarity, and running a local Flask API server. The OpenAI-compatible embeddings backend is supported, with configuration via environment variables.
Major features and changes:
Core functionality and CLI/API:
explorer/cli.pyimplementing a CLI for building/updating a persisted vectorstore from a JSON file, querying by topic similarity, and running a local Flask API server for search.explorer/web.pyproviding a Flask app with/searchand/healthendpoints, delegating to the CLI query logic.explorer/__init__.pyfor easier imports and reusability.Configuration and documentation:
.env.examplewith environment variable templates for OpenAI API key, embedding model, API base URL, and vectorstore path.README.mdwith setup instructions, CLI usage, API usage, and deployment notes.Sample data and CI:
data/gcam_simulations.jsonand a corresponding vectorstore metadata file indata/gcam_vectorstore.jsonfor testing/demo purposes. [1] [2].github/workflows/tests.ymlto install dependencies, build the package, and run tests on pushes and pull requests.