;;.maybe? #7

Domusgpt · 2025-06-16T17:07:18Z

Pull Request

📋 Description

🔗 Related Issue

Fixes #(issue_number)

🛠️ Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📚 Documentation update
🔧 Configuration change
🎨 Code style/formatting change
♻️ Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test addition or improvement

🧪 Testing

Unit tests pass
Integration tests pass
Manual testing completed
New tests added for new functionality

Test Details:

🔍 EMA Compliance Check

Digital Sovereignty: Changes respect user data ownership
Portability: Export/migration capabilities maintained or improved
Universal Standards: Uses open standards, avoids vendor lock-in
Transparent Competition: Competitive advantage through merit, not barriers

📝 Changes Made

📱 Component Impact

🔒 Breaking Changes

None / Describe breaking changes here

📸 Screenshots/Examples

✅ Checklist

My code follows the project's coding standards
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published

🎯 Additional Notes

By submitting this PR, I confirm that:

I have read and agree to the Code of Conduct
I have read the Contributing Guidelines
This contribution is my own work or I have permission to submit it
I agree to license this contribution under the project's MIT license

feat: Structured outputs and shared core architecture

Please go over these addres the issue if really an issue an delete these

Shared core implementation

This commit addresses several issues in the API to improve its robustness and security: 1. **Rate Limiting for Anonymous Users:** - I replaced the in-memory rate limiter for anonymous users with a Firestore-based solution in `rateLimitMiddleware.ts`. - Anonymous user requests are now tracked per IP per minute in the `anonymousRateLimits` Firestore collection using transactions, providing better scalability and persistence. - I added unit tests to `rateLimitMiddleware.test.ts` to verify this new logic, including timer-based reset functionality. 2. **Improved Error Handling for Malformed JSON:** - I enhanced `parseRoutes.ts` to provide more specific error responses when backend services (Architect or Extractor) return malformed JSON. - Instead of a generic 500 error, the API now returns a 422 (Unprocessable Entity) error with a clearer message (e.g., "Failed to parse response from Architect service."). - In development mode, the error response includes a `details` field with the original error and raw response for easier debugging. - I added unit tests in `parseRoutes.test.ts` to cover these scenarios. 3. **Input Size Limits:** - I implemented an input size limit of 1MB for `inputData` in `parseRoutes.ts`. - If `inputData` exceeds this limit, the API returns a 413 (Payload Too Large) error, including the received payload size in the message. - I added a check to ensure `inputData` is a string, returning a 400 error if not. - I added unit tests for these checks to `parseRoutes.test.ts`. 4. **Dependency Risk (`eslint-scope`):** - The issue report mentioned `eslint-scope` as a potential dependency risk. This is a development dependency and does not affect the runtime behavior of the API. No direct changes were made related to this, but it's noted for future review of development dependencies. These changes significantly improve the API's resilience to malformed inputs, provide clearer error feedback, and implement more robust rate limiting for anonymous users.

Fix: Enhance API security, error handling, and rate limiting

… done so far and provide feedback for Jules to continue.

Jules was unable to complete the task in time. Please review the work…

…ness This commit implements fixes for several critical security issues identified: 1. **Rate Limiting:** * Resolved issues where rate limiting was ineffective, particularly for anonymous users. * Implemented comprehensive RPM (Requests Per Minute), daily, and monthly limits for both anonymous (IP-based) and authenticated users in the main API entry point (`packages/api/src/index.ts`). * Updated the Express middleware (`packages/api/src/middleware/rateLimitMiddleware.ts`) to mirror these robust limits. * Instituted a "fail-closed" policy for rate limit checks: if the backend (Firestore) encounters an error during a check, the request is denied, preventing potential abuse during backend issues. 2. **Input Validation and Sanitization (XSS & Server Stability):** * Sanitized API key names provided via `/v1/user/keys`. HTML special characters and backticks are now encoded before storage and in API responses, mitigating reflected and potential stored XSS vulnerabilities. * Hardened the `/v1/parse` endpoint by escaping backtick characters in user-supplied `inputData`. This prevents malformed AI prompts that previously led to server errors. 3. **Error Handling Information Leaks:** * Modified the error response for the `/v1/parse` endpoint. Instead of returning raw internal error messages, a generic message is now sent to the client, while detailed error information is logged server-side. This prevents leakage of potentially sensitive internal state. 4. **Testing:** * Added a new test suite (`packages/api/src/__tests__/`) with integration tests. * Rate limiting logic in `packages/api/src/index.ts` is extensively tested, including RPM, daily, monthly limits for anonymous and authenticated users, and fail-closed behavior using Firestore mocks. * Input sanitization for API key names and backtick escaping for `inputData` are tested, verifying correct output and prompt construction with mocked AI responses. These changes significantly improve the security posture and reliability of the API.

This commit introduces an intelligent caching layer for Architect searchPlans to improve performance and reduce redundant AI calls, along with a self-correction mechanism driven by Extractor performance. Key changes: 1. **In-Memory LRU Cache for Architect Plans:** * I implemented an in-memory LRU (Least Recently Used) cache (`Map`-based) in `packages/api/src/index.ts` to store generated `searchPlan`s. * Cache keys are generated using a SHA256 hash of the `outputSchema`. * A `MAX_CACHE_SIZE` is defined to prevent unbounded memory use. 2. **Caching Logic in `/v1/parse` Endpoint:** * The endpoint now attempts to retrieve a `searchPlan` from the cache before calling the Architect model. * If a plan is found (cache hit), the Architect call is skipped. * Newly generated plans are stored in the cache. * A `forceRefreshArchitect: true` request body parameter allows you to bypass the cache and force a new Architect invocation. 3. **Extractor-Driven Re-architecture (Self-Correction):** * After the Extractor processes data using a cached `searchPlan`, I evaluate the `parsedData`. * If a significant number of top-level fields (defined in the `outputSchema`) are missing or null, the cached `searchPlan` is considered suboptimal and is automatically invalidated (removed from the cache). * This ensures that on subsequent requests with the same `outputSchema`, the Architect will be re-invoked to generate a potentially better plan. 4. **Response Metadata Enhancements:** * The API response for `/v1/parse` now includes a `cacheInfo` object in the `metadata`, indicating: * `retrievedFromCache: boolean` (if the plan was from cache) * `invalidatedByExtractor: boolean` (if the plan was invalidated in the current request due to poor Extractor results). 5. **Architect Prompt Refinement (Minor):** * I added a subtle instruction to the Architect prompt to encourage the creation of robust plans that can handle minor input variations. 6. **Comprehensive Testing:** * I added a new test suite (`index.caching.test.ts`) with extensive tests for the caching and re-architecture logic. * The tests cover cache hits, misses, `forceRefreshArchitect`, LRU eviction, and the Extractor failure invalidation mechanism, ensuring the new system behaves as expected under various conditions. These changes aim to make the Parserator API more efficient, adaptive, and cost-effective by intelligently reusing Architect computations while also allowing the system to learn from and correct suboptimal parsing strategies.

This commit improves the Architect `searchPlan` caching mechanism by incorporating a fingerprint of the input data sample into the cache key. This makes the cache more granular, ensuring that plans are reused only when both the output schema and the structural characteristics of the input data are similar. Key changes: 1. **Input Data Fingerprinting Strategy:** * Designed and implemented a `generateInputFingerprint` function in `packages/api/src/index.ts`. * The fingerprint is generated from the first 1KB of the input data sample and includes heuristics such as: * Presence of JSON/XML characters. * Number of lines and average line length. * Colon count. * Numeric density. * This provides a lightweight yet effective way to differentiate input structures. 2. **Compound Cache Key:** * Modified `generateCacheKey` to combine a hash of the `outputSchema` with the newly generated `inputFingerprint`. * This more specific cache key is now used for all cache operations (get, set, delete) in the `/v1/parse` endpoint. 3. **Integration into `/v1/parse`:** * The `inputFingerprint` is generated from the input sample. * The compound cache key is used for looking up cached plans, storing new plans, and invalidating plans via the Extractor-driven re-architecture logic. 4. **Updated Tests:** * Added direct unit tests for the `generateInputFingerprint` function to validate its behavior with various input types. * Updated all existing caching tests in `packages/api/src/__tests__/index.caching.test.ts` to reflect the new compound cache key logic. * Added new test scenarios to specifically verify that changes in input data (leading to different fingerprints) correctly result in cache misses, even if the output schema remains the same. This enhancement aims to increase cache effectiveness, reduce unnecessary Architect invocations when input structures vary for the same schema, and further improve the performance and cost-efficiency of the Parserator API. The existing Extractor-driven re-architecture mechanism remains compatible and will now operate on these more granular cache entries.

Jules wip 3583282160126608178

Domusgpt and others added 11 commits June 15, 2025 11:38

Merge pull request #1 from Domusgpt/shared-core-implementation

b975613

feat: Structured outputs and shared core architecture

Logs fail to check

e8fec3c

Please go over these addres the issue if really an issue an delete these

Merge pull request #2 from Domusgpt/shared-core-implementation

7496c93

Shared core implementation

Merge pull request #3 from Domusgpt/fix/api-enhancements

5c93dc9

Fix: Enhance API security, error handling, and rate limiting

Jules was unable to complete the task in time. Please review the work…

2230a81

… done so far and provide feedback for Jules to continue.

Merge pull request #4 from Domusgpt/jules_wip_3583282160126608178

1e83447

Jules was unable to complete the task in time. Please review the work…

Merge pull request #8 from Domusgpt/jules_wip_3583282160126608178

08199a7

Jules wip 3583282160126608178

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

;;.maybe? #7

;;.maybe? #7

Uh oh!

Domusgpt commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

;;.maybe? #7

Are you sure you want to change the base?

;;.maybe? #7

Uh oh!

Conversation

Domusgpt commented Jun 16, 2025

Pull Request

📋 Description

🔗 Related Issue

🛠️ Type of Change

🧪 Testing

🔍 EMA Compliance Check

📝 Changes Made

📱 Component Impact

🔒 Breaking Changes

📸 Screenshots/Examples

✅ Checklist

🎯 Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants