Skip to content

Conversation

@Domusgpt
Copy link
Owner

Pull Request

📋 Description

🔗 Related Issue

Fixes #(issue_number)

🛠️ Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🔧 Configuration change
  • 🎨 Code style/formatting change
  • ♻️ Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test addition or improvement

🧪 Testing

  • Unit tests pass
  • Integration tests pass
  • Manual testing completed
  • New tests added for new functionality

Test Details:

🔍 EMA Compliance Check

  • Digital Sovereignty: Changes respect user data ownership
  • Portability: Export/migration capabilities maintained or improved
  • Universal Standards: Uses open standards, avoids vendor lock-in
  • Transparent Competition: Competitive advantage through merit, not barriers

📝 Changes Made

📱 Component Impact

  • Core API
  • Chrome Extension
  • VS Code Extension
  • Python SDK
  • Node.js SDK
  • MCP Server
  • Documentation
  • Website
  • CI/CD
  • Tests

🔒 Breaking Changes

None / Describe breaking changes here

📸 Screenshots/Examples

✅ Checklist

  • My code follows the project's coding standards
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

🎯 Additional Notes


By submitting this PR, I confirm that:

  • I have read and agree to the Code of Conduct
  • I have read the Contributing Guidelines
  • This contribution is my own work or I have permission to submit it
  • I agree to license this contribution under the project's MIT license

Domusgpt and others added 11 commits June 15, 2025 11:38
feat: Structured outputs and shared core architecture
Please go over these addres the issue if really an issue an delete these
This commit addresses several issues in the API to improve its robustness and security:

1.  **Rate Limiting for Anonymous Users:**
    - I replaced the in-memory rate limiter for anonymous users with a Firestore-based solution in `rateLimitMiddleware.ts`.
    - Anonymous user requests are now tracked per IP per minute in the `anonymousRateLimits` Firestore collection using transactions, providing better scalability and persistence.
    - I added unit tests to `rateLimitMiddleware.test.ts` to verify this new logic, including timer-based reset functionality.

2.  **Improved Error Handling for Malformed JSON:**
    - I enhanced `parseRoutes.ts` to provide more specific error responses when backend services (Architect or Extractor) return malformed JSON.
    - Instead of a generic 500 error, the API now returns a 422 (Unprocessable Entity) error with a clearer message (e.g., "Failed to parse response from Architect service.").
    - In development mode, the error response includes a `details` field with the original error and raw response for easier debugging.
    - I added unit tests in `parseRoutes.test.ts` to cover these scenarios.

3.  **Input Size Limits:**
    - I implemented an input size limit of 1MB for `inputData` in `parseRoutes.ts`.
    - If `inputData` exceeds this limit, the API returns a 413 (Payload Too Large) error, including the received payload size in the message.
    - I added a check to ensure `inputData` is a string, returning a 400 error if not.
    - I added unit tests for these checks to `parseRoutes.test.ts`.

4.  **Dependency Risk (`eslint-scope`):**
    - The issue report mentioned `eslint-scope` as a potential dependency risk. This is a development dependency and does not affect the runtime behavior of the API. No direct changes were made related to this, but it's noted for future review of development dependencies.

These changes significantly improve the API's resilience to malformed inputs, provide clearer error feedback, and implement more robust rate limiting for anonymous users.
Fix: Enhance API security, error handling, and rate limiting
… done so far and provide feedback for Jules to continue.
Jules was unable to complete the task in time. Please review the work…
…ness

This commit implements fixes for several critical security issues identified:

1.  **Rate Limiting:**
    *   Resolved issues where rate limiting was ineffective, particularly for anonymous users.
    *   Implemented comprehensive RPM (Requests Per Minute), daily, and monthly limits for both anonymous (IP-based) and authenticated users in the main API entry point (`packages/api/src/index.ts`).
    *   Updated the Express middleware (`packages/api/src/middleware/rateLimitMiddleware.ts`) to mirror these robust limits.
    *   Instituted a "fail-closed" policy for rate limit checks: if the backend (Firestore) encounters an error during a check, the request is denied, preventing potential abuse during backend issues.

2.  **Input Validation and Sanitization (XSS & Server Stability):**
    *   Sanitized API key names provided via `/v1/user/keys`. HTML special characters and backticks are now encoded before storage and in API responses, mitigating reflected and potential stored XSS vulnerabilities.
    *   Hardened the `/v1/parse` endpoint by escaping backtick characters in user-supplied `inputData`. This prevents malformed AI prompts that previously led to server errors.

3.  **Error Handling Information Leaks:**
    *   Modified the error response for the `/v1/parse` endpoint. Instead of returning raw internal error messages, a generic message is now sent to the client, while detailed error information is logged server-side. This prevents leakage of potentially sensitive internal state.

4.  **Testing:**
    *   Added a new test suite (`packages/api/src/__tests__/`) with integration tests.
    *   Rate limiting logic in `packages/api/src/index.ts` is extensively tested, including RPM, daily, monthly limits for anonymous and authenticated users, and fail-closed behavior using Firestore mocks.
    *   Input sanitization for API key names and backtick escaping for `inputData` are tested, verifying correct output and prompt construction with mocked AI responses.

These changes significantly improve the security posture and reliability of the API.
This commit introduces an intelligent caching layer for Architect searchPlans
to improve performance and reduce redundant AI calls, along with a
self-correction mechanism driven by Extractor performance.

Key changes:

1.  **In-Memory LRU Cache for Architect Plans:**
    *   I implemented an in-memory LRU (Least Recently Used) cache (`Map`-based)
      in `packages/api/src/index.ts` to store generated `searchPlan`s.
    *   Cache keys are generated using a SHA256 hash of the `outputSchema`.
    *   A `MAX_CACHE_SIZE` is defined to prevent unbounded memory use.

2.  **Caching Logic in `/v1/parse` Endpoint:**
    *   The endpoint now attempts to retrieve a `searchPlan` from the cache
      before calling the Architect model.
    *   If a plan is found (cache hit), the Architect call is skipped.
    *   Newly generated plans are stored in the cache.
    *   A `forceRefreshArchitect: true` request body parameter allows you
      to bypass the cache and force a new Architect invocation.

3.  **Extractor-Driven Re-architecture (Self-Correction):**
    *   After the Extractor processes data using a cached `searchPlan`, I
      evaluate the `parsedData`.
    *   If a significant number of top-level fields (defined in the
      `outputSchema`) are missing or null, the cached `searchPlan` is
      considered suboptimal and is automatically invalidated (removed from
      the cache).
    *   This ensures that on subsequent requests with the same `outputSchema`,
      the Architect will be re-invoked to generate a potentially better plan.

4.  **Response Metadata Enhancements:**
    *   The API response for `/v1/parse` now includes a `cacheInfo` object in
      the `metadata`, indicating:
        *   `retrievedFromCache: boolean` (if the plan was from cache)
        *   `invalidatedByExtractor: boolean` (if the plan was invalidated
          in the current request due to poor Extractor results).

5.  **Architect Prompt Refinement (Minor):**
    *   I added a subtle instruction to the Architect prompt to encourage the
      creation of robust plans that can handle minor input variations.

6.  **Comprehensive Testing:**
    *   I added a new test suite (`index.caching.test.ts`) with extensive
      tests for the caching and re-architecture logic.
    *   The tests cover cache hits, misses, `forceRefreshArchitect`, LRU eviction,
      and the Extractor failure invalidation mechanism, ensuring the new
      system behaves as expected under various conditions.

These changes aim to make the Parserator API more efficient, adaptive, and
cost-effective by intelligently reusing Architect computations while also
allowing the system to learn from and correct suboptimal parsing strategies.
This commit improves the Architect `searchPlan` caching mechanism by
incorporating a fingerprint of the input data sample into the cache key.
This makes the cache more granular, ensuring that plans are reused only
when both the output schema and the structural characteristics of the
input data are similar.

Key changes:

1.  **Input Data Fingerprinting Strategy:**
    *   Designed and implemented a `generateInputFingerprint` function in
      `packages/api/src/index.ts`.
    *   The fingerprint is generated from the first 1KB of the input data
      sample and includes heuristics such as:
        *   Presence of JSON/XML characters.
        *   Number of lines and average line length.
        *   Colon count.
        *   Numeric density.
    *   This provides a lightweight yet effective way to differentiate input
      structures.

2.  **Compound Cache Key:**
    *   Modified `generateCacheKey` to combine a hash of the `outputSchema`
      with the newly generated `inputFingerprint`.
    *   This more specific cache key is now used for all cache operations
      (get, set, delete) in the `/v1/parse` endpoint.

3.  **Integration into `/v1/parse`:**
    *   The `inputFingerprint` is generated from the input sample.
    *   The compound cache key is used for looking up cached plans, storing
      new plans, and invalidating plans via the Extractor-driven
      re-architecture logic.

4.  **Updated Tests:**
    *   Added direct unit tests for the `generateInputFingerprint` function
      to validate its behavior with various input types.
    *   Updated all existing caching tests in
      `packages/api/src/__tests__/index.caching.test.ts` to reflect the
      new compound cache key logic.
    *   Added new test scenarios to specifically verify that changes in input
      data (leading to different fingerprints) correctly result in cache
      misses, even if the output schema remains the same.

This enhancement aims to increase cache effectiveness, reduce unnecessary
Architect invocations when input structures vary for the same schema, and
further improve the performance and cost-efficiency of the Parserator API.
The existing Extractor-driven re-architecture mechanism remains compatible
and will now operate on these more granular cache entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants