Skip to content

Fix critical schema transformation bugs and improve logging #1001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Mirza-Samad-Ahmed-Baig
Copy link

Problem

The codebase had several critical issues that could cause runtime failures and poor production practices:

  1. Schema Transformation Bugs: The transform_schema function in scrapegraphai/utils/schema_trasform.py was vulnerable to KeyError exceptions when processing malformed or incomplete Pydantic schemas, lacking proper error handling for missing keys.

  2. Poor Logging Practices: The SmartScraperGraph class used print() statements instead of proper logging, which is inappropriate for production environments and headless execution.

  3. Typos: Documentation contained typos that reduced code quality ("trasfrom" instead of "transforms").

Solution

  • Added comprehensive error handling to prevent KeyError exceptions in schema processing
  • Implemented proper logging using Python's logging module instead of print statements
  • Added fallback values for malformed array items and missing schema references
  • Improved input validation with proper error messages for invalid schemas

Changes Made

  1. scrapegraphai/utils/schema_trasform.py:

    • Fixed typo in docstring: "trasfrom" → "transforms"
    • Added null checks for items, $defs, and reference keys
    • Added fallback values for missing references and malformed arrays
    • Added validation for required schema structure with descriptive error messages
  2. scrapegraphai/graphs/smart_scraper_graph.py:

    • Replaced print() statements with proper logger.info() and logger.warning()
    • Added response structure validation before logging
    • Imported and configured logging module
  3. scrapegraphai/utils/__init__.py:

    • Added documentation comment noting the filename typo for future reference

Impact

  • Prevents runtime crashes from malformed schema processing
  • Improves production readiness with proper logging practices
  • Better error handling with graceful fallbacks
  • Enhanced debugging with structured logging instead of print statements
  • Maintains backward compatibility while fixing critical bugs

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Code quality improvement
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

- Fixed typo in docstring (trasfrom -> transforms)
- Added comprehensive error handling for missing schema keys
- Added fallback values for malformed array items and missing references
- Improved logging in SmartScraperGraph (replaced print with logger)
- Added proper validation for pydantic schema structure

These fixes prevent KeyError exceptions and improve production reliability.
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working typo typo labels Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working size:M This PR changes 30-99 lines, ignoring generated files. typo typo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant