Skip to content

Conversation

Copy link

Copilot AI commented Nov 4, 2025

Replaces the open-source scrapegraphai library (requiring local LLM setup) with the scrapegraph-py SDK for cloud-based API scraping.

Changes

Dependencies & Configuration

  • requirements.txt: scrapegraphai>=1.0.0scrapegraph-py>=1.0.0
  • .env.example: OPENAI_API_KEY/SCRAPEGRAPHAI_API_KEYSGAI_API_KEY
  • config.py: Removed openai_api_key, added sgai_api_key

Core Implementation (scraper.py)

  • Replaced SmartScraperGraph with Client from scrapegraph_py
  • scrape_product(): Uses client.smartscraper() with structured output schema
  • scrape_search_results(): Extracts multiple products with response validation
  • Added close() method for resource cleanup
  • Enhanced error handling: validates response structure before parsing, falls back to mock data

Examples & Tests

  • Added scraper.close() calls to all examples and quickstart
  • Updated tests to handle SDK client initialization
  • Updated quickstart.py messaging for API key awareness

Documentation

  • README: Added SDK setup instructions, API key acquisition steps, troubleshooting for API issues
  • Clarified cloud-based architecture vs local processing

Usage

from scrapegraph_py import Client
from src.scrapegraph_demo import Config, MarketplaceScraper

config = Config.from_env()  # Reads SGAI_API_KEY
scraper = MarketplaceScraper(config)

product = scraper.scrape_product(
    url="https://www.amazon.com/dp/PRODUCTID",
    marketplace="Amazon"
)

scraper.close()  # New: cleanup SDK resources

Mock data fallback preserved for testing without API key. All 12 tests pass, 0 security vulnerabilities.

Original prompt

Objective

Migrate the ScrapeGraphAI Elasticsearch Demo from using the open-source scrapegraphai library to the API-based scrapegraph-py SDK.

Background

Currently, the repository uses the open-source ScrapeGraphAI library which requires local LLM setup and OpenAI API keys. We need to transition to using the scrapegraph-py SDK (installed via pip install scrapegraph-py) which provides a simpler API-based approach for scraping.

Reference SDK repository: https://github.com/ScrapeGraphAI/scrapegraph-sdk

Required Changes

1. Update requirements.txt

Replace scrapegraphai>=1.0.0 with scrapegraph-py and keep all other dependencies:

# ScrapeGraphAI SDK (API-based)
scrapegraph-py

# Elasticsearch
elasticsearch>=8.0.0

# Data processing
pandas>=2.0.0

# Environment management
python-dotenv>=1.0.0

# Utilities
requests>=2.31.0
pydantic>=2.0.0

2. Update .env.example

Replace OpenAI/ScrapeGraphAI API key variables with the SDK's expected format:

# ScrapeGraphAI API Key (required for scrapegraph-py SDK)
SGAI_API_KEY=your-scrapegraphai-api-key-here

# Elasticsearch Configuration
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
# ELASTICSEARCH_USERNAME=
# ELASTICSEARCH_PASSWORD=

3. Update src/scrapegraph_demo/config.py

Modify the Config class to use SGAI_API_KEY instead of OPENAI_API_KEY or SCRAPEGRAPHAI_API_KEY:

  • Change parameter name from openai_api_key or scrapegraphai_api_key to sgai_api_key
  • Update from_env() method to read SGAI_API_KEY from environment
  • Update validation to check for sgai_api_key presence

4. Rewrite src/scrapegraph_demo/scraper.py

Complete rewrite to use the scrapegraph-py SDK instead of the open-source library:

Key SDK Methods to Use:

  • Client(api_key="...") or Client.from_env() - Initialize the SDK client
  • client.smartscraper(website_url, user_prompt, output_schema) - Extract structured product data
  • client.scrape(website_url) - Get raw HTML content
  • client.close() - Close the client connection

Implementation Requirements:

  • Initialize Client from scrapegraph-py in __init__
  • In scrape_product(): Use client.smartscraper() with a detailed prompt to extract:
    • Product name, price, currency, brand
    • Rating (out of 5), review count
    • Description, category, availability
  • In scrape_search_results(): Use client.smartscraper() or client.searchscraper() to extract multiple products
  • Implement proper error handling and fallback to mock data if API fails
  • Add a close() method that calls self.client.close()
  • Parse SDK responses (which return dictionaries with a 'result' key) into Product objects
  • Keep all existing helper methods like _extract_product_id() and _extract_price()

SDK Response Format:

response = {
    'result': {
        'name': 'Product Name',
        'price': 99.99,
        'rating': 4.5,
        # ... other fields
    }
}

5. Update Example Files

Update all files in the examples/ directory to work with the new SDK-based scraper:

  • examples/basic_usage.py
  • examples/product_comparison.py
  • examples/advanced_search.py

Ensure they:

  • Import from the updated scraper module
  • Handle the new API-based scraping
  • Include proper error handling
  • Call scraper.close() when done

6. Update quickstart.py

Modify the quickstart script to:

  • Use the SDK-based scraper
  • Update error messages to mention SGAI_API_KEY
  • Add proper cleanup with scraper.close()
  • Update documentation strings

7. Update README.md

Update documentation to reflect the SDK-based approach:

  • Change installation instructions to mention scrapegraph-py
  • Update environment variable documentation (SGAI_API_KEY instead of OPENAI_API_KEY)
  • Update code examples to show SDK usage
  • Add note about ScrapeGraphAI API credits
  • Update the "Related Resources" section to include SDK documentation
  • Clarify that this uses the ScrapeGraphAI API (not open-source library)

8. Update Tests

Update test files in tests/ to work with the SDK-based implementation:

  • Update imports
  • Mock the SDK Client appropriately
  • Test SDK-specific functionality
  • Ensure tests don't require actual API calls

Technical Specifications

SDK Documentation Reference:

  • Client initialization: from scrapegraph_py import Client
  • Environment-based init: client = Client.from_env() (reads SGAI_API_KEY)
  • SmartScraper usage: Extract structured data with AI
  • Async support available via AsyncClient (optional for future)

Key Differences:

  • Old approach: Graph-based pipelines with local LLM processing
  • New approach: Simple API client with cloud-based processing
  • Authentication: SGAI_API_KEY instead of OPENAI_API_KEY
  • Benefits: Simpler code, no local LLM setup, managed infrastructure

Testing Requirements

After implementation:

  1. All existing tests should pass with mocked SDK client
  2. Mock data functionality shoul...

This pull request was created as a result of the following prompt from Copilot chat.

Objective

Migrate the ScrapeGraphAI Elasticsearch Demo from using the open-source scrapegraphai library to the API-based scrapegraph-py SDK.

Background

Currently, the repository uses the open-source ScrapeGraphAI library which requires local LLM setup and OpenAI API keys. We need to transition to using the scrapegraph-py SDK (installed via pip install scrapegraph-py) which provides a simpler API-based approach for scraping.

Reference SDK repository: https://github.com/ScrapeGraphAI/scrapegraph-sdk

Required Changes

1. Update requirements.txt

Replace scrapegraphai>=1.0.0 with scrapegraph-py and keep all other dependencies:

# ScrapeGraphAI SDK (API-based)
scrapegraph-py

# Elasticsearch
elasticsearch>=8.0.0

# Data processing
pandas>=2.0.0

# Environment management
python-dotenv>=1.0.0

# Utilities
requests>=2.31.0
pydantic>=2.0.0

2. Update .env.example

Replace OpenAI/ScrapeGraphAI API key variables with the SDK's expected format:

# ScrapeGraphAI API Key (required for scrapegraph-py SDK)
SGAI_API_KEY=your-scrapegraphai-api-key-here

# Elasticsearch Configuration
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
# ELASTICSEARCH_USERNAME=
# ELASTICSEARCH_PASSWORD=

3. Update src/scrapegraph_demo/config.py

Modify the Config class to use SGAI_API_KEY instead of OPENAI_API_KEY or SCRAPEGRAPHAI_API_KEY:

  • Change parameter name from openai_api_key or scrapegraphai_api_key to sgai_api_key
  • Update from_env() method to read SGAI_API_KEY from environment
  • Update validation to check for sgai_api_key presence

4. Rewrite src/scrapegraph_demo/scraper.py

Complete rewrite to use the scrapegraph-py SDK instead of the open-source library:

Key SDK Methods to Use:

  • Client(api_key="...") or Client.from_env() - Initialize the SDK client
  • client.smartscraper(website_url, user_prompt, output_schema) - Extract structured product data
  • client.scrape(website_url) - Get raw HTML content
  • client.close() - Close the client connection

Implementation Requirements:

  • Initialize Client from scrapegraph-py in __init__
  • In scrape_product(): Use client.smartscraper() with a detailed prompt to extract:
    • Product name, price, currency, brand
    • Rating (out of 5), review count
    • Description, category, availability
  • In scrape_search_results(): Use client.smartscraper() or client.searchscraper() to extract multiple products
  • Implement proper error handling and fallback to mock data if API fails
  • Add a close() method that calls self.client.close()
  • Parse SDK responses (which return dictionaries with a 'result' key) into Product objects
  • Keep all existing helper methods like _extract_product_id() and _extract_price()

SDK Response Format:

response = {
    'result': {
        'name': 'Product Name',
        'price': 99.99,
        'rating': 4.5,
        # ... other fields
    }
}

5. Update Example Files

Update all files in the examples/ directory to work with the new SDK-based scraper:

  • examples/basic_usage.py
  • examples/product_comparison.py
  • examples/advanced_search.py

Ensure they:

  • Import from the updated scraper module
  • Handle the new API-based scraping
  • Include proper error handling
  • Call scraper.close() when done

6. Update quickstart.py

Modify the quickstart script to:

  • Use the SDK-based scraper
  • Update error messages to mention SGAI_API_KEY
  • Add proper cleanup with scraper.close()
  • Update documentation strings

7. Update README.md

Update documentation to reflect the SDK-based approach:

  • Change installation instructions to mention scrapegraph-py
  • Update environment variable documentation (SGAI_API_KEY instead of OPENAI_API_KEY)
  • Update code examples to show SDK usage
  • Add note about ScrapeGraphAI API credits
  • Update the "Related Resources" section to include SDK documentation
  • Clarify that this uses the ScrapeGraphAI API (not open-source library)

8. Update Tests

Update test files in tests/ to work with the SDK-based implementation:

  • Update imports
  • Mock the SDK Client appropriately
  • Test SDK-specific functionality
  • Ensure tests don't require actual API calls

Technical Specifications

SDK Documentation Reference:

  • Client initialization: from scrapegraph_py import Client
  • Environment-based init: client = Client.from_env() (reads SGAI_API_KEY)
  • SmartScraper usage: Extract structured data with AI
  • Async support available via AsyncClient (optional for future)

Key Differences:

  • Old approach: Graph-based pipelines with local LLM processing
  • New approach: Simple API client with cloud-based processing
  • Authentication: SGAI_API_KEY instead of OPENAI_API_KEY
  • Benefits: Simpler code, no local LLM setup, managed infrastructure

Testing Requirements

After implementation:

  1. All existing tests should pass with mocked SDK client
  2. Mock data functionality should still work as fallback
  3. Integration with Elasticsearch should remain unchanged
  4. Example scripts should run successfully with valid API key

Documentation Requirements

  • Update all code comments to reflect SDK usage
  • Keep inline documentation clear and accurate
  • Ensure README has complete setup instructions
  • Add troubleshooting section for common SDK issues

Notes

  • Keep all existing Elasticsearch functionality unchanged
  • Maintain backward compatibility with existing Product models
  • Preserve the demo's educational value and ease of use
  • The SDK is much simpler than the open-source library - leverage this simplicity
  • Ensure proper resource cleanup (client.close()) in all scripts

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits November 4, 2025 19:33
- Updated requirements.txt to use scrapegraph-py instead of scrapegraphai
- Updated .env.example with SGAI_API_KEY
- Updated config.py to use sgai_api_key
- Rewrote scraper.py to use scrapegraph-py SDK with Client API
- Added close() method to scraper for proper cleanup
- Updated all example files to call scraper.close()
- Updated quickstart.py with SDK-aware messaging
- Updated tests to work with SDK-based implementation
- Updated README.md with SDK documentation and instructions

Co-authored-by: lurenss <38807022+lurenss@users.noreply.github.com>
- Add version constraint to scrapegraph-py dependency
- Improve error handling with specific exceptions in close() method
- Add response validation in scrape_product() and scrape_search_results()
- Use safe dictionary access with isinstance checks
- Add logging for exception types in error handlers

Co-authored-by: lurenss <38807022+lurenss@users.noreply.github.com>
Copilot AI changed the title [WIP] Migrate Elasticsearch demo to scrapegraph-py SDK Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK Nov 4, 2025
Copilot AI requested a review from lurenss November 4, 2025 19:43
Copilot finished work on behalf of lurenss November 4, 2025 19:43
@lurenss lurenss marked this pull request as ready for review November 4, 2025 19:49
@lurenss lurenss merged commit d64f6f1 into main Nov 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants