Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK #3

Copilot · 2025-11-04T19:18:08Z

Replaces the open-source scrapegraphai library (requiring local LLM setup) with the scrapegraph-py SDK for cloud-based API scraping.

Changes

Dependencies & Configuration

requirements.txt: scrapegraphai>=1.0.0 → scrapegraph-py>=1.0.0
.env.example: OPENAI_API_KEY/SCRAPEGRAPHAI_API_KEY → SGAI_API_KEY
config.py: Removed openai_api_key, added sgai_api_key

Core Implementation (scraper.py)

Replaced SmartScraperGraph with Client from scrapegraph_py
scrape_product(): Uses client.smartscraper() with structured output schema
scrape_search_results(): Extracts multiple products with response validation
Added close() method for resource cleanup
Enhanced error handling: validates response structure before parsing, falls back to mock data

Examples & Tests

Added scraper.close() calls to all examples and quickstart
Updated tests to handle SDK client initialization
Updated quickstart.py messaging for API key awareness

Documentation

README: Added SDK setup instructions, API key acquisition steps, troubleshooting for API issues
Clarified cloud-based architecture vs local processing

Usage

from scrapegraph_py import Client
from src.scrapegraph_demo import Config, MarketplaceScraper

config = Config.from_env()  # Reads SGAI_API_KEY
scraper = MarketplaceScraper(config)

product = scraper.scrape_product(
    url="https://www.amazon.com/dp/PRODUCTID",
    marketplace="Amazon"
)

scraper.close()  # New: cleanup SDK resources

Mock data fallback preserved for testing without API key. All 12 tests pass, 0 security vulnerabilities.

Original prompt

Objective

Migrate the ScrapeGraphAI Elasticsearch Demo from using the open-source scrapegraphai library to the API-based scrapegraph-py SDK.

Background

Currently, the repository uses the open-source ScrapeGraphAI library which requires local LLM setup and OpenAI API keys. We need to transition to using the scrapegraph-py SDK (installed via pip install scrapegraph-py) which provides a simpler API-based approach for scraping.

Reference SDK repository: https://github.com/ScrapeGraphAI/scrapegraph-sdk

Required Changes

1. Update `requirements.txt`

Replace scrapegraphai>=1.0.0 with scrapegraph-py and keep all other dependencies:

# ScrapeGraphAI SDK (API-based)
scrapegraph-py

# Elasticsearch
elasticsearch>=8.0.0

# Data processing
pandas>=2.0.0

# Environment management
python-dotenv>=1.0.0

# Utilities
requests>=2.31.0
pydantic>=2.0.0

2. Update `.env.example`

Replace OpenAI/ScrapeGraphAI API key variables with the SDK's expected format:

# ScrapeGraphAI API Key (required for scrapegraph-py SDK)
SGAI_API_KEY=your-scrapegraphai-api-key-here

# Elasticsearch Configuration
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
# ELASTICSEARCH_USERNAME=
# ELASTICSEARCH_PASSWORD=

3. Update `src/scrapegraph_demo/config.py`

Modify the Config class to use SGAI_API_KEY instead of OPENAI_API_KEY or SCRAPEGRAPHAI_API_KEY:

Change parameter name from openai_api_key or scrapegraphai_api_key to sgai_api_key
Update from_env() method to read SGAI_API_KEY from environment
Update validation to check for sgai_api_key presence

4. Rewrite `src/scrapegraph_demo/scraper.py`

Complete rewrite to use the scrapegraph-py SDK instead of the open-source library:

Key SDK Methods to Use:

Client(api_key="...") or Client.from_env() - Initialize the SDK client
client.smartscraper(website_url, user_prompt, output_schema) - Extract structured product data
client.scrape(website_url) - Get raw HTML content
client.close() - Close the client connection

Implementation Requirements:

Initialize Client from scrapegraph-py in __init__
In scrape_product(): Use client.smartscraper() with a detailed prompt to extract:
- Product name, price, currency, brand
- Rating (out of 5), review count
- Description, category, availability
In scrape_search_results(): Use client.smartscraper() or client.searchscraper() to extract multiple products
Implement proper error handling and fallback to mock data if API fails
Add a close() method that calls self.client.close()
Parse SDK responses (which return dictionaries with a 'result' key) into Product objects
Keep all existing helper methods like _extract_product_id() and _extract_price()

SDK Response Format:

response = {
    'result': {
        'name': 'Product Name',
        'price': 99.99,
        'rating': 4.5,
        # ... other fields
    }
}

5. Update Example Files

Update all files in the examples/ directory to work with the new SDK-based scraper:

examples/basic_usage.py
examples/product_comparison.py
examples/advanced_search.py

Ensure they:

Import from the updated scraper module
Handle the new API-based scraping
Include proper error handling
Call scraper.close() when done

6. Update `quickstart.py`

Modify the quickstart script to:

Use the SDK-based scraper
Update error messages to mention SGAI_API_KEY
Add proper cleanup with scraper.close()
Update documentation strings

7. Update `README.md`

Update documentation to reflect the SDK-based approach:

Change installation instructions to mention scrapegraph-py
Update environment variable documentation (SGAI_API_KEY instead of OPENAI_API_KEY)
Update code examples to show SDK usage
Add note about ScrapeGraphAI API credits
Update the "Related Resources" section to include SDK documentation
Clarify that this uses the ScrapeGraphAI API (not open-source library)

8. Update Tests

Update test files in tests/ to work with the SDK-based implementation:

Update imports
Mock the SDK Client appropriately
Test SDK-specific functionality
Ensure tests don't require actual API calls

Technical Specifications

SDK Documentation Reference:

Client initialization: from scrapegraph_py import Client
Environment-based init: client = Client.from_env() (reads SGAI_API_KEY)
SmartScraper usage: Extract structured data with AI
Async support available via AsyncClient (optional for future)

Key Differences:

Old approach: Graph-based pipelines with local LLM processing
New approach: Simple API client with cloud-based processing
Authentication: SGAI_API_KEY instead of OPENAI_API_KEY
Benefits: Simpler code, no local LLM setup, managed infrastructure

Testing Requirements

After implementation:

All existing tests should pass with mocked SDK client
Mock data functionality shoul...

This pull request was created as a result of the following prompt from Copilot chat.

Objective

Migrate the ScrapeGraphAI Elasticsearch Demo from using the open-source scrapegraphai library to the API-based scrapegraph-py SDK.

Background

Currently, the repository uses the open-source ScrapeGraphAI library which requires local LLM setup and OpenAI API keys. We need to transition to using the scrapegraph-py SDK (installed via pip install scrapegraph-py) which provides a simpler API-based approach for scraping.

Reference SDK repository: https://github.com/ScrapeGraphAI/scrapegraph-sdk

Required Changes

1. Update requirements.txt

Replace scrapegraphai>=1.0.0 with scrapegraph-py and keep all other dependencies:
# ScrapeGraphAI SDK (API-based)
scrapegraph-py

# Elasticsearch
elasticsearch>=8.0.0

# Data processing
pandas>=2.0.0

# Environment management
python-dotenv>=1.0.0

# Utilities
requests>=2.31.0
pydantic>=2.0.0
2. Update .env.example

Replace OpenAI/ScrapeGraphAI API key variables with the SDK's expected format:
# ScrapeGraphAI API Key (required for scrapegraph-py SDK)
SGAI_API_KEY=your-scrapegraphai-api-key-here

# Elasticsearch Configuration
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
# ELASTICSEARCH_USERNAME=
# ELASTICSEARCH_PASSWORD=
3. Update src/scrapegraph_demo/config.py

Modify the Config class to use SGAI_API_KEY instead of OPENAI_API_KEY or SCRAPEGRAPHAI_API_KEY:

Change parameter name from openai_api_key or scrapegraphai_api_key to sgai_api_key

Update from_env() method to read SGAI_API_KEY from environment

Update validation to check for sgai_api_key presence

4. Rewrite src/scrapegraph_demo/scraper.py

Complete rewrite to use the scrapegraph-py SDK instead of the open-source library:

Key SDK Methods to Use:

Client(api_key="...") or Client.from_env() - Initialize the SDK client

client.smartscraper(website_url, user_prompt, output_schema) - Extract structured product data

client.scrape(website_url) - Get raw HTML content

client.close() - Close the client connection

Implementation Requirements:

Initialize Client from scrapegraph-py in __init__

In scrape_product(): Use client.smartscraper() with a detailed prompt to extract:

Product name, price, currency, brand

Rating (out of 5), review count

Description, category, availability

In scrape_search_results(): Use client.smartscraper() or client.searchscraper() to extract multiple products

Implement proper error handling and fallback to mock data if API fails

Add a close() method that calls self.client.close()

Parse SDK responses (which return dictionaries with a 'result' key) into Product objects

Keep all existing helper methods like _extract_product_id() and _extract_price()

SDK Response Format:
response = {
    'result': {
        'name': 'Product Name',
        'price': 99.99,
        'rating': 4.5,
        # ... other fields
    }
}
5. Update Example Files

Update all files in the examples/ directory to work with the new SDK-based scraper:

examples/basic_usage.py

examples/product_comparison.py

examples/advanced_search.py

Ensure they:

Import from the updated scraper module

Handle the new API-based scraping

Include proper error handling

Call scraper.close() when done

6. Update quickstart.py

Modify the quickstart script to:

Use the SDK-based scraper

Update error messages to mention SGAI_API_KEY

Add proper cleanup with scraper.close()

Update documentation strings

7. Update README.md

Update documentation to reflect the SDK-based approach:

Change installation instructions to mention scrapegraph-py

Update environment variable documentation (SGAI_API_KEY instead of OPENAI_API_KEY)

Update code examples to show SDK usage

Add note about ScrapeGraphAI API credits

Update the "Related Resources" section to include SDK documentation

Clarify that this uses the ScrapeGraphAI API (not open-source library)

8. Update Tests

Update test files in tests/ to work with the SDK-based implementation:

Update imports

Mock the SDK Client appropriately

Test SDK-specific functionality

Ensure tests don't require actual API calls

Technical Specifications

SDK Documentation Reference:

Client initialization: from scrapegraph_py import Client

Environment-based init: client = Client.from_env() (reads SGAI_API_KEY)

SmartScraper usage: Extract structured data with AI

Async support available via AsyncClient (optional for future)

Key Differences:

Old approach: Graph-based pipelines with local LLM processing

New approach: Simple API client with cloud-based processing

Authentication: SGAI_API_KEY instead of OPENAI_API_KEY

Benefits: Simpler code, no local LLM setup, managed infrastructure

Testing Requirements

After implementation:

All existing tests should pass with mocked SDK client

Mock data functionality should still work as fallback

Integration with Elasticsearch should remain unchanged

Example scripts should run successfully with valid API key

Documentation Requirements

Update all code comments to reflect SDK usage

Keep inline documentation clear and accurate

Ensure README has complete setup instructions

Add troubleshooting section for common SDK issues

Notes

Keep all existing Elasticsearch functionality unchanged

Maintain backward compatibility with existing Product models

Preserve the demo's educational value and ease of use

The SDK is much simpler than the open-source library - leverage this simplicity

Ensure proper resource cleanup (client.close()) in all scripts

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

- Updated requirements.txt to use scrapegraph-py instead of scrapegraphai - Updated .env.example with SGAI_API_KEY - Updated config.py to use sgai_api_key - Rewrote scraper.py to use scrapegraph-py SDK with Client API - Added close() method to scraper for proper cleanup - Updated all example files to call scraper.close() - Updated quickstart.py with SDK-aware messaging - Updated tests to work with SDK-based implementation - Updated README.md with SDK documentation and instructions Co-authored-by: lurenss <38807022+lurenss@users.noreply.github.com>

- Add version constraint to scrapegraph-py dependency - Improve error handling with specific exceptions in close() method - Add response validation in scrape_product() and scrape_search_results() - Use safe dictionary access with isinstance checks - Add logging for exception types in error handlers Co-authored-by: lurenss <38807022+lurenss@users.noreply.github.com>

Initial plan

b3db280

Copilot AI assigned Copilot and lurenss Nov 4, 2025

Copilot started work on behalf of lurenss November 4, 2025 19:18 View session

Copilot AI and others added 2 commits November 4, 2025 19:33

Copilot AI changed the title ~~[WIP] Migrate Elasticsearch demo to scrapegraph-py SDK~~ Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK Nov 4, 2025

Copilot AI requested a review from lurenss November 4, 2025 19:43

Copilot finished work on behalf of lurenss November 4, 2025 19:43

lurenss marked this pull request as ready for review November 4, 2025 19:49

lurenss merged commit d64f6f1 into main Nov 4, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK #3

Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK #3

Uh oh!

Copilot AI commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK #3

Migrate from open-source scrapegraphai to API-based scrapegraph-py SDK #3

Uh oh!

Conversation

Copilot AI commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Usage

Objective

Background

Required Changes

1. Update requirements.txt

2. Update .env.example

3. Update src/scrapegraph_demo/config.py

4. Rewrite src/scrapegraph_demo/scraper.py

5. Update Example Files

6. Update quickstart.py

7. Update README.md

8. Update Tests

Technical Specifications

Testing Requirements

Objective

Background

Required Changes

1. Update requirements.txt

2. Update .env.example

3. Update src/scrapegraph_demo/config.py

4. Rewrite src/scrapegraph_demo/scraper.py

5. Update Example Files

6. Update quickstart.py

7. Update README.md

8. Update Tests

Technical Specifications

Testing Requirements

Documentation Requirements

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 4, 2025 •

edited

Loading

1. Update `requirements.txt`

2. Update `.env.example`

3. Update `src/scrapegraph_demo/config.py`

4. Rewrite `src/scrapegraph_demo/scraper.py`

6. Update `quickstart.py`

7. Update `README.md`

1. Update `requirements.txt`

2. Update `.env.example`

3. Update `src/scrapegraph_demo/config.py`

4. Rewrite `src/scrapegraph_demo/scraper.py`

6. Update `quickstart.py`

7. Update `README.md`