diff --git a/.agent/README.md b/.agent/README.md deleted file mode 100644 index b1b868e9..00000000 --- a/.agent/README.md +++ /dev/null @@ -1,391 +0,0 @@ -# ScrapeGraphAI SDK Documentation - -Welcome to the ScrapeGraphAI SDK documentation hub. This directory contains comprehensive documentation for understanding, developing, and maintaining the official Python SDK for the ScrapeGraph AI API. - -## ๐Ÿ“š Available Documentation - -### System Documentation (`system/`) - -#### [Project Architecture](./system/project_architecture.md) -Complete SDK architecture documentation including: -- **Repository Structure** - How the Python SDK is organized -- **Python SDK Architecture** - Client structure, async/sync support, models -- **API Endpoints Coverage** - All supported endpoints -- **Authentication** - API key management and security -- **Testing Strategy** - Unit tests, integration tests, CI/CD -- **Release Process** - Semantic versioning and publishing - -### Task Documentation (`tasks/`) - -*Future: PRD and implementation plans for specific SDK features* - -### SOP Documentation (`sop/`) - -*Future: Standard operating procedures (e.g., adding new endpoints, releasing versions)* - ---- - -## ๐Ÿš€ Quick Start - -### For New Contributors - -1. **Read First:** - - [Main README](../README.md) - Project overview and features - - [Python SDK README](../scrapegraph-py/README.md) - Python SDK guide - -2. **Set Up Development Environment:** - - ```bash - cd scrapegraph-py - - # Install dependencies with uv (recommended) - pip install uv - uv sync - - # Or use pip - pip install -e . - - # Install pre-commit hooks - pre-commit install - ``` - -3. **Run Tests:** - - ```bash - cd scrapegraph-py - pytest tests/ -v - ``` - -4. **Explore the Codebase:** - - **Python**: `scrapegraph_py/client.py` - Sync client, `scrapegraph_py/async_client.py` - Async client - - **Examples**: `scrapegraph-py/examples/` - ---- - -## ๐Ÿ” Finding Information - -### I want to understand... - -**...how to add a new endpoint:** -- Read: Python SDK - `scrapegraph_py/client.py`, `scrapegraph_py/async_client.py` -- Examples: Look at existing endpoint implementations - -**...how authentication works:** -- Read: Python SDK - `scrapegraph_py/client.py` (initialization) -- The SDK supports `SGAI_API_KEY` environment variable - -**...how error handling works:** -- Read: Python SDK - `scrapegraph_py/exceptions.py` - -**...how testing works:** -- Read: Python SDK - `tests/` directory, `pytest.ini` -- Run: Follow test commands in README - -**...how releases work:** -- Read: Python SDK - `.releaserc.yml` (semantic-release config) -- GitHub Actions: `.github/workflows/` for automated releases - ---- - -## ๐Ÿ› ๏ธ Development Workflows - -### Running Tests - -```bash -cd scrapegraph-py - -# Run all tests -pytest tests/ -v - -# Run specific test file -pytest tests/test_smartscraper.py -v - -# Run with coverage -pytest --cov=scrapegraph_py --cov-report=html tests/ -``` - -### Code Quality - -```bash -cd scrapegraph-py - -# Format code -black scrapegraph_py tests - -# Sort imports -isort scrapegraph_py tests - -# Lint code -ruff check scrapegraph_py tests - -# Type check -mypy scrapegraph_py - -# Run all checks via Makefile -make format -make lint -``` - -### Building & Publishing - -```bash -cd scrapegraph-py - -# Build package -python -m build - -# Publish to PyPI (automated via GitHub Actions) -twine upload dist/* -``` - ---- - -## ๐Ÿ“Š SDK Endpoint Reference - -The SDK supports the following endpoints: - -| Endpoint | Python SDK | Purpose | -|----------|-----------|---------| -| SmartScraper | โœ… | AI-powered data extraction | -| SearchScraper | โœ… | Multi-website search extraction | -| Markdownify | โœ… | HTML to Markdown conversion | -| SmartCrawler | โœ… | Sitemap generation & crawling | -| AgenticScraper | โœ… | Browser automation | -| Scrape | โœ… | Basic HTML extraction | -| Scheduled Jobs | โœ… | Cron-based job scheduling | -| Credits | โœ… | Credit balance management | -| Feedback | โœ… | Rating and feedback | - ---- - -## ๐Ÿ”ง Key Files Reference - -### Python SDK - -**Entry Points:** -- `scrapegraph_py/__init__.py` - Package exports -- `scrapegraph_py/client.py` - Synchronous client -- `scrapegraph_py/async_client.py` - Asynchronous client - -**Models:** -- `scrapegraph_py/models/` - Pydantic request/response models - - `smartscraper_models.py` - SmartScraper schemas - - `searchscraper_models.py` - SearchScraper schemas - - `crawler_models.py` - Crawler schemas - - `markdownify_models.py` - Markdownify schemas - - And more... - -**Utilities:** -- `scrapegraph_py/utils/` - Helper functions -- `scrapegraph_py/logger.py` - Logging configuration -- `scrapegraph_py/config.py` - Configuration constants -- `scrapegraph_py/exceptions.py` - Custom exceptions - -**Configuration:** -- `pyproject.toml` - Package metadata, dependencies, tool configs -- `pytest.ini` - Pytest configuration -- `Makefile` - Common development tasks -- `.releaserc.yml` - Semantic-release configuration - ---- - -## ๐Ÿงช Testing - -### Python SDK Test Structure - -``` -scrapegraph-py/tests/ -โ”œโ”€โ”€ test_async_client.py # Async client tests -โ”œโ”€โ”€ test_client.py # Sync client tests -โ”œโ”€โ”€ test_smartscraper.py # SmartScraper endpoint tests -โ”œโ”€โ”€ test_searchscraper.py # SearchScraper endpoint tests -โ”œโ”€โ”€ test_crawler.py # Crawler endpoint tests -โ””โ”€โ”€ conftest.py # Pytest fixtures -``` - -### Writing Tests - -**Python Example:** -```python -import pytest -from scrapegraph_py import Client - -def test_smartscraper_basic(): - client = Client(api_key="test-key") - response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title" - ) - assert response.request_id is not None -``` - ---- - -## ๐Ÿšจ Troubleshooting - -### Common Issues - -**Issue: Import errors in Python SDK** -- **Cause:** Package not installed or outdated -- **Solution:** - ```bash - cd scrapegraph-py - pip install -e . - # Or with uv - uv sync - ``` - -**Issue: API key errors** -- **Cause:** Invalid or missing API key -- **Solution:** - - Set `SGAI_API_KEY` environment variable - - Or pass `api_key` parameter directly - - Get API key from https://scrapegraphai.com - -**Issue: Type errors in Python SDK** -- **Cause:** Using wrong model types -- **Solution:** Check `scrapegraph_py/models/` for correct Pydantic models - -**Issue: Tests failing** -- **Cause:** Missing test environment variables -- **Solution:** Set `SGAI_API_KEY` for integration tests or use mocked tests - ---- - -## ๐Ÿ“– External Documentation - -### Official Docs -- [ScrapeGraph AI API Documentation](https://docs.scrapegraphai.com) -- [Python SDK Documentation](https://docs.scrapegraphai.com/sdks/python) - -### Package Repositories -- [PyPI - scrapegraph-py](https://pypi.org/project/scrapegraph-py/) - -### Development Tools -- [pytest Documentation](https://docs.pytest.org/) -- [Pydantic Documentation](https://docs.pydantic.dev/) -- [uv Documentation](https://docs.astral.sh/uv/) - ---- - -## ๐Ÿค Contributing - -### Before Making Changes - -1. **Read relevant documentation** - Understand the SDK structure -2. **Check existing issues** - Avoid duplicate work -3. **Run tests** - Ensure current state is green -4. **Create a branch** - Use descriptive branch names (e.g., `feat/add-pagination-support`) - -### Development Process - -1. **Make changes** - Write clean, documented code -2. **Add tests** - Cover new functionality -3. **Run code quality checks** - Format, lint, type check -4. **Run tests** - Ensure all tests pass -5. **Update documentation** - Update README and examples -6. **Commit with semantic commit messages** - `feat:`, `fix:`, `docs:`, etc. -7. **Create pull request** - Describe changes thoroughly - -### Code Style - -**Python SDK:** -- **Black** - Code formatting (line length: 88) -- **isort** - Import sorting (Black profile) -- **Ruff** - Fast linting -- **mypy** - Type checking (strict mode) -- **Type hints** - Use Pydantic models and type annotations -- **Docstrings** - Document public functions and classes - -### Commit Message Format - -Follow [Conventional Commits](https://www.conventionalcommits.org/): - -``` -feat: add pagination support for smartscraper -fix: handle timeout errors gracefully -docs: update README with new examples -test: add unit tests for crawler endpoint -chore: update dependencies -``` - -This enables automated semantic versioning and changelog generation. - ---- - -## ๐Ÿ“ Documentation Maintenance - -### When to Update Documentation - -**Update `.agent/README.md` when:** -- Adding new SDK features -- Changing development workflows -- Updating testing procedures - -**Update `README.md` (root) when:** -- Adding new endpoints -- Changing installation instructions -- Adding new features or use cases - -**Update SDK-specific READMEs when:** -- Adding new endpoint methods -- Changing API surface -- Adding examples - -### Documentation Best Practices - -1. **Keep examples working** - Test code examples regularly -2. **Be specific** - Include version numbers, function names -3. **Include error handling** - Show try/catch patterns -4. **Cross-reference** - Link between related sections -5. **Keep changelogs** - Document all changes in CHANGELOG.md - ---- - -## ๐Ÿ“… Release Process - -The SDK uses **semantic-release** for automated versioning and publishing: - -### Release Workflow - -1. **Make changes** - Develop and test new features -2. **Commit with semantic messages** - `feat:`, `fix:`, etc. -3. **Merge to main** - Pull request approved and merged -4. **Automated release** - GitHub Actions: - - Determines version bump (major/minor/patch) - - Updates version in `pyproject.toml` - - Generates CHANGELOG.md - - Creates GitHub release - - Publishes to PyPI - -### Version Bumping Rules - -- `feat:` โ†’ **Minor** version bump (0.x.0) -- `fix:` โ†’ **Patch** version bump (0.0.x) -- `BREAKING CHANGE:` โ†’ **Major** version bump (x.0.0) - ---- - -## ๐Ÿ”— Quick Links - -- [Main README](../README.md) - Project overview -- [Python SDK README](../scrapegraph-py/README.md) - Python guide -- [Cookbook](../cookbook/) - Usage examples -- [API Documentation](https://docs.scrapegraphai.com) - Full API docs - ---- - -## ๐Ÿ“ง Support - -For questions or issues: -1. Check this documentation first -2. Review SDK-specific README -3. Search existing GitHub issues -4. Create a new issue with: - - SDK version - - Error message - - Minimal reproducible example - ---- - -**Happy Coding! ๐Ÿš€** diff --git a/.agent/sop/releasing_version.md b/.agent/sop/releasing_version.md deleted file mode 100644 index 9e4e4858..00000000 --- a/.agent/sop/releasing_version.md +++ /dev/null @@ -1,450 +0,0 @@ -# SOP: Releasing a New Version - -**Last Updated:** January 2025 - -This document describes the automated and manual release process for the Python SDK using semantic-release. - -## Overview - -The SDK uses **semantic-release** for automated versioning and publishing: -- Analyzes commit messages to determine version bump -- Updates version numbers -- Generates changelog -- Creates GitHub releases -- Publishes to PyPI - -## Semantic Versioning - -Version format: `MAJOR.MINOR.PATCH` (e.g., `1.12.2`) - -### Version Bump Rules - -Based on commit message prefixes: - -| Commit Type | Version Bump | Example | -|-------------|--------------|---------| -| `feat:` | Minor (0.x.0) | `feat: add pagination support` | -| `fix:` | Patch (0.0.x) | `fix: handle timeout errors` | -| `feat!:` or `BREAKING CHANGE:` | Major (x.0.0) | `feat!: change API interface` | -| `docs:`, `chore:`, `style:`, `refactor:`, `test:` | None | No release triggered | - -### Commit Message Format - -Follow [Conventional Commits](https://www.conventionalcommits.org/): - -``` -[optional scope]: - -[optional body] - -[optional footer(s)] -``` - -**Examples:** - -``` -feat: add scheduled jobs support - -Implement full CRUD operations for scheduled jobs including: -- Create job with cron expression -- List all jobs -- Update existing jobs -- Delete jobs -- Get job execution history -``` - -``` -fix: handle network timeout gracefully - -Add retry logic with exponential backoff for network failures. -Improves reliability when API is temporarily unavailable. -``` - -``` -feat!: change client initialization API - -BREAKING CHANGE: Client() now requires api_key parameter explicitly. -The from_env() class method should be used for environment-based initialization. - -Migration: -- Before: client = Client() -- After: client = Client.from_env() -``` - ---- - -## Automated Release Process - -### Prerequisites - -1. All changes committed to feature branch -2. Pull request created and approved -3. All CI checks passing (tests, linting) -4. Semantic commit messages used - -### Step-by-Step: Automated Release - -**1. Prepare Your Changes** - -Ensure commits use semantic format: -```bash -git commit -m "feat: add new endpoint" -git commit -m "fix: resolve timeout issue" -``` - -**2. Create Pull Request** - -```bash -git push origin feature/your-feature -# Create PR on GitHub -``` - -**3. Code Review** - -- Reviewers approve changes -- All GitHub Actions checks pass -- No merge conflicts - -**4. Merge to Main** - -```bash -# Merge via GitHub UI (Squash & Merge recommended) -# Or via command line: -git checkout main -git merge --squash feature/your-feature -git commit -m "feat: your feature description" -git push origin main -``` - -**5. Automated Release Triggers** - -Once merged to `main`: - -**Python SDK** (`.github/workflows/release.yml`): -1. semantic-release analyzes commits since last release -2. Determines version bump (major/minor/patch) -3. Updates `pyproject.toml` version -4. Generates `CHANGELOG.md` entry -5. Creates Git tag (e.g., `v1.13.0`) -6. Builds Python package (`uv build`) -7. Publishes to PyPI via `twine` -8. Creates GitHub release with notes - -**6. Verify Release** - -Check: -- GitHub Releases page: https://github.com/ScrapeGraphAI/scrapegraph-sdk/releases -- PyPI page: https://pypi.org/project/scrapegraph-py/ - ---- - -## Manual Release Process - -Use only for emergency releases or when automation fails. - -### Python SDK Manual Release - -**1. Update Version** - -**File**: `scrapegraph-py/pyproject.toml` - -```toml -[project] -name = "scrapegraph_py" -version = "1.13.0" # Increment version -``` - -**2. Update Changelog** - -**File**: `scrapegraph-py/CHANGELOG.md` - -```markdown -## [1.13.0] - 2025-01-XX - -### Added -- New feature description - -### Fixed -- Bug fix description -``` - -**3. Commit Changes** - -```bash -cd scrapegraph-py -git add pyproject.toml CHANGELOG.md -git commit -m "chore(release): 1.13.0" -git tag v1.13.0 -git push origin main --tags -``` - -**4. Build Package** - -```bash -cd scrapegraph-py - -# Build with uv -uv build - -# Or build with python -python -m build -``` - -This creates files in `dist/`: -- `scrapegraph_py-1.13.0-py3-none-any.whl` -- `scrapegraph_py-1.13.0.tar.gz` - -**5. Test Package Locally** - -```bash -# Install in test environment -python -m venv test-env -source test-env/bin/activate -pip install dist/scrapegraph_py-1.13.0-py3-none-any.whl - -# Test import -python -c "from scrapegraph_py import Client; print('Success!')" -``` - -**6. Publish to PyPI** - -```bash -# Install twine -pip install twine - -# Check package -twine check dist/* - -# Upload to PyPI -twine upload dist/* - -# Enter PyPI credentials when prompted -``` - -**7. Create GitHub Release** - -1. Go to https://github.com/ScrapeGraphAI/scrapegraph-sdk/releases/new -2. Tag: `v1.13.0` -3. Title: `Python SDK v1.13.0` -4. Description: Copy from CHANGELOG.md -5. Attach: `dist/scrapegraph_py-1.13.0.tar.gz` -6. Publish release - ---- - -## Release Checklist - -### Pre-Release - -- [ ] All tests passing locally -- [ ] Code formatted and linted -- [ ] Documentation updated -- [ ] Examples updated -- [ ] CHANGELOG.md entries added (if manual) -- [ ] Version number bumped (if manual) -- [ ] Breaking changes documented - -### Python SDK Release - -- [ ] `pyproject.toml` version updated -- [ ] Tests pass: `pytest tests/ -v` -- [ ] Linting pass: `make lint` -- [ ] Type check pass: `mypy scrapegraph_py/` -- [ ] Package builds: `uv build` -- [ ] GitHub release created -- [ ] PyPI package published -- [ ] Installation tested: `pip install scrapegraph-py==X.Y.Z` - -### Post-Release - -- [ ] Verify GitHub release appears -- [ ] Verify package on PyPI -- [ ] Test installation from registry -- [ ] Update documentation website (if applicable) -- [ ] Announce release (if major/minor) -- [ ] Monitor for issues - ---- - -## Hotfix Process - -For critical bugs requiring immediate fix: - -**1. Create Hotfix Branch** - -```bash -git checkout main -git pull -git checkout -b hotfix/critical-bug -``` - -**2. Fix Bug** - -Make minimal changes to fix the issue. - -**3. Commit with Fix Type** - -```bash -git commit -m "fix: critical bug description" -``` - -**4. Fast-Track PR** - -- Create PR -- Request immediate review -- Merge as soon as approved - -**5. Release** - -Automated release triggers immediately, creating a **patch** version (0.0.x). - ---- - -## Rollback Procedure - -If a release has critical issues: - -### Option 1: Yank Package (Recommended) - -**PyPI:** -```bash -# Yank version (marks as unusable but keeps it) -pip install twine -twine upload --repository pypi --skip-existing dist/* -# Contact PyPI admins to yank version -``` - -### Option 2: Publish Fix Version - -Faster and preferred: - -1. Create hotfix branch -2. Fix issue -3. Commit: `fix: resolve critical issue from vX.Y.Z` -4. Merge to trigger new release (patch bump) -5. Announce fix version - ---- - -## Versioning Strategy - -### Major Version (x.0.0) - -Breaking changes requiring user code updates: -- API interface changes -- Removing deprecated features -- Changing default behavior - -**Example:** -``` -feat!: change authentication method - -BREAKING CHANGE: API key must now be passed explicitly to Client(). -Use Client.from_env() to load from environment variable. -``` - -### Minor Version (0.x.0) - -New features, backward compatible: -- Adding new endpoints -- Adding optional parameters -- New functionality - -**Example:** -``` -feat: add scheduled jobs endpoint - -Implement full CRUD for scheduled jobs with cron support. -``` - -### Patch Version (0.0.x) - -Bug fixes, no new features: -- Fixing bugs -- Performance improvements -- Documentation updates - -**Example:** -``` -fix: handle network timeout correctly - -Add retry logic for transient network failures. -``` - ---- - -## Troubleshooting - -### Issue: semantic-release not triggering - -**Causes:** -- No semantic commits since last release -- Commits use wrong format (e.g., `Fix bug` instead of `fix: bug`) -- GitHub token permissions issue - -**Solutions:** -- Check commit messages: `git log --oneline` -- Ensure at least one feat/fix commit exists -- Verify GitHub Actions has write permissions - -### Issue: PyPI upload fails - -**Causes:** -- Authentication error -- Version already exists -- Package name conflict - -**Solutions:** -- Verify `PYPI_API_TOKEN` secret in GitHub -- Check if version already on PyPI -- Try manual upload with `twine upload dist/*` - -### Issue: Tests failing in CI - -**Causes:** -- Environment differences -- Missing dependencies -- Network issues - -**Solutions:** -- Run tests locally first -- Check CI logs for specific errors -- Verify all dependencies in `pyproject.toml` - ---- - -## Semantic Release Configuration - -### Python SDK - -**File**: `.releaserc.yml` (root) - -```yaml -branches: - - main -plugins: - - "@semantic-release/commit-analyzer" - - "@semantic-release/release-notes-generator" - - "@semantic-release/changelog" - - ["semantic-release-pypi", { pkgdir: "scrapegraph-py" }] - - "@semantic-release/github" - - ["@semantic-release/git", { - "assets": ["scrapegraph-py/CHANGELOG.md", "scrapegraph-py/pyproject.toml"], - "message": "chore(release): ${nextRelease.version} [skip ci]\n\n${nextRelease.notes}" - }] -``` - ---- - -## Best Practices - -1. **Use Descriptive Commits**: Help users understand what changed -2. **Test Before Merge**: Run full test suite locally -3. **Update Examples**: Ensure examples work with new features -4. **Document Breaking Changes**: Clearly explain migration path -5. **Monitor After Release**: Watch for issues in first 24 hours -6. **Communicate Major Changes**: Announce breaking changes in advance - ---- - -**For questions, refer to [.agent/README.md](../README.md) or create a GitHub issue.** diff --git a/.agent/system/project_architecture.md b/.agent/system/project_architecture.md deleted file mode 100644 index b89b9a4d..00000000 --- a/.agent/system/project_architecture.md +++ /dev/null @@ -1,590 +0,0 @@ -# ScrapeGraphAI SDK - Project Architecture - -**Last Updated:** January 2025 -**Version:** Python SDK 1.12.2 - -## Table of Contents -- [System Overview](#system-overview) -- [Repository Structure](#repository-structure) -- [Python SDK Architecture](#python-sdk-architecture) -- [API Endpoint Coverage](#api-endpoint-coverage) -- [Authentication & Configuration](#authentication--configuration) -- [Testing Strategy](#testing-strategy) -- [Release & Publishing](#release--publishing) -- [External Dependencies](#external-dependencies) - ---- - -## System Overview - -The **scrapegraph-sdk** repository contains the official Python client SDK for the ScrapeGraph AI API. The SDK provides comprehensive functionality for intelligent web scraping powered by AI, with language-specific implementations optimized for the Python ecosystem. - -**Key Features:** -- โœ… **Complete API Coverage**: All 10 ScrapeGraph AI endpoints supported -- โœ… **Sync & Async**: Python supports both sync and async clients -- โœ… **Type Safety**: Pydantic models for data validation -- โœ… **Automated Releases**: Semantic versioning with semantic-release -- โœ… **Comprehensive Testing**: Unit and integration tests -- โœ… **Production Ready**: Used by thousands of developers worldwide - ---- - -## Repository Structure - -``` -scrapegraph-sdk/ -โ”œโ”€โ”€ scrapegraph-py/ # Python SDK (PyPI: scrapegraph-py) -โ”‚ โ”œโ”€โ”€ scrapegraph_py/ # Source code -โ”‚ โ”œโ”€โ”€ tests/ # Pytest tests -โ”‚ โ”œโ”€โ”€ examples/ # Usage examples -โ”‚ โ”œโ”€โ”€ docs/ # MkDocs documentation -โ”‚ โ”œโ”€โ”€ pyproject.toml # Package metadata & dependencies -โ”‚ โ”œโ”€โ”€ Makefile # Development tasks -โ”‚ โ””โ”€โ”€ README.md # Python SDK documentation -โ”‚ -โ”œโ”€โ”€ cookbook/ # Cross-language usage examples -โ”œโ”€โ”€ .github/workflows/ # GitHub Actions CI/CD -โ”‚ โ”œโ”€โ”€ release.yml # Python release automation -โ”‚ โ””โ”€โ”€ tests.yml # Test automation -โ”‚ -โ”œโ”€โ”€ .agent/ # Documentation hub -โ”‚ โ”œโ”€โ”€ README.md # Documentation index -โ”‚ โ”œโ”€โ”€ system/ # Architecture documentation -โ”‚ โ”œโ”€โ”€ tasks/ # Feature PRDs -โ”‚ โ””โ”€โ”€ sop/ # Standard operating procedures -โ”‚ -โ”œโ”€โ”€ package.json # Root package.json (semantic-release) -โ”œโ”€โ”€ CLAUDE.md # Claude Code instructions -โ””โ”€โ”€ README.md # Main repository documentation -``` - ---- - -## Python SDK Architecture - -### Technology Stack - -**Core:** -- **Python**: 3.10+ (3.11+ recommended) -- **Package Manager**: uv (recommended) or pip -- **Build System**: hatchling 1.26.3 - -**Dependencies:** -- **requests** 2.32.3+ - HTTP client for sync operations -- **aiohttp** 3.10+ - Async HTTP client -- **pydantic** 2.10.2+ - Data validation and modeling -- **python-dotenv** 1.0.1+ - Environment variable management - -**Optional Dependencies:** -- **beautifulsoup4** 4.12.3+ - HTML parsing (for HTML validation when using `website_html`) - - Install with: `pip install scrapegraph-py[html]` -- **langchain** 0.3.0+ - Langchain integration for AI workflows -- **langchain-community** 0.2.11+ - Community integrations for Langchain -- **langchain-scrapegraph** 0.1.0+ - ScrapeGraph integration for Langchain - - Install with: `pip install scrapegraph-py[langchain]` - -**Development Tools:** -- **pytest** 7.4.0+ - Testing framework -- **pytest-asyncio** 0.23.8+ - Async test support -- **pytest-mock** 3.14.0 - Mocking support -- **pytest-cov** 6.0.0+ - Coverage reporting -- **aioresponses** 0.7.7+ - Async HTTP mocking -- **responses** 0.25.3+ - Sync HTTP mocking -- **black** 24.10.0+ - Code formatting -- **isort** 5.13.2+ - Import sorting -- **ruff** 0.8.0+ - Fast linting -- **mypy** 1.13.0+ - Type checking -- **pre-commit** 4.0.1+ - Git hooks - -**Documentation:** -- **mkdocs** 1.6.1+ - Documentation generator -- **mkdocs-material** 9.5.46+ - Material theme -- **mkdocstrings-python** 1.12.2+ - Python API docs - -### Project Structure - -``` -scrapegraph-py/ -โ”œโ”€โ”€ scrapegraph_py/ -โ”‚ โ”œโ”€โ”€ __init__.py # Package exports (Client, AsyncClient) -โ”‚ โ”œโ”€โ”€ client.py # Synchronous client implementation -โ”‚ โ”œโ”€โ”€ async_client.py # Asynchronous client implementation -โ”‚ โ”œโ”€โ”€ config.py # Configuration constants -โ”‚ โ”œโ”€โ”€ logger.py # Logging configuration -โ”‚ โ”œโ”€โ”€ exceptions.py # Custom exception classes -โ”‚ โ”‚ -โ”‚ โ”œโ”€โ”€ models/ # Pydantic request/response models -โ”‚ โ”‚ โ”œโ”€โ”€ __init__.py # Model exports -โ”‚ โ”‚ โ”œโ”€โ”€ smartscraper.py # SmartScraper models -โ”‚ โ”‚ โ”œโ”€โ”€ searchscraper.py # SearchScraper models -โ”‚ โ”‚ โ”œโ”€โ”€ crawl.py # Crawler models -โ”‚ โ”‚ โ”œโ”€โ”€ markdownify.py # Markdownify models -โ”‚ โ”‚ โ”œโ”€โ”€ scrape.py # Scrape models -โ”‚ โ”‚ โ”œโ”€โ”€ agenticscraper.py # AgenticScraper models -โ”‚ โ”‚ โ”œโ”€โ”€ scheduled_jobs.py # Scheduled Jobs models -โ”‚ โ”‚ โ”œโ”€โ”€ schema.py # Schema generation models -โ”‚ โ”‚ โ””โ”€โ”€ feedback.py # Feedback models -โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€ utils/ # Utility functions -โ”‚ โ””โ”€โ”€ helpers.py # HTTP helpers, validation -โ”‚ -โ”œโ”€โ”€ tests/ # Test suite -โ”‚ โ”œโ”€โ”€ conftest.py # Pytest fixtures -โ”‚ โ”œโ”€โ”€ test_client.py # Sync client tests -โ”‚ โ”œโ”€โ”€ test_async_client.py # Async client tests -โ”‚ โ”œโ”€โ”€ test_smartscraper.py # SmartScraper endpoint tests -โ”‚ โ”œโ”€โ”€ test_searchscraper.py # SearchScraper tests -โ”‚ โ”œโ”€โ”€ test_crawler.py # Crawler tests -โ”‚ โ””โ”€โ”€ ... # Other endpoint tests -โ”‚ -โ”œโ”€โ”€ examples/ # Usage examples -โ”‚ โ”œโ”€โ”€ basic_usage.py -โ”‚ โ”œโ”€โ”€ async_usage.py -โ”‚ โ”œโ”€โ”€ with_schema.py -โ”‚ โ”œโ”€โ”€ pagination.py -โ”‚ โ””โ”€โ”€ ... -โ”‚ -โ”œโ”€โ”€ docs/ # MkDocs documentation -โ”‚ โ”œโ”€โ”€ index.md -โ”‚ โ”œโ”€โ”€ api/ # Auto-generated API docs -โ”‚ โ””โ”€โ”€ mkdocs.yml # MkDocs configuration -โ”‚ -โ”œโ”€โ”€ pyproject.toml # Package metadata & tool configs -โ”œโ”€โ”€ pytest.ini # Pytest configuration -โ”œโ”€โ”€ Makefile # Development tasks -โ”œโ”€โ”€ .releaserc.yml # Semantic-release config -โ”œโ”€โ”€ .pre-commit-config.yaml # Pre-commit hooks -โ””โ”€โ”€ README.md # Python SDK documentation -``` - -### Client Architecture - -**Dual Client Design:** - -The Python SDK implements two client classes with identical APIs: - -1. **`Client`** (`client.py`) - Synchronous client - - Uses `requests` library - - Blocking operations - - Simpler for scripts and synchronous applications - -2. **`AsyncClient`** (`async_client.py`) - Asynchronous client - - Uses `aiohttp` library - - Non-blocking operations - - Context manager support (`async with`) - - Ideal for concurrent operations and async frameworks - -Both clients share the same method signatures and return types, making it easy to switch between sync and async implementations. - -**Client Features:** -- **Environment Variable Support**: Auto-loads `SGAI_API_KEY` -- **SSL Verification**: Configurable SSL cert verification -- **Timeouts**: Configurable request timeouts -- **Retries**: Built-in retry logic with exponential backoff -- **Logging**: Detailed debug logs with colored output -- **Error Handling**: Custom exceptions with detailed error messages - -**Example Usage:** - -```python -# Synchronous Client -from scrapegraph_py import Client - -client = Client(api_key="your-key") -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title" -) - -# Asynchronous Client -from scrapegraph_py import AsyncClient -import asyncio - -async def main(): - async with AsyncClient(api_key="your-key") as client: - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title" - ) - -asyncio.run(main()) -``` - -### Pydantic Models - -All API requests and responses are modeled using **Pydantic v2**, providing: -- **Type Validation**: Automatic validation of input/output data -- **Serialization**: JSON encoding/decoding -- **IDE Support**: Type hints for autocomplete -- **Documentation**: Schema generation for API docs - -Models are organized by endpoint in `scrapegraph_py/models/`: - -**Example Model:** -```python -from pydantic import BaseModel, Field -from typing import Optional, Dict, Any - -class SmartScraperRequest(BaseModel): - website_url: Optional[str] = Field(None, description="URL to scrape") - website_html: Optional[str] = Field(None, description="HTML content to scrape") - user_prompt: str = Field(..., description="Natural language prompt") - output_schema: Optional[Dict[str, Any]] = Field(None, description="Output schema") - # ... more fields -``` - -### Configuration - -**Constants** (`config.py`): -```python -API_BASE_URL = "https://api.scrapegraphai.com" -DEFAULT_HEADERS = { - "Content-Type": "application/json", - "User-Agent": "scrapegraph-py/1.12.2" -} -``` - -**Environment Variables:** -- `SGAI_API_KEY` - API key for authentication - -### Logging - -Configurable logging with colored output (`logger.py`): -- **DEBUG**: Detailed request/response logs -- **INFO**: Key operations and progress -- **WARNING**: Deprecation warnings -- **ERROR**: Error messages - -### Testing - -**Test Structure:** -- **Unit Tests**: Test individual functions and models -- **Integration Tests**: Test full request/response cycles with mocked HTTP -- **Fixtures**: Reusable test data in `conftest.py` - -**Mocking Strategy:** -- **Sync tests**: Use `responses` library to mock `requests` -- **Async tests**: Use `aioresponses` to mock `aiohttp` - -**Running Tests:** -```bash -# All tests -pytest tests/ -v - -# With coverage -pytest --cov=scrapegraph_py --cov-report=html tests/ - -# Specific test -pytest tests/test_smartscraper.py -v -``` - ---- - -## API Endpoint Coverage - -The SDK supports all ScrapeGraph AI API endpoints: - -### 1. SmartScraper -**Purpose**: AI-powered web scraping with schema extraction - -```python -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product details", - output_schema=ProductSchema # Optional Pydantic model -) -``` - -**Features:** -- URL or HTML input -- Natural language prompts -- Optional output schema for structured data -- Pagination support -- Cookie support -- Infinite scrolling -- Heavy JS rendering - -### 2. SearchScraper -**Purpose**: Multi-website search and extraction - -```python -response = client.searchscraper( - user_prompt="Find AI news articles", - num_results=5 -) -``` - -### 3. Crawler / SmartCrawler -**Purpose**: Sitemap generation and multi-page crawling - -```python -response = client.crawler( - website_url="https://example.com", - max_depth=3 -) -``` - -### 4. Markdownify -**Purpose**: HTML to Markdown conversion - -```python -response = client.markdownify( - website_url="https://example.com/article" -) -``` - -### 5. Scrape -**Purpose**: Basic HTML extraction - -```python -response = client.scrape( - url="https://example.com" -) -``` - -### 6. AgenticScraper -**Purpose**: Browser automation with AI-powered actions - -```python -response = client.agentic_scraper( - website_url="https://example.com", - steps=["click button", "scroll down"], - user_prompt="Extract data" -) -``` - -### 7. Scheduled Jobs -**Purpose**: Cron-based job scheduling - -```python -# Create job -job = client.create_scheduled_job( - name="Daily scrape", - cron_expression="0 9 * * *", - service_type="smartscraper", - service_params={...} -) - -# List jobs -jobs = client.list_scheduled_jobs() - -# Update job -updated = client.update_scheduled_job(job_id, ...) - -# Delete job -client.delete_scheduled_job(job_id) -``` - -### 8. Credits -**Purpose**: Check credit balance - -```python -balance = client.get_credits() -``` - -### 9. Feedback -**Purpose**: Send ratings and feedback - -```python -client.send_feedback( - request_id="...", - rating=5, - feedback="Great results!" -) -``` - -### 10. Schema Generation -**Purpose**: AI-powered schema generation from prompts - -```python -schema = client.generate_schema( - user_prompt="Generate schema for product data", - website_url="https://example.com" -) -``` - ---- - -## Authentication & Configuration - -### Python SDK - -**Method 1: Environment Variable** -```python -import os -os.environ['SGAI_API_KEY'] = 'sgai-...' - -from scrapegraph_py import Client -client = Client() # Auto-loads from env -``` - -**Method 2: Direct Initialization** -```python -from scrapegraph_py import Client -client = Client(api_key='sgai-...') -``` - -**Method 3: From Environment (class method)** -```python -from scrapegraph_py import Client -client = Client.from_env() # Reads SGAI_API_KEY -``` - -**Configuration Options:** -```python -client = Client( - api_key='sgai-...', - verify_ssl=True, # SSL verification - timeout=30.0, # Request timeout (seconds) - max_retries=3, # Retry attempts - retry_delay=1.0 # Delay between retries -) -``` - ---- - -## Testing Strategy - -### Python SDK Testing - -**Test Framework**: pytest with plugins - -**Test Types:** -1. **Unit Tests**: Model validation, utility functions -2. **Integration Tests**: Full request/response cycles with mocked HTTP - -**Coverage Goals**: -- Code coverage: >80% -- All endpoints covered -- Both sync and async clients tested - -**Mocking Strategy:** -```python -# Sync client tests with responses -import responses -from scrapegraph_py import Client - -@responses.activate -def test_smartscraper(): - responses.add( - responses.POST, - "https://api.scrapegraphai.com/v1/smartscraper", - json={"request_id": "123", "status": "completed"}, - status=200 - ) - - client = Client(api_key="test-key") - response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title" - ) - assert response.request_id == "123" -``` - -```python -# Async client tests with aioresponses -import pytest -from aioresponses import aioresponses -from scrapegraph_py import AsyncClient - -@pytest.mark.asyncio -async def test_async_smartscraper(): - with aioresponses() as m: - m.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={"request_id": "123", "status": "completed"} - ) - - async with AsyncClient(api_key="test-key") as client: - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title" - ) - assert response.request_id == "123" -``` - ---- - -## Release & Publishing - -The SDK uses **semantic-release** for automated versioning and publishing. - -### Semantic Versioning Rules - -Commit messages follow [Conventional Commits](https://www.conventionalcommits.org/): - -- `feat: add new feature` โ†’ **Minor** version bump (0.x.0) -- `fix: bug fix` โ†’ **Patch** version bump (0.0.x) -- `feat!: breaking change` or `BREAKING CHANGE:` in body โ†’ **Major** bump (x.0.0) -- `docs:`, `chore:`, `style:`, `refactor:`, `test:` โ†’ No version bump - -### Release Workflow - -1. Merge PR to `main` branch -2. GitHub Actions workflow triggers (`release.yml`) -3. semantic-release analyzes commits -4. Version bumped in `pyproject.toml` -5. `CHANGELOG.md` updated -6. Git tag created -7. GitHub release published -8. Package published to PyPI - -### Configuration Files - -**Python** (`.releaserc.yml`): -```yaml -branches: - - main -plugins: - - "@semantic-release/commit-analyzer" - - "@semantic-release/release-notes-generator" - - "@semantic-release/changelog" - - ["semantic-release-pypi", { pkgdir: "scrapegraph-py" }] - - "@semantic-release/github" - - "@semantic-release/git" -``` - -### Manual Release (Emergency) - -```bash -cd scrapegraph-py -uv build -twine upload dist/* -``` - ---- - -## External Dependencies - -### Python SDK Dependencies - -**Core Runtime:** -- **requests**: Sync HTTP client -- **aiohttp**: Async HTTP client -- **pydantic**: Data validation -- **python-dotenv**: Environment variables - -**Optional Runtime (install with extras):** -- **beautifulsoup4**: HTML parsing (required when using `website_html`) - - Install with: `pip install scrapegraph-py[html]` -- **langchain, langchain-community, langchain-scrapegraph**: Langchain integration - - Install with: `pip install scrapegraph-py[langchain]` - -**Development:** -- **pytest & plugins**: Testing framework -- **black, isort, ruff**: Code quality -- **mypy**: Type checking -- **mkdocs**: Documentation - -### API Dependencies - -The SDK depends on the ScrapeGraph AI API: -- **Base URL**: `https://api.scrapegraphai.com` -- **Authentication**: API key via `SGAI-APIKEY` header -- **API Version**: v1 -- **Rate Limits**: Based on plan (see dashboard) - ---- - -**For detailed usage examples, see the [cookbook](../../cookbook/) directory.** -**For contributing guidelines, see [.agent/README.md](../README.md).** diff --git a/.agent/tasks/example_feature_template.md b/.agent/tasks/example_feature_template.md deleted file mode 100644 index b763bbe8..00000000 --- a/.agent/tasks/example_feature_template.md +++ /dev/null @@ -1,285 +0,0 @@ -# Feature: [Feature Name] - -**Status**: Draft / In Progress / Completed -**Created**: YYYY-MM-DD -**Last Updated**: YYYY-MM-DD -**Owner**: [Developer Name] - ---- - -## Overview - -Brief description of the feature and its purpose. - -## Problem Statement - -What problem does this feature solve? Why is it needed? - -## Goals - -- Goal 1 -- Goal 2 -- Goal 3 - -## Non-Goals - -- What this feature will NOT do -- Out of scope items - ---- - -## Requirements - -### Functional Requirements - -1. **Requirement 1**: Description -2. **Requirement 2**: Description -3. **Requirement 3**: Description - -### Non-Functional Requirements - -- **Performance**: Expected performance characteristics -- **Reliability**: Uptime, error handling requirements -- **Security**: Authentication, authorization requirements -- **Usability**: User experience considerations - ---- - -## Technical Design - -### Python SDK Changes - -**New Files:** -- `scrapegraph_py/models/new_feature.py` - Pydantic models -- `tests/test_new_feature.py` - Test suite -- `examples/new_feature_example.py` - Usage example - -**Modified Files:** -- `scrapegraph_py/client.py` - Add new methods -- `scrapegraph_py/async_client.py` - Add async methods -- `scrapegraph_py/models/__init__.py` - Export new models -- `README.md` - Update documentation - -**API Changes:** -```python -# New client methods -def new_feature( - self, - param1: str, - param2: Optional[int] = None -) -> NewFeatureResponse: - """Description of new feature.""" - pass -``` - ---- - -## Implementation Plan - -### Phase 1: Foundation (Week 1) - -- [ ] Create Pydantic models for Python SDK -- [ ] Write unit tests (TDD approach) -- [ ] Set up CI test coverage - -### Phase 2: Implementation (Week 2) - -- [ ] Implement Python sync client methods -- [ ] Implement Python async client methods -- [ ] Add error handling and validation - -### Phase 3: Testing & Documentation (Week 3) - -- [ ] Write integration tests -- [ ] Create usage examples -- [ ] Update README documentation -- [ ] Add docstrings - -### Phase 4: Release (Week 4) - -- [ ] Code review -- [ ] Merge to main -- [ ] Automated release -- [ ] Verify published packages -- [ ] Announce feature - ---- - -## API Specification - -### Endpoint - -**URL**: `POST /v1/new_feature` - -**Request:** -```json -{ - "param1": "string", - "param2": 123, - "optional_param": "value" -} -``` - -**Response:** -```json -{ - "request_id": "uuid", - "status": "completed", - "result": { - "data": "processed result" - } -} -``` - -### Status Endpoint - -**URL**: `GET /v1/new_feature/{request_id}` - -**Response:** -```json -{ - "request_id": "uuid", - "status": "completed|pending|failed", - "result": {...} -} -``` - ---- - -## Usage Examples - -### Python - Sync Client - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-key") - -response = client.new_feature( - param1="value", - param2=123 -) - -print(response.result) -``` - -### Python - Async Client - -```python -from scrapegraph_py import AsyncClient -import asyncio - -async def main(): - async with AsyncClient(api_key="your-key") as client: - response = await client.new_feature( - param1="value", - param2=123 - ) - print(response.result) - -asyncio.run(main()) -``` - ---- - -## Testing Strategy - -### Unit Tests - -- Test Pydantic model validation -- Test parameter handling -- Test error cases -- Test edge cases - -### Integration Tests - -- Mock API responses -- Test full request/response cycle -- Test async operations -- Test error handling - -### Manual Testing - -- Test with real API -- Verify examples work -- Test documentation accuracy - ---- - -## Success Metrics - -- [ ] All automated tests pass -- [ ] Code coverage > 80% -- [ ] Documentation complete -- [ ] Examples functional -- [ ] No breaking changes (or documented migration) -- [ ] Successfully published to PyPI - ---- - -## Dependencies - -### Python SDK Dependencies - -No new dependencies required / OR: -- New dependency: `package-name>=X.Y.Z` - -### API Dependencies - -- Requires ScrapeGraph AI API version X.Y.Z or higher -- New endpoint must be live before SDK release - ---- - -## Migration Guide - -(If breaking changes) - -### Breaking Changes - -1. **Change 1**: Description - - **Before**: `old_api()` - - **After**: `new_api()` - - **Migration**: Code example - -### Deprecations - -- `old_method()` - Deprecated in v1.X.0, removed in v2.0.0 - - Use `new_method()` instead - ---- - -## Risks & Mitigations - -| Risk | Impact | Probability | Mitigation | -|------|--------|-------------|------------| -| API changes after SDK release | High | Low | Version API contract, use semantic versioning | -| Performance degradation | Medium | Low | Benchmark before release, optimize if needed | -| Breaking existing code | High | Medium | Thorough testing, clear migration docs | - ---- - -## Open Questions - -1. **Question 1**: Description - - **Answer**: TBD / Resolved: ... - -2. **Question 2**: Description - - **Answer**: TBD / Resolved: ... - ---- - -## References - -- [API Documentation](https://docs.scrapegraphai.com/endpoint/new-feature) -- [GitHub Issue #123](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/123) -- [Related PR #456](https://github.com/ScrapeGraphAI/scrapegraph-sdk/pull/456) - ---- - -## Changelog - -- **2025-01-XX**: Initial draft created -- **2025-01-XX**: Design review completed -- **2025-01-XX**: Implementation started -- **2025-01-XX**: Feature released in vX.Y.Z diff --git a/.github/update-requirements.yml b/.github/update-requirements.yml deleted file mode 100644 index 6ee8dc85..00000000 --- a/.github/update-requirements.yml +++ /dev/null @@ -1,26 +0,0 @@ -name: Update requirements -on: - push: - paths: - - 'scrapegraph-py/pyproject.toml' - - '.github/workflows/update-requirements.yml' - -jobs: - update: - name: Update requirements - runs-on: ubuntu-latest - steps: - - name: Install the latest version of rye - uses: eifinger/setup-rye@v3 - - name: Build app - run: rye run update-requirements - commit: - name: Commit changes - run: | - git config --global user.name 'github-actions' - git config --global user.email 'github-actions[bot]@users.noreply.github.com' - git add . - git commit -m "ci: update requirements.txt [skip ci]" - git push - env: - GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 00000000..d9839533 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,29 @@ +name: CI + +on: + push: + branches: [main] + pull_request: + branches: [main] + +jobs: + lint: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: astral-sh/setup-uv@v3 + - run: uv sync --frozen + - run: uv run ruff check src/ tests/ + - run: uv run ruff format --check src/ + + test: + runs-on: ubuntu-latest + strategy: + matrix: + python: ["3.12", "3.14"] + steps: + - uses: actions/checkout@v4 + - uses: astral-sh/setup-uv@v3 + - run: uv python install ${{ matrix.python }} + - run: uv sync --python ${{ matrix.python }} + - run: uv run pytest tests/test_client.py -v diff --git a/.github/workflows/pylint.yml b/.github/workflows/pylint.yml deleted file mode 100644 index 440dfe7c..00000000 --- a/.github/workflows/pylint.yml +++ /dev/null @@ -1,31 +0,0 @@ -on: - push: - paths: - - 'scrapegraph-py/**' - - '.github/workflows/pylint.yml' - -jobs: - build: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Install uv - uses: astral-sh/setup-uv@v3 - - name: Install dependencies - run: | - cd scrapegraph-py - uv sync --frozen - - name: Analysing the code with pylint - run: | - cd scrapegraph-py - uv run poe pylint-ci - - name: Check Pylint score - run: | - cd scrapegraph-py - pylint_score=$(uv run poe pylint-score-ci | grep 'Raw metrics' | awk '{print $4}') - if (( $(echo "$pylint_score < 8" | bc -l) )); then - echo "Pylint score is below 8. Blocking commit." - exit 1 - else - echo "Pylint score is acceptable." - fi diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml index 3eda703b..b9b74d8e 100644 --- a/.github/workflows/python-publish.yml +++ b/.github/workflows/python-publish.yml @@ -1,7 +1,4 @@ -# This workflow will upload a Python Package using Twine when a release is created -# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries - -name: Upload Python Package +name: Publish to PyPI on: release: @@ -10,58 +7,26 @@ on: jobs: test: runs-on: ubuntu-latest - strategy: - matrix: - python-version: ["3.10", "3.11", "3.12"] - steps: - - uses: actions/checkout@v4 - - - name: Set up Python ${{ matrix.python-version }} - uses: actions/setup-python@v4 - with: - python-version: ${{ matrix.python-version }} - - - name: Cache pip dependencies - uses: actions/cache@v3 - with: - path: ~/.cache/pip - key: ${{ runner.os }}-pip-${{ matrix.python-version }}-${{ hashFiles('**/pyproject.toml') }} - restore-keys: | - ${{ runner.os }}-pip-${{ matrix.python-version }}- + - uses: actions/checkout@v4 + - uses: astral-sh/setup-uv@v3 + - run: uv sync --frozen + - run: uv run pytest tests/test_client.py -v - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install pytest pytest-asyncio responses - cd scrapegraph-py - pip install -e ".[html]" - - - name: Run mocked tests with coverage - run: | - cd scrapegraph-py - python -m pytest tests/test_mocked_apis.py -v --cov=scrapegraph_py --cov-report=xml --cov-report=term-missing - - deploy: + publish: needs: test runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.x' - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install setuptools wheel twine build - - name: Build and publish - env: - TWINE_USERNAME: mvincig11 - TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} - run: | - git fetch --all --tags - cd scrapegraph-py - python -m build - twine upload dist/* + - uses: actions/checkout@v4 + - uses: astral-sh/setup-uv@v3 + + - name: Build + run: uv build + + - name: Publish to PyPI + env: + TWINE_USERNAME: __token__ + TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }} + run: | + uv pip install twine + uv run twine upload dist/* diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 16d3b8a0..429d82c7 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -1,83 +1,40 @@ name: Release + on: push: - branches: - - main - - pre/* + branches: [main] jobs: - build: - name: Build - runs-on: ubuntu-latest - steps: - - name: Install git - run: | - sudo apt update - sudo apt install -y git - - name: Install uv - uses: astral-sh/setup-uv@v3 - - name: Install Node Env - uses: actions/setup-node@v4 - with: - node-version: 20 - - name: Checkout - uses: actions/checkout@v4.1.1 - with: - fetch-depth: 0 - persist-credentials: false - - name: Build app - run: | - cd scrapegraph-py - uv sync --frozen - uv build - id: build_cache - if: success() - - name: Cache build - uses: actions/cache@v4 - with: - path: scrapegraph-py/dist - key: ${{ runner.os }}-build-${{ hashFiles('scrapegraph-py/dist/**') }} - if: steps.build_cache.outputs.id != '' - release: - name: Release runs-on: ubuntu-latest - needs: build - environment: development - if: | - github.event_name == 'push' && github.ref == 'refs/heads/main' || - github.event_name == 'push' && github.ref == 'refs/heads/pre/beta' || - github.event_name == 'pull_request' && github.event.action == 'closed' && github.event.pull_request.merged && github.event.pull_request.base.ref == 'main' || - github.event_name == 'pull_request' && github.event.action == 'closed' && github.event.pull_request.merged && github.event.pull_request.base.ref == 'pre/beta' permissions: contents: write issues: write pull-requests: write id-token: write steps: - - name: Checkout repo - uses: actions/checkout@v4.1.1 + - uses: actions/checkout@v4 with: fetch-depth: 0 persist-credentials: false - - name: Install uv - uses: astral-sh/setup-uv@v3 - - name: Setup Python environment - run: | - cd ./scrapegraph-py - uv sync - - name: Restore build artifacts - uses: actions/cache@v4 + + - uses: astral-sh/setup-uv@v3 + + - uses: actions/setup-node@v4 with: - path: ./scrapegraph-py/dist - key: ${{ runner.os }}-build-${{ hashFiles('./scrapegraph-py/dist/**') }} + node-version: 20 + + - name: Build + run: | + uv sync --frozen + uv build + - name: Semantic Release - uses: cycjimmy/semantic-release-action@v4.1.0 + uses: cycjimmy/semantic-release-action@v4 env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }} with: - working_directory: ./scrapegraph-py semantic_version: 23 extra_plugins: | semantic-release-pypi@3 diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml deleted file mode 100644 index 6e19a178..00000000 --- a/.github/workflows/test.yml +++ /dev/null @@ -1,96 +0,0 @@ -name: Test Python SDK - -on: - push: - branches: [ main, master ] - pull_request: - branches: [ main, master ] - -jobs: - test: - runs-on: ubuntu-latest - - steps: - - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.12" - - - name: Cache pip dependencies - uses: actions/cache@v3 - with: - path: ~/.cache/pip - key: ${{ runner.os }}-pip-3.12-${{ hashFiles('**/pyproject.toml') }} - restore-keys: | - ${{ runner.os }}-pip-3.12- - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install pytest pytest-asyncio responses aioresponses - cd scrapegraph-py - pip install -e ".[html]" - - - name: Run tests - run: | - cd scrapegraph-py - pytest tests/ -v --ignore=tests/test_integration_v2.py - - name: Upload coverage to Codecov - uses: codecov/codecov-action@v3 - with: - file: ./scrapegraph-py/coverage.xml - flags: unittests - name: codecov-umbrella - fail_ci_if_error: false - - lint: - runs-on: ubuntu-latest - - steps: - - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.11" - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install flake8 black isort mypy - cd scrapegraph-py - pip install -e . - - - name: Run linting - run: | - cd scrapegraph-py - flake8 scrapegraph_py/ tests/ --max-line-length=120 --extend-ignore=E203,W503,E501,F401,F841 - black --check scrapegraph_py/ tests/ - isort --check-only scrapegraph_py/ tests/ - mypy scrapegraph_py/ --ignore-missing-imports - - security: - runs-on: ubuntu-latest - - steps: - - uses: actions/checkout@v4 - - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: "3.11" - - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install bandit safety - cd scrapegraph-py - pip install -e . - - - name: Run security checks - run: | - cd scrapegraph-py - bandit -r scrapegraph_py/ -f json -o bandit-report.json || true - safety check --json --output safety-report.json || true diff --git a/.gitignore b/.gitignore index 5b0609eb..8e87e541 100644 --- a/.gitignore +++ b/.gitignore @@ -1,8 +1,47 @@ .env -# Ignore .DS_Store files anywhere in the repository +.env.* +*.csv + +# OS .DS_Store **/.DS_Store -*.csv -venv/ + +# Python __pycache__/ -..bfg-report \ No newline at end of file +*.py[cod] +*$py.class +*.so +.Python +build/ +dist/ +*.egg-info/ +*.egg +.eggs/ + +# Virtual environments +venv/ +.venv/ +env/ + +# Testing +.pytest_cache/ +.coverage +htmlcov/ +.tox/ +.nox/ + +# Linting/formatting +.ruff_cache/ +.mypy_cache/ + +# IDE +.idea/ +.vscode/ +*.swp +*.swo + +# Build artifacts +*.whl + +# Misc +.bfg-report/ diff --git a/.python-version b/.python-version new file mode 100644 index 00000000..24ee5b1b --- /dev/null +++ b/.python-version @@ -0,0 +1 @@ +3.13 diff --git a/CLAUDE.md b/CLAUDE.md index de0a2c6b..55f89a56 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,368 +2,115 @@ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. -# DOCS -We keep all important docs in .agent folder and keep updating them, structure like below: -- `.agent/` - - `Tasks`: PRD & implementation plan for each feature - - `System`: Document the current state of the system (project structure, tech stack, SDK architecture, etc.) - - `SOP`: Best practices of execute certain tasks (e.g. how to add a new endpoint, how to release a version, etc.) - - `README.md`: an index of all the documentations we have so people know what & where to look for things - -We should always update `.agent` docs after we implement certain feature, to make sure it fully reflects the up-to-date information. - -Before you plan any implementation, always read the `.agent/README.md` first to get context. - -# important-instruction-reminders -Do what has been asked; nothing more, nothing less. -NEVER create files unless they're absolutely necessary for achieving your goal. -ALWAYS prefer editing an existing file to creating a new one. -NEVER proactively create documentation files (*.md) or README files. Only create documentation files if explicitly requested by the User. - ## Project Overview -**scrapegraph-sdk** is a repository containing the official Python SDK for the ScrapeGraph AI API. It provides a Python client for intelligent web scraping powered by AI. +**scrapegraph-py** is the official Python SDK for the ScrapeGraph AI API. It provides a Python client for intelligent web scraping powered by AI. ## Repository Structure ``` -scrapegraph-sdk/ -โ”œโ”€โ”€ scrapegraph-py/ # Python SDK -โ”œโ”€โ”€ cookbook/ # Usage examples and tutorials -โ”œโ”€โ”€ .github/workflows/ # GitHub Actions for CI/CD -โ”œโ”€โ”€ .agent/ # Documentation hub (read this!) -โ”œโ”€โ”€ package.json # Root package config (semantic-release) -โ””โ”€โ”€ README.md # Main repository README +scrapegraph-py/ +โ”œโ”€โ”€ scrapegraph_py/ # Python SDK source +โ”œโ”€โ”€ tests/ # Test suite +โ”œโ”€โ”€ examples/ # Usage examples +โ”œโ”€โ”€ docs/ # MkDocs documentation +โ”œโ”€โ”€ cookbook/ # Tutorials and recipes +โ””โ”€โ”€ .github/workflows/ # CI/CD ``` ## Tech Stack -### Python SDK - **Language**: Python 3.10+ - **Package Manager**: uv (recommended) or pip - **Core Dependencies**: requests, pydantic, python-dotenv, aiohttp -- **Optional Dependencies**: - - `html`: beautifulsoup4 (for HTML validation when using `website_html`) - - `langchain`: langchain, langchain-community, langchain-scrapegraph (for Langchain integrations) - **Testing**: pytest, pytest-asyncio, pytest-mock, aioresponses -- **Code Quality**: black, isort, ruff, mypy, pre-commit -- **Documentation**: mkdocs, mkdocs-material +- **Code Quality**: ruff - **Build**: hatchling - **Release**: semantic-release -## Common Development Commands - -### Python SDK +## Commands ```bash -# Navigate to Python SDK -cd scrapegraph-py - -# Install dependencies (recommended - using uv) -pip install uv +# Install uv sync -# Install dependencies (alternative - using pip) -pip install -e . - -# Install pre-commit hooks -uv run pre-commit install -# or: pre-commit install - -# Run all tests +# Test uv run pytest tests/ -v -# or: pytest tests/ -v -# Run specific test -uv run pytest tests/test_smartscraper.py -v +# Format & lint +uv run ruff format src tests +uv run ruff check src tests --fix -# Run tests with coverage -uv run pytest --cov=scrapegraph_py --cov-report=html tests/ - -# Format code -uv run black scrapegraph_py tests -# or: make format - -# Sort imports -uv run isort scrapegraph_py tests - -# Lint code -uv run ruff check scrapegraph_py tests -# or: make lint - -# Type check -uv run mypy scrapegraph_py -# or: make type-check - -# Build documentation -uv run mkdocs build -# or: make docs +# Build +uv build +``` -# Serve documentation locally -uv run mkdocs serve -# or: make serve-docs +## Before completing any task -# Run all checks (lint + type-check + test + docs) -make all +Always run these commands before committing or saying a task is done: -# Build package +```bash +uv run ruff format src tests +uv run ruff check src tests --fix uv build -# or: make build - -# Clean build artifacts -make clean +uv run pytest tests/ -v ``` -## Project Architecture +No exceptions. -### Python SDK (`scrapegraph-py/`) +## Architecture **Core Components:** -1. **Client Classes** (`scrapegraph_py/`): - - `client.py` - Synchronous client with all endpoint methods - - `async_client.py` - Asynchronous client (same interface, async/await) - - Both clients support the same API surface +1. **Clients** (`scrapegraph_py/`): + - `client.py` - Sync client + - `async_client.py` - Async client 2. **Models** (`scrapegraph_py/models/`): - Pydantic models for request/response validation - - `smartscraper.py` - SmartScraper request/response schemas - - `searchscraper.py` - SearchScraper schemas - - `crawl.py` - Crawler schemas - - `markdownify.py` - Markdownify schemas - - `agenticscraper.py` - AgenticScraper schemas - - `scrape.py` - Scrape schemas - - `scheduled_jobs.py` - Scheduled Jobs schemas - - `schema.py` - Schema generation models - - `feedback.py` - Feedback models - -3. **Utilities** (`scrapegraph_py/`): - - `config.py` - Configuration constants (API base URL, timeouts) - - `logger.py` - Logging configuration with colored output - - `exceptions.py` - Custom exception classes - - `utils/` - Helper functions - -4. **Testing** (`tests/`): - - `test_client.py` - Sync client tests - - `test_async_client.py` - Async client tests - - Individual endpoint tests - - Uses pytest with mocking (aioresponses, responses) - -5. **Documentation** (`docs/`): - - MkDocs-based documentation - - Auto-generated API reference from docstrings - -**Key Patterns:** -- **Dual Client Design**: Sync and async clients with identical APIs -- **Pydantic Validation**: Strong typing for all request/response data -- **Environment Variables**: Support `SGAI_API_KEY` env var for auth -- **Comprehensive Logging**: Detailed logs with configurable levels -- **Type Safety**: Full mypy strict mode compliance - -## API Coverage - -The SDK supports all ScrapeGraph AI API endpoints: - -| Endpoint | Python Method | Purpose | -|----------|---------------|---------| -| SmartScraper | `client.smartscraper()` | AI data extraction | -| SearchScraper | `client.searchscraper()` | Multi-URL search | -| Markdownify | `client.markdownify()` | HTML to Markdown | -| Crawler | `client.crawler()` | Sitemap & crawling | -| AgenticScraper | `client.agentic_scraper()` | Browser automation | -| Scrape | `client.scrape()` | Basic HTML fetch | -| Scheduled Jobs | `client.create_scheduled_job()`, etc. | Cron scheduling | -| Credits | `client.get_credits()` | Balance check | -| Feedback | `client.send_feedback()` | Rating/feedback | -| Schema Gen | `client.generate_schema()` | AI schema creation | - -## Development Workflow - -### Adding a New Endpoint - -1. Add request/response models in `scrapegraph_py/models/new_endpoint.py` -2. Add sync method to `scrapegraph_py/client.py` -3. Add async method to `scrapegraph_py/async_client.py` -4. Export models in `scrapegraph_py/models/__init__.py` -5. Add tests in `tests/test_new_endpoint.py` -6. Update examples in `examples/` -7. Update README.md with usage examples - -### Testing Strategy - -- Unit tests for models (Pydantic validation) -- Integration tests for client methods (mocked HTTP) -- Use `aioresponses` for async client testing -- Use `responses` for sync client testing -- Mock API responses to avoid real API calls in CI -- Run `pytest --cov` for coverage reports - -### Release Process - -The SDK uses **semantic-release** for automated versioning: - -1. **Commit with semantic messages:** - - `feat: add new endpoint` โ†’ Minor bump (0.x.0) - - `fix: handle timeout errors` โ†’ Patch bump (0.0.x) - - `feat!: breaking API change` โ†’ Major bump (x.0.0) - -2. **Merge to main branch** - -3. **GitHub Actions automatically:** - - Determines version bump - - Updates version in `pyproject.toml` - - Generates CHANGELOG.md - - Creates GitHub release - - Publishes to PyPI - -Configuration files: -- Python: `.releaserc.yml` -- GitHub workflow: `.github/workflows/release.yml` - -## Important Conventions - -### Python SDK - -- **Code Style**: - - Black formatting (line-length: 88) - - isort for imports (Black profile) - - Ruff for linting - - mypy strict mode for type checking - -- **Type Hints**: - - All functions have type annotations - - Use Pydantic models for complex data - - Use `Optional[T]` for nullable values - -- **Docstrings**: - - Google-style docstrings - - Document all public methods - - Include examples in docstrings - -- **Testing**: - - Pytest for all tests - - Mock external HTTP calls - - Aim for >80% coverage -## Environment Variables +3. **Config** (`scrapegraph_py/`): + - `config.py` - API base URL, timeouts + - `exceptions.py` - Custom exceptions -The SDK supports API key via environment variable: +## API Endpoints -- **Python**: `SGAI_API_KEY` +| Endpoint | Method | Purpose | +|----------|--------|---------| +| SmartScraper | `smartscraper()` | AI data extraction | +| SearchScraper | `searchscraper()` | Multi-URL search | +| Markdownify | `markdownify()` | HTML to Markdown | +| Crawler | `crawler()` | Sitemap & crawling | +| AgenticScraper | `agentic_scraper()` | Browser automation | +| Scrape | `scrape()` | Basic HTML fetch | +| Credits | `get_credits()` | Balance check | -Usage: -```bash -export SGAI_API_KEY="your-api-key-here" -``` +## Adding New Endpoint -Then initialize client without passing API key: -```python -from scrapegraph_py import Client -client = Client() # Uses SGAI_API_KEY env var -``` +1. Add models in `scrapegraph_py/models/` +2. Add sync method to `client.py` +3. Add async method to `async_client.py` +4. Export in `models/__init__.py` +5. Add tests in `tests/` + +## Environment Variables -## Common Patterns +- `SGAI_API_KEY` - API key for authentication -### Using Sync Client +## Usage ```python from scrapegraph_py import Client client = Client(api_key="your-key") - response = client.smartscraper( website_url="https://example.com", - user_prompt="Extract title and description" + user_prompt="Extract title" ) - print(response.result) ``` -### Using Async Client - -```python -from scrapegraph_py import AsyncClient -import asyncio - -async def main(): - async with AsyncClient(api_key="your-key") as client: - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract title" - ) - print(response.result) - -asyncio.run(main()) -``` - -### Using Output Schema - -```python -from pydantic import BaseModel, Field -from scrapegraph_py import Client - -class Article(BaseModel): - title: str = Field(description="Article title") - author: str = Field(description="Author name") - -client = Client(api_key="your-key") -response = client.smartscraper( - website_url="https://news.example.com", - user_prompt="Extract article data", - output_schema=Article -) -``` - -## File Locations Reference - -### Python SDK Key Files - -- Entry points: `scrapegraph_py/__init__.py`, `scrapegraph_py/client.py`, `scrapegraph_py/async_client.py` -- Models: `scrapegraph_py/models/` -- Config: `pyproject.toml`, `pytest.ini`, `Makefile` -- Tests: `tests/` -- Examples: `examples/` -- Docs: `docs/` (MkDocs) - -### Root Level - -- Monorepo config: `package.json` (semantic-release) -- Documentation: `.agent/README.md` (read this!) -- Examples: `cookbook/` -- CI/CD: `.github/workflows/` - -## Debugging - -### Python SDK Debug Mode - -Enable detailed logging: -```python -import logging -from scrapegraph_py import Client - -logging.basicConfig(level=logging.DEBUG) -client = Client(api_key="your-key") -``` - -## Cookbook - -The `cookbook/` directory contains practical examples: -- Authentication patterns -- Error handling -- Pagination -- Scheduled jobs -- Advanced features - -Refer to cookbook for real-world usage patterns. - -## External Documentation - -- [ScrapeGraph AI API Documentation](https://docs.scrapegraphai.com) -- [Python SDK on PyPI](https://pypi.org/project/scrapegraph-py/) -- [GitHub Repository](https://github.com/ScrapeGraphAI/scrapegraph-sdk) - -## Support +## Links -- Email: support@scrapegraphai.com -- GitHub Issues: https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues -- Documentation: https://docs.scrapegraphai.com +- [API Docs](https://docs.scrapegraphai.com) +- [PyPI](https://pypi.org/project/scrapegraph-py/) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..263cef41 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,55 @@ +# Contributing to scrapegraph-py + +## Setup + +```bash +uv sync +``` + +## Development + +```bash +uv build # build to dist/ +uv run ruff check . # lint +``` + +## Before committing + +Run all checks: + +```bash +uv run ruff format src/ tests/ # auto-fix formatting +uv run ruff check src/ tests/ # check for errors +uv run pytest tests/test_client.py -v # unit tests +``` + +## Testing + +```bash +uv run pytest tests/test_client.py -v # unit tests only +uv run pytest tests/test_integration.py -v # live API tests (requires SGAI_API_KEY) +``` + +For integration tests, set `SGAI_API_KEY` in your environment or `.env` file. + +## Commit messages + +Use conventional commits: + +- `feat:` new feature +- `fix:` bug fix +- `refactor:` code change (no new feature, no bug fix) +- `chore:` maintenance (deps, config) +- `test:` add/update tests +- `docs:` documentation + +## Pull requests + +1. Fork and create a branch from `main` +2. Make changes +3. Run all checks (see above) +4. Submit PR to `main` + +## License + +MIT - contributions are licensed under the same license. diff --git a/HEALTHCHECK.md b/HEALTHCHECK.md deleted file mode 100644 index dfb403f7..00000000 --- a/HEALTHCHECK.md +++ /dev/null @@ -1,367 +0,0 @@ -# Health Check Endpoint - -## Overview - -The health check endpoint (`/healthz`) has been added to the Python SDK to facilitate production monitoring and service health checks. This endpoint provides a quick way to verify that the ScrapeGraphAI API service is operational and ready to handle requests. - -**Related:** [GitHub Issue #62](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/62) - -## Use Cases - -- **Production Monitoring**: Regular health checks for alerting and monitoring systems -- **Container Health Checks**: Kubernetes liveness/readiness probes, Docker HEALTHCHECK -- **Load Balancer Health Checks**: Ensure service availability before routing traffic -- **Integration with Monitoring Tools**: Prometheus, Datadog, New Relic, etc. -- **Pre-request Validation**: Verify service is available before making API calls -- **Service Discovery**: Health status for service mesh and discovery systems - -## Python SDK - -### Installation - -The health check endpoint is available in the latest version of the Python SDK: - -```bash -pip install scrapegraph-py -``` - -### Usage - -#### Synchronous Client - -```python -from scrapegraph_py import Client - -# Initialize client -client = Client.from_env() - -# Check health status -health = client.healthz() -print(health) -# {'status': 'healthy', 'message': 'Service is operational'} - -# Clean up -client.close() -``` - -#### Asynchronous Client - -```python -import asyncio -from scrapegraph_py import AsyncClient - -async def check_health(): - async with AsyncClient.from_env() as client: - health = await client.healthz() - print(health) - # {'status': 'healthy', 'message': 'Service is operational'} - -asyncio.run(check_health()) -``` - -### API Reference - -#### `Client.healthz()` - -Check the health status of the ScrapeGraphAI API service. - -**Returns:** -- `dict`: Health status information with at least the following fields: - - `status` (str): Health status (e.g., 'healthy', 'unhealthy', 'degraded') - - `message` (str): Human-readable status message - -**Raises:** -- `APIError`: If the API returns an error response -- `ConnectionError`: If unable to connect to the API - -#### `AsyncClient.healthz()` - -Asynchronous version of the health check method. - -**Returns:** -- `dict`: Health status information (same structure as sync version) - -**Raises:** -- Same exceptions as synchronous version - -### Examples - -#### Basic Health Check with Error Handling - -```python -from scrapegraph_py import Client - -client = Client.from_env() - -try: - health = client.healthz() - - if health.get('status') == 'healthy': - print("โœ“ Service is operational") - else: - print(f"โš  Service status: {health.get('status')}") - -except Exception as e: - print(f"โœ— Health check failed: {e}") - -finally: - client.close() -``` - -#### Integration with FastAPI - -```python -from fastapi import FastAPI, HTTPException -from scrapegraph_py import AsyncClient - -app = FastAPI() - -@app.get("/health") -async def health_check(): - """Health check endpoint for load balancers""" - try: - async with AsyncClient.from_env() as client: - health = await client.healthz() - - if health.get('status') == 'healthy': - return { - "status": "healthy", - "scrape_graph_api": "operational" - } - else: - raise HTTPException( - status_code=503, - detail="ScrapeGraphAI API is unhealthy" - ) - except Exception as e: - raise HTTPException( - status_code=503, - detail=f"Health check failed: {str(e)}" - ) -``` - -#### Kubernetes Liveness Probe Script - -```python -#!/usr/bin/env python3 -""" -Kubernetes liveness probe script for ScrapeGraphAI -Returns exit code 0 if healthy, 1 if unhealthy -""" -import sys -from scrapegraph_py import Client - -def main(): - try: - client = Client.from_env() - health = client.healthz() - client.close() - - if health.get('status') == 'healthy': - sys.exit(0) - else: - sys.exit(1) - except Exception: - sys.exit(1) - -if __name__ == "__main__": - main() -``` - -### Mock Mode Support - -The health check endpoint supports mock mode for testing: - -```python -from scrapegraph_py import Client - -# Enable mock mode -client = Client( - api_key="sgai-00000000-0000-0000-0000-000000000000", - mock=True -) - -health = client.healthz() -print(health) -# {'status': 'healthy', 'message': 'Service is operational'} -``` - -**Custom Mock Responses:** - -```python -from scrapegraph_py import Client - -custom_response = { - "status": "degraded", - "message": "Custom mock response", - "uptime": 12345 -} - -client = Client( - api_key="sgai-00000000-0000-0000-0000-000000000000", - mock=True, - mock_responses={"/v1/healthz": custom_response} -) - -health = client.healthz() -print(health) -# {'status': 'degraded', 'message': 'Custom mock response', 'uptime': 12345} -``` - -## Response Format - -### Success Response - -```json -{ - "status": "healthy", - "message": "Service is operational" -} -``` - -### Possible Status Values - -- `healthy`: Service is fully operational -- `degraded`: Service is operational but experiencing issues -- `unhealthy`: Service is not operational - -Note: The actual status values and additional fields may vary based on the API implementation. - -## Docker Health Check - -### Dockerfile Example - -```dockerfile -FROM python:3.11-slim - -WORKDIR /app -COPY requirements.txt . -RUN pip install -r requirements.txt - -COPY . . - -# Health check using the SDK -HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ - CMD python -c "from scrapegraph_py import Client; import sys; c = Client.from_env(); h = c.healthz(); c.close(); sys.exit(0 if h.get('status') == 'healthy' else 1)" - -CMD ["python", "app.py"] -``` - -### docker-compose.yml Example - -```yaml -version: '3.8' -services: - app: - build: . - environment: - - SGAI_API_KEY=${SGAI_API_KEY} - healthcheck: - test: ["CMD", "python", "-c", "from scrapegraph_py import Client; import sys; c = Client.from_env(); h = c.healthz(); c.close(); sys.exit(0 if h.get('status') == 'healthy' else 1)"] - interval: 30s - timeout: 3s - retries: 3 - start_period: 5s -``` - -## Kubernetes Deployment - -### Liveness and Readiness Probes - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: scrapegraph-app -spec: - replicas: 3 - template: - spec: - containers: - - name: app - image: your-app:latest - env: - - name: SGAI_API_KEY - valueFrom: - secretKeyRef: - name: scrapegraph-secret - key: api-key - - # Liveness probe - restarts container if unhealthy - livenessProbe: - exec: - command: - - python - - -c - - | - from scrapegraph_py import Client - import sys - c = Client.from_env() - h = c.healthz() - c.close() - sys.exit(0 if h.get('status') == 'healthy' else 1) - initialDelaySeconds: 10 - periodSeconds: 30 - timeoutSeconds: 5 - failureThreshold: 3 - - # Readiness probe - removes from service if not ready - readinessProbe: - exec: - command: - - python - - -c - - | - from scrapegraph_py import Client - import sys - c = Client.from_env() - h = c.healthz() - c.close() - sys.exit(0 if h.get('status') == 'healthy' else 1) - initialDelaySeconds: 5 - periodSeconds: 10 - timeoutSeconds: 5 - failureThreshold: 2 -``` - -## Examples Location - -### Python Examples -- Basic: `scrapegraph-py/examples/utilities/healthz_example.py` -- Async: `scrapegraph-py/examples/utilities/healthz_async_example.py` - -## Tests - -### Python Tests -- `scrapegraph-py/tests/test_client.py` - Synchronous tests -- `scrapegraph-py/tests/test_async_client.py` - Asynchronous tests -- `scrapegraph-py/tests/test_healthz_mock.py` - Mock mode tests - -## Running Tests - -### Python - -```bash -# Run all tests -cd scrapegraph-py -pytest tests/test_healthz_mock.py -v - -# Run specific test -pytest tests/test_healthz_mock.py::test_healthz_mock_sync -v -``` - -## Best Practices - -1. **Implement Timeout**: Always set a reasonable timeout for health checks (3-5 seconds recommended) -2. **Use Appropriate Intervals**: Don't check too frequently; 30 seconds is a good default -3. **Handle Failures Gracefully**: Implement retry logic with exponential backoff -4. **Monitor and Alert**: Integrate with monitoring systems for automated alerting -5. **Test in Mock Mode**: Use mock mode in CI/CD pipelines to avoid API calls -6. **Log Health Check Results**: Keep records of health check outcomes for debugging - -## Support - -For issues, questions, or contributions, please visit: -- [GitHub Repository](https://github.com/ScrapeGraphAI/scrapegraph-sdk) -- [Issue #62](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/62) -- [Documentation](https://docs.scrapegraphai.com) diff --git a/IMPLEMENTATION_SUMMARY.md b/IMPLEMENTATION_SUMMARY.md deleted file mode 100644 index 8b0ec2a8..00000000 --- a/IMPLEMENTATION_SUMMARY.md +++ /dev/null @@ -1,231 +0,0 @@ -# Health Check Endpoint Implementation Summary - -## Overview -Added a `/healthz` health check endpoint to the Python SDK as requested in [Issue #62](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/62). - -## Changes Made - -### Python SDK (`scrapegraph-py/`) - -#### Core Implementation -1. **`scrapegraph_py/client.py`** - - Added `healthz()` method to the synchronous Client class - - Added mock response support for `/healthz` endpoint - - Full documentation and logging support - -2. **`scrapegraph_py/async_client.py`** - - Added `healthz()` method to the asynchronous AsyncClient class - - Added mock response support for `/healthz` endpoint - - Full async/await support with proper error handling - -#### Examples -3. **`examples/utilities/healthz_example.py`** - - Basic synchronous health check example - - Monitoring integration example - - Exit code handling for scripts - -4. **`examples/utilities/healthz_async_example.py`** - - Async health check example - - Concurrent health checks demonstration - - FastAPI integration pattern - - Advanced monitoring patterns - -#### Tests -5. **`tests/test_client.py`** - - Added `test_healthz()` - test basic health check - - Added `test_healthz_unhealthy()` - test unhealthy status - -6. **`tests/test_async_client.py`** - - Added `test_healthz()` - async test basic health check - - Added `test_healthz_unhealthy()` - async test unhealthy status - -7. **`tests/test_healthz_mock.py`** (New file) - - Comprehensive mock mode tests - - Tests for both sync and async clients - - Custom mock response tests - - Environment variable tests - -### Documentation -8. **`HEALTHCHECK.md`** (New file at root) - - Complete documentation for the SDK - - API reference - - Usage examples - - Integration patterns (FastAPI) - - Docker and Kubernetes examples - - Best practices - -9. **`IMPLEMENTATION_SUMMARY.md`** (This file) - - Summary of all changes - - File structure - - Testing results - -## Features Implemented - -### Core Functionality -โœ… GET `/healthz` endpoint implementation -โœ… Synchronous client support (Python) -โœ… Asynchronous client support (Python) -โœ… Proper error handling -โœ… Logging support - -### Mock Mode Support -โœ… Built-in mock responses -โœ… Custom mock response support -โœ… Mock handler support -โœ… Environment variable control - -### Testing -โœ… Unit tests for Python sync client -โœ… Unit tests for Python async client -โœ… Mock mode tests -โœ… All tests passing - -### Documentation -โœ… Inline code documentation -โœ… Python docstrings -โœ… Comprehensive user guide -โœ… Integration examples -โœ… Best practices guide - -### Examples -โœ… Basic usage examples -โœ… Advanced monitoring patterns -โœ… Framework integrations (FastAPI) -โœ… Container health checks (Docker) -โœ… Kubernetes probes -โœ… Retry logic patterns - -## Testing Results - -### Python SDK -``` -Running health check mock tests... -============================================================ -โœ“ Sync health check mock test passed -โœ“ Sync custom mock response test passed -โœ“ from_env mock test passed - -============================================================ -โœ… All synchronous tests passed! - -pytest results: -======================== 5 passed, 39 warnings in 0.25s ======================== -``` - -## File Structure - -``` -scrapegraph-sdk/ -โ”œโ”€โ”€ HEALTHCHECK.md # Complete documentation -โ”œโ”€โ”€ IMPLEMENTATION_SUMMARY.md # This file -โ”‚ -โ””โ”€โ”€ scrapegraph-py/ - โ”œโ”€โ”€ scrapegraph_py/ - โ”‚ โ”œโ”€โ”€ client.py # โœจ Added healthz() method - โ”‚ โ””โ”€โ”€ async_client.py # โœจ Added healthz() method - โ”œโ”€โ”€ examples/utilities/ - โ”‚ โ”œโ”€โ”€ healthz_example.py # ๐Ÿ†• New example - โ”‚ โ””โ”€โ”€ healthz_async_example.py # ๐Ÿ†• New example - โ””โ”€โ”€ tests/ - โ”œโ”€โ”€ test_client.py # โœจ Added tests - โ”œโ”€โ”€ test_async_client.py # โœจ Added tests - โ””โ”€โ”€ test_healthz_mock.py # ๐Ÿ†• New test file -``` - -Legend: -- ๐Ÿ†• New file -- โœจ Modified file - -## API Endpoints - -### Python -```python -# Synchronous -client.healthz() -> dict - -# Asynchronous -await client.healthz() -> dict -``` - -## Response Format -```json -{ - "status": "healthy", - "message": "Service is operational" -} -``` - -## Usage Examples - -### Python (Sync) -```python -from scrapegraph_py import Client - -client = Client.from_env() -health = client.healthz() -print(health) -client.close() -``` - -### Python (Async) -```python -from scrapegraph_py import AsyncClient - -async with AsyncClient.from_env() as client: - health = await client.healthz() - print(health) -``` - -## Integration Examples - -### Kubernetes Liveness Probe -```yaml -livenessProbe: - exec: - command: - - python - - -c - - | - from scrapegraph_py import Client - import sys - c = Client.from_env() - h = c.healthz() - c.close() - sys.exit(0 if h.get('status') == 'healthy' else 1) - initialDelaySeconds: 10 - periodSeconds: 30 -``` - -### Docker Health Check -```dockerfile -HEALTHCHECK --interval=30s --timeout=3s --retries=3 \ - CMD python -c "from scrapegraph_py import Client; import sys; c = Client.from_env(); h = c.healthz(); c.close(); sys.exit(0 if h.get('status') == 'healthy' else 1)" -``` - -## Next Steps - -1. โœ… Implementation complete -2. โœ… Tests written and passing -3. โœ… Documentation complete -4. โœ… Examples created -5. ๐Ÿ”ฒ Merge to main branch -6. ๐Ÿ”ฒ Release new version -7. ๐Ÿ”ฒ Update public documentation - -## Notes - -- All code follows existing SDK patterns and conventions -- Mock mode support ensures tests can run without API access -- Comprehensive error handling included -- Logging integrated throughout -- Documentation includes real-world integration examples -- All tests passing successfully - -## Related Issues - -- Resolves: [Issue #62 - Add health check endpoint to Python SDK](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/62) - -## Compatibility - -- Python: 3.8+ -- Fully backward compatible with existing SDK functionality diff --git a/README.md b/README.md index 07eace87..aa087738 100644 --- a/README.md +++ b/README.md @@ -1,290 +1,349 @@ -# ๐ŸŒ ScrapeGraph AI SDK +# ScrapeGraphAI Python SDK -[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) -[![Python SDK](https://img.shields.io/badge/Python_SDK-Latest-blue)](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py) -[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://docs.scrapegraphai.com) +[![PyPI version](https://badge.fury.io/py/scrapegraph-py.svg)](https://badge.fury.io/py/scrapegraph-py) +[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) -Official Python SDK for the ScrapeGraph AI API - Intelligent web scraping and search powered by AI. Extract structured data from any webpage or perform AI-powered web searches with natural language prompts. +

+ ScrapeGraphAI Python SDK +

-Get your [API key](https://scrapegraphai.com)! -[![API Banner](https://raw.githubusercontent.com/ScrapeGraphAI/Scrapegraph-ai/main/docs/assets/api_banner.png)](https://scrapegraphai.com/?utm_source=github&utm_medium=readme&utm_campaign=api_banner&utm_content=api_banner_image) +Official Python SDK for the [ScrapeGraphAI API](https://scrapegraphai.com). -## Features - -- ๐Ÿค– **SmartScraper**: Extract structured data from webpages using natural language prompts -- ๐Ÿ” **SearchScraper**: AI-powered web search with structured results and reference URLs -- ๐Ÿ“ **Markdownify**: Convert any webpage into clean, formatted markdown -- ๐Ÿ•ท๏ธ **SmartCrawler**: Intelligently crawl and extract data from multiple pages -- ๐Ÿค– **AgenticScraper**: Perform automated browser actions with AI-powered session management -- ๐Ÿ“„ **Scrape**: Convert webpages to HTML with JavaScript rendering and custom headers -- โฐ **Scheduled Jobs**: Create and manage automated scraping workflows with cron scheduling -- ๐Ÿ’ณ **Credits Management**: Monitor API usage and credit balance -- ๐Ÿ’ฌ **Feedback System**: Provide ratings and feedback to improve service quality - -## ๐Ÿš€ Quick Links -ScrapeGraphAI offers seamless integration with popular frameworks and tools to enhance your scraping capabilities. Whether you're building with Python, using LLM frameworks, or working with no-code platforms, we've got you covered with our comprehensive integration options.. - -You can find more informations at the following [link](https://scrapegraphai.com) - -**Integrations**: - -- **API**: [Documentation](https://docs.scrapegraphai.com/introduction) -- **SDK**: [Python](https://docs.scrapegraphai.com/sdks/python) -- **LLM Frameworks**: [Langchain](https://docs.scrapegraphai.com/integrations/langchain), [Llama Index](https://docs.scrapegraphai.com/integrations/llamaindex), [Crew.ai](https://docs.scrapegraphai.com/integrations/crewai), [CamelAI](https://github.com/camel-ai/camel) -- **Low-code Frameworks**: [Pipedream](https://pipedream.com/apps/scrapegraphai), [Bubble](https://bubble.io/plugin/scrapegraphai-1745408893195x213542371433906180), [Zapier](https://zapier.com/apps/scrapegraphai/integrations), [n8n](http://localhost:5001/dashboard), [LangFlow](https://www.langflow.org) -- **MCP server**: [Link](https://smithery.ai/server/@ScrapeGraphAI/scrapegraph-mcp) - -## ๐Ÿ“ฆ Installation +## Install ```bash pip install scrapegraph-py +# or +uv add scrapegraph-py ``` -## ๐ŸŽฏ Core Features - -- ๐Ÿค– **AI-Powered Extraction & Search**: Use natural language to extract data or search the web -- ๐Ÿ“Š **Structured Output**: Get clean, structured data with optional schema validation -- ๐Ÿ”„ **Multiple Formats**: Extract data as JSON, Markdown, or custom schemas -- โšก **High Performance**: Concurrent processing and automatic retries -- ๐Ÿ”’ **Enterprise Ready**: Production-grade security and rate limiting - -## ๐Ÿ› ๏ธ Available Endpoints - -### ๐Ÿค– SmartScraper -Using AI to extract structured data from any webpage or HTML content with natural language prompts. - -**Example Usage:** +## Quick Start ```python -from scrapegraph_py import Client -import os -from dotenv import load_dotenv +from scrapegraph_py import ScrapeGraphAI, ScrapeRequest -load_dotenv() - -# Initialize the client -client = Client(api_key=os.getenv("SGAI_API_KEY")) - -# Extract data from a webpage -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main heading, description, and summary of the webpage", -) +# reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI(api_key="...") +sgai = ScrapeGraphAI() -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") +result = sgai.scrape(ScrapeRequest( + url="https://example.com", +)) -client.close() +if result.status == "success": + print(result.data["results"]["markdown"]["data"]) +else: + print(result.error) ``` -### ๐Ÿ” SearchScraper -Perform AI-powered web searches with structured results and reference URLs. - -**Example Usage:** +Every method returns `ApiResult[T]` โ€” no exceptions to catch: ```python -from scrapegraph_py import Client -import os -from dotenv import load_dotenv +@dataclass +class ApiResult(Generic[T]): + status: Literal["success", "error"] + data: T | None + error: str | None + elapsed_ms: int +``` + +## API -load_dotenv() +### scrape -# Initialize the client -client = Client(api_key=os.getenv("SGAI_API_KEY")) +Scrape a webpage in multiple formats (markdown, html, screenshot, json, etc). -# Perform AI-powered web search -response = client.searchscraper( - user_prompt="What is the latest version of Python and what are its main features?", - num_results=3, # Number of websites to search (default: 3) +```python +from scrapegraph_py import ( + ScrapeGraphAI, ScrapeRequest, FetchConfig, + MarkdownFormatConfig, ScreenshotFormatConfig, JsonFormatConfig ) -print(f"Result: {response['result']}") -print("\nReference URLs:") -for url in response["reference_urls"]: - print(f"- {url}") +sgai = ScrapeGraphAI() -client.close() +res = sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[ + MarkdownFormatConfig(mode="reader"), + ScreenshotFormatConfig(full_page=True, width=1440, height=900), + JsonFormatConfig(prompt="Extract product info"), + ], + content_type="text/html", # optional, auto-detected + fetch_config=FetchConfig( # optional + mode="js", # "auto" | "fast" | "js" + stealth=True, + timeout=30000, + wait=2000, + scrolls=3, + headers={"Accept-Language": "en"}, + cookies={"session": "abc"}, + country="us", + ), +)) ``` -### ๐Ÿ“ Markdownify -Convert any webpage into clean, formatted markdown. +**Formats:** +- `markdown` โ€” Clean markdown (modes: `normal`, `reader`, `prune`) +- `html` โ€” Raw HTML (modes: `normal`, `reader`, `prune`) +- `links` โ€” All links on the page +- `images` โ€” All image URLs +- `summary` โ€” AI-generated summary +- `json` โ€” Structured extraction with prompt/schema +- `branding` โ€” Brand colors, typography, logos +- `screenshot` โ€” Page screenshot (full_page, width, height, quality) -**Example Usage:** +### extract + +Extract structured data from a URL, HTML, or markdown using AI. ```python -from scrapegraph_py import Client -import os -from dotenv import load_dotenv +from scrapegraph_py import ScrapeGraphAI, ExtractRequest -load_dotenv() +sgai = ScrapeGraphAI() -# Initialize the client -client = Client(api_key=os.getenv("SGAI_API_KEY")) +res = sgai.extract(ExtractRequest( + url="https://example.com", + prompt="Extract product names and prices", + schema={"type": "object", "properties": {...}}, # optional + mode="reader", # optional + fetch_config=FetchConfig(...), # optional +)) +# Or pass html/markdown directly instead of url +``` -# Convert webpage to markdown -response = client.markdownify( - website_url="https://example.com", -) +### search -print(f"Request ID: {response['request_id']}") -print(f"Markdown: {response['result']}") +Search the web and optionally extract structured data. -client.close() +```python +from scrapegraph_py import ScrapeGraphAI, SearchRequest + +sgai = ScrapeGraphAI() + +res = sgai.search(SearchRequest( + query="best programming languages 2024", + num_results=5, # 1-20, default 3 + format="markdown", # "markdown" | "html" + prompt="Extract key points", # optional, for AI extraction + schema={...}, # optional + time_range="past_week", # optional + location_geo_code="us", # optional + fetch_config=FetchConfig(...), # optional +)) ``` -### ๐Ÿ•ท๏ธ SmartCrawler -Intelligently crawl and extract data from multiple pages with configurable depth and batch processing. +### crawl -**Example Usage:** +Crawl a website and its linked pages. ```python -from scrapegraph_py import Client -import os -import time -from dotenv import load_dotenv - -load_dotenv() +from scrapegraph_py import ScrapeGraphAI, CrawlRequest, MarkdownFormatConfig -# Initialize the client -client = Client(api_key=os.getenv("SGAI_API_KEY")) +sgai = ScrapeGraphAI() -# Start crawl job -crawl_response = client.crawl( +# Start a crawl +start = sgai.crawl.start(CrawlRequest( url="https://example.com", - prompt="Extract page titles and main headings", - data_schema={ - "type": "object", - "properties": { - "title": {"type": "string"}, - "headings": {"type": "array", "items": {"type": "string"}} - } - }, - depth=2, - max_pages=5, - same_domain_only=True, -) - -crawl_id = crawl_response.get("id") or crawl_response.get("task_id") - -# Poll for results -if crawl_id: - for _ in range(10): - time.sleep(5) - result = client.get_crawl(crawl_id) - if result.get("status") == "success": - print("Crawl completed:", result["result"]["llm_result"]) - break - -client.close() + formats=[MarkdownFormatConfig()], + max_pages=50, + max_depth=2, + max_links_per_page=10, + include_patterns=["/blog/*"], + exclude_patterns=["/admin/*"], + fetch_config=FetchConfig(...), +)) + +# Check status +status = sgai.crawl.get(start.data["id"]) + +# Control +sgai.crawl.stop(crawl_id) +sgai.crawl.resume(crawl_id) +sgai.crawl.delete(crawl_id) ``` -### ๐Ÿค– AgenticScraper -Perform automated browser actions on webpages using AI-powered agentic scraping with session management. +### monitor -**Example Usage:** +Monitor a webpage for changes on a schedule. ```python -from scrapegraph_py import Client -import os -from dotenv import load_dotenv +from scrapegraph_py import ScrapeGraphAI, MonitorCreateRequest, MarkdownFormatConfig -load_dotenv() +sgai = ScrapeGraphAI() -# Initialize the client -client = Client(api_key=os.getenv("SGAI_API_KEY")) - -# Perform automated browser actions -response = client.agenticscraper( +# Create a monitor +mon = sgai.monitor.create(MonitorCreateRequest( url="https://example.com", - use_session=True, - steps=[ - "Type email@gmail.com in email input box", - "Type password123 in password inputbox", - "click on login" - ], - ai_extraction=False # Set to True for AI extraction -) + name="Price Monitor", + interval="0 * * * *", # cron expression + formats=[MarkdownFormatConfig()], + webhook_url="https://...", # optional + fetch_config=FetchConfig(...), +)) + +# Manage monitors +sgai.monitor.list() +sgai.monitor.get(cron_id) +sgai.monitor.update(cron_id, MonitorUpdateRequest(interval="0 */6 * * *")) +sgai.monitor.pause(cron_id) +sgai.monitor.resume(cron_id) +sgai.monitor.delete(cron_id) +``` -print(f"Request ID: {response['request_id']}") -print(f"Status: {response.get('status')}") +### history -# Get results -result = client.get_agenticscraper(response['request_id']) -print(f"Result: {result.get('result')}") +Fetch request history. -client.close() -``` +```python +from scrapegraph_py import ScrapeGraphAI, HistoryFilter -### ๐Ÿ“„ Scrape -Convert webpages into HTML format with optional JavaScript rendering and custom headers. +sgai = ScrapeGraphAI() -**Example Usage:** +history = sgai.history.list(HistoryFilter( + service="scrape", # optional filter + page=1, + limit=20, +)) -```python -from scrapegraph_py import Client -import os -from dotenv import load_dotenv +entry = sgai.history.get("request-id") +``` -load_dotenv() +### credits / health -# Initialize the client -client = Client(api_key=os.getenv("SGAI_API_KEY")) +```python +from scrapegraph_py import ScrapeGraphAI -# Get HTML content from webpage -response = client.scrape( - website_url="https://example.com", - render_heavy_js=False, # Set to True for JavaScript-heavy sites -) +sgai = ScrapeGraphAI() -print(f"Request ID: {response['request_id']}") -print(f"HTML length: {len(response.get('html', ''))} characters") +credits = sgai.credits() +# { remaining: 1000, used: 500, plan: "pro", jobs: { crawl: {...}, monitor: {...} } } -client.close() +health = sgai.health() +# { status: "ok", uptime: 12345 } ``` -### โฐ Scheduled Jobs -Create, manage, and monitor scheduled scraping jobs with cron expressions and execution history. +## Async Client -### ๐Ÿ’ณ Credits -Check your API credit balance and usage. +All methods have async equivalents via `AsyncScrapeGraphAI`: -### ๐Ÿ’ฌ Feedback -Send feedback and ratings for scraping requests to help improve the service. +```python +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, ScrapeRequest + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + result = await sgai.scrape(ScrapeRequest(url="https://example.com")) + if result.status == "success": + print(result.data["results"]["markdown"]["data"]) + else: + print(result.error) + +asyncio.run(main()) +``` -## ๐ŸŒŸ Key Benefits +### Async Extract -- ๐Ÿ“ **Natural Language Queries**: No complex selectors or XPath needed -- ๐ŸŽฏ **Precise Extraction**: AI understands context and structure -- ๐Ÿ”„ **Adaptive Processing**: Works with both web content and direct HTML -- ๐Ÿ“Š **Schema Validation**: Ensure data consistency with Pydantic -- โšก **Async Support**: Handle multiple requests efficiently -- ๐Ÿ” **Source Attribution**: Get reference URLs for search results +```python +async with AsyncScrapeGraphAI() as sgai: + res = await sgai.extract(ExtractRequest( + url="https://example.com", + prompt="Extract product names and prices", + )) +``` -## ๐Ÿ’ก Use Cases +### Async Search -- ๐Ÿข **Business Intelligence**: Extract company information and contacts -- ๐Ÿ“Š **Market Research**: Gather product data and pricing -- ๐Ÿ“ฐ **Content Aggregation**: Convert articles to structured formats -- ๐Ÿ” **Data Mining**: Extract specific information from multiple sources -- ๐Ÿ“ฑ **App Integration**: Feed clean data into your applications -- ๐ŸŒ **Web Research**: Perform AI-powered searches with structured results +```python +async with AsyncScrapeGraphAI() as sgai: + res = await sgai.search(SearchRequest( + query="best programming languages 2024", + num_results=5, + )) +``` -## ๐Ÿ“– Documentation +### Async Crawl -For detailed documentation and examples, visit: -- [Python SDK Guide](scrapegraph-py/README.md) -- [API Documentation](https://docs.scrapegraphai.com) +```python +async with AsyncScrapeGraphAI() as sgai: + start = await sgai.crawl.start(CrawlRequest( + url="https://example.com", + max_pages=50, + )) + status = await sgai.crawl.get(start.data["id"]) +``` -## ๐Ÿ’ฌ Support & Feedback +### Async Monitor -- ๐Ÿ“ง Email: support@scrapegraphai.com -- ๐Ÿ’ป GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues) -- ๐ŸŒŸ Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/new) +```python +async with AsyncScrapeGraphAI() as sgai: + mon = await sgai.monitor.create(MonitorCreateRequest( + url="https://example.com", + name="Price Monitor", + interval="0 * * * *", + )) +``` -## ๐Ÿ“„ License +## Examples + +### Sync Examples + +| Service | Example | Description | +|---------|---------|-------------| +| scrape | [`scrape_basic.py`](examples/scrape/scrape_basic.py) | Basic markdown scraping | +| scrape | [`scrape_multi_format.py`](examples/scrape/scrape_multi_format.py) | Multiple formats | +| scrape | [`scrape_json_extraction.py`](examples/scrape/scrape_json_extraction.py) | Structured JSON extraction | +| scrape | [`scrape_pdf.py`](examples/scrape/scrape_pdf.py) | PDF document parsing | +| scrape | [`scrape_with_fetchconfig.py`](examples/scrape/scrape_with_fetchconfig.py) | JS rendering, stealth mode | +| extract | [`extract_basic.py`](examples/extract/extract_basic.py) | AI data extraction | +| extract | [`extract_with_schema.py`](examples/extract/extract_with_schema.py) | Extraction with JSON schema | +| search | [`search_basic.py`](examples/search/search_basic.py) | Web search | +| search | [`search_with_extraction.py`](examples/search/search_with_extraction.py) | Search + AI extraction | +| crawl | [`crawl_basic.py`](examples/crawl/crawl_basic.py) | Start and monitor a crawl | +| crawl | [`crawl_with_formats.py`](examples/crawl/crawl_with_formats.py) | Crawl with formats | +| monitor | [`monitor_basic.py`](examples/monitor/monitor_basic.py) | Create a page monitor | +| monitor | [`monitor_with_webhook.py`](examples/monitor/monitor_with_webhook.py) | Monitor with webhook | +| utilities | [`credits.py`](examples/utilities/credits.py) | Check credits and limits | +| utilities | [`health.py`](examples/utilities/health.py) | API health check | +| utilities | [`history.py`](examples/utilities/history.py) | Request history | + +### Async Examples + +| Service | Example | Description | +|---------|---------|-------------| +| scrape | [`scrape_basic_async.py`](examples/scrape/scrape_basic_async.py) | Basic markdown scraping | +| scrape | [`scrape_multi_format_async.py`](examples/scrape/scrape_multi_format_async.py) | Multiple formats | +| scrape | [`scrape_json_extraction_async.py`](examples/scrape/scrape_json_extraction_async.py) | Structured JSON extraction | +| scrape | [`scrape_pdf_async.py`](examples/scrape/scrape_pdf_async.py) | PDF document parsing | +| scrape | [`scrape_with_fetchconfig_async.py`](examples/scrape/scrape_with_fetchconfig_async.py) | JS rendering, stealth mode | +| extract | [`extract_basic_async.py`](examples/extract/extract_basic_async.py) | AI data extraction | +| extract | [`extract_with_schema_async.py`](examples/extract/extract_with_schema_async.py) | Extraction with JSON schema | +| search | [`search_basic_async.py`](examples/search/search_basic_async.py) | Web search | +| search | [`search_with_extraction_async.py`](examples/search/search_with_extraction_async.py) | Search + AI extraction | +| crawl | [`crawl_basic_async.py`](examples/crawl/crawl_basic_async.py) | Start and monitor a crawl | +| crawl | [`crawl_with_formats_async.py`](examples/crawl/crawl_with_formats_async.py) | Crawl with formats | +| monitor | [`monitor_basic_async.py`](examples/monitor/monitor_basic_async.py) | Create a page monitor | +| monitor | [`monitor_with_webhook_async.py`](examples/monitor/monitor_with_webhook_async.py) | Monitor with webhook | +| utilities | [`credits_async.py`](examples/utilities/credits_async.py) | Check credits and limits | +| utilities | [`health_async.py`](examples/utilities/health_async.py) | API health check | +| utilities | [`history_async.py`](examples/utilities/history_async.py) | Request history | + +## Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `SGAI_API_KEY` | Your ScrapeGraphAI API key | โ€” | +| `SGAI_API_URL` | Override API base URL | `https://api.scrapegraphai.com/api/v2` | +| `SGAI_DEBUG` | Enable debug logging (`"1"`) | off | +| `SGAI_TIMEOUT` | Request timeout in seconds | `120` | + +## Development -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. +```bash +uv sync +uv run pytest tests/ # unit tests +uv run pytest tests/test_integration.py # live API tests (requires SGAI_API_KEY) +uv run ruff check . # lint +``` ---- +## License -Made with โค๏ธ by [ScrapeGraph AI](https://scrapegraphai.com) +MIT - [ScrapeGraphAI](https://scrapegraphai.com) diff --git a/TOON_INTEGRATION_SUMMARY.md b/TOON_INTEGRATION_SUMMARY.md deleted file mode 100644 index 7fce9884..00000000 --- a/TOON_INTEGRATION_SUMMARY.md +++ /dev/null @@ -1,170 +0,0 @@ -# TOON Integration - Implementation Summary - -## ๐ŸŽฏ Objective -Integrate the [Toonify library](https://github.com/ScrapeGraphAI/toonify) into the ScrapeGraph SDK to enable token-efficient responses using the TOON (Token-Oriented Object Notation) format. - -## โœ… What Was Done - -### 1. **Dependency Management** -- Added `toonify>=1.0.0` as a dependency in `pyproject.toml` -- The library was successfully installed and tested - -### 2. **Core Implementation** -Created a new utility module: `scrapegraph_py/utils/toon_converter.py` -- Implements `convert_to_toon()` function for converting Python dicts to TOON format -- Implements `process_response_with_toon()` helper function -- Handles graceful fallback if toonify is not installed - -### 3. **Client Integration - Synchronous Client** -Updated `scrapegraph_py/client.py` to add `return_toon` parameter to: -- โœ… `smartscraper()` and `get_smartscraper()` -- โœ… `searchscraper()` and `get_searchscraper()` -- โœ… `crawl()` and `get_crawl()` -- โœ… `agenticscraper()` and `get_agenticscraper()` -- โœ… `markdownify()` and `get_markdownify()` -- โœ… `scrape()` and `get_scrape()` - -### 4. **Client Integration - Asynchronous Client** -Updated `scrapegraph_py/async_client.py` with identical `return_toon` parameter to: -- โœ… `smartscraper()` and `get_smartscraper()` -- โœ… `searchscraper()` and `get_searchscraper()` -- โœ… `crawl()` and `get_crawl()` -- โœ… `agenticscraper()` and `get_agenticscraper()` -- โœ… `markdownify()` and `get_markdownify()` -- โœ… `scrape()` and `get_scrape()` - -### 5. **Documentation** -- Created `TOON_INTEGRATION.md` with comprehensive documentation - - Overview of TOON format - - Benefits and use cases - - Usage examples for all methods - - Cost savings calculations - - When to use TOON vs JSON - -### 6. **Examples** -Created two complete example scripts: -- `examples/toon_example.py` - Synchronous examples -- `examples/toon_async_example.py` - Asynchronous examples -- Both examples demonstrate multiple scraping methods with TOON format -- Include token comparison and savings calculations - -### 7. **Testing** -- โœ… Successfully tested with a valid API key -- โœ… Verified both JSON and TOON outputs work correctly -- โœ… Confirmed token reduction in practice - -## ๐Ÿ“Š Key Results - -### Example Output Comparison - -**JSON Format:** -```json -{ - "request_id": "f424487d-6e2b-4361-824f-9c54f8fe0d8e", - "status": "completed", - "website_url": "https://example.com", - "user_prompt": "Extract the page title and main heading", - "result": { - "page_title": "Example Domain", - "main_heading": "Example Domain" - }, - "error": "" -} -``` - -**TOON Format:** -``` -request_id: de003fcc-212c-4604-be14-06a6e88ff350 -status: completed -website_url: "https://example.com" -user_prompt: Extract the page title and main heading -result: - page_title: Example Domain - main_heading: Example Domain -error: "" -``` - -### Benefits Achieved -- โœ… **30-60% token reduction** for typical responses -- โœ… **Lower LLM API costs** (saves $2,147 per million requests at GPT-4 pricing) -- โœ… **Faster processing** due to smaller payloads -- โœ… **Human-readable** format maintained -- โœ… **Backward compatible** - existing code continues to work with JSON - -## ๐ŸŒฟ Branch Information - -**Branch Name:** `feature/toonify-integration` - -**Commit:** `c094530` - -**Remote URL:** https://github.com/ScrapeGraphAI/scrapegraph-sdk/pull/new/feature/toonify-integration - -## ๐Ÿ”„ Files Changed - -### Modified Files (3): -1. `scrapegraph-py/pyproject.toml` - Added toonify dependency -2. `scrapegraph-py/scrapegraph_py/client.py` - Added TOON support to sync methods -3. `scrapegraph-py/scrapegraph_py/async_client.py` - Added TOON support to async methods - -### New Files (4): -1. `scrapegraph-py/scrapegraph_py/utils/toon_converter.py` - Core TOON conversion utility -2. `scrapegraph-py/examples/toon_example.py` - Sync examples -3. `scrapegraph-py/examples/toon_async_example.py` - Async examples -4. `scrapegraph-py/TOON_INTEGRATION.md` - Complete documentation - -**Total:** 7 files changed, 764 insertions(+), 58 deletions(-) - -## ๐Ÿš€ Usage - -### Basic Example - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key") - -# Get response in TOON format (30-60% fewer tokens) -toon_result = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - return_toon=True # Enable TOON format -) - -print(toon_result) # TOON formatted string -``` - -### Async Example - -```python -import asyncio -from scrapegraph_py import AsyncClient - -async def main(): - async with AsyncClient(api_key="your-api-key") as client: - toon_result = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - return_toon=True - ) - print(toon_result) - -asyncio.run(main()) -``` - -## ๐ŸŽ‰ Summary - -The TOON integration has been successfully completed! All scraping methods in both synchronous and asynchronous clients now support the `return_toon=True` parameter. The implementation is: - -- โœ… **Fully functional** - tested and working -- โœ… **Well documented** - includes comprehensive guide and examples -- โœ… **Backward compatible** - existing code continues to work -- โœ… **Token efficient** - delivers 30-60% token savings as promised - -The feature is ready for review and can be merged into the main branch. - -## ๐Ÿ”— Resources - -- **Toonify Repository:** https://github.com/ScrapeGraphAI/toonify -- **TOON Format Spec:** https://github.com/toon-format/toon -- **Branch:** https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/feature/toonify-integration - diff --git a/codebeaver.yml b/codebeaver.yml deleted file mode 100644 index a22a7b72..00000000 --- a/codebeaver.yml +++ /dev/null @@ -1,7 +0,0 @@ -workspaces: -- from: jest - name: scrapegraph-js - path: scrapegraph-js -- from: pytest - name: scrapegraph-py - path: scrapegraph-py diff --git a/examples/.env.example b/examples/.env.example new file mode 100644 index 00000000..da809565 --- /dev/null +++ b/examples/.env.example @@ -0,0 +1 @@ +SGAI_API_KEY=your_api_key_here diff --git a/examples/crawl/crawl_basic.py b/examples/crawl/crawl_basic.py new file mode 100644 index 00000000..0c8cfdf3 --- /dev/null +++ b/examples/crawl/crawl_basic.py @@ -0,0 +1,34 @@ +from dotenv import load_dotenv +load_dotenv() + +import time +from scrapegraph_py import ScrapeGraphAI, CrawlRequest + +sgai = ScrapeGraphAI() + +start_res = sgai.crawl.start(CrawlRequest( + url="https://scrapegraphai.com/", + max_pages=5, + max_depth=2, +)) + +if start_res.status != "success" or not start_res.data: + print("Failed to start:", start_res.error) +else: + crawl_id = start_res.data.id + print("Crawl started:", crawl_id) + + status = start_res.data.status + while status == "running": + time.sleep(2) + get_res = sgai.crawl.get(crawl_id) + if get_res.status != "success" or not get_res.data: + print("Failed to get status:", get_res.error) + break + status = get_res.data.status + print(f"Progress: {get_res.data.finished}/{get_res.data.total} - {status}") + + if status in ("completed", "failed"): + print("\nPages crawled:") + for page in get_res.data.pages: + print(f" {page.url} - {page.status}") diff --git a/examples/crawl/crawl_basic_async.py b/examples/crawl/crawl_basic_async.py new file mode 100644 index 00000000..30fd0b79 --- /dev/null +++ b/examples/crawl/crawl_basic_async.py @@ -0,0 +1,36 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, CrawlRequest + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + start_res = await sgai.crawl.start(CrawlRequest( + url="https://scrapegraphai.com/", + max_pages=5, + max_depth=2, + )) + + if start_res.status != "success" or not start_res.data: + print("Failed to start:", start_res.error) + else: + crawl_id = start_res.data.id + print("Crawl started:", crawl_id) + + status = start_res.data.status + while status == "running": + await asyncio.sleep(2) + get_res = await sgai.crawl.get(crawl_id) + if get_res.status != "success" or not get_res.data: + print("Failed to get status:", get_res.error) + break + status = get_res.data.status + print(f"Progress: {get_res.data.finished}/{get_res.data.total} - {status}") + + if status in ("completed", "failed"): + print("\nPages crawled:") + for page in get_res.data.pages: + print(f" {page.url} - {page.status}") + +asyncio.run(main()) diff --git a/examples/crawl/crawl_with_formats.py b/examples/crawl/crawl_with_formats.py new file mode 100644 index 00000000..1026b384 --- /dev/null +++ b/examples/crawl/crawl_with_formats.py @@ -0,0 +1,45 @@ +from dotenv import load_dotenv +load_dotenv() + +import time +from scrapegraph_py import ( + ScrapeGraphAI, + CrawlRequest, + MarkdownFormatConfig, + LinksFormatConfig, +) + +sgai = ScrapeGraphAI() + +start_res = sgai.crawl.start(CrawlRequest( + url="https://scrapegraphai.com/", + max_pages=3, + max_depth=1, + formats=[ + MarkdownFormatConfig(), + LinksFormatConfig(), + ], +)) + +if start_res.status != "success" or not start_res.data: + print("Failed to start:", start_res.error) +else: + crawl_id = start_res.data.id + print("Crawl started:", crawl_id) + + status = start_res.data.status + while status == "running": + time.sleep(2) + get_res = sgai.crawl.get(crawl_id) + if get_res.status != "success" or not get_res.data: + print("Failed to get status:", get_res.error) + break + status = get_res.data.status + print(f"Progress: {get_res.data.finished}/{get_res.data.total} - {status}") + + if status in ("completed", "failed"): + print("\nPages crawled:") + for page in get_res.data.pages: + print(f"\n Page: {page.url}") + print(f" Status: {page.status}") + print(f" Depth: {page.depth}") diff --git a/examples/crawl/crawl_with_formats_async.py b/examples/crawl/crawl_with_formats_async.py new file mode 100644 index 00000000..d238a58c --- /dev/null +++ b/examples/crawl/crawl_with_formats_async.py @@ -0,0 +1,47 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import ( + AsyncScrapeGraphAI, + CrawlRequest, + MarkdownFormatConfig, + LinksFormatConfig, +) + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + start_res = await sgai.crawl.start(CrawlRequest( + url="https://scrapegraphai.com/", + max_pages=3, + max_depth=1, + formats=[ + MarkdownFormatConfig(), + LinksFormatConfig(), + ], + )) + + if start_res.status != "success" or not start_res.data: + print("Failed to start:", start_res.error) + else: + crawl_id = start_res.data.id + print("Crawl started:", crawl_id) + + status = start_res.data.status + while status == "running": + await asyncio.sleep(2) + get_res = await sgai.crawl.get(crawl_id) + if get_res.status != "success" or not get_res.data: + print("Failed to get status:", get_res.error) + break + status = get_res.data.status + print(f"Progress: {get_res.data.finished}/{get_res.data.total} - {status}") + + if status in ("completed", "failed"): + print("\nPages crawled:") + for page in get_res.data.pages: + print(f"\n Page: {page.url}") + print(f" Status: {page.status}") + print(f" Depth: {page.depth}") + +asyncio.run(main()) diff --git a/examples/extract/extract_basic.py b/examples/extract/extract_basic.py new file mode 100644 index 00000000..5bb82aca --- /dev/null +++ b/examples/extract/extract_basic.py @@ -0,0 +1,18 @@ +from dotenv import load_dotenv +load_dotenv() + +import json +from scrapegraph_py import ScrapeGraphAI, ExtractRequest + +sgai = ScrapeGraphAI() + +res = sgai.extract(ExtractRequest( + url="https://example.com", + prompt="What is this page about? Extract the main heading and description.", +)) + +if res.status == "success": + print("Extracted:", json.dumps(res.data.json_data, indent=2)) + print("\nTokens used:", res.data.usage) +else: + print("Failed:", res.error) diff --git a/examples/extract/extract_basic_async.py b/examples/extract/extract_basic_async.py new file mode 100644 index 00000000..be98fcf4 --- /dev/null +++ b/examples/extract/extract_basic_async.py @@ -0,0 +1,21 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +import json +from scrapegraph_py import AsyncScrapeGraphAI, ExtractRequest + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.extract(ExtractRequest( + url="https://example.com", + prompt="What is this page about? Extract the main heading and description.", + )) + + if res.status == "success": + print("Extracted:", json.dumps(res.data.json_data, indent=2)) + print("\nTokens used:", res.data.usage) + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/extract/extract_with_schema.py b/examples/extract/extract_with_schema.py new file mode 100644 index 00000000..b10a68d1 --- /dev/null +++ b/examples/extract/extract_with_schema.py @@ -0,0 +1,31 @@ +from dotenv import load_dotenv +load_dotenv() + +import json +from scrapegraph_py import ScrapeGraphAI, ExtractRequest + +sgai = ScrapeGraphAI() + +res = sgai.extract(ExtractRequest( + url="https://example.com", + prompt="Extract structured information about this page", + schema={ + "type": "object", + "properties": { + "title": {"type": "string"}, + "description": {"type": "string"}, + "links": { + "type": "array", + "items": {"type": "string"}, + }, + }, + "required": ["title"], + }, +)) + +if res.status == "success": + print("Extracted:", json.dumps(res.data.json_data, indent=2)) + print("\nRaw:", res.data.raw) + print("\nTokens used:", res.data.usage) +else: + print("Failed:", res.error) diff --git a/examples/extract/extract_with_schema_async.py b/examples/extract/extract_with_schema_async.py new file mode 100644 index 00000000..6a6641d8 --- /dev/null +++ b/examples/extract/extract_with_schema_async.py @@ -0,0 +1,34 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +import json +from scrapegraph_py import AsyncScrapeGraphAI, ExtractRequest + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.extract(ExtractRequest( + url="https://example.com", + prompt="Extract structured information about this page", + schema={ + "type": "object", + "properties": { + "title": {"type": "string"}, + "description": {"type": "string"}, + "links": { + "type": "array", + "items": {"type": "string"}, + }, + }, + "required": ["title"], + }, + )) + + if res.status == "success": + print("Extracted:", json.dumps(res.data.json_data, indent=2)) + print("\nRaw:", res.data.raw) + print("\nTokens used:", res.data.usage) + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/monitor/monitor_basic.py b/examples/monitor/monitor_basic.py new file mode 100644 index 00000000..5a500200 --- /dev/null +++ b/examples/monitor/monitor_basic.py @@ -0,0 +1,61 @@ +from dotenv import load_dotenv +load_dotenv() + +import json +import signal +import time +from scrapegraph_py import ScrapeGraphAI, MonitorCreateRequest, JsonFormatConfig + +sgai = ScrapeGraphAI() + +res = sgai.monitor.create(MonitorCreateRequest( + url="https://time.is/", + name="Time Monitor", + interval="*/10 * * * *", + formats=[JsonFormatConfig( + prompt="Extract the current time", + schema={ + "type": "object", + "properties": { + "time": {"type": "string"}, + }, + "required": ["time"], + }, + )], +)) + +if res.status != "success" or not res.data: + print("Failed to create monitor:", res.error) + exit(1) + +monitor_id = res.data.cron_id +print(f"Monitor created: {monitor_id}") +print(f"Interval: {res.data.interval}") +print("\nPolling for activity (Ctrl+C to stop)...\n") + +def cleanup(_sig, _frame): + print("\nStopping monitor...") + sgai.monitor.delete(monitor_id) + print("Monitor deleted") + exit(0) + +signal.signal(signal.SIGINT, cleanup) + +seen_ids = set() + +while True: + activity = sgai.monitor.activity(monitor_id) + if activity.status == "success" and activity.data: + for tick in activity.data.ticks: + if tick.id in seen_ids: + continue + seen_ids.add(tick.id) + + changes = "CHANGED" if tick.changed else "no change" + print(f"[{tick.created_at}] {tick.status} - {changes} ({tick.elapsed_ms}ms)") + diffs = tick.diffs.model_dump(exclude_none=True) + if diffs: + print(f" Diffs: {json.dumps(diffs, indent=2)}") + elif tick.changed: + print(" (no diffs data)") + time.sleep(30) diff --git a/examples/monitor/monitor_basic_async.py b/examples/monitor/monitor_basic_async.py new file mode 100644 index 00000000..137e621f --- /dev/null +++ b/examples/monitor/monitor_basic_async.py @@ -0,0 +1,61 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +import json +from scrapegraph_py import AsyncScrapeGraphAI, MonitorCreateRequest, JsonFormatConfig + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.monitor.create(MonitorCreateRequest( + url="https://time.is/", + name="Time Monitor", + interval="*/10 * * * *", + formats=[JsonFormatConfig( + prompt="Extract the current time", + schema={ + "type": "object", + "properties": { + "time": {"type": "string"}, + }, + "required": ["time"], + }, + )], + )) + + if res.status != "success" or not res.data: + print("Failed to create monitor:", res.error) + return + + monitor_id = res.data.cron_id + print(f"Monitor created: {monitor_id}") + print(f"Interval: {res.data.interval}") + print("\nPolling for activity (Ctrl+C to stop)...\n") + + seen_ids = set() + + try: + while True: + activity = await sgai.monitor.activity(monitor_id) + if activity.status == "success" and activity.data: + for tick in activity.data.ticks: + if tick.id in seen_ids: + continue + seen_ids.add(tick.id) + + changes = "CHANGED" if tick.changed else "no change" + print(f"[{tick.created_at}] {tick.status} - {changes} ({tick.elapsed_ms}ms)") + diffs = tick.diffs.model_dump(exclude_none=True) + if diffs: + print(f" Diffs: {json.dumps(diffs, indent=2)}") + elif tick.changed: + print(" (no diffs data)") + await asyncio.sleep(30) + except asyncio.CancelledError: + pass + + print("\nStopping monitor...") + await sgai.monitor.delete(monitor_id) + print("Monitor deleted") + +asyncio.run(main()) diff --git a/examples/monitor/monitor_with_webhook.py b/examples/monitor/monitor_with_webhook.py new file mode 100644 index 00000000..710865d6 --- /dev/null +++ b/examples/monitor/monitor_with_webhook.py @@ -0,0 +1,63 @@ +from dotenv import load_dotenv +load_dotenv() + +import json +import signal +import time +from scrapegraph_py import ScrapeGraphAI, MonitorCreateRequest, JsonFormatConfig + +sgai = ScrapeGraphAI() + +res = sgai.monitor.create(MonitorCreateRequest( + url="https://time.is/", + name="Time Monitor with Webhook", + interval="*/10 * * * *", + webhook_url="https://your-webhook-endpoint.com/hook", + formats=[JsonFormatConfig( + prompt="Extract the current time", + schema={ + "type": "object", + "properties": { + "time": {"type": "string"}, + }, + "required": ["time"], + }, + )], +)) + +if res.status != "success" or not res.data: + print("Failed to create monitor:", res.error) + exit(1) + +monitor_id = res.data.cron_id +print(f"Monitor created: {monitor_id}") +print(f"Interval: {res.data.interval}") +print("Webhook configured") +print("\nPolling for activity (Ctrl+C to stop)...\n") + +def cleanup(_sig, _frame): + print("\nStopping monitor...") + sgai.monitor.delete(monitor_id) + print("Monitor deleted") + exit(0) + +signal.signal(signal.SIGINT, cleanup) + +seen_ids = set() + +while True: + activity = sgai.monitor.activity(monitor_id) + if activity.status == "success" and activity.data: + for tick in activity.data.ticks: + if tick.id in seen_ids: + continue + seen_ids.add(tick.id) + + changes = "CHANGED" if tick.changed else "no change" + print(f"[{tick.created_at}] {tick.status} - {changes} ({tick.elapsed_ms}ms)") + diffs = tick.diffs.model_dump(exclude_none=True) + if diffs: + print(f" Diffs: {json.dumps(diffs, indent=2)}") + elif tick.changed: + print(" (no diffs data)") + time.sleep(30) diff --git a/examples/monitor/monitor_with_webhook_async.py b/examples/monitor/monitor_with_webhook_async.py new file mode 100644 index 00000000..faac49d8 --- /dev/null +++ b/examples/monitor/monitor_with_webhook_async.py @@ -0,0 +1,63 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +import json +from scrapegraph_py import AsyncScrapeGraphAI, MonitorCreateRequest, JsonFormatConfig + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.monitor.create(MonitorCreateRequest( + url="https://time.is/", + name="Time Monitor with Webhook", + interval="*/10 * * * *", + webhook_url="https://your-webhook-endpoint.com/hook", + formats=[JsonFormatConfig( + prompt="Extract the current time", + schema={ + "type": "object", + "properties": { + "time": {"type": "string"}, + }, + "required": ["time"], + }, + )], + )) + + if res.status != "success" or not res.data: + print("Failed to create monitor:", res.error) + return + + monitor_id = res.data.cron_id + print(f"Monitor created: {monitor_id}") + print(f"Interval: {res.data.interval}") + print("Webhook configured") + print("\nPolling for activity (Ctrl+C to stop)...\n") + + seen_ids = set() + + try: + while True: + activity = await sgai.monitor.activity(monitor_id) + if activity.status == "success" and activity.data: + for tick in activity.data.ticks: + if tick.id in seen_ids: + continue + seen_ids.add(tick.id) + + changes = "CHANGED" if tick.changed else "no change" + print(f"[{tick.created_at}] {tick.status} - {changes} ({tick.elapsed_ms}ms)") + diffs = tick.diffs.model_dump(exclude_none=True) + if diffs: + print(f" Diffs: {json.dumps(diffs, indent=2)}") + elif tick.changed: + print(" (no diffs data)") + await asyncio.sleep(30) + except asyncio.CancelledError: + pass + + print("\nStopping monitor...") + await sgai.monitor.delete(monitor_id) + print("Monitor deleted") + +asyncio.run(main()) diff --git a/examples/scrape/scrape_basic.py b/examples/scrape/scrape_basic.py new file mode 100644 index 00000000..5c4c800c --- /dev/null +++ b/examples/scrape/scrape_basic.py @@ -0,0 +1,17 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI, ScrapeRequest, MarkdownFormatConfig + +sgai = ScrapeGraphAI() + +res = sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[MarkdownFormatConfig()], +)) + +if res.status == "success": + print("Markdown:", res.data.results.get("markdown", {}).get("data")) + print(f"\nTook {res.elapsed_ms}ms") +else: + print("Failed:", res.error) diff --git a/examples/scrape/scrape_basic_async.py b/examples/scrape/scrape_basic_async.py new file mode 100644 index 00000000..2e2e2ce2 --- /dev/null +++ b/examples/scrape/scrape_basic_async.py @@ -0,0 +1,20 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, ScrapeRequest, MarkdownFormatConfig + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[MarkdownFormatConfig()], + )) + + if res.status == "success": + print("Markdown:", res.data.results.get("markdown", {}).get("data")) + print(f"\nTook {res.elapsed_ms}ms") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/scrape/scrape_json_extraction.py b/examples/scrape/scrape_json_extraction.py new file mode 100644 index 00000000..7511f00e --- /dev/null +++ b/examples/scrape/scrape_json_extraction.py @@ -0,0 +1,44 @@ +from dotenv import load_dotenv +load_dotenv() + +import json +from scrapegraph_py import ScrapeGraphAI, ScrapeRequest, JsonFormatConfig + +sgai = ScrapeGraphAI() + +res = sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[ + JsonFormatConfig( + prompt="Extract the company name, tagline, and list of features", + schema={ + "type": "object", + "properties": { + "companyName": {"type": "string"}, + "tagline": {"type": "string"}, + "features": { + "type": "array", + "items": {"type": "string"}, + }, + }, + "required": ["companyName"], + }, + ), + ], +)) + +if res.status == "success": + json_result = res.data.results.get("json", {}) + + print("=== JSON Extraction ===\n") + print("Extracted data:") + print(json.dumps(json_result.get("data"), indent=2)) + + chunker = json_result.get("metadata", {}).get("chunker") + if chunker: + chunks = chunker.get("chunks", []) + print("\nChunker info:") + print(" Chunks:", len(chunks)) + print(" Total size:", sum(c.get("size", 0) for c in chunks), "chars") +else: + print("Failed:", res.error) diff --git a/examples/scrape/scrape_json_extraction_async.py b/examples/scrape/scrape_json_extraction_async.py new file mode 100644 index 00000000..f61d0df5 --- /dev/null +++ b/examples/scrape/scrape_json_extraction_async.py @@ -0,0 +1,47 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +import json +from scrapegraph_py import AsyncScrapeGraphAI, ScrapeRequest, JsonFormatConfig + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[ + JsonFormatConfig( + prompt="Extract the company name, tagline, and list of features", + schema={ + "type": "object", + "properties": { + "companyName": {"type": "string"}, + "tagline": {"type": "string"}, + "features": { + "type": "array", + "items": {"type": "string"}, + }, + }, + "required": ["companyName"], + }, + ), + ], + )) + + if res.status == "success": + json_result = res.data.results.get("json", {}) + + print("=== JSON Extraction ===\n") + print("Extracted data:") + print(json.dumps(json_result.get("data"), indent=2)) + + chunker = json_result.get("metadata", {}).get("chunker") + if chunker: + chunks = chunker.get("chunks", []) + print("\nChunker info:") + print(" Chunks:", len(chunks)) + print(" Total size:", sum(c.get("size", 0) for c in chunks), "chars") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/scrape/scrape_multi_format.py b/examples/scrape/scrape_multi_format.py new file mode 100644 index 00000000..4e157287 --- /dev/null +++ b/examples/scrape/scrape_multi_format.py @@ -0,0 +1,40 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ( + ScrapeGraphAI, + ScrapeRequest, + MarkdownFormatConfig, + LinksFormatConfig, + ScreenshotFormatConfig, +) + +sgai = ScrapeGraphAI() + +res = sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[ + MarkdownFormatConfig(), + LinksFormatConfig(), + ScreenshotFormatConfig(width=1280, height=720), + ], +)) + +if res.status == "success": + results = res.data.results + + print("=== Markdown ===") + print(results.get("markdown", {}).get("data", [""])[0][:500], "...") + + print("\n=== Links ===") + links = results.get("links", {}).get("data", []) + print(f"Found {len(links)} links") + for link in links[:5]: + print(f" - {link}") + + print("\n=== Screenshot ===") + screenshot = results.get("screenshot", {}).get("data", {}) + print(f"URL: {screenshot.get('url')}") + print(f"Size: {screenshot.get('width')}x{screenshot.get('height')}") +else: + print("Failed:", res.error) diff --git a/examples/scrape/scrape_multi_format_async.py b/examples/scrape/scrape_multi_format_async.py new file mode 100644 index 00000000..cb56891c --- /dev/null +++ b/examples/scrape/scrape_multi_format_async.py @@ -0,0 +1,43 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import ( + AsyncScrapeGraphAI, + ScrapeRequest, + MarkdownFormatConfig, + LinksFormatConfig, + ScreenshotFormatConfig, +) + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[ + MarkdownFormatConfig(), + LinksFormatConfig(), + ScreenshotFormatConfig(width=1280, height=720), + ], + )) + + if res.status == "success": + results = res.data.results + + print("=== Markdown ===") + print(results.get("markdown", {}).get("data", [""])[0][:500], "...") + + print("\n=== Links ===") + links = results.get("links", {}).get("data", []) + print(f"Found {len(links)} links") + for link in links[:5]: + print(f" - {link}") + + print("\n=== Screenshot ===") + screenshot = results.get("screenshot", {}).get("data", {}) + print(f"URL: {screenshot.get('url')}") + print(f"Size: {screenshot.get('width')}x{screenshot.get('height')}") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/scrape/scrape_pdf.py b/examples/scrape/scrape_pdf.py new file mode 100644 index 00000000..ad4992dd --- /dev/null +++ b/examples/scrape/scrape_pdf.py @@ -0,0 +1,18 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI, ScrapeRequest, MarkdownFormatConfig + +sgai = ScrapeGraphAI() + +res = sgai.scrape(ScrapeRequest( + url="https://pdfobject.com/pdf/sample.pdf", + content_type="application/pdf", + formats=[MarkdownFormatConfig()], +)) + +if res.status == "success": + print("Markdown:", res.data.results.get("markdown", {}).get("data")) + print(f"\nTook {res.elapsed_ms}ms") +else: + print("Failed:", res.error) diff --git a/examples/scrape/scrape_pdf_async.py b/examples/scrape/scrape_pdf_async.py new file mode 100644 index 00000000..8ac100b4 --- /dev/null +++ b/examples/scrape/scrape_pdf_async.py @@ -0,0 +1,21 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, ScrapeRequest, MarkdownFormatConfig + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.scrape(ScrapeRequest( + url="https://pdfobject.com/pdf/sample.pdf", + content_type="application/pdf", + formats=[MarkdownFormatConfig()], + )) + + if res.status == "success": + print("Markdown:", res.data.results.get("markdown", {}).get("data")) + print(f"\nTook {res.elapsed_ms}ms") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/scrape/scrape_with_fetchconfig.py b/examples/scrape/scrape_with_fetchconfig.py new file mode 100644 index 00000000..bc3a89bb --- /dev/null +++ b/examples/scrape/scrape_with_fetchconfig.py @@ -0,0 +1,23 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI, ScrapeRequest, MarkdownFormatConfig, FetchConfig + +sgai = ScrapeGraphAI() + +res = sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[MarkdownFormatConfig()], + fetch_config=FetchConfig( + mode="js", + timeout=45000, + wait=2000, + stealth=True, + ), +)) + +if res.status == "success": + print("Markdown:", res.data.results.get("markdown", {}).get("data")) + print(f"\nTook {res.elapsed_ms}ms") +else: + print("Failed:", res.error) diff --git a/examples/scrape/scrape_with_fetchconfig_async.py b/examples/scrape/scrape_with_fetchconfig_async.py new file mode 100644 index 00000000..f0fafde7 --- /dev/null +++ b/examples/scrape/scrape_with_fetchconfig_async.py @@ -0,0 +1,26 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, ScrapeRequest, MarkdownFormatConfig, FetchConfig + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.scrape(ScrapeRequest( + url="https://example.com", + formats=[MarkdownFormatConfig()], + fetch_config=FetchConfig( + mode="js", + timeout=45000, + wait=2000, + stealth=True, + ), + )) + + if res.status == "success": + print("Markdown:", res.data.results.get("markdown", {}).get("data")) + print(f"\nTook {res.elapsed_ms}ms") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/search/search_basic.py b/examples/search/search_basic.py new file mode 100644 index 00000000..8a84ba9e --- /dev/null +++ b/examples/search/search_basic.py @@ -0,0 +1,19 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI, SearchRequest + +sgai = ScrapeGraphAI() + +res = sgai.search(SearchRequest( + query="best programming languages 2024", + num_results=3, +)) + +if res.status == "success": + for result in res.data.results: + print(f"\n{result.title}") + print(f"URL: {result.url}") + print(f"Content: {result.content[:200]}...") +else: + print("Failed:", res.error) diff --git a/examples/search/search_basic_async.py b/examples/search/search_basic_async.py new file mode 100644 index 00000000..cb758919 --- /dev/null +++ b/examples/search/search_basic_async.py @@ -0,0 +1,22 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, SearchRequest + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.search(SearchRequest( + query="best programming languages 2024", + num_results=3, + )) + + if res.status == "success": + for result in res.data.results: + print(f"\n{result.title}") + print(f"URL: {result.url}") + print(f"Content: {result.content[:200]}...") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/search/search_with_extraction.py b/examples/search/search_with_extraction.py new file mode 100644 index 00000000..5bb043c6 --- /dev/null +++ b/examples/search/search_with_extraction.py @@ -0,0 +1,39 @@ +from dotenv import load_dotenv +load_dotenv() + +import json +from scrapegraph_py import ScrapeGraphAI, SearchRequest + +sgai = ScrapeGraphAI() + +res = sgai.search(SearchRequest( + query="best programming languages 2024", + num_results=3, + prompt="Summarize the top programming languages mentioned and why they are recommended", + schema={ + "type": "object", + "properties": { + "languages": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "reason": {"type": "string"}, + }, + }, + }, + }, + }, +)) + +if res.status == "success": + print("=== Search Results ===") + for result in res.data.results: + print(f"\n{result.title}") + print(f"URL: {result.url}") + + print("\n=== Extracted Summary ===") + print(json.dumps(res.data.json_data, indent=2)) +else: + print("Failed:", res.error) diff --git a/examples/search/search_with_extraction_async.py b/examples/search/search_with_extraction_async.py new file mode 100644 index 00000000..fc0487d8 --- /dev/null +++ b/examples/search/search_with_extraction_async.py @@ -0,0 +1,42 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +import json +from scrapegraph_py import AsyncScrapeGraphAI, SearchRequest + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.search(SearchRequest( + query="best programming languages 2024", + num_results=3, + prompt="Summarize the top programming languages mentioned and why they are recommended", + schema={ + "type": "object", + "properties": { + "languages": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "reason": {"type": "string"}, + }, + }, + }, + }, + }, + )) + + if res.status == "success": + print("=== Search Results ===") + for result in res.data.results: + print(f"\n{result.title}") + print(f"URL: {result.url}") + + print("\n=== Extracted Summary ===") + print(json.dumps(res.data.json_data, indent=2)) + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/utilities/credits.py b/examples/utilities/credits.py new file mode 100644 index 00000000..d240c573 --- /dev/null +++ b/examples/utilities/credits.py @@ -0,0 +1,18 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI + +sgai = ScrapeGraphAI() + +res = sgai.credits() + +if res.status == "success": + print("Plan:", res.data.plan) + print("Remaining credits:", res.data.remaining) + print("Used credits:", res.data.used) + print("\nJob limits:") + print(" Crawl:", res.data.jobs.crawl.used, "/", res.data.jobs.crawl.limit) + print(" Monitor:", res.data.jobs.monitor.used, "/", res.data.jobs.monitor.limit) +else: + print("Failed:", res.error) diff --git a/examples/utilities/credits_async.py b/examples/utilities/credits_async.py new file mode 100644 index 00000000..1bcec401 --- /dev/null +++ b/examples/utilities/credits_async.py @@ -0,0 +1,21 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.credits() + + if res.status == "success": + print("Plan:", res.data.plan) + print("Remaining credits:", res.data.remaining) + print("Used credits:", res.data.used) + print("\nJob limits:") + print(" Crawl:", res.data.jobs.crawl.used, "/", res.data.jobs.crawl.limit) + print(" Monitor:", res.data.jobs.monitor.used, "/", res.data.jobs.monitor.limit) + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/utilities/health.py b/examples/utilities/health.py new file mode 100644 index 00000000..e723b1b9 --- /dev/null +++ b/examples/utilities/health.py @@ -0,0 +1,18 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI + +sgai = ScrapeGraphAI() + +res = sgai.health() + +if res.status == "success": + print("Status:", res.data.status) + print("Uptime:", res.data.uptime, "seconds") + if res.data.services: + print("Services:") + print(" Redis:", res.data.services.redis) + print(" DB:", res.data.services.db) +else: + print("Failed:", res.error) diff --git a/examples/utilities/health_async.py b/examples/utilities/health_async.py new file mode 100644 index 00000000..f29678ef --- /dev/null +++ b/examples/utilities/health_async.py @@ -0,0 +1,21 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.health() + + if res.status == "success": + print("Status:", res.data.status) + print("Uptime:", res.data.uptime, "seconds") + if res.data.services: + print("Services:") + print(" Redis:", res.data.services.redis) + print(" DB:", res.data.services.db) + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/examples/utilities/history.py b/examples/utilities/history.py new file mode 100644 index 00000000..cd91e3c7 --- /dev/null +++ b/examples/utilities/history.py @@ -0,0 +1,22 @@ +from dotenv import load_dotenv +load_dotenv() + +from scrapegraph_py import ScrapeGraphAI, HistoryFilter + +sgai = ScrapeGraphAI() + +res = sgai.history.list(HistoryFilter(limit=5)) + +if res.status == "success": + data = res.data + print(f"Total: {data.pagination.total}") + print(f"Page: {data.pagination.page} / {(data.pagination.total // data.pagination.limit) + 1}") + + for entry in data.data: + print(f"\n ID: {entry.id}") + print(f" Service: {entry.service}") + print(f" Status: {entry.status}") + print(f" Created: {entry.created_at}") + print(f" Elapsed: {entry.elapsed_ms}ms") +else: + print("Failed:", res.error) diff --git a/examples/utilities/history_async.py b/examples/utilities/history_async.py new file mode 100644 index 00000000..8fc7f284 --- /dev/null +++ b/examples/utilities/history_async.py @@ -0,0 +1,25 @@ +from dotenv import load_dotenv +load_dotenv() + +import asyncio +from scrapegraph_py import AsyncScrapeGraphAI, HistoryFilter + +async def main(): + async with AsyncScrapeGraphAI() as sgai: + res = await sgai.history.list(HistoryFilter(limit=5)) + + if res.status == "success": + data = res.data + print(f"Total: {data.pagination.total}") + print(f"Page: {data.pagination.page} / {(data.pagination.total // data.pagination.limit) + 1}") + + for entry in data.data: + print(f"\n ID: {entry.id}") + print(f" Service: {entry.service}") + print(f" Status: {entry.status}") + print(f" Created: {entry.created_at}") + print(f" Elapsed: {entry.elapsed_ms}ms") + else: + print("Failed:", res.error) + +asyncio.run(main()) diff --git a/media/banner.png b/media/banner.png new file mode 100644 index 00000000..8b06be50 Binary files /dev/null and b/media/banner.png differ diff --git a/pyproject.toml b/pyproject.toml new file mode 100644 index 00000000..b28439ec --- /dev/null +++ b/pyproject.toml @@ -0,0 +1,57 @@ +[project] +name = "scrapegraph-py" +version = "2.0.0" +description = "Official Python SDK for ScrapeGraph AI API" +readme = "README.md" +license = "MIT" +authors = [ + { name = "ScrapeGraph AI", email = "support@scrapegraphai.com" } +] +requires-python = ">=3.12" +dependencies = [ + "httpx>=0.27.0", + "pydantic>=2.0.0", +] +keywords = ["scraping", "ai", "web-scraping", "api", "sdk"] +classifiers = [ + "Development Status :: 4 - Beta", + "Intended Audience :: Developers", + "License :: OSI Approved :: MIT License", + "Programming Language :: Python :: 3", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", + "Topic :: Internet :: WWW/HTTP :: Indexing/Search", + "Topic :: Software Development :: Libraries :: Python Modules", + "Typing :: Typed", +] + +[project.urls] +Homepage = "https://scrapegraphai.com" +Documentation = "https://docs.scrapegraphai.com" +Repository = "https://github.com/ScrapeGraphAI/scrapegraph-py" +Issues = "https://github.com/ScrapeGraphAI/scrapegraph-py/issues" + +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[tool.hatch.build.targets.wheel] +packages = ["src/scrapegraph_py"] + +[dependency-groups] +dev = [ + "pytest>=8.0.0", + "pytest-asyncio>=0.23.0", + "python-dotenv>=1.0.0", + "ruff>=0.4.0", +] + +[tool.ruff] +line-length = 100 + +[tool.ruff.lint] +select = ["E", "F", "I"] +ignore = ["E501"] + +[tool.ruff.lint.per-file-ignores] +"tests/*" = ["F841", "E402"] diff --git a/scrapegraph-py/.gitignore b/scrapegraph-py/.gitignore deleted file mode 100644 index d66869b7..00000000 --- a/scrapegraph-py/.gitignore +++ /dev/null @@ -1,149 +0,0 @@ -# Byte-compiled / optimized / DLL files -__pycache__/ -*.py[cod] -*$py.class - -# C extensions -*.so - -# Distribution / packaging -.Python -build/ -develop-eggs/ -dist/ -downloads/ -eggs/ -.eggs/ -lib/ -lib64/ -parts/ -sdist/ -var/ -wheels/ -share/python-wheels/ -*.egg-info/ -.installed.cfg -*.egg -MANIFEST - -# PyInstaller -*.manifest -*.spec - -# Installer logs -pip-log.txt -pip-delete-this-directory.txt - -# Unit test / coverage reports -htmlcov/ -.tox/ -.nox/ -.coverage -.coverage.* -.cache -nosetests.xml -coverage.xml -*.cover -*.py,cover -.hypothesis/ -.pytest_cache/ -.ruff_cache/ -cover/ - -# Translations -*.mo -*.pot - -# Django stuff: -*.log -local_settings.py -db.sqlite3 -db.sqlite3-journal - -# Flask stuff: -instance/ -.webassets-cache - -# Scrapy stuff: -.scrapy - -# Sphinx documentation -docs/_build/ - -# PyBuilder -.pybuilder/ -target/ - -# Jupyter Notebook -.ipynb_checkpoints - -# IPython -profile_default/ -ipython_config.py - -# pyenv -.python-version - -# pipenv -Pipfile.lock - -# poetry -poetry.lock - -# pdm -pdm.lock -.pdm.toml - -# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm -__pypackages__/ - -# Celery stuff -celerybeat-schedule -celerybeat.pid - -# SageMath parsed files -*.sage.py - -# Environments -.env -.venv -env/ -venv/ -ENV/ -env.bak/ -venv.bak/ - -# Spyder project settings -.spyderproject -.spyproject - -# Rope project settings -.ropeproject - -# mkdocs documentation -/site - -# mypy -.mypy_cache/ -.dmypy.json -dmypy.json - -# Pyre type checker -.pyre/ - -# pytype static type analyzer -.pytype/ - -# Cython debug symbols -cython_debug/ - -# PyCharm -.idea/ - -# VS Code -.vscode/ - -# macOS -.DS_Store - -dev.ipynb diff --git a/scrapegraph-py/.pre-commit-config.yaml b/scrapegraph-py/.pre-commit-config.yaml deleted file mode 100644 index 2481504a..00000000 --- a/scrapegraph-py/.pre-commit-config.yaml +++ /dev/null @@ -1,23 +0,0 @@ -repos: - - repo: https://github.com/psf/black - rev: 24.8.0 - hooks: - - id: black - - - repo: https://github.com/charliermarsh/ruff-pre-commit - rev: v0.6.9 - hooks: - - id: ruff - - - repo: https://github.com/pycqa/isort - rev: 5.13.2 - hooks: - - id: isort - - - repo: https://github.com/pre-commit/pre-commit-hooks - rev: v4.6.0 - hooks: - - id: trailing-whitespace - - id: end-of-file-fixer - - id: check-yaml - exclude: mkdocs.yml diff --git a/scrapegraph-py/.releaserc.yml b/scrapegraph-py/.releaserc.yml deleted file mode 100644 index be6ea7c1..00000000 --- a/scrapegraph-py/.releaserc.yml +++ /dev/null @@ -1,58 +0,0 @@ -plugins: - - - "@semantic-release/commit-analyzer" - - preset: conventionalcommits - - - "@semantic-release/release-notes-generator" - - writerOpts: - commitsSort: - - subject - - scope - preset: conventionalcommits - presetConfig: - types: - - type: feat - section: Features - - type: fix - section: Bug Fixes - - type: chore - section: chore - - type: docs - section: Docs - - type: style - hidden: true - - type: refactor - section: Refactor - - type: perf - section: Perf - - type: test - section: Test - - type: build - section: Build - - type: ci - section: CI - - "@semantic-release/changelog" - - - "semantic-release-pypi" - - buildCommand: "cd scrapegraph-py && rye build" - distDirectory: "scrapegraph-py/dist" - packageDirectory: "scrapegraph-py" - - "@semantic-release/github" - - - "@semantic-release/git" - - assets: - - CHANGELOG.md - - scrapegraph-py/pyproject.toml - message: |- - ci(release): ${nextRelease.version} [skip ci] - - ${nextRelease.notes} -branches: - #child branches coming from tagged version for bugfix (1.1.x) or new features (1.x) - #maintenance branch - - name: "+([0-9])?(.{+([0-9]),x}).x" - channel: "stable" - #release a production version when merging towards main - - name: "main" - channel: "stable" - #prerelease branch - - name: "pre/beta" - channel: "dev" - prerelease: "beta" -debug: true diff --git a/scrapegraph-py/CHANGELOG.md b/scrapegraph-py/CHANGELOG.md deleted file mode 100644 index 90320194..00000000 --- a/scrapegraph-py/CHANGELOG.md +++ /dev/null @@ -1,2453 +0,0 @@ -## [1.46.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.45.0...v1.46.0) (2026-01-26) - - -### Features - -* add breadth ([5d996c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5d996c50692f77a4330512a781f3ff70a46ddfab)) - -## [1.45.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.44.1...v1.45.0) (2026-01-23) - - -### Features - -* add webhook_url parameter to crawler endpoint ([fd55fbc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55fbcd2b4b94ef1c7840b4f81c14bf101fc2f6)) - -## [1.44.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.44.0...v1.44.1) (2026-01-17) - - -### Bug Fixes - -* update readme ([11e6c2c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11e6c2cafa7d8afebc1a2392c4d35f382cded1d6)) - -## [1.44.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.43.0...v1.44.0) (2025-11-28) - - -### Features - -* integrate Toonify library for token-efficient responses ([c094530](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c094530afd3029717d69ecb521b8198413c14a49)) - -## [1.43.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.42.0...v1.43.0) (2025-11-26) - - -### Features - -* add rendering to the sdk ([3c9770a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3c9770a76f3d4eb60761196455b1a026747e80fb)) - -## [1.42.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.41.1...v1.42.0) (2025-11-21) - - -### Features - -* refactoring of the dependencies ([114372d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/114372d990d930f58acf76fe65afd1de4ba07b53)) - -## [1.41.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.41.0...v1.41.1) (2025-11-14) - - -### chore - -* **security:** sanitize example cookies and test literals to avoid secret detection ([6727e95](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6727e958cdd487836bf59f8738f43dd0b2353522)) - - -### Refactor - -* update console log formatting in advanced and complete agentic scraper examples to use repeat method ([997ae16](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/997ae16f92e2e69b0807a04d87aa7d2edf0342ea)) - -## [1.41.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.40.0...v1.41.0) (2025-11-04) - - -### Features - -* add health endpoint ([6edfe67](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6edfe67000627f9b3d435e8d9f7b660e467e54b6)) -* update health endpoint ([3e993b3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3e993b3de38bd6358bc4c01252df444895ec5bf6)) - -## [1.40.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.39.0...v1.40.0) (2025-11-04) - - -### Features - -* add markdown for smartscraper ([9d868b5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d868b52b2f72cac3fdb13ad0b711e76b1eb3367)) - -## [1.39.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.38.0...v1.39.0) (2025-11-03) - - -### Features - -* refactoring local scraping ([76f446f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76f446f1cea174ed5342afe5fb39c6ceee2dec98)) - -## [1.38.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.37.0...v1.38.0) (2025-10-23) - - -### Features - -* update js rendering ([07a898c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/07a898c69e78d8b7af0242d995b42bd74928b94d)) - -## [1.37.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.36.0...v1.37.0) (2025-10-23) - - -### Features - -* update render_heavy_js ([2a02cb9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a02cb988fc860352bc679fc7cbc6a0c2dc85b4b)) - -## [1.36.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.35.0...v1.36.0) (2025-10-16) - - -### Features - -* add stealth mode ([0d658c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0d658c18ca50588fb165a8e57cd761bef0fbf318)) - -## [1.35.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.34.0...v1.35.0) (2025-10-15) - - -### Features - -* add docstring ([8984f4b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8984f4b36849a078fc05da8abdf66d646216c022)) - -## [1.34.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.33.0...v1.34.0) (2025-10-08) - - -### Features - -* add sitemap endpoint ([f5e907e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f5e907ed5de2974817d9fb9bba7b66a0a209b24e)) -* add sitemap example ([e07cd76](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e07cd769e6baaff4961f48723e7703cab48e61e6)) - - -### Docs - -* update for agentic doc oriented development ([d0a10e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d0a10e5b1e838a6867317955253a7f5696593601)) - -## [1.33.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.32.0...v1.33.0) (2025-10-06) - - -### Features - -* add examples ([c260512](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c26051253a11b308b053518a66c2db4e5ccbbb38)) -* add generate schema ([5f3ccf2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f3ccf27c0e0b3b03071900eb19bdaf11b8d1bf6)) - -## [1.32.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.31.0...v1.32.0) (2025-10-06) - - -### Features - -* update smartcrawler with sitemamp functionalities ([73e1e42](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/73e1e42361d80f7042c5082722f1ac49724b6cde)) - -## [1.31.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.30.0...v1.31.0) (2025-09-17) - - -### Features - -* add scrape endpoint ([43d1bf6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/43d1bf6877e9ae132f40034888c8fd92946877ec)) - -## [1.30.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.29.0...v1.30.0) (2025-09-17) - - -### Features - -* add md mode in searchscraper ([c6b513a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c6b513a67413582e107451739f8eef23efa7c9d0)) - -## [1.29.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.28.0...v1.29.0) (2025-09-16) - - -### Features - -* add render_heavy_js ([8d6994f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8d6994f0885d521cedcd81cdf8e7510453a1eac6)) - -## [1.28.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.27.0...v1.28.0) (2025-09-16) - - -### Features - -* add render_heavy ([6874ed9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6874ed9a8fb01dad237db59dfda62814c6b6f750)) - -## [1.27.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.26.0...v1.27.0) (2025-09-14) - - -### Features - -* add user agent ([00bb21b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/00bb21b72a5c859fc610b3a31ddfc42907c02eda)) - -## [1.26.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.25.1...v1.26.0) (2025-09-11) - - -### Features - -* refactoring of the example folder ([78f2318](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/78f23184626061b22dfde6d6b5b3c3df93f2a73a)) - -## [1.25.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.25.0...v1.25.1) (2025-09-08) - - -### Bug Fixes - -* removed unused name ([1147570](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1147570d751d1aba75ce274d6c07261d44b5f829)) -* scheduled jobs ([6f5cbf3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f5cbf3335dd7dfffa4803dd33c07c678a56ef0e)) - -## [1.25.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.24.0...v1.25.0) (2025-09-08) - - -### Features - -* add cron jobs ([d22e8eb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d22e8ebbc3bb978217ae5c486ab8bddd586b24f3)) - -## [1.24.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.23.0...v1.24.0) (2025-09-03) - - -### Features - -* add new mock ([db6a5ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/db6a5ea8baac250f70358c9ffc2a5ceb4d206993)) - -## [1.23.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.22.0...v1.23.0) (2025-09-01) - - -### Features - -* add examples ([710129e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/710129e24e6e0d1bb2aa9576c20b4a0e2d483be3)) -* add files ([af63d00](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/af63d0034cb7b2ce50123bc4698fe94a6b628cfe)) - -## [1.22.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.21.0...v1.22.0) (2025-09-01) - - -### Features - -* add files ([e056508](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e056508fd3b706ba40e2ab7cffd26e58f3f0c0ae)) - -## [1.21.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.20.0...v1.21.0) (2025-09-01) - - -### Features - -* add files for htmlify ([73539a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/73539a0c486baad7c479ee2fa488e3571ca369df)) -* add js files ([013f3ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/013f3ceb932fa95c8d559a16cd9c78260d844fd8)) -* rebranding ([cf5a26d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cf5a26dc7a8c7d135fb5ce8f6c7eea9cca25b15b)) - -## [1.20.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.19.0...v1.20.0) (2025-08-19) - - -### Features - -* add examples and md mode ([a02cb1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a02cb1e6c31942c05a633aa5abddc5d3a9e105d9)) - -## [1.19.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.18.2...v1.19.0) (2025-08-18) - - -### Features - -* add file for python ([2010ea1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2010ea14673a70fdf1aad3ef9b0485ee00a8bf1b)) -* add files ([bf3d42d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bf3d42d7534121f21e8cfec267760320c404a7f1)) - - -### chore - -* add tests ([b059fd6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b059fd6112b7f391b2b82f10255f27949c0767c7)) - -## [1.18.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.18.1...v1.18.2) (2025-08-06) - - -### Bug Fixes - -* removedunused imports ([ad55f88](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad55f886a08bbf8644e82fd40f4764022ba8cba7)) - -## [1.18.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.18.0...v1.18.1) (2025-08-06) - - -### Bug Fixes - -* linting errors ([e88a710](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88a710a6a7c63c68e5ae8f4b632732c6c28f8ce)) - -## [1.18.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.17.0...v1.18.0) (2025-08-05) - - -### Features - -* add crawling markdown ([e5d573b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e5d573b98c4ecbfeaec794f7bbfb89f429d4da26)) -* add js files ([1a9053e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1a9053e3d702e2c447c14d8142a37dc382f60cd0)) -* add tests ([7a50122](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a50122230e77a1e319534bebc6ea28deb7eed5d)) - -## [1.17.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.16.0...v1.17.0) (2025-07-30) - - -### Features - -* update crawl integrarion ([f1e9efe](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f1e9efe20107ab15496343a5d3de19ff56946b55)) - -## [1.16.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.15.0...v1.16.0) (2025-07-21) - - -### Features - -* add cookies integration ([043d3b3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/043d3b3284e8d89c434cc86f3e4f2f26b2b2a1b7)) -* add js ([767f810](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/767f81018d7060043bcd4420446a57acabff7bf7)) - -## [1.15.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.14.2...v1.15.0) (2025-07-18) - - -### Features - -* add examples in javascript ([16426de](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/16426de3bf24a153574979c5b1d0cbf84c53ca09)) -* add python integration ([5f5ec1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f5ec1b9fcf4c62c50ccea29519c9e6f71efc4a0)) -* update examples ([2e477d1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2e477d1b550a5044e58b8cfab66d70146544e357)) - -## [1.14.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.14.1...v1.14.2) (2025-07-12) - - -### Bug Fixes - -* broken sdk ([b2890d5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2890d5128b167d181280bfeb846f2e22ef00d54)) - -## [1.14.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.14.0...v1.14.1) (2025-07-08) - - -### Bug Fixes - -* pyproject ([a3b3121](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a3b312161762bcd4aec898c4bf360fb0dedbc637)) - -## [1.14.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.13.0...v1.14.0) (2025-07-08) - - -### Features - -* update a tag ([2a4b6aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a4b6aa73c9672dde558b2f487b9fa0637838478)) - -## [1.1.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.0.0...v1.1.0) (2025-07-08) - - -### Features - -* update lock file ([4ae7aa1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4ae7aa1687e2fdcae145ecb3375ac8ef4a83b411)) - -## 1.0.0 (2025-07-02) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add crawling endpoint ([4cf4ea6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4cf4ea67915e7dbb27dae6d3fa0a71719b28dfec)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add search number example ([4e93394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4e93394a2a010cb0459abd7c5cc9aa68d7bc8c8c)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* update version ([4d1851d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d1851dae4825570f4ba74381595037512a8103b)) -* update versions ([282aa5d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/282aa5d7c98d6e0e9a3781cb6be477d1eed23bcf)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* update ([527539b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/527539b01dcfc30f22bd9ca1c356613459c35569)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([5217502](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52175023e4735aed6a1f8d6532779e0a4c81b2a5)) -* **release:** 1.0.0 [skip ci] ([6ce94d6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6ce94d69309d5cab85a57817a46e1d5363fc3588)) -* **release:** 1.0.0 [skip ci] ([7fef9f1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7fef9f10cfb8ea8f7f52dcb543571937848b393c)) -* **release:** 1.0.0 [skip ci] ([99af971](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/99af971aa0df717ae2f8c4148d60c45870926100)) -* **release:** 1.0.0 [skip ci] ([46756ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/46756ac26d7b540e523f3a23afe688673293e8c6)) -* **release:** 1.0.0 [skip ci] ([635f7d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/635f7d09b172e6b00cd498ebf2d95799c82e0821)) -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-07-02) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add crawling endpoint ([4cf4ea6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4cf4ea67915e7dbb27dae6d3fa0a71719b28dfec)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add search number example ([4e93394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4e93394a2a010cb0459abd7c5cc9aa68d7bc8c8c)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* update version ([4d1851d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d1851dae4825570f4ba74381595037512a8103b)) -* update versions ([282aa5d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/282aa5d7c98d6e0e9a3781cb6be477d1eed23bcf)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([6ce94d6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6ce94d69309d5cab85a57817a46e1d5363fc3588)) -* **release:** 1.0.0 [skip ci] ([7fef9f1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7fef9f10cfb8ea8f7f52dcb543571937848b393c)) -* **release:** 1.0.0 [skip ci] ([99af971](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/99af971aa0df717ae2f8c4148d60c45870926100)) -* **release:** 1.0.0 [skip ci] ([46756ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/46756ac26d7b540e523f3a23afe688673293e8c6)) -* **release:** 1.0.0 [skip ci] ([635f7d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/635f7d09b172e6b00cd498ebf2d95799c82e0821)) -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-07-02) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add crawling endpoint ([4cf4ea6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4cf4ea67915e7dbb27dae6d3fa0a71719b28dfec)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add search number example ([4e93394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4e93394a2a010cb0459abd7c5cc9aa68d7bc8c8c)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([7fef9f1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7fef9f10cfb8ea8f7f52dcb543571937848b393c)) -* **release:** 1.0.0 [skip ci] ([99af971](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/99af971aa0df717ae2f8c4148d60c45870926100)) -* **release:** 1.0.0 [skip ci] ([46756ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/46756ac26d7b540e523f3a23afe688673293e8c6)) -* **release:** 1.0.0 [skip ci] ([635f7d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/635f7d09b172e6b00cd498ebf2d95799c82e0821)) -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-07-01) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add crawling endpoint ([4cf4ea6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4cf4ea67915e7dbb27dae6d3fa0a71719b28dfec)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([99af971](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/99af971aa0df717ae2f8c4148d60c45870926100)) -* **release:** 1.0.0 [skip ci] ([46756ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/46756ac26d7b540e523f3a23afe688673293e8c6)) -* **release:** 1.0.0 [skip ci] ([635f7d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/635f7d09b172e6b00cd498ebf2d95799c82e0821)) -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-07-01) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add crawling endpoint ([4cf4ea6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4cf4ea67915e7dbb27dae6d3fa0a71719b28dfec)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([46756ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/46756ac26d7b540e523f3a23afe688673293e8c6)) -* **release:** 1.0.0 [skip ci] ([635f7d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/635f7d09b172e6b00cd498ebf2d95799c82e0821)) -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-06-19) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([635f7d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/635f7d09b172e6b00cd498ebf2d95799c82e0821)) -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-06-19) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add infinte scrolling ([3166542](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3166542005eeae2b9fd9e5aaad0abc1966ec4abc)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([763e52b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/763e52bdf696192eb8f0143f3e97ccd40ae0bb8c)) -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-06-16) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add infinite scrolling ([928fb9b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/928fb9b274b55caaec3024bf1c3ca5b865120aa2)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* simplify infinite scroll config by replacing scroll_options with number_of_scrolls parameter ([ffc32ce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ffc32ce05a5c3546579142181466e34c3027ec67)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.0.0 [skip ci] ([0a7a968](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0a7a96864fbe38f8b2b2887807415f8869d96c65)) -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## 1.0.0 (2025-06-16) - - -### Features - -* add client integration ([5cbc551](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5cbc551092b33849fdbb1e1468eb35ba8b4f5c20)) -* add docstring ([04622dd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/04622dd39bb45223d266aab64ea086b2f1548425)) -* add integration for env variables ([2bf03a7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2bf03a7ca7a7936ad2c3b50ded4a6d90161fffa4)) -* add integration for local_scraper ([4f6402a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4f6402a94ebfa1b7534fc77ccef2deee5e9295d1)) -* add integration for sql ([8ae60c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ae60c4cfcc070a0a7053862aafaf758e91f465f)) -* add integration for the api ([457a2aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/457a2aac6c9afcf4bbb06a99e35a7f5ca5ed5797)) -* add localScraper functionality ([7ee0bc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7ee0bc0778c1fc65b6f33bd76d0a4ca8735ce373)) -* add markdownify and localscraper ([675de86](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/675de867428efb01d8d9f8aedca34055bce9e974)) -* add markdownify functionality ([938274f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/938274f2f67d9e6bca212e6bebd6203349c6494c)) -* add optional headers to request ([246f10e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/246f10ef3b24649887a813b5a31d2ffc7b848b1b)) -* add requirement files ([65fe013](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/65fe013ff3859b53f17d097bad780038216797e3)) -* add scrapegraphai api integration ([382c347](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/382c347061f9a0690cfab09393c11fd5e0ebee70)) -* add time varying timeout ([12afa1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/12afa1d09858b04cb99b158ab2f9f1ea2c4967dd)) -* added example of the smartScraper function using a schema ([e79c300](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e79c30038ede5c0b6b6460ecc3d791be6b21b811)) -* changed SyncClient to Client ([89210bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/89210bf26ee8ee5fd15fd7994b3c1fb0b0ad185e)) -* check ([5f9b4ed](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5f9b4edc08e6325124dc335eb1e145cfb7394113)) -* enhaced python sdk ([e66e60d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e66e60de27b89c0ea0a9abcd097a001feb7e8147)) -* final release maybe semantic? ([d096725](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d09672512605b2d8289abc017ef3c82147a69cd3)) -* fix ([d03013c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03013c5330d9b2728848655e373ea878eebf71d)) -* implemented search scraper functionality ([44834c2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/44834c2e3c8523fffe88c9bbd97009a846b5997c)) -* implemented support for requests with schema ([ad5f0b4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ad5f0b45d2dc183b63b27f5e5f4dd4b9801aa008)) -* maybe final release? ([40035f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/40035f3f0d9c8c2fcecbcd603397c38af405153a)) -* merged localscraper into smartscraper ([eaac552](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/eaac552493f4a245cfb2246713a8febf87851d05)) -* modified icons ([836faea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/836faea0975e7b1dcc13495a0c76c7d50cbedbaa)) -* refactoring of the folders ([e613e2e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e613e2e07c95a4e5348d1a74b8ba9f1f853a0911)) -* refctoring of the folder ([3085b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3085b5a74f748c4ce42fa6e02fd04029a4dc25a5)) -* removed local scraper ([021bf6d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/021bf6dcc6bd216cc8129b146e1f7892e52cf244)) -* revert to old release ([6565b3e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6565b3e937958fc0b897eb84456643e02d90790e)) -* searchscraper ([e281e0d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e281e0d798eccbe8114c75f8a3e2a2a4ab8cca25)) -* semantic relaase ([93759c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/93759c39b8f44ee1c51fac843544c93e87708760)) -* semantic release ([9613ba9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9613ba933274fe1bddb56339aae40617eaf46d65)) -* semantic release ([956eceb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/956ecebc27dae00fa0b487f97eec8114dfc3a1bd)) -* semantic release ([0bc1358](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0bc135814e6ebb438313b4d466357b2e5631f09d)) -* splitted files ([5337d1e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5337d1e39f7625165066c4aada771a4eb25fa635)) -* test ([9bec234](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9bec2340b0e507409c6ae221a8eb0ea93178a82f)) -* test semantic release ([5dbb0cc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5dbb0cc1243be847f2d7dee4f6e3df0c6550d8aa)) -* test semantic release ([66c789e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/66c789e907d69b6c8a919a2c7d4a2c4a79826d3d)) -* test semantic release ([d63fdda](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d63fddaa3794677c862313b0058d34ddc358441d)) -* test semantic release ([682baa3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/682baa39695f564b684568d9a6bf23ecda00b5ec)) -* try semantic release ([d686c1f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d686c1ff1911885a553e68897efa95afcd09a503)) -* update doc readme ([e317b1b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e317b1b0135c0d0134846ea0b0b63552773cff45)) -* updated readmes ([d485cf0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d485cf0ee0bd331e5970a158636dcdb44db98d81)) - - -### Bug Fixes - -* .toml file ([31d9ad8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31d9ad8d65fde79022a9971c1b463ccd5452820a)) -* add enw timeout ([cfc565c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cfc565c5ad23f547138be0466820c1c2dee6aa47)) -* add new python compatibility ([45d24a6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/45d24a6a3d1c3090f1c575cf4fe6a8d80d637c38)) -* add revert ([b81ec1d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b81ec1d37d0a1635f444525e1e4a99823f5cea83)) -* come back to py 3.10 ([e10bb5c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e10bb5c0ed0cd93a36b97eb91d634db8aac575d7)) -* fixed configuration for ignored files ([76e1d0e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/76e1d0edbfbb796b87c3608610e4d4125cdf4bfd)) -* fixed HttpError messages ([935dac1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/935dac100185b3622aa2744a38a2d4ce740deaa5)) -* fixed schema example ([4be2bd0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4be2bd0310cb860864e7666d5613d1664818e505)) -* houses examples and typos ([e787776](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e787776125215bc5c9d40e6691c971d46651548e)) -* improve api desc ([87e2040](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/87e2040ce4fd090cf511f67048f6275502120ab7)) -* logger working properly now ([6c619c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6c619c11ea90c81e0054b36504cc3d9e62dce249)) -* make timeout optional ([09b0cc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09b0cc015d2b8f8848a625a7d75e36a5caf7b546)) -* minor fix version ([d05bb6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d05bb6a34a15d45ebce2056c89c146f4fcf5a35f)) -* pyproject ([d5005a0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d5005a00671148c22956eb52f4bedc369f9361c2)) -* pyproject ([d04f0aa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d04f0aa770ebe480f293b60db0c5883f2c39e0f3)) -* pyproject.toml ([1c2ae7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1c2ae7fc9ffc485c9d36020da3fcc90037ea3c98)) -* pyproject.toml ([a509471](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a5094710e63b903da61e359b9ea8f79bf57b48f2)) -* python version ([98b859d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/98b859dab0effc966d1731372750e14abb0373c8)) -* readme js sdk ([6f95f27](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6f95f2782354ab62ab2ad320e338c4be2701c20b)) -* removed wrong information ([75ef97e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/75ef97eae8b31bac72a3e999e3423b8a455000f6)) -* semanti release 2 ([b008a3b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b008a3bc691b52be167edd1cbd9f0d1d689d0989)) -* semantic release ([4d230ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/4d230ae6b2404466b956c7a567223a03ff6ae448)) -* sync client ([8fee46f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fee46f7645c5b9e0cfa6d3d90b7d7e4e30567eb)) -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([15e590c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15e590ca23b3ecfeabd387af3eb7b42548337f87)) -* timeout ([57f6593](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/57f6593ee632595c39a9241009a0e71120baecc2)) -* updated comment ([62e2792](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62e2792174d5403f05c73aeb64bb515d722721d2)) -* updated env variable loading ([e259ed1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e259ed173f249c935e2de3c54831edf9fa954caa)) -* updated hatchling version ([2b41262](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2b412623aec22c3be37061347ec25e74ea8d6126)) - - -### chore - -* added dotenv pakage dependency ([e88abab](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e88abab18a3a375b6090790af4a1012381af164c)) -* added more information about the package ([a91198d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a91198de86e9be84b20aeac0e69eba81392ad39b)) -* added Zod package dependency ([49d3a59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49d3a59f916ac27170c3640775f0844063afd65a)) -* changed pakage name ([f93e49b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f93e49bff839b8de2c4f41c638c9c4df76592463)) -* fix _make_request not using it ([05f61ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05f61ea10a8183fc8863b1703fd4fbf6ca921c93)) -* fix pylint scripts ([7a593a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7a593a8a116d863d573f68d6e2282ba6c2204cbc)) -* fix pyproject version ([bc0c722](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc0c722986d65c500496b91f5cd8cec23b19189a)) -* fix semantic release, migrate to uv ([a70e1b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a70e1b7f86671e5d7a49c882b4c854d32c6b5944)) -* improved url validation ([25072a9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/25072a9976873e59169cd7a9bcce5797f5dcbfa3)) -* refactor examples ([85738d8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/85738d87117cf803e02c608b2476d24265ce65c6)) -* set up CI scripts ([7b33f8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b33f8f78e37c13eafcc0193fe7a2b2efb258cdf)) -* set up eslint and prettier for code linting and formatting ([9d54406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9d544066c08d6dca6e4dd764f665e56912bc4285)) -* update workflow scripts ([80ae3f7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/80ae3f761cc1b5cb860b9d75ee14920e37725cc0)) -* **tests:** updated tests ([b33f0b7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b33f0b7b602e29ea86ae2bfff7862279b5cca9ec)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([ae245c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ae245c83d70259a8eb5c626c21dfb3a0f6e76f62)) -* added api reference ([f87a7c8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f87a7c8fc3b39360f2339731af52b0b0766c80c2)) -* added api reference ([0cf5f3a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0cf5f3ae4b5b8704e86fc21b298a053f3bd9822e)) -* added cookbook reference ([54841e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/54841e5c0f705a64c4295e4fc8a414af0e62ca4f)) -* added langchain-scrapegraph examples ([b9d771e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b9d771eab5b0eace6eb03f0075094e3cc51efce9)) -* added new image ([b710bd3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b710bd3d60603e3b720b1a811ad91f35b1bea832)) -* added open in colab badge ([9e5519c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e5519c55be9a20da0d4a08e363a66bfacc970be)) -* added two langchain-scrapegraph examples ([b18b14d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b18b14d41eab4cdd7ed8d6fdc7ceb5e9f8fa9b24)) -* added two new examples ([b3fd406](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b3fd406680ab9aab2d99640fbe5244a9ebb14263)) -* **cookbook:** added two new examples ([2a8fb8c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a8fb8c45af4ff5b03483c7031ab541a03e36b83)) -* added wired langgraph react agent ([69c82ea](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69c82ea3d068c677866cb062a4b0345073dce6de)) -* added zillow example ([1eb365c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1eb365c7cf41aa8b04c96bd2667a7bddff7f6220)) -* api reference ([2f3210c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2f3210cd37e40d29699fada48e754449e4b163e7)) -* fixed cookbook images and urls ([b743830](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b74383090261f3d242fc58c16d8071f6da05dc97)) -* github trending sdk ([f0890ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f0890efa79ca884d1825e4d47b92601d092d0080)) -* improved examples ([5245394](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/52453944c5be0a71459bb324731331b194346174)) -* improved main readme ([8e280c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e280c64a64bb3d36b19bff74f90ea305852aceb)) -* link typo ([d6038b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d6038b0e1ed636959633ec03524ff5cf1cad3164)) -* llama-index @VinciGit00 ([b847053](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b8470535f52730f6d1757912b420e35ef94688b4)) -* research agent ([628fdd5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/628fdd5c0b0fbe371648b8a0171461ff2d615257)) -* updated new documentation urls ([bd4cbf8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bd4cbf81e3c5740c1cc77b6204447cd7878c3978)) -* updated precommit and installation guide ([bca95c5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bca95c5d6f1bd8fff3d0303b562ac229c6936f5d)) -* updated readme ([2a1e643](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a1e64328ca4b20e9593b65517e7c5bf1fe43ffa)) - - -### Refactor - -* code refactoring ([197638b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/197638b7742854fceca756261b96e74024bdfc3f)) -* code refactoring ([1be81f0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1be81f0539c631659dcf7e90540cebdd8539ae6a)) -* code refactoring ([6270a6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6270a6e43782f9efbb407e71b1ca7c57c37db38a)) -* improved code structure ([d1a33dc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d1a33dcf87c401043b3ff7676638daadeca0f2c8)) -* renamed functions ([95719e3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/95719e3b4a2c78de381bfdc39a077c42fdddec05)) -* update readme ([fee30c3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fee30c3355ffee30fc1bffb56984085000df6192)) - - -### Test - -* Add coverage improvement test for scrapegraph-py/tests/test_localscraper.py ([84ba517](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/84ba51747898cec2bb74b4d6c4a5ea398b56bca7)) -* Add coverage improvement test for scrapegraph-py/tests/test_markdownify.py ([c39dbb0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c39dbb034ab89e1830da63f24f534ee070046c5d)) -* Add coverage improvement test for scrapegraph-py/tests/test_smartscraper.py ([2216f6f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2216f6f309cac66f6688c0d84190d71c29290415)) - - -### CI - -* **release:** 1.1.0 [skip ci] ([fd55dc0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/fd55dc0d82f16dc277a9c45cf2e687245c4b76a2)) -* **release:** 1.10.0 [skip ci] ([69a5d7d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/69a5d7d66236050c2ba9c88fd53785f573d34fa2)) -* **release:** 1.10.1 [skip ci] ([48eb09d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/48eb09d406bd1dafb02bc9b001c6ef9752c4125a)) -* **release:** 1.10.2 [skip ci] ([9f2a50c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f2a50c940a70aa04b4484053ccae3a1cfb4148c)) -* **release:** 1.11.0 [skip ci] ([82fc505](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/82fc50507bb610f1059a12779a90d7d200c1759b)) -* **release:** 1.11.0-beta.1 [skip ci] ([e25a870](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e25a87013eb7be0db193d0093392eb78f3f1cfb6)) -* **release:** 1.12.0 [skip ci] ([15b9626](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/15b9626c11f84c60b496d70422a2df86e76d49a5)) -* **release:** 1.2.0 [skip ci] ([8ebd90b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ebd90b963de62c879910d7cf64f491bcf4c47f7)) -* **release:** 1.2.1 [skip ci] ([bc5c9a8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc5c9a8db64d0c792566f58d5265f6936edc5526)) -* **release:** 1.2.2 [skip ci] ([14dcf99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/14dcf99173a589557b7a2860716eedbee892b16b)) -* **release:** 1.3.0 [skip ci] ([daf43d0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/daf43d06761d85608b9599337e111da694a858a6)) -* **release:** 1.4.0 [skip ci] ([cb18d8f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cb18d8fb219dd55b6478ee33c051a40a091c4fd0)) -* **release:** 1.4.1 [skip ci] ([0b42489](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b424891395f0630c9886eb1a8a23232603e856f)) -* **release:** 1.4.2 [skip ci] ([e4ad100](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e4ad100e3004a4b7c63f865d3f12398a080bb599)) -* **release:** 1.4.3 [skip ci] ([11a0edc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/11a0edc54b299eea60a382c2781d3d1ac0084c3f)) -* **release:** 1.4.3-beta.1 [skip ci] ([c4ba791](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c4ba791e45edfbff13332117f2583781354f90d6)) -* **release:** 1.4.3-beta.2 [skip ci] ([3110601](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/31106014d662745fa9afef6083f61452508b67fb)) -* **release:** 1.4.3-beta.3 [skip ci] ([b6f7589](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6f75899bd9f5d906cb5034b3a74c3ef46280537)) -* **release:** 1.5.0 [skip ci] ([c7c91bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c7c91bd7c6f3550089d1231b2167ca18921fd48f)) -* **release:** 1.5.0-beta.1 [skip ci] ([298fce2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/298fce2058f7f39546afa022a135b497b9d8024d)) -* **release:** 1.6.0 [skip ci] ([1b0fdce](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1b0fdce5827378dc80a5cd0a83e7444d50db79c1)) -* **release:** 1.6.0-beta.1 [skip ci] ([ba7588d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ba7588d978217d2c2fce5404d989c527fe63bb16)) -* **release:** 1.7.0 [skip ci] ([bb2847c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb2847ca1f7045c86b5fa26cb1c16422039bfafb)) -* **release:** 1.7.0-beta.1 [skip ci] ([aab21db](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/aab21db9230800707c7814b0702e7d1f70a6a4f4)) -* **release:** 1.8.0 [skip ci] ([8fa12bb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8fa12bbc56dbcb4976551818ae5c99132ac393b3)) -* **release:** 1.9.0 [skip ci] ([a21e331](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a21e3317a48632bbb442d352c47e5b155ee96d94)) -* **release:** 1.9.0-beta.1 [skip ci] ([3173f66](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3173f661ff6d954a6059c8e899faba391cb51276)) -* **release:** 1.9.0-beta.2 [skip ci] ([c2fef9e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fef9e5405e16ba5d61a8b2fbf0b1c03c6fa306)) -* **release:** 1.9.0-beta.3 [skip ci] ([ca9fa71](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ca9fa71d2e68aafa2a438b659349e1fb4589ebdf)) -* **release:** 1.9.0-beta.4 [skip ci] ([b2e5ab1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b2e5ab167c0777449ac4974674abd294e0f7e41d)) -* **release:** 1.9.0-beta.5 [skip ci] ([604aea3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/604aea3c6aff087388d6014f0d1fcd7df0c66f69)) -* **release:** 1.9.0-beta.6 [skip ci] ([19c33b2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19c33b2e0d8160d78549878c64355e16702d406a)) -* **release:** 1.9.0-beta.7 [skip ci] ([c232796](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2327961096cd9f9ad3b0f54cf242a7d99ab11bc)) - -## [1.12.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.11.0...v1.12.0) (2025-02-05) - - -### Features - -* implemented search scraper functionality ([2c5a59b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2c5a59bd5cee46535aa1b157463db9164d7d42fb)) - - -### Bug Fixes - -* fixed HttpError messages ([869441b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/869441b156f49ed38bb95236f26d5b87139d6db0)) - - -### CI - -* **release:** 1.11.0-beta.1 [skip ci] ([2a62b40](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2a62b403df3481f5ad803bc192d3177beee79e35)) - -## [1.11.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.10.2...v1.11.0) (2025-02-03) -## [1.11.0-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.10.2...v1.11.0-beta.1) (2025-02-03) - - -### Features - -* add optional headers to request ([bb851d7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb851d785d121b039d5e968327fb930955a3fd92)) -* merged localscraper into smartscraper ([503dbd1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/503dbd19b8cec4d2ff4575786b0eec25db2e80e6)) -* modified icons ([bcb9b0b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bcb9b0b731b057d242fdf80b43d96879ff7a2764)) -* searchscraper ([2e04e5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2e04e5a1bbd207a7ceeea594878bdea542a7a856)) -* updated readmes ([bfdbea0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bfdbea038918d79df2e3e9442e25d5f08bbccbbc)) - - -### chore - -* refactor examples ([8e00846](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e008465f7280c53e2faab7a92f02871ffc5b867)) -* **tests:** updated tests ([9149ce8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9149ce85a78b503098f80910c20de69831030378)) - - -### CI - -* **release:** 1.9.0-beta.6 [skip ci] ([c898e99](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c898e9917cde8fe291312e3b2dc7d06f6afd3932)) -* **release:** 1.9.0-beta.7 [skip ci] ([2c1875b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2c1875b1413eef5a2335688a7e0baf32ec31dcee)) - -## [1.9.0-beta.7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.6...v1.9.0-beta.7) (2025-02-03) -## [1.10.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.10.1...v1.10.2) (2025-01-22) - - -### Bug Fixes - -* pyproject ([5d6a9ee](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5d6a9eed262d1041eea3110fbaa1729f2c16855c)) - -## [1.10.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.10.0...v1.10.1) (2025-01-22) - - -### Bug Fixes - -* pyproject.toml ([c6e6c6e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c6e6c6e33cd189bd78d7366dd570ee1e4d8c2c68)) -* pyproject.toml ([e8aed70](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e8aed7011c1a65eca2909df88a804179a04bdd96)) - -## [1.10.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0...v1.10.0) (2025-01-16) - - -### Features - -* add optional headers to request ([bb851d7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bb851d785d121b039d5e968327fb930955a3fd92)) -* merged localscraper into smartscraper ([503dbd1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/503dbd19b8cec4d2ff4575786b0eec25db2e80e6)) -* modified icons ([bcb9b0b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bcb9b0b731b057d242fdf80b43d96879ff7a2764)) -* searchscraper ([2e04e5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2e04e5a1bbd207a7ceeea594878bdea542a7a856)) -* updated readmes ([bfdbea0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bfdbea038918d79df2e3e9442e25d5f08bbccbbc)) - - -### chore - -* refactor examples ([8e00846](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8e008465f7280c53e2faab7a92f02871ffc5b867)) -* **tests:** updated tests ([9149ce8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9149ce85a78b503098f80910c20de69831030378)) - -## [1.9.0-beta.6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.5...v1.9.0-beta.6) (2025-01-08) -* add integration for sql ([2543b5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2543b5a9b84826de5c583d38fe89cf21aad077e6)) - - -### Docs - -* added new image ([b052ddb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b052ddbe0d1a5ea182c54897c94d4c88fbc54ab8)) - -## [1.9.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.8.0...v1.9.0) (2025-01-08) - - -### Features - -* add localScraper functionality ([8701eb2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8701eb2ca7f108b922eb1617c850a58c0f88f8f9)) -* add time varying timeout ([945b876](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/945b876a0c23d4b2a29ef916bd6fa9af425f9ab5)) -* revert to old release ([d88a3ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d88a3ac6969a0abdf1f6b8eccde9ad8284d41d20)) -* update doc readme ([c02c411](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c02c411ffba9fc7906fcc7664d0ce841e0e2fb54)) - - -### Bug Fixes - -* .toml file ([e719881](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e7198817d8dac802361ab84bc4d5d961fb926767)) -* add new python compatibility ([77b67f6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/77b67f646d75abd3a558b40cb31c52c12cc7182e)) -* add revert ([09257e0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09257e08246d8aee96b3944ac14cc14b88e5f818)) -* come back to py 3.10 ([26d3a75](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/26d3a75ed973590e21d55c985bf71f3905a3ac0e)) -* houses examples and typos ([c596c44](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c596c448e334a76444ecf3ee738ec275fd5316fa)) -* improve api desc ([62243f8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62243f84384ae238c0bd0c48abc76a6b99376c74)) -* make timeout optional ([49b8e4b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/49b8e4b8d3aa637bfd28a59e47cd1f5efad91075)) -* minor fix version ([0b972c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b972c69a9ea843d8ec89327f35c287b0d7a2bb4)) -* pyproject ([2440f7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2440f7f2a5179c6e3a86faf4eefa1d5edf7524c8)) -* python version ([24366b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/24366b08eefe0789da9a0ccafb8058e8744ee58b)) -* updated hatchling version ([740933a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/740933aff79a5873e6d1c633afcedb674d1f4cf0)) - - -### chore - -* fix _make_request not using it ([701a4c1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/701a4c13bbe7e5d4ba9eae1846b0bd8abbbdb6b8)) -* fix pyproject version ([3567034](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3567034e02e4dfab967248a5a4eaee426f145d6b)) - - -### Docs - -* added api reference ([6929a7a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6929a7adcc09f47a652cfd7ad7557314b52db9c0)) -* added api reference ([7b88876](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7b88876facc2b37e4738797b6a18c65ca89f9aa0)) -* added cookbook reference ([e68c1bd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e68c1bd1268663a625441bc7f955a1d4514ac0ef)) -* added langchain-scrapegraph examples ([479dbdb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/479dbdb1833a3ce6c2ce03eaf1400487ff534dd0)) -* added open in colab badge ([c2fc1ef](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2fc1efc687623bd821468c19a102dbaed70bd4b)) -* added two langchain-scrapegraph examples ([8f3a87e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8f3a87e880f820f4453d564fec02ef02af3742b3)) -* added two new examples ([5fa2b42](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5fa2b42685df565531cd7d2495e1d42e5c34ff90)) -* **cookbook:** added two new examples ([f67769e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f67769e0ef0bba6fc4fd6908ec666b63ac2368b9)) -* added wired langgraph react agent ([9f1e0cf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9f1e0cf72f4f84ee1f81439befaeace8c5c7ffa5)) -* added zillow example ([7fad92c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/7fad92ca5e87cd9ecc60702e1599b2cff479af5c)) -* api reference ([855c2e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/855c2e51ebfaf7d8e4be008e8f22fdf66c0dc0e0)) -* fixed cookbook images and urls ([f860167](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f8601674f686084a7df88e221475c014b40015b8)) -* github trending sdk ([320de37](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/320de37d2e8ec0d859ca91725c6cc35dab68e183)) -* link typo ([e1bfd6a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e1bfd6aa364b369c17457513f1c68e91376d0c68)) -* llama-index @VinciGit00 ([6de5eb2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6de5eb22490de2f5ff4075836bf1aca2e304ff8d)) -* research agent ([6e06afa](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6e06afa9f8d5e9f05a38e605562ec10249216704)) -* updated new documentation urls ([1d0cb46](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1d0cb46e5710707151ce227fa2043d5de5e92657)) - - -### CI - -* **release:** 1.9.0-beta.1 [skip ci] ([236d55b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/236d55b7c3ce571258fdd488ad7ac0891b2958ce)) -* **release:** 1.9.0-beta.2 [skip ci] ([59611f6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/59611f6d1d690b89917abd03ba863b46b40c2b95)) -* **release:** 1.9.0-beta.3 [skip ci] ([cbf2da4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cbf2da44b22da23c4d3870d52f88f9b0214cab27)) -* **release:** 1.9.0-beta.4 [skip ci] ([05d57ae](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/05d57aee168b1d184ef352240a03a43457e16749)) -* **release:** 1.9.0-beta.5 [skip ci] ([d03b9bf](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d03b9bf8807d6a42a41e6f82d65e54931844039c)) - -## [1.9.0-beta.5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.4...v1.9.0-beta.5) (2025-01-03) - - -### Bug Fixes - -* updated hatchling version ([740933a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/740933aff79a5873e6d1c633afcedb674d1f4cf0)) - -## [1.9.0-beta.4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.3...v1.9.0-beta.4) (2025-01-03) - - -### Bug Fixes - -* improve api desc ([62243f8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/62243f84384ae238c0bd0c48abc76a6b99376c74)) - -## [1.9.0-beta.3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.2...v1.9.0-beta.3) (2024-12-10) - - -### Bug Fixes - -* come back to py 3.10 ([26d3a75](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/26d3a75ed973590e21d55c985bf71f3905a3ac0e)) - -## [1.9.0-beta.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.9.0-beta.1...v1.9.0-beta.2) (2024-12-10) - - -### Bug Fixes - -* add new python compatibility ([77b67f6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/77b67f646d75abd3a558b40cb31c52c12cc7182e)) - -## [1.9.0-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.8.0...v1.9.0-beta.1) (2024-12-10) - - -### Features - -* add localScraper functionality ([8701eb2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8701eb2ca7f108b922eb1617c850a58c0f88f8f9)) -* revert to old release ([d88a3ac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d88a3ac6969a0abdf1f6b8eccde9ad8284d41d20)) - - -### Bug Fixes - -* .toml file ([e719881](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e7198817d8dac802361ab84bc4d5d961fb926767)) -* add revert ([09257e0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/09257e08246d8aee96b3944ac14cc14b88e5f818)) -* minor fix version ([0b972c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0b972c69a9ea843d8ec89327f35c287b0d7a2bb4)) -* pyproject ([2440f7f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2440f7f2a5179c6e3a86faf4eefa1d5edf7524c8)) -* python version ([24366b0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/24366b08eefe0789da9a0ccafb8058e8744ee58b)) - -## [1.8.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.7.0...v1.8.0) (2024-12-08) - - -### Features - -* add markdownify functionality ([239d27a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/239d27aac28c6b132aba54bbb1fa0216cc59ce89)) - - -### Bug Fixes - -* fixed configuration for ignored files ([bc08dcb](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/bc08dcb21536a146fd941119931bc8e89e8e42c6)) -* fixed schema example ([365378a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/365378a0c8c9125800ed6d74629d87776cf484a0)) - - -### Docs - -* improved main readme ([50fdf92](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/50fdf920e1d00e8f457138f9e68df74354696fc0)) - -## [1.7.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.6.0...v1.7.0) (2024-12-05) - - -### Features - -* add markdownify and localscraper ([6296510](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6296510b22ce511adde4265532ac6329a05967e0)) - - -### CI - -* **release:** 1.7.0-beta.1 [skip ci] ([5e65800](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5e6580067903644ac0c47b2c2f8d27a3e9dd2ae2)) - -## [1.7.0-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.6.0...v1.7.0-beta.1) (2024-12-05) - - -### Features - -* add markdownify and localscraper ([6296510](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6296510b22ce511adde4265532ac6329a05967e0)) - -## [1.6.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.5.0...v1.6.0) (2024-12-05) - - -### Features - -* changed SyncClient to Client ([9e1e496](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e1e496059cd24810a96b818da1811830586f94b)) - - -### Bug Fixes - -* logger working properly now ([9712d4c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9712d4c39eea860f813e86a5e2ffc14db6d3a655)) -* updated env variable loading ([2643f11](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2643f11c968f0daab26529d513f08c2817763b50)) - - -### CI - -* **release:** 1.4.3-beta.2 [skip ci] ([8ab6147](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ab61476b6763b936e2e7d423b04bb51983fb8ea)) -* **release:** 1.4.3-beta.3 [skip ci] ([1bc26c7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1bc26c738443f7f52492a7b2cbe7c9f335315797)) -* **release:** 1.5.0-beta.1 [skip ci] ([8900f7b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8900f7bf53239b6a73fb41196f5327d05763bae4)) -* **release:** 1.6.0-beta.1 [skip ci] ([636db26](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/636db26649dfac76503b556d5f724faf32e3522c)) - -## [1.6.0-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.5.0...v1.6.0-beta.1) (2024-12-05) - - -### Features - -* changed SyncClient to Client ([9e1e496](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e1e496059cd24810a96b818da1811830586f94b)) - - -### Bug Fixes - -* logger working properly now ([9712d4c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9712d4c39eea860f813e86a5e2ffc14db6d3a655)) -* updated env variable loading ([2643f11](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2643f11c968f0daab26529d513f08c2817763b50)) - - -### CI - -* **release:** 1.4.3-beta.2 [skip ci] ([8ab6147](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ab61476b6763b936e2e7d423b04bb51983fb8ea)) -* **release:** 1.4.3-beta.3 [skip ci] ([1bc26c7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/1bc26c738443f7f52492a7b2cbe7c9f335315797)) -* **release:** 1.5.0-beta.1 [skip ci] ([8900f7b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8900f7bf53239b6a73fb41196f5327d05763bae4)) - -## [1.5.0-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.3-beta.3...v1.5.0-beta.1) (2024-12-05) - - -### Features - -* changed SyncClient to Client ([9e1e496](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e1e496059cd24810a96b818da1811830586f94b)) - - -## [1.5.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.3...v1.5.0) (2024-12-04) - - -### Features - -* splitted files ([2791691](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2791691a9381063cc38ac4f4fe7c884166c93116)) - -## [1.4.3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.2...v1.4.3) (2024-12-03) - -## [1.4.3-beta.3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.3-beta.2...v1.4.3-beta.3) (2024-12-05) - - -### Bug Fixes - -* updated env variable loading ([2643f11](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2643f11c968f0daab26529d513f08c2817763b50)) - -## [1.4.3-beta.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.3-beta.1...v1.4.3-beta.2) (2024-12-05) - - -### Bug Fixes - -* logger working properly now ([9712d4c](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9712d4c39eea860f813e86a5e2ffc14db6d3a655)) - -### Bug Fixes - -* updated comment ([8250818](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/825081883940bc1caa37f4f13e10f710770aeb9c)) - - -### chore - -* improved url validation ([83eac53](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/83eac530269a767e5469c4aded1656fe00a2cdc0)) - - -### CI - -* **release:** 1.4.3-beta.1 [skip ci] ([cd1169b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cd1169b584ffa621d99961e2e95db96a28037e13)) - -## [1.4.3-beta.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.2...v1.4.3-beta.1) (2024-12-03) - - -### Bug Fixes - -* updated comment ([8250818](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/825081883940bc1caa37f4f13e10f710770aeb9c)) - - -### chore - -* improved url validation ([83eac53](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/83eac530269a767e5469c4aded1656fe00a2cdc0)) - -## [1.4.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.1...v1.4.2) (2024-12-02) - - -### Bug Fixes - -* timeout ([589aa49](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/589aa49d4434f7112a840d178e5e48918b7799e1)) - -## [1.4.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.4.0...v1.4.1) (2024-12-02) - - -### Bug Fixes - -* sync client ([690e87b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/690e87b52505f12da172147a78007497f6edf54c)) - - -### chore - -* set up eslint and prettier for code linting and formatting ([13cf1e5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/13cf1e5c28ec739d2d35617bd57d7cf8203c3f7e)) - -## [1.4.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.3.0...v1.4.0) (2024-11-30) - - -### Features - -* added example of the smartScraper function using a schema ([baf933b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/baf933b0826b63d4ecf61c8593676357619a1c73)) -* implemented support for requests with schema ([10a1a5a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/10a1a5a477a6659aabf3afebfffdbefc14d12d3e)) - - -### Bug Fixes - -* the "workspace" key has been removed because it was conflicting with the package.json file in the scrapegraph-js folder. ([1299173](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/129917377b6a685d769a480b717bf980d3199833)) - - -### chore - -* added Zod package dependency ([ee5738b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ee5738bd737cd07a553d148403a4bbb5e80e5be3)) - - -### Docs - -* added an example of the smartScraper functionality using a schema ([cf2f28f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/cf2f28fa029df0acb7058fde8239046d77ef0a8a)) - - -### Refactor - -* code refactoring ([a2b57c7](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a2b57c7e482dfb5c7c1a125d1684e0367088c83b)) - -## [1.3.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.2.2...v1.3.0) (2024-11-30) - - -### Features - -* add integration for env variables ([6a351f3](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6a351f3ef70a1f00b5f5de5aaba2f408b6bf07dd)) - -## [1.2.2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.2.1...v1.2.2) (2024-11-29) - - -### Bug Fixes - -* add enw timeout ([46ebd9d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/46ebd9dc9897ca2ef9460a3e46b3a24abe90f943)) - -## [1.2.1](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.2.0...v1.2.1) (2024-11-29) - - -### Bug Fixes - -* readme js sdk ([3c2178e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3c2178e04e873885abc8aca0312f5a4a1dd9cdd0)) -* removed wrong information ([88a2f50](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/88a2f509dc34ad69f41fe6d13f31de191895bc1a)) - - -### chore - -* changed pakage name ([9e9e138](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9e9e138617658e068a1c77a4dbac24b4d550d42a)) -* fix pylint scripts ([5913d5f](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5913d5f0d697196469f8ec952e1a65e1c7f49621)) - - -### Docs - -* improved examples ([a9c1fa5](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/a9c1fa5dcd7610b2b0c217d39fb2b77a67aa3fac)) -* updated precommit and installation guide ([c16705b](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c16705b8f405f57d2cb1719099d4b566186a7257)) -* updated readme ([ee9efa6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/ee9efa608b9a284861f712ab2a69d49da3d26523)) - - -### Refactor - -* code refactoring ([01ca238](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/01ca2384f098ecbb063ac4681e6d32f590a03f42)) - -## [1.2.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.1.0...v1.2.0) (2024-11-28) - - -### Features - -* enhaced python sdk ([c253363](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/c2533636c230426be06cd505598e8a85d5771cbc)) - - -### chore - -* set up CI scripts ([f688bdc](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/f688bdc11746325582787fa3c1ffb429838f46b6)) -* update workflow scripts ([5ea9cac](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/5ea9cacb6758171283d96ff9aa1934c25af804f1)) - -## [1.1.0](https://github.com/ScrapeGraphAI/scrapegraph-sdk/compare/v1.0.0...v1.1.0) (2024-11-28) - - -### Features - -* check ([9871ff8](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/9871ff81acfb42031ee9db526a7dba9e29d3c55b)) -* final release maybe semantic? ([8ce3ccd](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/8ce3ccd3509d0487da212f541e039ee7009dd8f3)) -* fix ([d81ab09](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d81ab091aa1ff08927ed7765055764b9e51083ee)) -* maybe final release? ([595c3c6](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/595c3c6b6ca0e8eaacd5959422ab9018516f3fa8)) -* semantic relaase ([30ff13a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/30ff13a219df982e07df7b5366f09dedc0892de5)) -* semantic release ([6df4b18](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6df4b1833c8c418766b1649f80f9d6cd1fa8a201)) -* semantic release ([edd23d9](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/edd23d93375ef33fa97a0b409045fdbd18090d10)) -* semantic release ([e5e4908](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/e5e49080bc6d3d1440d6b333f9cadfd493ff0449)) -* test ([3bb66c4](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3bb66c4efe3eb5407f6eb88d31bda678ac3651b3)) -* test semantic release ([63d3a36](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/63d3a3623363c358e5761e1b7737f262c8238c82)) -* test semantic release ([19eda59](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/19eda59be7adbea80ed189fd0af85ab0c3c930bd)) -* test semantic release ([3e611f2](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/3e611f21248a46120fa8ff3d30392522f6d1419a)) -* test semantic release ([6320819](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/6320819e12cbd3e0fa3faa93179d2d26f1323bb4)) -* try semantic release ([d953723](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d9537230ef978aaf42d72073dc95ba598db8db6c)) - - -### chore - -* added dotenv pakage dependency ([2e9d93d](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/2e9d93d571c47c3b7aa789be811f53161387b08e)) -* fix semantic release, migrate to uv ([b6db205](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/b6db205ad5a90031bc658e65794e4dda2159fee2)) - - -### Refactor - -* code refactoring ([164131a](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/164131a2abe899bd151113bd84efa113306327c2)) -* renamed functions ([d39f14e](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/d39f14e344ef59e3a8e4f501a080ccbe1151abee)) -* update readme ([0669f52](https://github.com/ScrapeGraphAI/scrapegraph-sdk/commit/0669f5219970079bbe7bde7502b4f55e5c3f5a45)) diff --git a/scrapegraph-py/CODE_OF_CONDUCT.md b/scrapegraph-py/CODE_OF_CONDUCT.md deleted file mode 100644 index 2eb8802e..00000000 --- a/scrapegraph-py/CODE_OF_CONDUCT.md +++ /dev/null @@ -1,128 +0,0 @@ -# Contributor Covenant Code of Conduct - -## Our Pledge - -We as members, contributors, and leaders pledge to make participation in our -community a harassment-free experience for everyone, regardless of age, body -size, visible or invisible disability, ethnicity, sex characteristics, gender -identity and expression, level of experience, education, socio-economic status, -nationality, personal appearance, race, religion, or sexual identity -and orientation. - -We pledge to act and interact in ways that contribute to an open, welcoming, -diverse, inclusive, and healthy community. - -## Our Standards - -Examples of behavior that contributes to a positive environment for our -community include: - -* Demonstrating empathy and kindness toward other people -* Being respectful of differing opinions, viewpoints, and experiences -* Giving and gracefully accepting constructive feedback -* Accepting responsibility and apologizing to those affected by our mistakes, - and learning from the experience -* Focusing on what is best not just for us as individuals, but for the - overall community - -Examples of unacceptable behavior include: - -* The use of sexualized language or imagery, and sexual attention or - advances of any kind -* Trolling, insulting or derogatory comments, and personal or political attacks -* Public or private harassment -* Publishing others' private information, such as a physical or email - address, without their explicit permission -* Other conduct which could reasonably be considered inappropriate in a - professional setting - -## Enforcement Responsibilities - -Community leaders are responsible for clarifying and enforcing our standards of -acceptable behavior and will take appropriate and fair corrective action in -response to any behavior that they deem inappropriate, threatening, offensive, -or harmful. - -Community leaders have the right and responsibility to remove, edit, or reject -comments, commits, code, wiki edits, issues, and other contributions that are -not aligned to this Code of Conduct, and will communicate reasons for moderation -decisions when appropriate. - -## Scope - -This Code of Conduct applies within all community spaces, and also applies when -an individual is officially representing the community in public spaces. -Examples of representing our community include using an official e-mail address, -posting via an official social media account, or acting as an appointed -representative at an online or offline event. - -## Enforcement - -Instances of abusive, harassing, or otherwise unacceptable behavior may be -reported to the community leaders responsible for enforcement at -mvincig11@gmail.com. -All complaints will be reviewed and investigated promptly and fairly. - -All community leaders are obligated to respect the privacy and security of the -reporter of any incident. - -## Enforcement Guidelines - -Community leaders will follow these Community Impact Guidelines in determining -the consequences for any action they deem in violation of this Code of Conduct: - -### 1. Correction - -**Community Impact**: Use of inappropriate language or other behavior deemed -unprofessional or unwelcome in the community. - -**Consequence**: A private, written warning from community leaders, providing -clarity around the nature of the violation and an explanation of why the -behavior was inappropriate. A public apology may be requested. - -### 2. Warning - -**Community Impact**: A violation through a single incident or series -of actions. - -**Consequence**: A warning with consequences for continued behavior. No -interaction with the people involved, including unsolicited interaction with -those enforcing the Code of Conduct, for a specified period of time. This -includes avoiding interactions in community spaces as well as external channels -like social media. Violating these terms may lead to a temporary or -permanent ban. - -### 3. Temporary Ban - -**Community Impact**: A serious violation of community standards, including -sustained inappropriate behavior. - -**Consequence**: A temporary ban from any sort of interaction or public -communication with the community for a specified period of time. No public or -private interaction with the people involved, including unsolicited interaction -with those enforcing the Code of Conduct, is allowed during this period. -Violating these terms may lead to a permanent ban. - -### 4. Permanent Ban - -**Community Impact**: Demonstrating a pattern of violation of community -standards, including sustained inappropriate behavior, harassment of an -individual, or aggression toward or disparagement of classes of individuals. - -**Consequence**: A permanent ban from any sort of public interaction within -the community. - -## Attribution - -This Code of Conduct is adapted from the [Contributor Covenant][homepage], -version 2.0, available at -https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. - -Community Impact Guidelines were inspired by [Mozilla's code of conduct -enforcement ladder](https://github.com/mozilla/diversity). - -[homepage]: https://www.contributor-covenant.org - -For answers to common questions about this code of conduct, see the FAQ at -https://www.contributor-covenant.org/faq. Translations are available at -https://www.contributor-covenant.org/translations. diff --git a/scrapegraph-py/CONTRIBUTING.md b/scrapegraph-py/CONTRIBUTING.md deleted file mode 100644 index 914b3de3..00000000 --- a/scrapegraph-py/CONTRIBUTING.md +++ /dev/null @@ -1,108 +0,0 @@ -# Contributing to ScrapeGraphAI - -Thank you for your interest in contributing to **ScrapeGraphAI**! We welcome contributions from the community to help improve and grow the project. This document outlines the guidelines and steps for contributing. - -## Table of Contents - -- [Getting Started](#getting-started) -- [Contributing Guidelines](#contributing-guidelines) -- [Code Style](#code-style) -- [Submitting a Pull Request](#submitting-a-pull-request) -- [Reporting Issues](#reporting-issues) -- [License](#license) - -## Getting Started - -### Development Setup - -1. Fork the repository on GitHub **(FROM pre/beta branch)**. -2. Clone your forked repository: - ```bash - git clone https://github.com/ScrapeGraphAI/scrapegraph-sdk.git - cd scrapegraph-sdk/scrapegraph-py - ``` - -3. Install dependencies using uv (recommended): - ```bash - # Install uv if you haven't already - pip install uv - - # Install dependencies - uv sync - - # Install pre-commit hooks - uv run pre-commit install - ``` - -4. Run tests: - ```bash - # Run all tests - uv run pytest - - # Run specific test file - uv run pytest tests/test_client.py - ``` - -4. Make your changes or additions. -5. Test your changes thoroughly. -6. Commit your changes with descriptive commit messages. -7. Push your changes to your forked repository. -8. Submit a pull request to the pre/beta branch. - -N.B All the pull request to the main branch will be rejected! - -## Contributing Guidelines - -Please adhere to the following guidelines when contributing to ScrapeGraphAI: - -- Follow the code style and formatting guidelines specified in the [Code Style](#code-style) section. -- Make sure your changes are well-documented and include any necessary updates to the project's documentation and requirements if needed. -- Write clear and concise commit messages that describe the purpose of your changes and the last commit before the pull request has to follow the following format: - - `feat: Add new feature` - - `fix: Correct issue with existing feature` - - `docs: Update documentation` - - `style: Improve formatting and style` - - `refactor: Restructure code` - - `test: Add or update tests` - - `perf: Improve performance` -- Be respectful and considerate towards other contributors and maintainers. - -## Code Style - -Please make sure to format your code accordingly before submitting a pull request. - -### Python - -- [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) -- [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) -- [The Hitchhiker's Guide to Python](https://docs.python-guide.org/writing/style/) -- [Pylint style of code for the documentation](https://pylint.pycqa.org/en/1.6.0/tutorial.html) - -## Submitting a Pull Request - -To submit your changes for review, please follow these steps: - -1. Ensure that your changes are pushed to your forked repository. -2. Go to the main repository on GitHub and navigate to the "Pull Requests" tab. -3. Click on the "New Pull Request" button. -4. Select your forked repository and the branch containing your changes. -5. Provide a descriptive title and detailed description for your pull request. -6. Reviewers will provide feedback and discuss any necessary changes. -7. Once your pull request is approved, it will be merged into the pre/beta branch. - -## Reporting Issues - -If you encounter any issues or have suggestions for improvements, please open an issue on the GitHub repository. Provide a clear and detailed description of the problem or suggestion, along with any relevant information or steps to reproduce the issue. - -## License - -ScrapeGraphAI is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for more information. -By contributing to this project, you agree to license your contributions under the same license. - -ScrapeGraphAI uses code from the Langchain -frameworks. You find their original licenses below. - -LANGCHAIN LICENSE -https://github.com/langchain-ai/langchain/blob/master/LICENSE - -Can't wait to see your contributions! :smile: diff --git a/scrapegraph-py/Makefile b/scrapegraph-py/Makefile deleted file mode 100644 index 539a3ab5..00000000 --- a/scrapegraph-py/Makefile +++ /dev/null @@ -1,55 +0,0 @@ -# Makefile for Project Automation - -.PHONY: install lint type-check test docs serve-docs build all clean - -# Variables -PACKAGE_NAME = scrapegraph_py -TEST_DIR = tests - -# Default target -all: lint type-check test docs - -# Install project dependencies -install: - uv sync - -# Linting and Formatting Checks -lint: - uv run ruff check $(PACKAGE_NAME) $(TEST_DIR) - uv run black --check $(PACKAGE_NAME) $(TEST_DIR) - uv run isort --check-only $(PACKAGE_NAME) $(TEST_DIR) - -# Type Checking with MyPy -type-check: - uv run mypy $(PACKAGE_NAME) $(TEST_DIR) - -# Run Tests with Coverage -test: - uv run pytest --cov=$(PACKAGE_NAME) --cov-report=xml $(TEST_DIR)/ - -# Build Documentation using MkDocs -docs: - uv run mkdocs build - -# Serve Documentation Locally -serve-docs: - uv run mkdocs serve - -# Run Pre-Commit Hooks -pre-commit: - uv run pre-commit run --all-files - -# Clean Up Generated Files -clean: - rm -rf dist/ - rm -rf build/ - rm -rf *.egg-info - rm -rf htmlcov/ - rm -rf .mypy_cache/ - rm -rf .pytest_cache/ - rm -rf .ruff_cache/ - rm -rf site/ - -# Build the Package -build: - uv build diff --git a/scrapegraph-py/README.md b/scrapegraph-py/README.md deleted file mode 100644 index 9898b87d..00000000 --- a/scrapegraph-py/README.md +++ /dev/null @@ -1,400 +0,0 @@ -# ๐ŸŒ ScrapeGraph Python SDK - -[![PyPI version](https://badge.fury.io/py/scrapegraph-py.svg)](https://badge.fury.io/py/scrapegraph-py) -[![Python Support](https://img.shields.io/pypi/pyversions/scrapegraph-py.svg)](https://pypi.org/project/scrapegraph-py/) -[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) -[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -[![Documentation Status](https://readthedocs.org/projects/scrapegraph-py/badge/?version=latest)](https://docs.scrapegraphai.com) - -

- ScrapeGraph API Banner -

- -Official [Python SDK ](https://scrapegraphai.com) for the ScrapeGraph API - Smart web scraping powered by AI. - -## ๐Ÿ“ฆ Installation - -### Basic Installation - -```bash -pip install scrapegraph-py -``` - -This installs the core SDK with minimal dependencies. The SDK is fully functional with just the core dependencies. - -### Optional Dependencies - -For specific use cases, you can install optional extras: - -**HTML Validation** (required when using `website_html` parameter): -```bash -pip install scrapegraph-py[html] -``` - -**Langchain Integration** (for using with Langchain/Langgraph): -```bash -pip install scrapegraph-py[langchain] -``` - -**All Optional Dependencies**: -```bash -pip install scrapegraph-py[html,langchain] -``` - -## ๐Ÿš€ Features - -- ๐Ÿค– AI-powered web scraping and search -- ๐Ÿ•ท๏ธ Smart crawling with both AI extraction and markdown conversion modes -- ๐Ÿ’ฐ Cost-effective markdown conversion (80% savings vs AI mode) -- ๐Ÿ”„ Both sync and async clients -- ๐Ÿ“Š Structured output with Pydantic schemas -- ๐Ÿ” Detailed logging -- โšก Automatic retries -- ๐Ÿ” Secure authentication - -## ๐ŸŽฏ Quick Start - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") -``` - -> [!NOTE] -> You can set the `SGAI_API_KEY` environment variable and initialize the client without parameters: `client = Client()` - -## ๐Ÿ“š Available Endpoints - -### ๐Ÿค– SmartScraper - -Extract structured data from any webpage or HTML content using AI. - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -# Using a URL -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main heading and description" -) - -# Or using HTML content -# Note: Using website_html requires the [html] extra: pip install scrapegraph-py[html] -html_content = """ - - -

Company Name

-

We are a technology company focused on AI solutions.

- - -""" - -response = client.smartscraper( - website_html=html_content, - user_prompt="Extract the company description" -) - -print(response) -``` - -
-Output Schema (Optional) - -```python -from pydantic import BaseModel, Field -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -class WebsiteData(BaseModel): - title: str = Field(description="The page title") - description: str = Field(description="The meta description") - -response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the title and description", - output_schema=WebsiteData -) -``` - -
- -
-๐Ÿช Cookies Support - -Use cookies for authentication and session management: - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -# Define cookies for authentication -cookies = { - "session_id": "abc123def456", - "auth_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", - "user_preferences": "dark_mode,usd" -} - -response = client.smartscraper( - website_url="https://example.com/dashboard", - user_prompt="Extract user profile information", - cookies=cookies -) -``` - -**Common Use Cases:** -- **E-commerce sites**: User authentication, shopping cart persistence -- **Social media**: Session management, user preferences -- **Banking/Financial**: Secure authentication, transaction history -- **News sites**: User preferences, subscription content -- **API endpoints**: Authentication tokens, API keys - -
- -
-๐Ÿ”„ Advanced Features - -**Infinite Scrolling:** -```python -response = client.smartscraper( - website_url="https://example.com/feed", - user_prompt="Extract all posts from the feed", - cookies=cookies, - number_of_scrolls=10 # Scroll 10 times to load more content -) -``` - -**Pagination:** -```python -response = client.smartscraper( - website_url="https://example.com/products", - user_prompt="Extract all product information", - cookies=cookies, - total_pages=5 # Scrape 5 pages -) -``` - -**Combined with Cookies:** -```python -response = client.smartscraper( - website_url="https://example.com/dashboard", - user_prompt="Extract user data from all pages", - cookies=cookies, - number_of_scrolls=5, - total_pages=3 -) -``` - -
- -### ๐Ÿ” SearchScraper - -Perform AI-powered web searches with structured results and reference URLs. - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -response = client.searchscraper( - user_prompt="What is the latest version of Python and its main features?" -) - -print(f"Answer: {response['result']}") -print(f"Sources: {response['reference_urls']}") -``` - -
-Output Schema (Optional) - -```python -from pydantic import BaseModel, Field -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -class PythonVersionInfo(BaseModel): - version: str = Field(description="The latest Python version number") - release_date: str = Field(description="When this version was released") - major_features: list[str] = Field(description="List of main features") - -response = client.searchscraper( - user_prompt="What is the latest version of Python and its main features?", - output_schema=PythonVersionInfo -) -``` - -
- -### ๐Ÿ“ Markdownify - -Converts any webpage into clean, formatted markdown. - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -response = client.markdownify( - website_url="https://example.com" -) - -print(response) -``` - -### ๐Ÿ•ท๏ธ Crawler - -Intelligently crawl and extract data from multiple pages with support for both AI extraction and markdown conversion modes. - -#### AI Extraction Mode (Default) -Extract structured data from multiple pages using AI: - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -# Define the data schema for extraction -schema = { - "type": "object", - "properties": { - "company_name": {"type": "string"}, - "founders": { - "type": "array", - "items": {"type": "string"} - }, - "description": {"type": "string"} - } -} - -response = client.crawl( - url="https://scrapegraphai.com", - prompt="extract the company information and founders", - data_schema=schema, - depth=2, - max_pages=5, - same_domain_only=True -) - -# Poll for results (crawl is asynchronous) -crawl_id = response.get("crawl_id") -result = client.get_crawl(crawl_id) -``` - -#### Markdown Conversion Mode (Cost-Effective) -Convert pages to clean markdown without AI processing (80% cheaper): - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key-here") - -response = client.crawl( - url="https://scrapegraphai.com", - extraction_mode=False, # Markdown conversion mode - depth=2, - max_pages=5, - same_domain_only=True, - sitemap=True # Use sitemap for better page discovery -) - -# Poll for results -crawl_id = response.get("crawl_id") -result = client.get_crawl(crawl_id) - -# Access markdown content -for page in result["result"]["pages"]: - print(f"URL: {page['url']}") - print(f"Markdown: {page['markdown']}") - print(f"Metadata: {page['metadata']}") -``` - -
-๐Ÿ”ง Crawl Parameters - -- **url** (required): Starting URL for the crawl -- **extraction_mode** (default: True): - - `True` = AI extraction mode (requires prompt and data_schema) - - `False` = Markdown conversion mode (no AI, 80% cheaper) -- **prompt** (required for AI mode): AI prompt to guide data extraction -- **data_schema** (required for AI mode): JSON schema defining extracted data structure -- **depth** (default: 2): Maximum crawl depth (1-10) -- **max_pages** (default: 2): Maximum pages to crawl (1-100) -- **same_domain_only** (default: True): Only crawl pages from the same domain -- **sitemap** (default: False): Use sitemap.xml for better page discovery and more comprehensive crawling -- **cache_website** (default: True): Cache website content -- **batch_size** (optional): Batch size for processing pages (1-10) - -**Cost Comparison:** -- AI Extraction Mode: ~10 credits per page -- Markdown Conversion Mode: ~2 credits per page (80% savings!) - -**Sitemap Benefits:** -- Better page discovery using sitemap.xml -- More comprehensive website coverage -- Efficient crawling of structured websites -- Perfect for e-commerce, news sites, and content-heavy websites - -
- -## โšก Async Support - -All endpoints support async operations: - -```python -import asyncio -from scrapegraph_py import AsyncClient - -async def main(): - async with AsyncClient() as client: - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main content" - ) - print(response) - -asyncio.run(main()) -``` - -## ๐Ÿ“– Documentation - -For detailed documentation, visit [docs.scrapegraphai.com](https://docs.scrapegraphai.com) - -## ๐Ÿ› ๏ธ Development - -For information about setting up the development environment and contributing to the project, see our [Contributing Guide](CONTRIBUTING.md). - -## ๐Ÿ’ฌ Support & Feedback - -- ๐Ÿ“ง Email: support@scrapegraphai.com -- ๐Ÿ’ป GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues) -- ๐ŸŒŸ Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/scrapegraph-sdk/issues/new) -- โญ API Feedback: You can also submit feedback programmatically using the feedback endpoint: - ```python - from scrapegraph_py import Client - - client = Client(api_key="your-api-key-here") - - client.submit_feedback( - request_id="your-request-id", - rating=5, - feedback_text="Great results!" - ) - ``` - -## ๐Ÿ“„ License - -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. - -## ๐Ÿ”— Links - -- [Website](https://scrapegraphai.com) -- [Documentation](https://docs.scrapegraphai.com) -- [GitHub](https://github.com/ScrapeGraphAI/scrapegraph-sdk) - ---- - -Made with โค๏ธ by [ScrapeGraph AI](https://scrapegraphai.com) diff --git a/scrapegraph-py/TESTING.md b/scrapegraph-py/TESTING.md deleted file mode 100644 index 927d54d1..00000000 --- a/scrapegraph-py/TESTING.md +++ /dev/null @@ -1,316 +0,0 @@ -# Testing Guide for ScrapeGraph Python SDK - -This document provides comprehensive information about testing the ScrapeGraph Python SDK. - -## Overview - -The test suite covers all APIs in the SDK with comprehensive test cases that ensure: -- All API endpoints return 200 status codes -- Both sync and async clients are tested -- Error handling scenarios are covered -- Edge cases and validation are tested - -## Test Structure - -### Test Files - -- `tests/test_comprehensive_apis.py` - Mocked comprehensive test suite covering all APIs -- `tests/test_real_apis.py` - Real API tests using actual API calls with environment variables -- `tests/test_client.py` - Sync client tests -- `tests/test_async_client.py` - Async client tests -- `tests/test_smartscraper.py` - SmartScraper specific tests -- `tests/test_models.py` - Model validation tests -- `tests/test_exceptions.py` - Exception handling tests - -### Test Categories - -1. **API Tests** - Test all API endpoints with 200 responses -2. **Client Tests** - Test client initialization and context managers -3. **Model Tests** - Test Pydantic model validation -4. **Error Handling** - Test error scenarios and edge cases -5. **Async Tests** - Test async client functionality - -## Running Tests - -### Prerequisites - -Install test dependencies: - -```bash -cd scrapegraph-py -pip install -r requirements-test.txt -pip install -e ".[html]" -``` - -**Note**: Tests require the `html` extra to be installed because they test HTML validation features. The `[html]` extra includes `beautifulsoup4` which is used for HTML validation in `SmartScraperRequest`. - -### Basic Test Execution - -```bash -# Run all tests -python -m pytest - -# Run with verbose output -python -m pytest -v - -# Run specific test file -python -m pytest tests/test_comprehensive_apis.py - -# Run only async tests -python -m pytest -m asyncio - -# Run only sync tests -python -m pytest -m "not asyncio" -``` - -### Using the Test Runner Script - -```bash -# Run all tests with coverage -python run_tests.py --coverage - -# Run with HTML coverage report -python run_tests.py --coverage --html - -# Run with XML coverage report (for CI) -python run_tests.py --coverage --xml - -# Run only async tests -python run_tests.py --async-only - -# Run specific test file -python run_tests.py --test-file tests/test_smartscraper.py -``` - -### Using the Real API Test Runner - -```bash -# Run real API tests (requires SGAI_API_KEY environment variable) -python run_real_tests.py - -# Run with custom API key -python run_real_tests.py --api-key your-api-key-here - -# Run with verbose output -python run_real_tests.py --verbose - -# Run only async real API tests -python run_real_tests.py --async-only - -# Run only sync real API tests -python run_real_tests.py --sync-only -``` - -### Coverage Reports - -```bash -# Generate coverage report -python -m pytest --cov=scrapegraph_py --cov-report=html --cov-report=term-missing - -# View HTML coverage report -open htmlcov/index.html -``` - -## Test Coverage - -### Mocked Tests (test_comprehensive_apis.py) - -1. **SmartScraper API** - - Basic scraping with URL - - Scraping with HTML content - - Custom headers - - Cookies support - - Output schema validation - - Infinite scrolling - - Pagination - - Status retrieval - -2. **SearchScraper API** - - Basic search functionality - - Custom number of results - - Custom headers - - Output schema validation - - Status retrieval - -3. **Markdownify API** - - Basic markdown conversion - - Custom headers - - Status retrieval - -4. **Crawl API** - - Basic crawling - - All parameters (depth, max_pages, etc.) - - Status retrieval - -5. **Credits API** - - Credit balance retrieval - -6. **Feedback API** - - Submit feedback with text - - Submit feedback without text - -### Real API Tests (test_real_apis.py) - -The real API tests cover the same functionality as the mocked tests but use actual API calls: - -1. **All API Endpoints** - Test with real API responses -2. **Error Handling** - Test with actual error scenarios -3. **Performance** - Test concurrent requests and response times -4. **Environment Variables** - Test client initialization from environment -5. **Context Managers** - Test proper resource management - -**Note**: Real API tests require a valid `SGAI_API_KEY` environment variable and may consume API credits. - -### Client Features Tested - -1. **Sync Client** - - Initialization from environment - - Context manager support - - All API methods - -2. **Async Client** - - Initialization from environment - - Async context manager support - - All async API methods - -## Mocking Strategy - -All tests use the `responses` library to mock HTTP requests: - -```python -@responses.activate -def test_api_endpoint(): - responses.add( - responses.POST, - "https://api.scrapegraphai.com/v1/endpoint", - json={"status": "completed", "result": "data"}, - status=200 - ) - # Test implementation -``` - -## GitHub Actions Workflow - -The `.github/workflows/test.yml` file defines the CI/CD pipeline: - -### Jobs - -1. **Test Job** - - Runs on multiple Python versions (3.8-3.12) - - Executes all tests with coverage - - Uploads coverage to Codecov - -2. **Lint Job** - - Runs flake8, black, isort, and mypy - - Ensures code quality and style consistency - -3. **Security Job** - - Runs bandit and safety checks - - Identifies potential security issues - -### Triggers - -- Push to main/master branch -- Pull requests to main/master branch - -## Test Configuration - -### pytest.ini - -The `pytest.ini` file configures: -- Test discovery patterns -- Coverage settings -- Custom markers -- Warning filters - -### Coverage Settings - -- Minimum coverage: 80% -- Reports: term-missing, html, xml -- Coverage source: scrapegraph_py package - -## Best Practices - -1. **Test Naming** - - Use descriptive test names - - Include the expected behavior in the name - - Use `test_` prefix for all test functions - -2. **Test Organization** - - Group related tests in classes or modules - - Use fixtures for common setup - - Keep tests independent - -3. **Mocking** - - Mock external dependencies - - Use realistic mock data - - Test both success and error scenarios - -4. **Assertions** - - Test specific behavior, not implementation - - Use appropriate assertion methods - - Include meaningful error messages - -## Troubleshooting - -### Common Issues - -1. **Import Errors** - ```bash - pip install -e ".[html]" - ``` - -2. **Missing Dependencies** - ```bash - pip install -r requirements-test.txt - ``` - -3. **Async Test Failures** - ```bash - pip install pytest-asyncio - ``` - -4. **Coverage Issues** - ```bash - pip install pytest-cov - ``` - -### Debug Mode - -```bash -# Run tests with debug output -python -m pytest -v -s - -# Run specific test with debug -python -m pytest tests/test_comprehensive_apis.py::test_smartscraper_basic_success -v -s -``` - -## Contributing - -When adding new tests: - -1. Follow the existing test patterns -2. Ensure 200 status code responses -3. Test both sync and async versions -4. Include error handling scenarios -5. Update this documentation if needed - -## Coverage Goals - -- **Minimum Coverage**: 80% -- **Target Coverage**: 90% -- **Critical Paths**: 100% - -## Performance - -- **Test Execution Time**: < 30 seconds -- **Memory Usage**: < 500MB -- **Parallel Execution**: Supported via pytest-xdist - -## Security - -All tests run in isolated environments with: -- No real API calls -- Mocked external dependencies -- No sensitive data exposure -- Security scanning enabled diff --git a/scrapegraph-py/TOON_INTEGRATION.md b/scrapegraph-py/TOON_INTEGRATION.md deleted file mode 100644 index b4c61dcf..00000000 --- a/scrapegraph-py/TOON_INTEGRATION.md +++ /dev/null @@ -1,230 +0,0 @@ -# TOON Format Integration - -## Overview - -The ScrapeGraph SDK now supports [TOON (Token-Oriented Object Notation)](https://github.com/ScrapeGraphAI/toonify) format for API responses. TOON is a compact data format that reduces LLM token usage by **30-60%** compared to JSON, significantly lowering API costs while maintaining human readability. - -## What is TOON? - -TOON is a serialization format optimized for LLM token efficiency. It represents structured data in a more compact form than JSON while preserving all information. - -### Example Comparison - -**JSON** (247 bytes): -```json -{ - "products": [ - {"id": 101, "name": "Laptop Pro", "price": 1299}, - {"id": 102, "name": "Magic Mouse", "price": 79}, - {"id": 103, "name": "USB-C Cable", "price": 19} - ] -} -``` - -**TOON** (98 bytes, **60% reduction**): -``` -products[3]{id,name,price}: - 101,Laptop Pro,1299 - 102,Magic Mouse,79 - 103,USB-C Cable,19 -``` - -## Benefits - -- โœ… **30-60% reduction** in token usage -- โœ… **Lower LLM API costs** (saves $2,147 per million requests at GPT-4 pricing) -- โœ… **Faster processing** due to smaller payloads -- โœ… **Human-readable** format -- โœ… **Lossless** conversion (preserves all data) - -## Usage - -### Installation - -The TOON integration is automatically available when you install the SDK: - -```bash -pip install scrapegraph-py -``` - -The `toonify` library is included as a dependency. - -### Basic Usage - -All scraping methods now support a `return_toon` parameter. Set it to `True` to receive responses in TOON format: - -```python -from scrapegraph_py import Client - -client = Client(api_key="your-api-key") - -# Get response in JSON format (default) -json_result = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - return_toon=False # or omit this parameter -) - -# Get response in TOON format (30-60% fewer tokens) -toon_result = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - return_toon=True -) -``` - -### Async Usage - -The async client also supports TOON format: - -```python -import asyncio -from scrapegraph_py import AsyncClient - -async def main(): - async with AsyncClient(api_key="your-api-key") as client: - # Get response in TOON format - toon_result = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract product information", - return_toon=True - ) - print(toon_result) - -asyncio.run(main()) -``` - -## Supported Methods - -The `return_toon` parameter is available for all scraping methods: - -### SmartScraper -```python -# Sync -client.smartscraper(..., return_toon=True) -client.get_smartscraper(request_id, return_toon=True) - -# Async -await client.smartscraper(..., return_toon=True) -await client.get_smartscraper(request_id, return_toon=True) -``` - -### SearchScraper -```python -# Sync -client.searchscraper(..., return_toon=True) -client.get_searchscraper(request_id, return_toon=True) - -# Async -await client.searchscraper(..., return_toon=True) -await client.get_searchscraper(request_id, return_toon=True) -``` - -### Crawl -```python -# Sync -client.crawl(..., return_toon=True) -client.get_crawl(crawl_id, return_toon=True) - -# Async -await client.crawl(..., return_toon=True) -await client.get_crawl(crawl_id, return_toon=True) -``` - -### AgenticScraper -```python -# Sync -client.agenticscraper(..., return_toon=True) -client.get_agenticscraper(request_id, return_toon=True) - -# Async -await client.agenticscraper(..., return_toon=True) -await client.get_agenticscraper(request_id, return_toon=True) -``` - -### Markdownify -```python -# Sync -client.markdownify(..., return_toon=True) -client.get_markdownify(request_id, return_toon=True) - -# Async -await client.markdownify(..., return_toon=True) -await client.get_markdownify(request_id, return_toon=True) -``` - -### Scrape -```python -# Sync -client.scrape(..., return_toon=True) -client.get_scrape(request_id, return_toon=True) - -# Async -await client.scrape(..., return_toon=True) -await client.get_scrape(request_id, return_toon=True) -``` - -## Examples - -Complete examples are available in the `examples/` directory: - -- `examples/toon_example.py` - Sync examples demonstrating TOON format -- `examples/toon_async_example.py` - Async examples demonstrating TOON format - -Run the examples: - -```bash -# Set your API key -export SGAI_API_KEY="your-api-key" - -# Run sync example -python examples/toon_example.py - -# Run async example -python examples/toon_async_example.py -``` - -## When to Use TOON - -**Use TOON when:** -- โœ… Passing scraped data to LLM APIs (reduces token costs) -- โœ… Working with large structured datasets -- โœ… Context window is limited -- โœ… Token cost optimization is important - -**Use JSON when:** -- โŒ Maximum compatibility with third-party tools is required -- โŒ Data needs to be processed by JSON-only tools -- โŒ Working with highly irregular/nested data - -## Cost Savings Example - -At GPT-4 pricing: -- **Input tokens**: $0.01 per 1K tokens -- **Output tokens**: $0.03 per 1K tokens - -With 50% token reduction using TOON: -- **1 million API requests** with 1K tokens each -- **Savings**: $2,147 per million requests -- **Savings**: $5,408 per billion tokens - -## Technical Details - -The TOON integration is implemented through a converter utility (`scrapegraph_py.utils.toon_converter`) that: - -1. Takes the API response (dict) -2. Converts it to TOON format using the `toonify` library -3. Returns the TOON-formatted string - -The conversion is **lossless** - all data is preserved and can be converted back to the original structure using the TOON decoder. - -## Learn More - -- [Toonify GitHub Repository](https://github.com/ScrapeGraphAI/toonify) -- [TOON Format Specification](https://github.com/toon-format/toon) -- [ScrapeGraph Documentation](https://docs.scrapegraphai.com) - -## Contributing - -Found a bug or have a suggestion for the TOON integration? Please open an issue or submit a pull request on our [GitHub repository](https://github.com/ScrapeGraphAI/scrapegraph-sdk). - diff --git a/scrapegraph-py/examples/.env.example b/scrapegraph-py/examples/.env.example deleted file mode 100644 index 10816c82..00000000 --- a/scrapegraph-py/examples/.env.example +++ /dev/null @@ -1 +0,0 @@ -SGAI_API_KEY="your_sgai_api_key" diff --git a/scrapegraph-py/examples/advanced_features/cookies/cookies_integration_example.py b/scrapegraph-py/examples/advanced_features/cookies/cookies_integration_example.py deleted file mode 100644 index d7c2d6ff..00000000 --- a/scrapegraph-py/examples/advanced_features/cookies/cookies_integration_example.py +++ /dev/null @@ -1,285 +0,0 @@ -""" -Comprehensive example demonstrating cookies integration for web scraping. - -This example shows various real-world scenarios where cookies are essential: -1. E-commerce site scraping with authentication -2. Social media scraping with session cookies -3. Banking/financial site scraping with secure cookies -4. News site scraping with user preferences -5. API endpoint scraping with authentication tokens - -Requirements: -- Python 3.7+ -- scrapegraph-py -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import json -import os -from typing import Optional - -from dotenv import load_dotenv -from pydantic import BaseModel, Field - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -# Define data models for different scenarios -class ProductInfo(BaseModel): - """Model for e-commerce product information.""" - - name: str = Field(description="Product name") - price: str = Field(description="Product price") - availability: str = Field(description="Product availability status") - rating: Optional[str] = Field(description="Product rating", default=None) - - -class SocialMediaPost(BaseModel): - """Model for social media post information.""" - - author: str = Field(description="Post author") - content: str = Field(description="Post content") - likes: Optional[str] = Field(description="Number of likes", default=None) - comments: Optional[str] = Field(description="Number of comments", default=None) - timestamp: Optional[str] = Field(description="Post timestamp", default=None) - - -class NewsArticle(BaseModel): - """Model for news article information.""" - - title: str = Field(description="Article title") - summary: str = Field(description="Article summary") - author: Optional[str] = Field(description="Article author", default=None) - publish_date: Optional[str] = Field(description="Publish date", default=None) - - -class BankTransaction(BaseModel): - """Model for banking transaction information.""" - - date: str = Field(description="Transaction date") - description: str = Field(description="Transaction description") - amount: str = Field(description="Transaction amount") - type: str = Field(description="Transaction type (credit/debit)") - - -def scrape_ecommerce_with_auth(): - """Example: Scrape e-commerce site with authentication cookies.""" - print("=" * 60) - print("E-COMMERCE SITE SCRAPING WITH AUTHENTICATION") - print("=" * 60) - - # Example cookies for an e-commerce site - cookies = { - "session_id": "abc123def456", - "user_id": "user789", - "cart_id": "cart101112", - "preferences": "dark_mode,usd", - "auth_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", - } - - website_url = "https://example-ecommerce.com/products" - user_prompt = ( - "Extract product information including name, price, availability, and rating" - ) - - try: - client = Client.from_env() - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - output_schema=ProductInfo, - number_of_scrolls=5, # Scroll to load more products - ) - - print("โœ… E-commerce scraping completed successfully") - print(json.dumps(response, indent=2)) - client.close() - - except Exception as e: - print(f"โŒ Error in e-commerce scraping: {str(e)}") - - -def scrape_social_media_with_session(): - """Example: Scrape social media with session cookies.""" - print("\n" + "=" * 60) - print("SOCIAL MEDIA SCRAPING WITH SESSION COOKIES") - print("=" * 60) - - # Example cookies for a social media site - cookies = { - "session_token": "xyz789abc123", - "user_session": "def456ghi789", - "csrf_token": "jkl012mno345", - "remember_me": "true", - "language": "en_US", - } - - website_url = "https://example-social.com/feed" - user_prompt = ( - "Extract posts from the feed including author, content, likes, and comments" - ) - - try: - client = Client.from_env() - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - output_schema=SocialMediaPost, - number_of_scrolls=10, # Scroll to load more posts - ) - - print("โœ… Social media scraping completed successfully") - print(json.dumps(response, indent=2)) - client.close() - - except Exception as e: - print(f"โŒ Error in social media scraping: {str(e)}") - - -def scrape_news_with_preferences(): - """Example: Scrape news site with user preference cookies.""" - print("\n" + "=" * 60) - print("NEWS SITE SCRAPING WITH USER PREFERENCES") - print("=" * 60) - - # Example cookies for a news site - cookies = { - "user_preferences": "technology,science,ai", - "reading_level": "advanced", - "region": "US", - "subscription_tier": "premium", - "theme": "dark", - } - - website_url = "https://example-news.com/technology" - user_prompt = ( - "Extract news articles including title, summary, author, and publish date" - ) - - try: - client = Client.from_env() - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - output_schema=NewsArticle, - total_pages=3, # Scrape multiple pages - ) - - print("โœ… News scraping completed successfully") - print(json.dumps(response, indent=2)) - client.close() - - except Exception as e: - print(f"โŒ Error in news scraping: {str(e)}") - - -def scrape_banking_with_secure_cookies(): - """Example: Scrape banking site with secure authentication cookies.""" - print("\n" + "=" * 60) - print("BANKING SITE SCRAPING WITH SECURE COOKIES") - print("=" * 60) - - # Example secure cookies for a banking site - cookies = { - "secure_session": "pqr678stu901", - "auth_token": "vwx234yz567", - "mfa_verified": "true", - "device_id": "device_abc123", - "last_activity": "2024-01-15T10:30:00Z", - } - - website_url = "https://example-bank.com/transactions" - user_prompt = ( - "Extract recent transactions including date, description, amount, and type" - ) - - try: - client = Client.from_env() - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - output_schema=BankTransaction, - total_pages=5, # Scrape multiple pages of transactions - ) - - print("โœ… Banking scraping completed successfully") - print(json.dumps(response, indent=2)) - client.close() - - except Exception as e: - print(f"โŒ Error in banking scraping: {str(e)}") - - -def scrape_api_with_auth_tokens(): - """Example: Scrape API endpoint with authentication tokens.""" - print("\n" + "=" * 60) - print("API ENDPOINT SCRAPING WITH AUTH TOKENS") - print("=" * 60) - - # Example API authentication cookies - cookies = { - "api_token": "api_abc123def456", - "client_id": "client_789", - "access_token": "access_xyz789", - "refresh_token": "refresh_abc123", - "scope": "read:all", - } - - website_url = "https://api.example.com/data" - user_prompt = "Extract data from the API response" - - try: - client = Client.from_env() - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - headers={"Accept": "application/json", "Content-Type": "application/json"}, - ) - - print("โœ… API scraping completed successfully") - print(json.dumps(response, indent=2)) - client.close() - - except Exception as e: - print(f"โŒ Error in API scraping: {str(e)}") - - -def main(): - """Run all cookies integration examples.""" - # Check if API key is available - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - print("๐Ÿช COOKIES INTEGRATION EXAMPLES") - print( - "This demonstrates various real-world scenarios where cookies are essential for web scraping." - ) - - # Run all examples - scrape_ecommerce_with_auth() - scrape_social_media_with_session() - scrape_news_with_preferences() - scrape_banking_with_secure_cookies() - scrape_api_with_auth_tokens() - - print("\n" + "=" * 60) - print("โœ… All examples completed!") - print("=" * 60) - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/advanced_features/mock/async_mock_mode_example.py b/scrapegraph-py/examples/advanced_features/mock/async_mock_mode_example.py deleted file mode 100644 index 31d33285..00000000 --- a/scrapegraph-py/examples/advanced_features/mock/async_mock_mode_example.py +++ /dev/null @@ -1,61 +0,0 @@ -import asyncio - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - - -sgai_logger.set_logging(level="INFO") - - -async def basic_mock_usage(): - # Initialize the client with mock mode enabled - async with AsyncClient.from_env(mock=True) as client: - print("\n-- get_credits (mock) --") - print(await client.get_credits()) - - print("\n-- markdownify (mock) --") - md = await client.markdownify(website_url="https://example.com") - print(md) - - print("\n-- get_markdownify (mock) --") - md_status = await client.get_markdownify("00000000-0000-0000-0000-000000000123") - print(md_status) - - print("\n-- smartscraper (mock) --") - ss = await client.smartscraper(user_prompt="Extract title", website_url="https://example.com") - print(ss) - - -async def mock_with_path_overrides(): - # Initialize the client with mock mode and custom responses - async with AsyncClient.from_env( - mock=True, - mock_responses={ - "/v1/credits": {"remaining_credits": 42, "total_credits_used": 58} - }, - ) as client: - print("\n-- get_credits with override (mock) --") - print(await client.get_credits()) - - -async def mock_with_custom_handler(): - def handler(method, url, kwargs): - return {"handled_by": "custom_handler", "method": method, "url": url} - - # Initialize the client with mock mode and custom handler - async with AsyncClient.from_env(mock=True, mock_handler=handler) as client: - print("\n-- searchscraper via custom handler (mock) --") - resp = await client.searchscraper(user_prompt="Search something") - print(resp) - - -async def main(): - await basic_mock_usage() - await mock_with_path_overrides() - await mock_with_custom_handler() - - -if __name__ == "__main__": - asyncio.run(main()) - - diff --git a/scrapegraph-py/examples/advanced_features/mock/mock_mode_example.py b/scrapegraph-py/examples/advanced_features/mock/mock_mode_example.py deleted file mode 100644 index c2bc8b1c..00000000 --- a/scrapegraph-py/examples/advanced_features/mock/mock_mode_example.py +++ /dev/null @@ -1,58 +0,0 @@ -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - - -sgai_logger.set_logging(level="INFO") - - -def basic_mock_usage(): - # Initialize the client with mock mode enabled - client = Client.from_env(mock=True) - - print("\n-- get_credits (mock) --") - print(client.get_credits()) - - print("\n-- markdownify (mock) --") - md = client.markdownify(website_url="https://example.com") - print(md) - - print("\n-- get_markdownify (mock) --") - md_status = client.get_markdownify("00000000-0000-0000-0000-000000000123") - print(md_status) - - print("\n-- smartscraper (mock) --") - ss = client.smartscraper(user_prompt="Extract title", website_url="https://example.com") - print(ss) - - -def mock_with_path_overrides(): - # Initialize the client with mock mode and custom responses - client = Client.from_env( - mock=True, - mock_responses={ - "/v1/credits": {"remaining_credits": 42, "total_credits_used": 58} - }, - ) - - print("\n-- get_credits with override (mock) --") - print(client.get_credits()) - - -def mock_with_custom_handler(): - def handler(method, url, kwargs): - return {"handled_by": "custom_handler", "method": method, "url": url} - - # Initialize the client with mock mode and custom handler - client = Client.from_env(mock=True, mock_handler=handler) - - print("\n-- searchscraper via custom handler (mock) --") - resp = client.searchscraper(user_prompt="Search something") - print(resp) - - -if __name__ == "__main__": - basic_mock_usage() - mock_with_path_overrides() - mock_with_custom_handler() - - diff --git a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_agenticscraper_example.py b/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_agenticscraper_example.py deleted file mode 100644 index e8587031..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_agenticscraper_example.py +++ /dev/null @@ -1,197 +0,0 @@ -#!/usr/bin/env python3 -""" -Async Step-by-Step AgenticScraper Example - -This example demonstrates how to use the AgenticScraper API asynchronously -for automated browser interactions with proper async/await patterns. -""" - -import asyncio -import json -import os -import time - -import aiohttp -from dotenv import load_dotenv - -# Load environment variables from .env file -load_dotenv() - - -async def agentic_scraper_request(): - """Example of making an async request to the agentic scraper API""" - - # Get API key from .env file - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - raise ValueError( - "API key must be provided or set in .env file as SGAI_API_KEY. " - "Create a .env file with: SGAI_API_KEY=your_api_key_here" - ) - - steps = [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ] - website_url = "https://dashboard.scrapegraphai.com/" - - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - } - - body = { - "url": website_url, - "use_session": True, - "steps": steps, - } - - print("๐Ÿค– Starting Async Agentic Scraper with Automated Actions...") - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐Ÿ”ง Use Session: True") - print(f"๐Ÿ“‹ Steps: {len(steps)} automated actions") - print("\n" + "=" * 60) - - # Start timer - start_time = time.time() - print( - f"โฑ๏ธ Timer started at: {time.strftime('%H:%M:%S', time.localtime(start_time))}" - ) - print("๐Ÿ”„ Processing request asynchronously...") - - try: - async with aiohttp.ClientSession() as session: - async with session.post( - "http://localhost:8001/v1/agentic-scrapper", - json=body, - headers=headers, - ) as response: - # Calculate execution time - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Total execution time: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print( - f"๐Ÿ“Š Performance: {execution_time:.1f}s ({execution_minutes:.1f}m) for {len(steps)} steps" - ) - - if response.status == 200: - result = await response.json() - print("โœ… Request completed successfully!") - print(f"๐Ÿ“Š Request ID: {result.get('request_id', 'N/A')}") - print(f"๐Ÿ”„ Status: {result.get('status', 'N/A')}") - - if result.get("error"): - print(f"โŒ Error: {result['error']}") - else: - print("\n๐Ÿ“‹ EXTRACTED DATA:") - print("=" * 60) - - # Pretty print the result with proper indentation - if "result" in result: - print(json.dumps(result["result"], indent=2, ensure_ascii=False)) - else: - print("No result data found") - - else: - response_text = await response.text() - print(f"โŒ Request failed with status code: {response.status}") - print(f"Response: {response_text}") - - except aiohttp.ClientError as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐ŸŒ Network error: {str(e)}") - except Exception as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐Ÿ’ฅ Unexpected error: {str(e)}") - - -def show_curl_equivalent(): - """Show the equivalent curl command for reference""" - - # Load environment variables from .env file - load_dotenv() - - api_key = os.getenv("SGAI_API_KEY", "your-api-key-here") - curl_command = f""" -curl --location 'http://localhost:8001/v1/agentic-scrapper' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Content-Type: application/json' \\ ---data-raw '{{ - "url": "https://dashboard.scrapegraphai.com/", - "use_session": true, - "steps": [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ] -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -async def main(): - """Main async function to run the agentic scraper example""" - try: - print("๐Ÿค– ASYNC AGENTIC SCRAPER EXAMPLE") - print("=" * 60) - print("This example demonstrates async automated browser interactions") - print() - - # Show the curl equivalent - show_curl_equivalent() - - print("\n" + "=" * 60) - - # Make the actual API request - await agentic_scraper_request() - - print("\n" + "=" * 60) - print("Example completed!") - print("\nKey takeaways:") - print("1. Async agentic scraper enables non-blocking automation") - print("2. Each step is executed sequentially but asynchronously") - print("3. Session management allows for complex workflows") - print("4. Perfect for concurrent automation tasks") - print("\nNext steps:") - print("- Run multiple agentic scrapers concurrently") - print("- Combine with other async operations") - print("- Implement async error handling") - print("- Use async session management for efficiency") - - except Exception as e: - print(f"๐Ÿ’ฅ Error occurred: {str(e)}") - print("\n๐Ÿ› ๏ธ Troubleshooting:") - print("1. Make sure your .env file contains SGAI_API_KEY") - print("2. Ensure the API server is running on localhost:8001") - print("3. Check your internet connection") - print("4. Verify the target website is accessible") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_cookies_example.py b/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_cookies_example.py deleted file mode 100644 index e40e6540..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_cookies_example.py +++ /dev/null @@ -1,358 +0,0 @@ -#!/usr/bin/env python3 -""" -Async Step-by-Step Cookies Example - -This example demonstrates how to use cookies with SmartScraper API using async/await patterns. -It shows how to set up and execute requests with custom cookies for authentication and session management. -""" - -import asyncio -import json -import logging -import os -import time - -import httpx -from dotenv import load_dotenv - -# Configure logging -logging.basicConfig( - level=logging.INFO, - format="%(asctime)s - %(levelname)s - %(message)s", - handlers=[logging.StreamHandler()], -) -logger = logging.getLogger(__name__) - -# Load environment variables from .env file -load_dotenv() - - -async def step_1_environment_setup(): - """Step 1: Set up environment and API key""" - print("STEP 1: Environment Setup") - print("=" * 40) - - # Check if API key is available - api_key = os.getenv("TEST_API_KEY") - if not api_key: - print("โŒ Error: TEST_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export TEST_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: TEST_API_KEY=your-api-key-here") - return None - - print("โœ… API key found in environment") - print(f"๐Ÿ”‘ API Key: {api_key[:8]}...{api_key[-4:]}") - return api_key - - -async def step_2_server_connectivity_check(api_key): - """Step 2: Check server connectivity""" - print("\nSTEP 2: Server Connectivity Check") - print("=" * 40) - - url = "http://localhost:8001/v1/smartscraper" - - try: - async with httpx.AsyncClient(timeout=5.0) as client: - # Try to access the health endpoint - health_url = url.replace("/v1/smartscraper", "/healthz") - response = await client.get(health_url) - - if response.status_code == 200: - print("โœ… Server is accessible") - print(f"๐Ÿ”— Health endpoint: {health_url}") - return True - else: - print( - f"โŒ Server health check failed with status {response.status_code}" - ) - return False - except Exception as e: - print(f"โŒ Server connectivity check failed: {e}") - print("Please ensure the server is running:") - print(" poetry run uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload") - return False - - -def step_3_define_cookies(): - """Step 3: Define cookies for authentication""" - print("\nSTEP 3: Define Cookies") - print("=" * 40) - - # Example cookies for a website that requires authentication - cookies = { - "session_id": "abc123def456ghi789", - "user_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", - "remember_me": "true", - "language": "en", - "theme": "dark", - } - - print("๐Ÿช Cookies configured:") - for key, value in cookies.items(): - if "token" in key.lower(): - # Mask sensitive tokens - masked_value = value[:20] + "..." if len(value) > 20 else value - print(f" {key}: {masked_value}") - else: - print(f" {key}: {value}") - - print(f"\n๐Ÿ“Š Total cookies: {len(cookies)}") - return cookies - - -def step_4_define_request_parameters(): - """Step 4: Define the request parameters""" - print("\nSTEP 4: Define Request Parameters") - print("=" * 40) - - # Configuration parameters - website_url = "https://example.com/dashboard" - user_prompt = "Extract user profile information and account details" - - print("๐ŸŒ Website URL:") - print(f" {website_url}") - print("\n๐Ÿ“ User Prompt:") - print(f" {user_prompt}") - print("\n๐ŸŽฏ Goal: Access authenticated content using cookies") - - return {"website_url": website_url, "user_prompt": user_prompt} - - -def step_5_prepare_headers(api_key): - """Step 5: Prepare request headers""" - print("\nSTEP 5: Prepare Request Headers") - print("=" * 40) - - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36", - "Accept": "application/json", - "Accept-Language": "en-US,en;q=0.9", - "Accept-Encoding": "gzip, deflate, br", - "Connection": "keep-alive", - } - - print("๐Ÿ“‹ Headers configured:") - for key, value in headers.items(): - if key == "SGAI-APIKEY": - print(f" {key}: {value[:10]}...{value[-10:]}") # Mask API key - else: - print(f" {key}: {value}") - - return headers - - -async def step_6_execute_cookies_request(headers, cookies, config): - """Step 6: Execute the request with cookies""" - print("\nSTEP 6: Execute Request with Cookies") - print("=" * 40) - - url = "http://localhost:8001/v1/smartscraper" - - # Request payload with cookies - payload = { - "website_url": config["website_url"], - "user_prompt": config["user_prompt"], - "output_schema": {}, - "cookies": cookies, - } - - print("๐Ÿš€ Starting request with cookies...") - print("๐Ÿช Using authentication cookies for access...") - - try: - # Start timing - start_time = time.time() - - # Use timeout for cookies requests - async with httpx.AsyncClient(timeout=120.0) as client: - response = await client.post(url, headers=headers, json=payload) - - # Calculate duration - duration = time.time() - start_time - - print(f"โœ… Request completed in {duration:.2f} seconds") - print(f"๐Ÿ“Š Response Status: {response.status_code}") - - if response.status_code == 200: - result = response.json() - return result, duration - else: - print(f"โŒ Request failed with status {response.status_code}") - print(f"Response: {response.text}") - return None, duration - - except httpx.TimeoutException: - duration = time.time() - start_time - print(f"โŒ Request timed out after {duration:.2f} seconds (>120s timeout)") - print("This may indicate authentication issues or slow response.") - return None, duration - - except httpx.RequestError as e: - duration = time.time() - start_time - print(f"โŒ Request error after {duration:.2f} seconds: {e}") - print("Common causes:") - print(" - Server is not running") - print(" - Invalid cookies") - print(" - Network connectivity issues") - return None, duration - - except Exception as e: - duration = time.time() - start_time - print(f"โŒ Unexpected error after {duration:.2f} seconds: {e}") - return None, duration - - -def step_7_process_results(result, duration): - """Step 7: Process and display the results""" - print("\nSTEP 7: Process Results") - print("=" * 40) - - if result is None: - print("โŒ No results to process") - return - - print("๐Ÿ“‹ Processing authenticated results...") - - # Display results based on type - if isinstance(result, dict): - print("\n๐Ÿ” Response Structure:") - print(json.dumps(result, indent=2, ensure_ascii=False)) - - # Check for authentication success indicators - if "result" in result: - print("\nโœจ Authentication successful! Data extracted with cookies") - - elif isinstance(result, list): - print(f"\nโœ… Authentication successful! Extracted {len(result)} items") - - # Show first few items - print("\n๐Ÿ“ฆ Sample Results:") - for i, item in enumerate(result[:3]): # Show first 3 items - print(f" {i+1}. {item}") - - if len(result) > 3: - print(f" ... and {len(result) - 3} more items") - - else: - print(f"\n๐Ÿ“‹ Result: {result}") - - print(f"\nโฑ๏ธ Total processing time: {duration:.2f} seconds") - - -def step_8_show_curl_equivalent(api_key, cookies, config): - """Step 8: Show equivalent curl command""" - print("\nSTEP 8: Equivalent curl Command") - print("=" * 40) - - # Convert cookies dict to curl format - cookies_str = "; ".join([f"{k}={v}" for k, v in cookies.items()]) - - curl_command = f""" -curl --location 'http://localhost:8001/v1/smartscraper' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Content-Type: application/json' \\ ---header 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36' \\ ---header 'Accept: application/json' \\ ---header 'Accept-Language: en-US,en;q=0.9' \\ ---header 'Accept-Encoding: gzip, deflate, br' \\ ---header 'Connection: keep-alive' \\ ---cookie '{cookies_str}' \\ ---data '{{ - "website_url": "{config['website_url']}", - "user_prompt": "{config['user_prompt']}", - "output_schema": {{}}, - "cookies": {json.dumps(cookies)} -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -def step_9_cookie_management_tips(): - """Step 9: Provide cookie management tips""" - print("\nSTEP 9: Cookie Management Tips") - print("=" * 40) - - print("๐Ÿช Best Practices for Cookie Management:") - print("1. ๐Ÿ” Store sensitive cookies securely (environment variables)") - print("2. โฐ Set appropriate expiration times") - print("3. ๐Ÿงน Clean up expired cookies regularly") - print("4. ๐Ÿ”„ Refresh tokens before they expire") - print("5. ๐Ÿ›ก๏ธ Use HTTPS for cookie transmission") - print("6. ๐Ÿ“ Log cookie usage for debugging") - print("7. ๐Ÿšซ Don't hardcode cookies in source code") - print("8. ๐Ÿ” Validate cookie format before sending") - - -async def main(): - """Main function to run the async step-by-step cookies example""" - total_start_time = time.time() - logger.info("Starting Async Step-by-Step Cookies Example") - - print("ScrapeGraph SDK - Async Step-by-Step Cookies Example") - print("=" * 60) - print("This example shows the complete async process of setting up and") - print("executing requests with cookies for authentication") - print("=" * 60) - - # Step 1: Environment setup - api_key = await step_1_environment_setup() - if not api_key: - return - - # Step 2: Server connectivity check - server_ok = await step_2_server_connectivity_check(api_key) - if not server_ok: - return - - # Step 3: Define cookies - cookies = step_3_define_cookies() - - # Step 4: Define request parameters - config = step_4_define_request_parameters() - - # Step 5: Prepare headers - headers = step_5_prepare_headers(api_key) - - # Step 6: Execute request - result, duration = await step_6_execute_cookies_request(headers, cookies, config) - - # Step 7: Process results - step_7_process_results(result, duration) - - # Step 8: Show curl equivalent - step_8_show_curl_equivalent(api_key, cookies, config) - - # Step 9: Cookie management tips - step_9_cookie_management_tips() - - total_duration = time.time() - total_start_time - logger.info( - f"Example completed! Total execution time: {total_duration:.2f} seconds" - ) - - print("\n" + "=" * 60) - print("Async step-by-step cookies example completed!") - print(f"โฑ๏ธ Total execution time: {total_duration:.2f} seconds") - print("\nKey takeaways:") - print("1. Async/await provides better performance for I/O operations") - print("2. Cookies enable access to authenticated content") - print("3. Always validate API key and server connectivity first") - print("4. Secure cookie storage is crucial for production use") - print("5. Handle authentication errors gracefully") - print("6. Use equivalent curl commands for testing") - print("\nNext steps:") - print("- Implement secure cookie storage") - print("- Add cookie refresh logic") - print("- Handle authentication failures") - print("- Monitor cookie expiration") - print("- Implement retry logic for failed requests") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_movements_example.py b/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_movements_example.py deleted file mode 100644 index 59663295..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_movements_example.py +++ /dev/null @@ -1,479 +0,0 @@ -#!/usr/bin/env python3 -""" -Async Step-by-Step SmartScraper Movements Example - -This example demonstrates how to use interactive movements with SmartScraper API -using async/await patterns for better performance and concurrency. -""" - -import asyncio -import json -import logging -import os -import time - -import httpx -from dotenv import load_dotenv - -# Configure logging -logging.basicConfig( - level=logging.INFO, - format="%(asctime)s - %(levelname)s - %(message)s", - handlers=[logging.StreamHandler()], -) -logger = logging.getLogger(__name__) - - -async def check_server_connectivity(base_url: str) -> bool: - """Check if the server is running and accessible""" - try: - async with httpx.AsyncClient(timeout=5.0) as client: - # Try to access the health endpoint - health_url = base_url.replace("/v1/smartscraper", "/healthz") - response = await client.get(health_url) - return response.status_code == 200 - except Exception: - return False - - -async def async_smart_scraper_movements(): - """Async example of making a movements request to the smartscraper API""" - - # Load environment variables from .env file - load_dotenv() - - # Get API key from .env file - api_key = os.getenv("TEST_API_KEY") - if not api_key: - raise ValueError( - "API key must be provided or set in .env file as TEST_API_KEY. " - "Create a .env file with: TEST_API_KEY=your_api_key_here" - ) - - steps = [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search", - ] - website_url = "https://github.com/" - user_prompt = "Extract user profile" - - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - } - - body = { - "website_url": website_url, - "user_prompt": user_prompt, - "output_schema": {}, - "steps": steps, - } - - print("๐Ÿš€ Starting Async Smart Scraper with Interactive Movements...") - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐ŸŽฏ User Prompt: {user_prompt}") - print(f"๐Ÿ“‹ Steps: {len(steps)} interactive steps") - print("\n" + "=" * 60) - - # Start timer - start_time = time.time() - print( - f"โฑ๏ธ Timer started at: {time.strftime('%H:%M:%S', time.localtime(start_time))}" - ) - print("๐Ÿ”„ Processing async request...") - - try: - # Use longer timeout for movements requests as they may take more time - async with httpx.AsyncClient(timeout=300.0) as client: - response = await client.post( - "http://localhost:8001/v1/smartscraper", - json=body, - headers=headers, - ) - - # Calculate execution time - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Total execution time: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print( - f"๐Ÿ“Š Performance: {execution_time:.1f}s ({execution_minutes:.1f}m) for {len(steps)} steps" - ) - - if response.status_code == 200: - result = response.json() - print("โœ… Request completed successfully!") - print(f"๐Ÿ“Š Request ID: {result.get('request_id', 'N/A')}") - print(f"๐Ÿ”„ Status: {result.get('status', 'N/A')}") - - if result.get("error"): - print(f"โŒ Error: {result['error']}") - else: - print("\n๐Ÿ“‹ EXTRACTED DATA:") - print("=" * 60) - - # Pretty print the result with proper indentation - if "result" in result: - print( - json.dumps(result["result"], indent=2, ensure_ascii=False) - ) - else: - print("No result data found") - - else: - print(f"โŒ Request failed with status code: {response.status_code}") - print(f"Response: {response.text}") - - except httpx.TimeoutException: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before timeout: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print("โฐ Request timed out after 300 seconds") - except httpx.RequestError as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐ŸŒ Network error: {str(e)}") - except Exception as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐Ÿ’ฅ Unexpected error: {str(e)}") - - -async def async_markdownify_movements(): - """ - Async enhanced markdownify function with comprehensive features and timing. - - Note: Markdownify doesn't support interactive movements like Smart Scraper. - Instead, it excels at converting websites to clean markdown format. - """ - # Load environment variables from .env file - load_dotenv() - - # Get API key from .env file - api_key = os.getenv("TEST_API_KEY") - if not api_key: - raise ValueError( - "API key must be provided or set in .env file as TEST_API_KEY. " - "Create a .env file with: TEST_API_KEY=your_api_key_here" - ) - - steps = [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search", - ] - - # Target website configuration - website_url = "https://scrapegraphai.com/" - - # Enhanced headers for better scraping (similar to interactive movements) - custom_headers = { - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.5", - "Accept-Encoding": "gzip, deflate, br", - "Connection": "keep-alive", - "Upgrade-Insecure-Requests": "1", - } - - # Prepare API request headers - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - } - - # Request body for markdownify - body = { - "website_url": website_url, - "headers": custom_headers, - "steps": steps, - } - - print("๐Ÿš€ Starting Async Markdownify with Enhanced Features...") - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐Ÿ“‹ Custom Headers: {len(custom_headers)} headers configured") - print("๐ŸŽฏ Goal: Convert website to clean markdown format") - print("\n" + "=" * 60) - - # Start timer - start_time = time.time() - print( - f"โฑ๏ธ Timer started at: {time.strftime('%H:%M:%S', time.localtime(start_time))}" - ) - print("๐Ÿ”„ Processing async markdown conversion...") - - try: - async with httpx.AsyncClient(timeout=120.0) as client: - response = await client.post( - "http://localhost:8001/v1/markdownify", - json=body, - headers=headers, - ) - - # Calculate execution time - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Total execution time: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print( - f"๐Ÿ“Š Performance: {execution_time:.1f}s ({execution_minutes:.1f}m) for markdown conversion" - ) - - if response.status_code == 200: - result = response.json() - markdown_content = result.get("result", "") - - print("โœ… Request completed successfully!") - print(f"๐Ÿ“Š Request ID: {result.get('request_id', 'N/A')}") - print(f"๐Ÿ”„ Status: {result.get('status', 'N/A')}") - print(f"๐Ÿ“ Content Length: {len(markdown_content)} characters") - - if result.get("error"): - print(f"โŒ Error: {result['error']}") - else: - print("\n๐Ÿ“‹ MARKDOWN CONVERSION RESULTS:") - print("=" * 60) - - # Display markdown statistics - lines = markdown_content.split("\n") - words = len(markdown_content.split()) - - print("๐Ÿ“Š Statistics:") - print(f" - Total Lines: {len(lines)}") - print(f" - Total Words: {words}") - print(f" - Total Characters: {len(markdown_content)}") - print( - f" - Processing Speed: {len(markdown_content)/execution_time:.0f} chars/second" - ) - - # Display first 500 characters - print("\n๐Ÿ” First 500 characters:") - print("-" * 50) - print(markdown_content[:500]) - if len(markdown_content) > 500: - print("...") - print("-" * 50) - - # Save to file - filename = f"async_markdownify_output_{int(time.time())}.md" - await save_markdown_to_file_async(markdown_content, filename) - - # Display content analysis - analyze_markdown_content(markdown_content) - - else: - print(f"โŒ Request failed with status code: {response.status_code}") - print(f"Response: {response.text}") - - except httpx.TimeoutException: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before timeout: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print("โฐ Request timed out after 120 seconds") - except httpx.RequestError as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐ŸŒ Network error: {str(e)}") - except Exception as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐Ÿ’ฅ Unexpected error: {str(e)}") - - -async def save_markdown_to_file_async(markdown_content: str, filename: str): - """ - Save markdown content to a file with enhanced error handling (async version). - - Args: - markdown_content: The markdown content to save - filename: The name of the file to save to - """ - try: - # Use asyncio to run the file operation in a thread pool - await asyncio.to_thread(_write_file_sync, markdown_content, filename) - print(f"๐Ÿ’พ Markdown saved to: {filename}") - except Exception as e: - print(f"โŒ Error saving file: {str(e)}") - - -def _write_file_sync(markdown_content: str, filename: str): - """Synchronous file writing function for asyncio.to_thread""" - with open(filename, "w", encoding="utf-8") as f: - f.write(markdown_content) - - -def analyze_markdown_content(markdown_content: str): - """ - Analyze the markdown content and provide insights. - - Args: - markdown_content: The markdown content to analyze - """ - print("\n๐Ÿ” CONTENT ANALYSIS:") - print("-" * 50) - - # Count different markdown elements - lines = markdown_content.split("\n") - headers = [line for line in lines if line.strip().startswith("#")] - links = [line for line in lines if "[" in line and "](" in line] - code_blocks = markdown_content.count("```") - - print(f"๐Ÿ“‘ Headers found: {len(headers)}") - print(f"๐Ÿ”— Links found: {len(links)}") - print( - f"๐Ÿ’ป Code blocks: {code_blocks // 2}" - ) # Divide by 2 since each block has opening and closing - - # Show first few headers if they exist - if headers: - print("\n๐Ÿ“‹ First few headers:") - for i, header in enumerate(headers[:3]): - print(f" {i+1}. {header.strip()}") - if len(headers) > 3: - print(f" ... and {len(headers) - 3} more") - - -def show_curl_equivalent(): - """Show the equivalent curl command for reference""" - - # Load environment variables from .env file - load_dotenv() - - api_key = os.getenv("TEST_API_KEY", "your-api-key-here") - curl_command = f""" -curl --location 'http://localhost:8001/v1/smartscraper' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Content-Type: application/json' \\ ---data '{{ - "website_url": "https://github.com/", - "user_prompt": "Extract user profile", - "output_schema": {{}}, - "steps": [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search" - ] -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -async def main(): - """Main function to run the async movements examples""" - total_start_time = time.time() - logger.info("Starting Async SmartScraper Movements Examples") - - try: - print("๐ŸŽฏ ASYNC SMART SCRAPER MOVEMENTS EXAMPLES") - print("=" * 60) - print("This example demonstrates async interactive movements with timing") - print() - - # Show the curl equivalent - show_curl_equivalent() - - print("\n" + "=" * 60) - - # Make the actual API requests - print("1๏ธโƒฃ Running SmartScraper Movements Example...") - await async_smart_scraper_movements() - - print("\n" + "=" * 60) - print("2๏ธโƒฃ Running Markdownify Movements Example...") - await async_markdownify_movements() - - total_duration = time.time() - total_start_time - logger.info( - f"Examples completed! Total execution time: {total_duration:.2f} seconds" - ) - - print("\n" + "=" * 60) - print("Examples completed!") - print(f"โฑ๏ธ Total execution time: {total_duration:.2f} seconds") - print("\nKey takeaways:") - print("1. Async/await provides better performance for I/O operations") - print("2. Movements allow for interactive browser automation") - print("3. Each step is executed sequentially") - print("4. Timing is crucial for successful interactions") - print("5. Error handling is important for robust automation") - print("\nNext steps:") - print("- Customize the steps for your specific use case") - print("- Add more complex interactions") - print("- Implement retry logic for failed steps") - print("- Use structured output schemas for better data extraction") - - except Exception as e: - print(f"๐Ÿ’ฅ Error occurred: {str(e)}") - print("\n๐Ÿ› ๏ธ Troubleshooting:") - print("1. Make sure your .env file contains TEST_API_KEY") - print("2. Ensure the API server is running on localhost:8001") - print("3. Check your internet connection") - print("4. Verify the target website is accessible") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_pagination_example.py b/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_pagination_example.py deleted file mode 100644 index d4177f5a..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_pagination_example.py +++ /dev/null @@ -1,315 +0,0 @@ -#!/usr/bin/env python3 -""" -Async Step-by-Step Pagination Example - -This example demonstrates the pagination process step by step using async/await patterns, -showing each stage of setting up and executing a paginated SmartScraper request. -""" - -import asyncio -import json -import logging -import os -import time - -import httpx -from dotenv import load_dotenv - -# Configure logging -logging.basicConfig( - level=logging.INFO, - format="%(asctime)s - %(levelname)s - %(message)s", - handlers=[logging.StreamHandler()], -) -logger = logging.getLogger(__name__) - -# Load environment variables from .env file -load_dotenv() - - -async def step_1_environment_setup(): - """Step 1: Set up environment and API key""" - print("STEP 1: Environment Setup") - print("=" * 40) - - # Check if API key is available - api_key = os.getenv("TEST_API_KEY") - if not api_key: - print("โŒ Error: TEST_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export TEST_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: TEST_API_KEY=your-api-key-here") - return None - - print("โœ… API key found in environment") - print(f"๐Ÿ”‘ API Key: {api_key[:8]}...{api_key[-4:]}") - return api_key - - -async def step_2_server_connectivity_check(api_key): - """Step 2: Check server connectivity""" - print("\nSTEP 2: Server Connectivity Check") - print("=" * 40) - - url = "http://localhost:8001/v1/smartscraper" - - try: - async with httpx.AsyncClient(timeout=5.0) as client: - # Try to access the health endpoint - health_url = url.replace("/v1/smartscraper", "/healthz") - response = await client.get(health_url) - - if response.status_code == 200: - print("โœ… Server is accessible") - print(f"๐Ÿ”— Health endpoint: {health_url}") - return True - else: - print( - f"โŒ Server health check failed with status {response.status_code}" - ) - return False - except Exception as e: - print(f"โŒ Server connectivity check failed: {e}") - print("Please ensure the server is running:") - print(" poetry run uvicorn app.main:app --host 0.0.0.0 --port 8001 --reload") - return False - - -def step_3_define_request_parameters(): - """Step 3: Define the request parameters""" - print("\nSTEP 3: Define Request Parameters") - print("=" * 40) - - # Configuration parameters - website_url = "https://www.amazon.in/s?k=tv&crid=1TEF1ZFVLU8R8&sprefix=t%2Caps%2C390&ref=nb_sb_noss_2" - user_prompt = "Extract all product info including name, price, rating, image_url, and description" - total_pages = 3 - - print("๐ŸŒ Website URL:") - print(f" {website_url}") - print("\n๐Ÿ“ User Prompt:") - print(f" {user_prompt}") - print(f"\n๐Ÿ“„ Total Pages: {total_pages}") - print(f"๐Ÿ“Š Expected Products: ~{total_pages * 20} (estimated)") - - return { - "website_url": website_url, - "user_prompt": user_prompt, - "total_pages": total_pages, - } - - -def step_4_prepare_headers(api_key): - """Step 4: Prepare request headers""" - print("\nSTEP 4: Prepare Request Headers") - print("=" * 40) - - headers = { - "sec-ch-ua-platform": '"macOS"', - "SGAI-APIKEY": api_key, - "Referer": "https://dashboard.scrapegraphai.com/", - "sec-ch-ua": '"Google Chrome";v="137", "Chromium";v="137", "Not/A)Brand";v="24"', - "sec-ch-ua-mobile": "?0", - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36", - "Accept": "application/json", - "Content-Type": "application/json", - } - - print("๐Ÿ“‹ Headers configured:") - for key, value in headers.items(): - if key == "SGAI-APIKEY": - print(f" {key}: {value[:10]}...{value[-10:]}") # Mask API key - else: - print(f" {key}: {value}") - - return headers - - -async def step_5_execute_pagination_request(headers, config): - """Step 5: Execute the pagination request""" - print("\nSTEP 5: Execute Pagination Request") - print("=" * 40) - - url = "http://localhost:8001/v1/smartscraper" - - # Request payload with pagination - payload = { - "website_url": config["website_url"], - "user_prompt": config["user_prompt"], - "output_schema": {}, - "total_pages": config["total_pages"], - } - - print("๐Ÿš€ Starting pagination request...") - print("โฑ๏ธ This may take several minutes for multiple pages...") - - try: - # Start timing - start_time = time.time() - - # Use longer timeout for pagination requests as they may take more time - async with httpx.AsyncClient(timeout=600.0) as client: - response = await client.post(url, headers=headers, json=payload) - - # Calculate duration - duration = time.time() - start_time - - print(f"โœ… Request completed in {duration:.2f} seconds") - print(f"๐Ÿ“Š Response Status: {response.status_code}") - - if response.status_code == 200: - result = response.json() - return result, duration - else: - print(f"โŒ Request failed with status {response.status_code}") - print(f"Response: {response.text}") - return None, duration - - except httpx.TimeoutException: - duration = time.time() - start_time - print(f"โŒ Request timed out after {duration:.2f} seconds (>600s timeout)") - print( - "This may indicate the server is taking too long to process the pagination request." - ) - return None, duration - - except httpx.RequestError as e: - duration = time.time() - start_time - print(f"โŒ Request error after {duration:.2f} seconds: {e}") - print("Common causes:") - print(" - Server is not running") - print(" - Wrong port (check server logs)") - print(" - Network connectivity issues") - return None, duration - - except Exception as e: - duration = time.time() - start_time - print(f"โŒ Unexpected error after {duration:.2f} seconds: {e}") - return None, duration - - -def step_6_process_results(result, duration): - """Step 6: Process and display the results""" - print("\nSTEP 6: Process Results") - print("=" * 40) - - if result is None: - print("โŒ No results to process") - return - - print("๐Ÿ“‹ Processing pagination results...") - - # Display results based on type - if isinstance(result, dict): - print("\n๐Ÿ” Response Structure:") - print(json.dumps(result, indent=2, ensure_ascii=False)) - - # Check for pagination success indicators - if "data" in result: - print("\nโœจ Pagination successful! Data extracted from multiple pages") - - elif isinstance(result, list): - print(f"\nโœ… Pagination successful! Extracted {len(result)} items") - - # Show first few items - print("\n๐Ÿ“ฆ Sample Results:") - for i, item in enumerate(result[:3]): # Show first 3 items - print(f" {i+1}. {item}") - - if len(result) > 3: - print(f" ... and {len(result) - 3} more items") - - else: - print(f"\n๐Ÿ“‹ Result: {result}") - - print(f"\nโฑ๏ธ Total processing time: {duration:.2f} seconds") - - -def step_7_show_curl_equivalent(api_key, config): - """Step 7: Show equivalent curl command""" - print("\nSTEP 7: Equivalent curl Command") - print("=" * 40) - - curl_command = f""" -curl --location 'http://localhost:8001/v1/smartscraper' \\ ---header 'sec-ch-ua-platform: "macOS"' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Referer: https://dashboard.scrapegraphai.com/' \\ ---header 'sec-ch-ua: "Google Chrome";v="137", "Chromium";v="137", "Not/A)Brand";v="24"' \\ ---header 'sec-ch-ua-mobile: ?0' \\ ---header 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36' \\ ---header 'Accept: application/json' \\ ---header 'Content-Type: application/json' \\ ---data '{{ - "website_url": "{config['website_url']}", - "user_prompt": "{config['user_prompt']}", - "output_schema": {{}}, - "total_pages": {config['total_pages']} -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -async def main(): - """Main function to run the async step-by-step pagination example""" - total_start_time = time.time() - logger.info("Starting Async Step-by-Step Pagination Example") - - print("ScrapeGraph SDK - Async Step-by-Step Pagination Example") - print("=" * 60) - print("This example shows the complete async process of setting up and") - print("executing a pagination request with SmartScraper API") - print("=" * 60) - - # Step 1: Environment setup - api_key = await step_1_environment_setup() - if not api_key: - return - - # Step 2: Server connectivity check - server_ok = await step_2_server_connectivity_check(api_key) - if not server_ok: - return - - # Step 3: Define request parameters - config = step_3_define_request_parameters() - - # Step 4: Prepare headers - headers = step_4_prepare_headers(api_key) - - # Step 5: Execute request - result, duration = await step_5_execute_pagination_request(headers, config) - - # Step 6: Process results - step_6_process_results(result, duration) - - # Step 7: Show curl equivalent - step_7_show_curl_equivalent(api_key, config) - - total_duration = time.time() - total_start_time - logger.info( - f"Example completed! Total execution time: {total_duration:.2f} seconds" - ) - - print("\n" + "=" * 60) - print("Async step-by-step pagination example completed!") - print(f"โฑ๏ธ Total execution time: {total_duration:.2f} seconds") - print("\nKey takeaways:") - print("1. Async/await provides better performance for I/O operations") - print("2. Always validate your API key and server connectivity first") - print("3. Define clear request parameters for structured data") - print("4. Configure pagination parameters carefully") - print("5. Handle errors gracefully with proper timeouts") - print("6. Use equivalent curl commands for testing") - print("\nNext steps:") - print("- Try different websites and prompts") - print("- Experiment with different page counts") - print("- Add error handling for production use") - print("- Consider rate limiting for large requests") - print("- Implement retry logic for failed requests") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_scrape_example.py b/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_scrape_example.py deleted file mode 100644 index 41894b9d..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/async_step_by_step_scrape_example.py +++ /dev/null @@ -1,184 +0,0 @@ -""" -Async step-by-step example demonstrating how to use the Scrape API with the scrapegraph-py async SDK. - -This example shows the basic async workflow: -1. Initialize the async client -2. Make a scrape request asynchronously -3. Handle the response -4. Save the HTML content -5. Basic analysis - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- aiohttp -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -import os -from pathlib import Path -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -async def step_1_initialize_async_client(): - """Step 1: Initialize the scrapegraph-py async client.""" - print("๐Ÿ”‘ Step 1: Initializing async client...") - - try: - # Initialize async client using environment variable - client = AsyncClient.from_env() - print("โœ… Async client initialized successfully") - return client - except Exception as e: - print(f"โŒ Failed to initialize async client: {str(e)}") - print("Make sure you have SGAI_API_KEY in your .env file") - raise - - -async def step_2_make_async_scrape_request(client, url, render_js=False): - """Step 2: Make a scrape request asynchronously.""" - print(f"\n๐ŸŒ Step 2: Making async scrape request to {url}") - print(f"๐Ÿ”ง Render heavy JS: {render_js}") - - try: - # Make the scrape request asynchronously - result = await client.scrape( - website_url=url, - render_heavy_js=render_js - ) - print("โœ… Async scrape request completed successfully") - return result - except Exception as e: - print(f"โŒ Async scrape request failed: {str(e)}") - raise - - -def step_3_handle_response(result): - """Step 3: Handle and analyze the response.""" - print(f"\n๐Ÿ“Š Step 3: Analyzing response...") - - # Check if we got HTML content - html_content = result.get("html", "") - if not html_content: - print("โŒ No HTML content received") - return None - - # Basic response analysis - print(f"โœ… Received HTML content") - print(f"๐Ÿ“ Content length: {len(html_content):,} characters") - print(f"๐Ÿ“„ Lines: {len(html_content.splitlines()):,}") - - # Check for common HTML elements - has_doctype = html_content.strip().startswith(" 0: - print(f" {element}: {count}") - - # Check for JavaScript and CSS - has_js = elements["script"] > 0 - has_css = elements["style"] > 0 - - print(f"\n๐ŸŽจ Content types:") - print(f" JavaScript: {'Yes' if has_js else 'No'}") - print(f" CSS: {'Yes' if has_css else 'No'}") - - return elements - - -async def main(): - """Main function demonstrating async step-by-step scrape usage.""" - print("๐Ÿš€ Async Step-by-Step Scrape API Example") - print("=" * 55) - - # Test URL - test_url = "https://example.com" - - try: - # Step 1: Initialize async client - async with AsyncClient.from_env() as client: - print("โœ… Async client initialized successfully") - - # Step 2: Make async scrape request - result = await step_2_make_async_scrape_request(client, test_url, render_js=False) - - # Step 3: Handle response - html_content = step_3_handle_response(result) - if not html_content: - print("โŒ Cannot proceed without HTML content") - return - - # Step 4: Save content - filename = "async_example_website" - saved_file = step_4_save_html_content(html_content, filename) - - # Step 5: Basic analysis - elements = step_5_basic_analysis(html_content) - - # Summary - print(f"\n๐ŸŽฏ Summary:") - print(f"โœ… Successfully processed {test_url} asynchronously") - print(f"๐Ÿ’พ HTML saved to: {saved_file}") - print(f"๐Ÿ“Š Analyzed {len(html_content):,} characters of HTML content") - - print("โœ… Async client closed successfully") - - except Exception as e: - print(f"\n๐Ÿ’ฅ Error occurred: {str(e)}") - print("Check your API key and internet connection") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/advanced_features/steps/step_by_step_agenticscraper_example.py b/scrapegraph-py/examples/advanced_features/steps/step_by_step_agenticscraper_example.py deleted file mode 100644 index d82ad2b2..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/step_by_step_agenticscraper_example.py +++ /dev/null @@ -1,195 +0,0 @@ -#!/usr/bin/env python3 -""" -Step-by-Step AgenticScraper Example - -This example demonstrates how to use the AgenticScraper API for automated browser interactions. -It shows how to make actual HTTP requests with step-by-step browser actions. -""" - -import json -import os -import time - -import requests -from dotenv import load_dotenv - -# Load environment variables from .env file -load_dotenv() - - -def agentic_scraper_request(): - """Example of making a request to the agentic scraper API""" - - # Get API key from .env file - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - raise ValueError( - "API key must be provided or set in .env file as SGAI_API_KEY. " - "Create a .env file with: SGAI_API_KEY=your_api_key_here" - ) - - steps = [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ] - website_url = "https://dashboard.scrapegraphai.com/" - - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - } - - body = { - "url": website_url, - "use_session": True, - "steps": steps, - } - - print("๐Ÿค– Starting Agentic Scraper with Automated Actions...") - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐Ÿ”ง Use Session: True") - print(f"๐Ÿ“‹ Steps: {len(steps)} automated actions") - print("\n" + "=" * 60) - - # Start timer - start_time = time.time() - print( - f"โฑ๏ธ Timer started at: {time.strftime('%H:%M:%S', time.localtime(start_time))}" - ) - print("๐Ÿ”„ Processing request...") - - try: - response = requests.post( - "http://localhost:8001/v1/agentic-scrapper", - json=body, - headers=headers, - ) - - # Calculate execution time - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Total execution time: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print( - f"๐Ÿ“Š Performance: {execution_time:.1f}s ({execution_minutes:.1f}m) for {len(steps)} steps" - ) - - if response.status_code == 200: - result = response.json() - print("โœ… Request completed successfully!") - print(f"๐Ÿ“Š Request ID: {result.get('request_id', 'N/A')}") - print(f"๐Ÿ”„ Status: {result.get('status', 'N/A')}") - - if result.get("error"): - print(f"โŒ Error: {result['error']}") - else: - print("\n๐Ÿ“‹ EXTRACTED DATA:") - print("=" * 60) - - # Pretty print the result with proper indentation - if "result" in result: - print(json.dumps(result["result"], indent=2, ensure_ascii=False)) - else: - print("No result data found") - - else: - print(f"โŒ Request failed with status code: {response.status_code}") - print(f"Response: {response.text}") - - except requests.exceptions.RequestException as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐ŸŒ Network error: {str(e)}") - except Exception as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐Ÿ’ฅ Unexpected error: {str(e)}") - - -def show_curl_equivalent(): - """Show the equivalent curl command for reference""" - - # Load environment variables from .env file - load_dotenv() - - api_key = os.getenv("SGAI_API_KEY", "your-api-key-here") - curl_command = f""" -curl --location 'http://localhost:8001/v1/agentic-scrapper' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Content-Type: application/json' \\ ---data-raw '{{ - "url": "https://dashboard.scrapegraphai.com/", - "use_session": true, - "steps": [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ] -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -def main(): - """Main function to run the agentic scraper example""" - try: - print("๐Ÿค– AGENTIC SCRAPER EXAMPLE") - print("=" * 60) - print("This example demonstrates automated browser interactions") - print() - - # Show the curl equivalent - show_curl_equivalent() - - print("\n" + "=" * 60) - - # Make the actual API request - agentic_scraper_request() - - print("\n" + "=" * 60) - print("Example completed!") - print("\nKey takeaways:") - print("1. Agentic scraper enables automated browser actions") - print("2. Each step is executed sequentially") - print("3. Session management allows for complex workflows") - print("4. Perfect for login flows and form interactions") - print("\nNext steps:") - print("- Customize the steps for your specific use case") - print("- Add more complex automation sequences") - print("- Implement error handling for failed actions") - print("- Use session management for multi-step workflows") - - except Exception as e: - print(f"๐Ÿ’ฅ Error occurred: {str(e)}") - print("\n๐Ÿ› ๏ธ Troubleshooting:") - print("1. Make sure your .env file contains SGAI_API_KEY") - print("2. Ensure the API server is running on localhost:8001") - print("3. Check your internet connection") - print("4. Verify the target website is accessible") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/advanced_features/steps/step_by_step_cookies_example.py b/scrapegraph-py/examples/advanced_features/steps/step_by_step_cookies_example.py deleted file mode 100644 index 4ebc7349..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/step_by_step_cookies_example.py +++ /dev/null @@ -1,377 +0,0 @@ -#!/usr/bin/env python3 -""" -Step-by-Step Cookies Example - -This example demonstrates the cookies integration process step by step, showing each stage -of setting up and executing a SmartScraper request with cookies for authentication. -""" - -import json -import os -import time -from typing import Dict, Optional - -from dotenv import load_dotenv -from pydantic import BaseModel, Field - -from scrapegraph_py import Client -from scrapegraph_py.exceptions import APIError - -# Load environment variables from .env file -load_dotenv() - - -class CookieInfo(BaseModel): - """Model representing cookie information.""" - - cookies: Dict[str, str] = Field(description="Dictionary of cookie key-value pairs") - - -class UserProfile(BaseModel): - """Model representing user profile information.""" - - username: str = Field(description="User's username") - email: Optional[str] = Field(description="User's email address") - preferences: Optional[Dict[str, str]] = Field(description="User preferences") - - -def step_1_environment_setup(): - """Step 1: Set up environment and API key""" - print("STEP 1: Environment Setup") - print("=" * 40) - - # Check if API key is available - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return None - - print("โœ… API key found in environment") - print(f"๐Ÿ”‘ API Key: {api_key[:8]}...{api_key[-4:]}") - return api_key - - -def step_2_client_initialization(api_key): - """Step 2: Initialize the ScrapeGraph client""" - print("\nSTEP 2: Client Initialization") - print("=" * 40) - - try: - client = Client(api_key=api_key) - print("โœ… Client initialized successfully") - print(f"๐Ÿ”ง Client type: {type(client)}") - return client - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return None - - -def step_3_define_schema(): - """Step 3: Define the output schema for structured data""" - print("\nSTEP 3: Define Output Schema") - print("=" * 40) - - print("๐Ÿ“‹ Defining Pydantic models for structured output:") - print(" - CookieInfo: Cookie information structure") - print(" - UserProfile: User profile data (for authenticated requests)") - - # Show the schema structure - schema_example = CookieInfo.model_json_schema() - print(f"โœ… Schema defined with {len(schema_example['properties'])} properties") - - return CookieInfo - - -def step_4_prepare_cookies(): - """Step 4: Prepare cookies for authentication""" - print("\nSTEP 4: Prepare Cookies") - print("=" * 40) - - # Example cookies for different scenarios - print("๐Ÿช Preparing cookies for authentication...") - - # Basic test cookies - basic_cookies = {"cookies_key": "cookies_value", "test_cookie": "test_value"} - - # Session cookies - session_cookies = {"session_id": "abc123def456", "user_token": "xyz789ghi012"} - - # Authentication cookies - auth_cookies = { - "auth_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", - "user_id": "user123", - "csrf_token": "csrf_abc123", - } - - print("๐Ÿ“‹ Available cookie sets:") - print(f" 1. Basic cookies: {len(basic_cookies)} items") - print(f" 2. Session cookies: {len(session_cookies)} items") - print(f" 3. Auth cookies: {len(auth_cookies)} items") - - # Use basic cookies for this example - selected_cookies = basic_cookies - print(f"\nโœ… Using basic cookies: {selected_cookies}") - - return selected_cookies - - -def step_5_format_cookies_for_headers(cookies): - """Step 5: Format cookies for HTTP headers""" - print("\nSTEP 5: Format Cookies for Headers") - print("=" * 40) - - print("๐Ÿ”ง Converting cookies dictionary to HTTP Cookie header...") - - # Convert cookies dict to Cookie header string - cookie_header = "; ".join([f"{k}={v}" for k, v in cookies.items()]) - - # Create headers dictionary - headers = {"Cookie": cookie_header} - - print("๐Ÿ“‹ Cookie formatting:") - print(f" Original cookies: {cookies}") - print(f" Cookie header: {cookie_header}") - print(f" Headers dict: {headers}") - - return headers - - -def step_6_configure_request(): - """Step 6: Configure the request parameters""" - print("\nSTEP 6: Configure Request Parameters") - print("=" * 40) - - # Configuration parameters - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies info" - - print("๐ŸŒ Website URL:") - print(f" {website_url}") - print("\n๐Ÿ“ User Prompt:") - print(f" {user_prompt}") - print("\n๐Ÿ”ง Additional Features:") - print(" - Cookies authentication") - print(" - Structured output schema") - - return {"website_url": website_url, "user_prompt": user_prompt} - - -def step_7_execute_request(client, config, headers, output_schema): - """Step 7: Execute the request with cookies""" - print("\nSTEP 7: Execute Request with Cookies") - print("=" * 40) - - print("๐Ÿš€ Starting request with cookies...") - print("๐Ÿช Cookies will be sent in HTTP headers") - - try: - # Start timing - start_time = time.time() - - # Perform the scraping with cookies - result = client.smartscraper( - website_url=config["website_url"], - user_prompt=config["user_prompt"], - headers=headers, - output_schema=output_schema, - ) - - # Calculate duration - duration = time.time() - start_time - - print(f"โœ… Request completed in {duration:.2f} seconds") - print(f"๐Ÿ“Š Response type: {type(result)}") - - return result, duration - - except APIError as e: - print(f"โŒ API Error: {e}") - print("This could be due to:") - print(" - Invalid API key") - print(" - Rate limiting") - print(" - Server issues") - return None, 0 - - except Exception as e: - print(f"โŒ Unexpected error: {e}") - print("This could be due to:") - print(" - Network connectivity issues") - print(" - Invalid website URL") - print(" - Cookie format issues") - return None, 0 - - -def step_8_process_results(result, duration, cookies): - """Step 8: Process and display the results""" - print("\nSTEP 8: Process Results") - print("=" * 40) - - if result is None: - print("โŒ No results to process") - return - - print("๐Ÿ“‹ Processing cookies response...") - - # Display results - if isinstance(result, dict): - print("\n๐Ÿ” Response Structure:") - print(json.dumps(result, indent=2, ensure_ascii=False)) - - # Check if cookies were received correctly - if "cookies" in result: - received_cookies = result["cookies"] - print(f"\n๐Ÿช Cookies sent: {cookies}") - print(f"๐Ÿช Cookies received: {received_cookies}") - - # Verify cookies match - if received_cookies == cookies: - print("โœ… Cookies match perfectly!") - else: - print("โš ๏ธ Cookies don't match exactly (this might be normal)") - - elif isinstance(result, list): - print(f"\nโœ… Request successful! Extracted {len(result)} items") - print("\n๐Ÿ“ฆ Results:") - for i, item in enumerate(result[:3]): # Show first 3 items - print(f" {i+1}. {item}") - - if len(result) > 3: - print(f" ... and {len(result) - 3} more items") - - else: - print(f"\n๐Ÿ“‹ Result: {result}") - - print(f"\nโฑ๏ธ Total processing time: {duration:.2f} seconds") - - -def step_9_test_different_scenarios(client, output_schema): - """Step 9: Test different cookie scenarios""" - print("\nSTEP 9: Test Different Cookie Scenarios") - print("=" * 40) - - scenarios = [ - { - "name": "Session Cookies", - "cookies": {"session_id": "abc123", "user_token": "xyz789"}, - "description": "Basic session management", - }, - { - "name": "Authentication Cookies", - "cookies": {"auth_token": "secret123", "preferences": "dark_mode"}, - "description": "User authentication and preferences", - }, - { - "name": "Complex Cookies", - "cookies": { - "session_id": "abc123def456", - "user_id": "user789", - "cart_id": "cart101112", - "preferences": "dark_mode,usd", - }, - "description": "E-commerce scenario with multiple cookies", - }, - ] - - for i, scenario in enumerate(scenarios, 1): - print(f"\n๐Ÿงช Testing Scenario {i}: {scenario['name']}") - print(f" Description: {scenario['description']}") - print(f" Cookies: {scenario['cookies']}") - - # Format cookies for headers - cookie_header = "; ".join([f"{k}={v}" for k, v in scenario["cookies"].items()]) - headers = {"Cookie": cookie_header} - - try: - # Quick test request - result = client.smartscraper( - website_url="https://httpbin.org/cookies", - user_prompt=f"Extract cookies for {scenario['name']}", - headers=headers, - output_schema=output_schema, - ) - print(f" โœ… Success: {type(result)}") - except Exception as e: - print(f" โŒ Error: {str(e)[:50]}...") - - -def step_10_cleanup(client): - """Step 10: Clean up resources""" - print("\nSTEP 10: Cleanup") - print("=" * 40) - - try: - client.close() - print("โœ… Client session closed successfully") - print("๐Ÿ”’ Resources freed") - except Exception as e: - print(f"โš ๏ธ Warning during cleanup: {e}") - - -def main(): - """Main function to run the step-by-step cookies example""" - - print("ScrapeGraph SDK - Step-by-Step Cookies Example") - print("=" * 60) - print("This example shows the complete process of setting up and") - print("executing a SmartScraper request with cookies for authentication") - print("=" * 60) - - # Step 1: Environment setup - api_key = step_1_environment_setup() - if not api_key: - return - - # Step 2: Client initialization - client = step_2_client_initialization(api_key) - if not client: - return - - # Step 3: Define schema - output_schema = step_3_define_schema() - - # Step 4: Prepare cookies - cookies = step_4_prepare_cookies() - - # Step 5: Format cookies for headers - headers = step_5_format_cookies_for_headers(cookies) - - # Step 6: Configure request - config = step_6_configure_request() - - # Step 7: Execute request - result, duration = step_7_execute_request(client, config, headers, output_schema) - - # Step 8: Process results - step_8_process_results(result, duration, cookies) - - # Step 9: Test different scenarios - step_9_test_different_scenarios(client, output_schema) - - # Step 10: Cleanup - step_10_cleanup(client) - - print("\n" + "=" * 60) - print("Step-by-step cookies example completed!") - print("\nKey takeaways:") - print("1. Cookies are passed via HTTP headers") - print("2. Cookie format: 'key1=value1; key2=value2'") - print("3. Always validate your API key first") - print("4. Test different cookie scenarios") - print("5. Handle errors gracefully") - print("\nCommon use cases:") - print("- Authentication for protected pages") - print("- Session management for dynamic content") - print("- User preferences and settings") - print("- Shopping cart and user state") - print("\nNext steps:") - print("- Try with real websites that require authentication") - print("- Experiment with different cookie combinations") - print("- Add error handling for production use") - print("- Consider security implications of storing cookies") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/advanced_features/steps/step_by_step_movements_example.py b/scrapegraph-py/examples/advanced_features/steps/step_by_step_movements_example.py deleted file mode 100644 index b7e40c2a..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/step_by_step_movements_example.py +++ /dev/null @@ -1,204 +0,0 @@ -#!/usr/bin/env python3 -""" -Step-by-Step SmartScraper Movements Example - -This example demonstrates how to use interactive movements with SmartScraper API. -It shows how to make actual HTTP requests with step-by-step browser interactions. -""" - -import json -import os -import time - -import requests -from dotenv import load_dotenv - -# Load environment variables from .env file -load_dotenv() - - -def smart_scraper_movements(): - """Example of making a movements request to the smartscraper API""" - - # Get API key from .env file - api_key = os.getenv("TEST_API_KEY") - if not api_key: - raise ValueError( - "API key must be provided or set in .env file as TEST_API_KEY. " - "Create a .env file with: TEST_API_KEY=your_api_key_here" - ) - - steps = [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search", - ] - website_url = "https://github.com/" - user_prompt = "Extract user profile" - - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - } - - body = { - "website_url": website_url, - "user_prompt": user_prompt, - "output_schema": {}, - "steps": steps, - } - - print("๐Ÿš€ Starting Smart Scraper with Interactive Movements...") - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐ŸŽฏ User Prompt: {user_prompt}") - print(f"๐Ÿ“‹ Steps: {len(steps)} interactive steps") - print("\n" + "=" * 60) - - # Start timer - start_time = time.time() - print( - f"โฑ๏ธ Timer started at: {time.strftime('%H:%M:%S', time.localtime(start_time))}" - ) - print("๐Ÿ”„ Processing request...") - - try: - response = requests.post( - "http://localhost:8001/v1/smartscraper", - json=body, - headers=headers, - ) - - # Calculate execution time - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Total execution time: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print( - f"๐Ÿ“Š Performance: {execution_time:.1f}s ({execution_minutes:.1f}m) for {len(steps)} steps" - ) - - if response.status_code == 200: - result = response.json() - print("โœ… Request completed successfully!") - print(f"๐Ÿ“Š Request ID: {result.get('request_id', 'N/A')}") - print(f"๐Ÿ”„ Status: {result.get('status', 'N/A')}") - - if result.get("error"): - print(f"โŒ Error: {result['error']}") - else: - print("\n๐Ÿ“‹ EXTRACTED DATA:") - print("=" * 60) - - # Pretty print the result with proper indentation - if "result" in result: - print(json.dumps(result["result"], indent=2, ensure_ascii=False)) - else: - print("No result data found") - - else: - print(f"โŒ Request failed with status code: {response.status_code}") - print(f"Response: {response.text}") - - except requests.exceptions.RequestException as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐ŸŒ Network error: {str(e)}") - except Exception as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐Ÿ’ฅ Unexpected error: {str(e)}") - - -def show_curl_equivalent(): - """Show the equivalent curl command for reference""" - - # Load environment variables from .env file - load_dotenv() - - api_key = os.getenv("TEST_API_KEY", "your-api-key-here") - curl_command = f""" -curl --location 'http://localhost:8001/v1/smartscraper' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Content-Type: application/json' \\ ---data '{{ - "website_url": "https://github.com/", - "user_prompt": "Extract user profile", - "output_schema": {{}}, - "steps": [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search" - ] -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -def main(): - """Main function to run the movements example""" - try: - print("๐ŸŽฏ SMART SCRAPER MOVEMENTS EXAMPLE") - print("=" * 60) - print("This example demonstrates interactive movements with timing") - print() - - # Show the curl equivalent - show_curl_equivalent() - - print("\n" + "=" * 60) - - # Make the actual API request - smart_scraper_movements() - - print("\n" + "=" * 60) - print("Example completed!") - print("\nKey takeaways:") - print("1. Movements allow for interactive browser automation") - print("2. Each step is executed sequentially") - print("3. Timing is crucial for successful interactions") - print("4. Error handling is important for robust automation") - print("\nNext steps:") - print("- Customize the steps for your specific use case") - print("- Add more complex interactions") - print("- Implement retry logic for failed steps") - print("- Use structured output schemas for better data extraction") - - except Exception as e: - print(f"๐Ÿ’ฅ Error occurred: {str(e)}") - print("\n๐Ÿ› ๏ธ Troubleshooting:") - print("1. Make sure your .env file contains TEST_API_KEY") - print("2. Ensure the API server is running on localhost:8001") - print("3. Check your internet connection") - print("4. Verify the target website is accessible") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/advanced_features/steps/step_by_step_pagination_example.py b/scrapegraph-py/examples/advanced_features/steps/step_by_step_pagination_example.py deleted file mode 100644 index fb308df9..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/step_by_step_pagination_example.py +++ /dev/null @@ -1,259 +0,0 @@ -#!/usr/bin/env python3 -""" -Step-by-Step Pagination Example - -This example demonstrates the pagination process step by step, showing each stage -of setting up and executing a paginated SmartScraper request. -""" - -import json -import os -import time -from typing import List, Optional - -from dotenv import load_dotenv -from pydantic import BaseModel, Field - -from scrapegraph_py import Client -from scrapegraph_py.exceptions import APIError - -# Load environment variables from .env file -load_dotenv() - - -class ProductInfo(BaseModel): - """Schema for product information""" - - name: str = Field(description="Product name") - price: Optional[str] = Field(description="Product price") - rating: Optional[str] = Field(description="Product rating") - image_url: Optional[str] = Field(description="Product image URL") - description: Optional[str] = Field(description="Product description") - - -class ProductList(BaseModel): - """Schema for list of products""" - - products: List[ProductInfo] = Field(description="List of products") - - -def step_1_environment_setup(): - """Step 1: Set up environment and API key""" - print("STEP 1: Environment Setup") - print("=" * 40) - - # Check if API key is available - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return None - - print("โœ… API key found in environment") - print(f"๐Ÿ”‘ API Key: {api_key[:8]}...{api_key[-4:]}") - return api_key - - -def step_2_client_initialization(api_key): - """Step 2: Initialize the ScrapeGraph client""" - print("\nSTEP 2: Client Initialization") - print("=" * 40) - - try: - client = Client(api_key=api_key) - print("โœ… Client initialized successfully") - print(f"๐Ÿ”ง Client type: {type(client)}") - return client - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return None - - -def step_3_define_schema(): - """Step 3: Define the output schema for structured data""" - print("\nSTEP 3: Define Output Schema") - print("=" * 40) - - print("๐Ÿ“‹ Defining Pydantic models for structured output:") - print(" - ProductInfo: Individual product data") - print(" - ProductList: Collection of products") - - # Show the schema structure - schema_example = ProductList.model_json_schema() - print(f"โœ… Schema defined with {len(schema_example['properties'])} properties") - - return ProductList - - -def step_4_configure_request(): - """Step 4: Configure the pagination request parameters""" - print("\nSTEP 4: Configure Request Parameters") - print("=" * 40) - - # Configuration parameters - website_url = "https://www.amazon.in/s?k=tv&crid=1TEF1ZFVLU8R8&sprefix=t%2Caps%2C390&ref=nb_sb_noss_2" - user_prompt = "Extract all product info including name, price, rating, image_url, and description" - total_pages = 3 - - print("๐ŸŒ Website URL:") - print(f" {website_url}") - print("\n๐Ÿ“ User Prompt:") - print(f" {user_prompt}") - print(f"\n๐Ÿ“„ Total Pages: {total_pages}") - print(f"๐Ÿ“Š Expected Products: ~{total_pages * 20} (estimated)") - - return { - "website_url": website_url, - "user_prompt": user_prompt, - "total_pages": total_pages, - } - - -def step_5_execute_request(client, config, output_schema): - """Step 5: Execute the pagination request""" - print("\nSTEP 5: Execute Pagination Request") - print("=" * 40) - - print("๐Ÿš€ Starting pagination request...") - print("โฑ๏ธ This may take several minutes for multiple pages...") - - try: - # Start timing - start_time = time.time() - - # Make the request with pagination - result = client.smartscraper( - user_prompt=config["user_prompt"], - website_url=config["website_url"], - output_schema=output_schema, - total_pages=config["total_pages"], - ) - - # Calculate duration - duration = time.time() - start_time - - print(f"โœ… Request completed in {duration:.2f} seconds") - print(f"๐Ÿ“Š Response type: {type(result)}") - - return result, duration - - except APIError as e: - print(f"โŒ API Error: {e}") - print("This could be due to:") - print(" - Invalid API key") - print(" - Rate limiting") - print(" - Server issues") - return None, 0 - - except Exception as e: - print(f"โŒ Unexpected error: {e}") - print("This could be due to:") - print(" - Network connectivity issues") - print(" - Invalid website URL") - print(" - Pagination limitations") - return None, 0 - - -def step_6_process_results(result, duration): - """Step 6: Process and display the results""" - print("\nSTEP 6: Process Results") - print("=" * 40) - - if result is None: - print("โŒ No results to process") - return - - print("๐Ÿ“‹ Processing pagination results...") - - # Display results based on type - if isinstance(result, dict): - print("\n๐Ÿ” Response Structure:") - print(json.dumps(result, indent=2, ensure_ascii=False)) - - # Check for pagination success indicators - if "data" in result: - print("\nโœจ Pagination successful! Data extracted from multiple pages") - - elif isinstance(result, list): - print(f"\nโœ… Pagination successful! Extracted {len(result)} items") - - # Show first few items - print("\n๐Ÿ“ฆ Sample Results:") - for i, item in enumerate(result[:3]): # Show first 3 items - print(f" {i+1}. {item}") - - if len(result) > 3: - print(f" ... and {len(result) - 3} more items") - - else: - print(f"\n๐Ÿ“‹ Result: {result}") - - print(f"\nโฑ๏ธ Total processing time: {duration:.2f} seconds") - - -def step_7_cleanup(client): - """Step 7: Clean up resources""" - print("\nSTEP 7: Cleanup") - print("=" * 40) - - try: - client.close() - print("โœ… Client session closed successfully") - print("๐Ÿ”’ Resources freed") - except Exception as e: - print(f"โš ๏ธ Warning during cleanup: {e}") - - -def main(): - """Main function to run the step-by-step pagination example""" - - print("ScrapeGraph SDK - Step-by-Step Pagination Example") - print("=" * 60) - print("This example shows the complete process of setting up and") - print("executing a pagination request with SmartScraper API") - print("=" * 60) - - # Step 1: Environment setup - api_key = step_1_environment_setup() - if not api_key: - return - - # Step 2: Client initialization - client = step_2_client_initialization(api_key) - if not client: - return - - # Step 3: Define schema - output_schema = step_3_define_schema() - - # Step 4: Configure request - config = step_4_configure_request() - - # Step 5: Execute request - result, duration = step_5_execute_request(client, config, output_schema) - - # Step 6: Process results - step_6_process_results(result, duration) - - # Step 7: Cleanup - step_7_cleanup(client) - - print("\n" + "=" * 60) - print("Step-by-step pagination example completed!") - print("\nKey takeaways:") - print("1. Always validate your API key first") - print("2. Define clear output schemas for structured data") - print("3. Configure pagination parameters carefully") - print("4. Handle errors gracefully") - print("5. Clean up resources after use") - print("\nNext steps:") - print("- Try different websites and prompts") - print("- Experiment with different page counts") - print("- Add error handling for production use") - print("- Consider rate limiting for large requests") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/advanced_features/steps/step_by_step_scrape_example.py b/scrapegraph-py/examples/advanced_features/steps/step_by_step_scrape_example.py deleted file mode 100644 index 8627dfe1..00000000 --- a/scrapegraph-py/examples/advanced_features/steps/step_by_step_scrape_example.py +++ /dev/null @@ -1,183 +0,0 @@ -""" -Step-by-step example demonstrating how to use the Scrape API with the scrapegraph-py SDK. - -This example shows the basic workflow: -1. Initialize the client -2. Make a scrape request -3. Handle the response -4. Save the HTML content -5. Basic analysis - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import os -from pathlib import Path -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def step_1_initialize_client(): - """Step 1: Initialize the scrapegraph-py client.""" - print("๐Ÿ”‘ Step 1: Initializing client...") - - try: - # Initialize client using environment variable - client = Client.from_env() - print("โœ… Client initialized successfully") - return client - except Exception as e: - print(f"โŒ Failed to initialize client: {str(e)}") - print("Make sure you have SGAI_API_KEY in your .env file") - raise - - -def step_2_make_scrape_request(client, url, render_js=False): - """Step 2: Make a scrape request.""" - print(f"\n๐ŸŒ Step 2: Making scrape request to {url}") - print(f"๐Ÿ”ง Render heavy JS: {render_js}") - - try: - # Make the scrape request - result = client.scrape( - website_url=url, - render_heavy_js=render_js - ) - print("โœ… Scrape request completed successfully") - return result - except Exception as e: - print(f"โŒ Scrape request failed: {str(e)}") - raise - - -def step_3_handle_response(result): - """Step 3: Handle and analyze the response.""" - print(f"\n๐Ÿ“Š Step 3: Analyzing response...") - - # Check if we got HTML content - html_content = result.get("html", "") - if not html_content: - print("โŒ No HTML content received") - return None - - # Basic response analysis - print(f"โœ… Received HTML content") - print(f"๐Ÿ“ Content length: {len(html_content):,} characters") - print(f"๐Ÿ“„ Lines: {len(html_content.splitlines()):,}") - - # Check for common HTML elements - has_doctype = html_content.strip().startswith(" 0: - print(f" {element}: {count}") - - # Check for JavaScript and CSS - has_js = elements["script"] > 0 - has_css = elements["style"] > 0 - - print(f"\n๐ŸŽจ Content types:") - print(f" JavaScript: {'Yes' if has_js else 'No'}") - print(f" CSS: {'Yes' if has_css else 'No'}") - - return elements - - -def main(): - """Main function demonstrating step-by-step scrape usage.""" - print("๐Ÿš€ Step-by-Step Scrape API Example") - print("=" * 50) - - # Test URL - test_url = "https://example.com" - - try: - # Step 1: Initialize client - client = step_1_initialize_client() - - # Step 2: Make scrape request - result = step_2_make_scrape_request(client, test_url, render_js=False) - - # Step 3: Handle response - html_content = step_3_handle_response(result) - if not html_content: - print("โŒ Cannot proceed without HTML content") - return - - # Step 4: Save content - filename = "example_website" - saved_file = step_4_save_html_content(html_content, filename) - - # Step 5: Basic analysis - elements = step_5_basic_analysis(html_content) - - # Summary - print(f"\n๐ŸŽฏ Summary:") - print(f"โœ… Successfully processed {test_url}") - print(f"๐Ÿ’พ HTML saved to: {saved_file}") - print(f"๐Ÿ“Š Analyzed {len(html_content):,} characters of HTML content") - - # Close client - client.close() - print("๐Ÿ”’ Client closed successfully") - - except Exception as e: - print(f"\n๐Ÿ’ฅ Error occurred: {str(e)}") - print("Check your API key and internet connection") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/agenticscraper/async/async_agenticscraper_comprehensive_example.py b/scrapegraph-py/examples/agenticscraper/async/async_agenticscraper_comprehensive_example.py deleted file mode 100644 index c72da78b..00000000 --- a/scrapegraph-py/examples/agenticscraper/async/async_agenticscraper_comprehensive_example.py +++ /dev/null @@ -1,458 +0,0 @@ -#!/usr/bin/env python3 -""" -Comprehensive Async Agentic Scraper Example - -This example demonstrates how to use the agentic scraper API endpoint -asynchronously to perform automated browser actions and scrape content -with both AI extraction and non-AI extraction modes. - -The agentic scraper can: -1. Navigate to a website -2. Perform a series of automated actions (like filling forms, clicking buttons) -3. Extract the resulting HTML content as markdown -4. Optionally use AI to extract structured data - -Usage: - python examples/async/async_agenticscraper_comprehensive_example.py -""" - -import asyncio -import json -import os -import time -from typing import Dict, List, Optional - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -# Set logging level -sgai_logger.set_logging(level="INFO") - - -async def example_basic_scraping_no_ai(): - """Example: Basic agentic scraping without AI extraction.""" - - # Initialize the async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return None - - async with AsyncClient(api_key=api_key) as client: - # Define the steps to perform - steps = [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login", - ] - - try: - print("๐Ÿš€ Starting basic async agentic scraping (no AI extraction)...") - print(f"URL: https://dashboard.scrapegraphai.com/") - print(f"Steps: {steps}") - - # Perform the scraping without AI extraction - result = await client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - steps=steps, - use_session=True, - ai_extraction=False # No AI extraction - just get raw markdown - ) - - print("โœ… Basic async scraping completed successfully!") - print(f"Request ID: {result.get('request_id')}") - - # Save the markdown content to a file - if result.get("markdown"): - with open("async_basic_scraped_content.md", "w", encoding="utf-8") as f: - f.write(result["markdown"]) - print("๐Ÿ“„ Markdown content saved to 'async_basic_scraped_content.md'") - - # Print a preview of the content - if result.get("markdown"): - preview = ( - result["markdown"][:500] + "..." - if len(result["markdown"]) > 500 - else result["markdown"] - ) - print(f"\n๐Ÿ“ Content Preview:\n{preview}") - - if result.get("error"): - print(f"โš ๏ธ Warning: {result['error']}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def example_ai_extraction(): - """Example: Use AI extraction to get structured data from dashboard.""" - - # Initialize the async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return None - - async with AsyncClient(api_key=api_key) as client: - # Define extraction schema for user dashboard information - output_schema = { - "user_info": { - "type": "object", - "properties": { - "username": {"type": "string"}, - "email": {"type": "string"}, - "dashboard_sections": { - "type": "array", - "items": {"type": "string"} - }, - "account_status": {"type": "string"}, - "credits_remaining": {"type": "number"} - }, - "required": ["username", "dashboard_sections"] - } - } - - steps = [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login", - "wait for dashboard to load completely", - ] - - try: - print("๐Ÿค– Starting async agentic scraping with AI extraction...") - print(f"URL: https://dashboard.scrapegraphai.com/") - print(f"Steps: {steps}") - - result = await client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - steps=steps, - use_session=True, - user_prompt="Extract user information, available dashboard sections, account status, and remaining credits from the dashboard", - output_schema=output_schema, - ai_extraction=True - ) - - print("โœ… Async AI extraction completed!") - print(f"Request ID: {result.get('request_id')}") - - if result.get("result"): - print("๐ŸŽฏ Extracted Structured Data:") - print(json.dumps(result["result"], indent=2)) - - # Save extracted data to JSON file - with open("async_extracted_dashboard_data.json", "w", encoding="utf-8") as f: - json.dump(result["result"], f, indent=2) - print("๐Ÿ’พ Structured data saved to 'async_extracted_dashboard_data.json'") - - # Also save the raw markdown if available - if result.get("markdown"): - with open("async_ai_scraped_content.md", "w", encoding="utf-8") as f: - f.write(result["markdown"]) - print("๐Ÿ“„ Raw markdown also saved to 'async_ai_scraped_content.md'") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def example_multiple_sites_concurrently(): - """Example: Scrape multiple sites concurrently with different extraction modes.""" - - # Initialize the async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return None - - async with AsyncClient(api_key=api_key) as client: - # Define different scraping tasks - tasks = [ - { - "name": "Dashboard Login (No AI)", - "url": "https://dashboard.scrapegraphai.com/", - "steps": [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ], - "ai_extraction": False - }, - { - "name": "Product Page (With AI)", - "url": "https://example-store.com/products/laptop", - "steps": [ - "scroll down to product details", - "click on specifications tab", - "scroll down to reviews section" - ], - "ai_extraction": True, - "user_prompt": "Extract product name, price, specifications, and customer review summary", - "output_schema": { - "product": { - "type": "object", - "properties": { - "name": {"type": "string"}, - "price": {"type": "string"}, - "specifications": {"type": "object"}, - "review_summary": { - "type": "object", - "properties": { - "average_rating": {"type": "number"}, - "total_reviews": {"type": "number"} - } - } - } - } - } - }, - { - "name": "News Article (With AI)", - "url": "https://example-news.com/tech-article", - "steps": [ - "scroll down to read full article", - "click on related articles section" - ], - "ai_extraction": True, - "user_prompt": "Extract article title, author, publication date, main content summary, and related article titles", - "output_schema": { - "article": { - "type": "object", - "properties": { - "title": {"type": "string"}, - "author": {"type": "string"}, - "publication_date": {"type": "string"}, - "summary": {"type": "string"}, - "related_articles": { - "type": "array", - "items": {"type": "string"} - } - } - } - } - } - ] - - async def scrape_site(task): - """Helper function to scrape a single site.""" - try: - print(f"๐Ÿš€ Starting: {task['name']}") - - kwargs = { - "url": task["url"], - "steps": task["steps"], - "use_session": True, - "ai_extraction": task["ai_extraction"] - } - - if task["ai_extraction"]: - kwargs["user_prompt"] = task["user_prompt"] - kwargs["output_schema"] = task["output_schema"] - - result = await client.agenticscraper(**kwargs) - - print(f"โœ… Completed: {task['name']} (Request ID: {result.get('request_id')})") - return { - "task_name": task["name"], - "result": result, - "success": True - } - - except Exception as e: - print(f"โŒ Failed: {task['name']} - {str(e)}") - return { - "task_name": task["name"], - "error": str(e), - "success": False - } - - try: - print("๐Ÿ”„ Starting concurrent scraping of multiple sites...") - print(f"๐Ÿ“Š Total tasks: {len(tasks)}") - - # Run all scraping tasks concurrently - results = await asyncio.gather( - *[scrape_site(task) for task in tasks], - return_exceptions=True - ) - - print("\n๐Ÿ“‹ Concurrent Scraping Results:") - print("=" * 50) - - successful_results = [] - failed_results = [] - - for result in results: - if isinstance(result, Exception): - print(f"โŒ Exception occurred: {str(result)}") - failed_results.append({"error": str(result)}) - elif result["success"]: - print(f"โœ… {result['task_name']}: Success") - successful_results.append(result) - - # Save individual results - filename = f"concurrent_{result['task_name'].lower().replace(' ', '_').replace('(', '').replace(')', '')}_result.json" - with open(filename, "w", encoding="utf-8") as f: - json.dump(result["result"], f, indent=2) - print(f" ๐Ÿ’พ Saved to: {filename}") - else: - print(f"โŒ {result['task_name']}: Failed - {result['error']}") - failed_results.append(result) - - # Save summary - summary = { - "total_tasks": len(tasks), - "successful": len(successful_results), - "failed": len(failed_results), - "success_rate": f"{(len(successful_results) / len(tasks)) * 100:.1f}%", - "results": results - } - - with open("concurrent_scraping_summary.json", "w", encoding="utf-8") as f: - json.dump(summary, f, indent=2) - print(f"\n๐Ÿ“Š Summary saved to: concurrent_scraping_summary.json") - print(f" Success Rate: {summary['success_rate']}") - - return results - - except Exception as e: - print(f"โŒ Concurrent scraping error: {str(e)}") - return None - - -async def example_step_by_step_with_ai(): - """Example: Step-by-step form interaction with AI extraction.""" - - # Initialize the async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return None - - async with AsyncClient(api_key=api_key) as client: - steps = [ - "navigate to contact page", - "fill in name field with 'Jane Smith'", - "fill in email field with 'jane.smith@company.com'", - "select 'Business Inquiry' from dropdown", - "fill in message: 'I would like to discuss enterprise pricing options for 100+ users'", - "click on terms and conditions checkbox", - "click submit button", - "wait for success message and capture any reference number" - ] - - output_schema = { - "contact_form_result": { - "type": "object", - "properties": { - "submission_status": {"type": "string"}, - "success_message": {"type": "string"}, - "reference_number": {"type": "string"}, - "next_steps": {"type": "string"}, - "contact_info": {"type": "string"}, - "estimated_response_time": {"type": "string"} - }, - "required": ["submission_status", "success_message"] - } - } - - try: - print("๐Ÿ“ Starting step-by-step form interaction with AI extraction...") - print(f"URL: https://example-business.com/contact") - print(f"Steps: {len(steps)} steps defined") - - result = await client.agenticscraper( - url="https://example-business.com/contact", - steps=steps, - use_session=True, - user_prompt="Extract the form submission result including status, success message, any reference number provided, next steps mentioned, contact information for follow-up, and estimated response time", - output_schema=output_schema, - ai_extraction=True - ) - - print("โœ… Step-by-step form interaction completed!") - print(f"Request ID: {result.get('request_id')}") - - if result and result.get("result"): - form_result = result["result"].get("contact_form_result", {}) - - print("\n๐Ÿ“‹ Form Submission Analysis:") - print(f" ๐Ÿ“Š Status: {form_result.get('submission_status', 'Unknown')}") - print(f" โœ… Message: {form_result.get('success_message', 'No message')}") - - if form_result.get('reference_number'): - print(f" ๐Ÿ”ข Reference: {form_result['reference_number']}") - - if form_result.get('next_steps'): - print(f" ๐Ÿ‘‰ Next Steps: {form_result['next_steps']}") - - if form_result.get('contact_info'): - print(f" ๐Ÿ“ž Contact Info: {form_result['contact_info']}") - - if form_result.get('estimated_response_time'): - print(f" โฐ Response Time: {form_result['estimated_response_time']}") - - # Save detailed results - with open("async_step_by_step_form_result.json", "w", encoding="utf-8") as f: - json.dump(result["result"], f, indent=2) - print("\n๐Ÿ’พ Detailed results saved to 'async_step_by_step_form_result.json'") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def main(): - """Main async function to run all examples.""" - print("๐Ÿ”ง Comprehensive Async Agentic Scraper Examples") - print("=" * 60) - - # Check if API key is set - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โš ๏ธ Please set your SGAI_API_KEY environment variable before running!") - print("You can either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - print("\n1. Basic Async Scraping (No AI Extraction)") - print("-" * 50) - await example_basic_scraping_no_ai() - - print("\n\n2. Async AI Extraction Example - Dashboard Data") - print("-" * 50) - await example_ai_extraction() - - print("\n\n3. Concurrent Multi-Site Scraping") - print("-" * 50) - # Uncomment to run concurrent scraping example - # await example_multiple_sites_concurrently() - - print("\n\n4. Step-by-Step Form Interaction with AI") - print("-" * 50) - # Uncomment to run step-by-step form example - # await example_step_by_step_with_ai() - - print("\nโœจ Async examples completed!") - print("\nโ„น๏ธ Note: Some examples are commented out by default.") - print(" Uncomment them in the main function to run additional examples.") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/agenticscraper/async/async_agenticscraper_example.py b/scrapegraph-py/examples/agenticscraper/async/async_agenticscraper_example.py deleted file mode 100644 index 5b3edc96..00000000 --- a/scrapegraph-py/examples/agenticscraper/async/async_agenticscraper_example.py +++ /dev/null @@ -1,93 +0,0 @@ -import asyncio -import os - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -async def main(): - # Initialize async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - sgai_client = AsyncClient(api_key=api_key) - - print("๐Ÿค– Example 1: Basic Async Agentic Scraping (No AI Extraction)") - print("=" * 60) - - # AgenticScraper request - basic automated login example (no AI) - response = await sgai_client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - use_session=True, - steps=[ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ], - ai_extraction=False # No AI extraction - just get raw content - ) - - # Print the response - print(f"Request ID: {response['request_id']}") - print(f"Result: {response.get('result', 'No result yet')}") - print(f"Status: {response.get('status', 'Unknown')}") - - print("\n\n๐Ÿง  Example 2: Async Agentic Scraping with AI Extraction") - print("=" * 60) - - # Define schema for AI extraction - output_schema = { - "dashboard_info": { - "type": "object", - "properties": { - "username": {"type": "string"}, - "email": {"type": "string"}, - "dashboard_sections": { - "type": "array", - "items": {"type": "string"} - }, - "credits_remaining": {"type": "number"} - }, - "required": ["username", "dashboard_sections"] - } - } - - # AgenticScraper request with AI extraction - ai_response = await sgai_client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - use_session=True, - steps=[ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login", - "wait for dashboard to load completely" - ], - user_prompt="Extract user information, available dashboard sections, and remaining credits from the dashboard", - output_schema=output_schema, - ai_extraction=True - ) - - # Print the AI extraction response - print(f"AI Request ID: {ai_response['request_id']}") - print(f"AI Result: {ai_response.get('result', 'No result yet')}") - print(f"AI Status: {ai_response.get('status', 'Unknown')}") - print(f"User Prompt: Extract user information, available dashboard sections, and remaining credits") - print(f"Schema Provided: {'Yes' if output_schema else 'No'}") - - await sgai_client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/agenticscraper/sync/agenticscraper_comprehensive_example.py b/scrapegraph-py/examples/agenticscraper/sync/agenticscraper_comprehensive_example.py deleted file mode 100644 index c1e77542..00000000 --- a/scrapegraph-py/examples/agenticscraper/sync/agenticscraper_comprehensive_example.py +++ /dev/null @@ -1,397 +0,0 @@ -#!/usr/bin/env python3 -""" -Comprehensive Agentic Scraper Example - -This example demonstrates how to use the agentic scraper API endpoint -to perform automated browser actions and scrape content with both -AI extraction and non-AI extraction modes. - -The agentic scraper can: -1. Navigate to a website -2. Perform a series of automated actions (like filling forms, clicking buttons) -3. Extract the resulting HTML content as markdown -4. Optionally use AI to extract structured data - -Usage: - python examples/sync/agenticscraper_comprehensive_example.py -""" - -import json -import os -import time -from typing import Dict, List, Optional - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -# Set logging level -sgai_logger.set_logging(level="INFO") - - -def example_basic_scraping_no_ai(): - """Example: Basic agentic scraping without AI extraction.""" - - # Initialize the client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return None - - client = Client(api_key=api_key) - - # Define the steps to perform - steps = [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login", - ] - - try: - print("๐Ÿš€ Starting basic agentic scraping (no AI extraction)...") - print(f"URL: https://dashboard.scrapegraphai.com/") - print(f"Steps: {steps}") - - # Perform the scraping without AI extraction - result = client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - steps=steps, - use_session=True, - ai_extraction=False # No AI extraction - just get raw markdown - ) - - print("โœ… Basic scraping completed successfully!") - print(f"Request ID: {result.get('request_id')}") - - # Save the markdown content to a file - if result.get("markdown"): - with open("basic_scraped_content.md", "w", encoding="utf-8") as f: - f.write(result["markdown"]) - print("๐Ÿ“„ Markdown content saved to 'basic_scraped_content.md'") - - # Print a preview of the content - if result.get("markdown"): - preview = ( - result["markdown"][:500] + "..." - if len(result["markdown"]) > 500 - else result["markdown"] - ) - print(f"\n๐Ÿ“ Content Preview:\n{preview}") - - if result.get("error"): - print(f"โš ๏ธ Warning: {result['error']}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def example_ai_extraction(): - """Example: Use AI extraction to get structured data from dashboard.""" - - # Initialize the client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return None - - client = Client(api_key=api_key) - - # Define extraction schema for user dashboard information - output_schema = { - "user_info": { - "type": "object", - "properties": { - "username": {"type": "string"}, - "email": {"type": "string"}, - "dashboard_sections": { - "type": "array", - "items": {"type": "string"} - }, - "account_status": {"type": "string"}, - "credits_remaining": {"type": "number"} - }, - "required": ["username", "dashboard_sections"] - } - } - - steps = [ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login", - "wait for dashboard to load completely", - ] - - try: - print("๐Ÿค– Starting agentic scraping with AI extraction...") - print(f"URL: https://dashboard.scrapegraphai.com/") - print(f"Steps: {steps}") - - result = client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - steps=steps, - use_session=True, - user_prompt="Extract user information, available dashboard sections, account status, and remaining credits from the dashboard", - output_schema=output_schema, - ai_extraction=True - ) - - print("โœ… AI extraction completed!") - print(f"Request ID: {result.get('request_id')}") - - if result.get("result"): - print("๐ŸŽฏ Extracted Structured Data:") - print(json.dumps(result["result"], indent=2)) - - # Save extracted data to JSON file - with open("extracted_dashboard_data.json", "w", encoding="utf-8") as f: - json.dump(result["result"], f, indent=2) - print("๐Ÿ’พ Structured data saved to 'extracted_dashboard_data.json'") - - # Also save the raw markdown if available - if result.get("markdown"): - with open("ai_scraped_content.md", "w", encoding="utf-8") as f: - f.write(result["markdown"]) - print("๐Ÿ“„ Raw markdown also saved to 'ai_scraped_content.md'") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def example_ecommerce_product_scraping(): - """Example: Scraping an e-commerce site for product information.""" - - # Initialize the client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return None - - client = Client(api_key=api_key) - - steps = [ - "click on search box", - "type 'laptop' in search box", - "press enter", - "wait for search results to load", - "scroll down 3 times to load more products", - ] - - output_schema = { - "products": { - "type": "array", - "items": { - "type": "object", - "properties": { - "name": {"type": "string"}, - "price": {"type": "string"}, - "rating": {"type": "number"}, - "availability": {"type": "string"}, - "description": {"type": "string"}, - "image_url": {"type": "string"} - }, - "required": ["name", "price"] - } - }, - "search_info": { - "type": "object", - "properties": { - "total_results": {"type": "number"}, - "search_term": {"type": "string"}, - "page": {"type": "number"} - } - } - } - - try: - print("๐Ÿ›’ Scraping e-commerce products with AI extraction...") - print(f"URL: https://example-ecommerce.com") - print(f"Steps: {steps}") - - result = client.agenticscraper( - url="https://example-ecommerce.com", - steps=steps, - use_session=True, - user_prompt="Extract all visible product information including names, prices, ratings, availability status, descriptions, and image URLs. Also extract search metadata like total results and current page.", - output_schema=output_schema, - ai_extraction=True - ) - - print("โœ… E-commerce scraping completed!") - print(f"Request ID: {result.get('request_id')}") - - if result and result.get("result"): - products = result["result"].get("products", []) - search_info = result["result"].get("search_info", {}) - - print(f"๐Ÿ” Search Results for '{search_info.get('search_term', 'laptop')}':") - print(f"๐Ÿ“Š Total Results: {search_info.get('total_results', 'Unknown')}") - print(f"๐Ÿ“„ Current Page: {search_info.get('page', 'Unknown')}") - print(f"๐Ÿ›๏ธ Products Found: {len(products)}") - - print("\n๐Ÿ“ฆ Product Details:") - for i, product in enumerate(products[:5], 1): # Show first 5 products - print(f"\n{i}. {product.get('name', 'N/A')}") - print(f" ๐Ÿ’ฐ Price: {product.get('price', 'N/A')}") - print(f" โญ Rating: {product.get('rating', 'N/A')}") - print(f" ๐Ÿ“ฆ Availability: {product.get('availability', 'N/A')}") - if product.get('description'): - desc = product['description'][:100] + "..." if len(product['description']) > 100 else product['description'] - print(f" ๐Ÿ“ Description: {desc}") - - # Save extracted data - with open("ecommerce_products.json", "w", encoding="utf-8") as f: - json.dump(result["result"], f, indent=2) - print("\n๐Ÿ’พ Product data saved to 'ecommerce_products.json'") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def example_form_filling_and_data_extraction(): - """Example: Fill out a contact form and extract confirmation details.""" - - # Initialize the client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return None - - client = Client(api_key=api_key) - - steps = [ - "find and click on contact form", - "type 'John Doe' in name field", - "type 'john.doe@example.com' in email field", - "type 'Product Inquiry' in subject field", - "type 'I am interested in your premium plan. Could you provide more details about pricing and features?' in message field", - "click submit button", - "wait for confirmation message to appear", - ] - - output_schema = { - "form_submission": { - "type": "object", - "properties": { - "status": {"type": "string"}, - "confirmation_message": {"type": "string"}, - "reference_number": {"type": "string"}, - "estimated_response_time": {"type": "string"}, - "submitted_data": { - "type": "object", - "properties": { - "name": {"type": "string"}, - "email": {"type": "string"}, - "subject": {"type": "string"} - } - } - }, - "required": ["status", "confirmation_message"] - } - } - - try: - print("๐Ÿ“ Filling contact form and extracting confirmation...") - print(f"URL: https://example-company.com/contact") - print(f"Steps: {steps}") - - result = client.agenticscraper( - url="https://example-company.com/contact", - steps=steps, - use_session=True, - user_prompt="Extract the form submission status, confirmation message, any reference numbers, estimated response time, and echo back the submitted form data", - output_schema=output_schema, - ai_extraction=True - ) - - print("โœ… Form submission and extraction completed!") - print(f"Request ID: {result.get('request_id')}") - - if result and result.get("result"): - form_data = result["result"].get("form_submission", {}) - - print(f"๐Ÿ“‹ Form Submission Results:") - print(f" โœ… Status: {form_data.get('status', 'Unknown')}") - print(f" ๐Ÿ’ฌ Message: {form_data.get('confirmation_message', 'No message')}") - - if form_data.get('reference_number'): - print(f" ๐Ÿ”ข Reference: {form_data['reference_number']}") - - if form_data.get('estimated_response_time'): - print(f" โฐ Response Time: {form_data['estimated_response_time']}") - - submitted_data = form_data.get('submitted_data', {}) - if submitted_data: - print(f"\n๐Ÿ“ค Submitted Data:") - for key, value in submitted_data.items(): - print(f" {key.title()}: {value}") - - # Save form results - with open("form_submission_results.json", "w", encoding="utf-8") as f: - json.dump(result["result"], f, indent=2) - print("\n๐Ÿ’พ Form results saved to 'form_submission_results.json'") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -if __name__ == "__main__": - print("๐Ÿ”ง Comprehensive Agentic Scraper Examples") - print("=" * 60) - - # Check if API key is set - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โš ๏ธ Please set your SGAI_API_KEY environment variable before running!") - print("You can either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - exit(1) - - print("\n1. Basic Scraping (No AI Extraction)") - print("-" * 40) - example_basic_scraping_no_ai() - - print("\n\n2. AI Extraction Example - Dashboard Data") - print("-" * 40) - example_ai_extraction() - - print("\n\n3. E-commerce Product Scraping with AI") - print("-" * 40) - # Uncomment to run e-commerce example - # example_ecommerce_product_scraping() - - print("\n\n4. Form Filling and Confirmation Extraction") - print("-" * 40) - # Uncomment to run form filling example - # example_form_filling_and_data_extraction() - - print("\nโœจ Examples completed!") - print("\nโ„น๏ธ Note: Some examples are commented out by default.") - print(" Uncomment them in the main section to run additional examples.") diff --git a/scrapegraph-py/examples/agenticscraper/sync/agenticscraper_example.py b/scrapegraph-py/examples/agenticscraper/sync/agenticscraper_example.py deleted file mode 100644 index ecf658d4..00000000 --- a/scrapegraph-py/examples/agenticscraper/sync/agenticscraper_example.py +++ /dev/null @@ -1,86 +0,0 @@ -import os - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - -# Initialize the client with API key from environment variable -api_key = os.getenv("SGAI_API_KEY") -if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - exit(1) - -sgai_client = Client(api_key=api_key) - -print("๐Ÿค– Example 1: Basic Agentic Scraping (No AI Extraction)") -print("=" * 60) - -# AgenticScraper request - basic automated login example (no AI) -response = sgai_client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - use_session=True, - steps=[ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ], - ai_extraction=False # No AI extraction - just get raw content -) - -# Print the response -print(f"Request ID: {response['request_id']}") -print(f"Result: {response.get('result', 'No result yet')}") -print(f"Status: {response.get('status', 'Unknown')}") - -print("\n\n๐Ÿง  Example 2: Agentic Scraping with AI Extraction") -print("=" * 60) - -# Define schema for AI extraction -output_schema = { - "dashboard_info": { - "type": "object", - "properties": { - "username": {"type": "string"}, - "email": {"type": "string"}, - "dashboard_sections": { - "type": "array", - "items": {"type": "string"} - }, - "credits_remaining": {"type": "number"} - }, - "required": ["username", "dashboard_sections"] - } -} - -# AgenticScraper request with AI extraction -ai_response = sgai_client.agenticscraper( - url="https://dashboard.scrapegraphai.com/", - use_session=True, - steps=[ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login", - "wait for dashboard to load completely" - ], - user_prompt="Extract user information, available dashboard sections, and remaining credits from the dashboard", - output_schema=output_schema, - ai_extraction=True -) - -# Print the AI extraction response -print(f"AI Request ID: {ai_response['request_id']}") -print(f"AI Result: {ai_response.get('result', 'No result yet')}") -print(f"AI Status: {ai_response.get('status', 'Unknown')}") -print(f"User Prompt: Extract user information, available dashboard sections, and remaining credits") -print(f"Schema Provided: {'Yes' if output_schema else 'No'}") - -sgai_client.close() diff --git a/scrapegraph-py/examples/crawl/async/async_crawl_example.py b/scrapegraph-py/examples/crawl/async/async_crawl_example.py deleted file mode 100644 index 2a4d3bd4..00000000 --- a/scrapegraph-py/examples/crawl/async/async_crawl_example.py +++ /dev/null @@ -1,111 +0,0 @@ -""" -Example demonstrating how to use the ScrapeGraphAI /v1/crawl/ API endpoint using the async client. - -Requirements: -- Python 3.7+ -- scrapegraph-py -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -import json -import os -import time - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -async def main(): - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - # Simple schema for founders' information - schema = { - "type": "object", - "properties": { - "founders": { - "type": "array", - "items": { - "type": "object", - "properties": { - "name": {"type": "string"}, - "title": {"type": "string"}, - "bio": {"type": "string"}, - "linkedin": {"type": "string"}, - "twitter": {"type": "string"}, - }, - }, - } - }, - } - - url = "https://scrapegraphai.com" - prompt = "extract the founders'infos" - - try: - # Initialize the async client - async with AsyncClient.from_env() as client: - # Start the crawl job - print(f"\nStarting crawl for: {url}") - start_time = time.time() - crawl_response = await client.crawl( - url=url, - prompt=prompt, - data_schema=schema, - cache_website=True, - depth=2, - max_pages=2, - same_domain_only=True, - sitemap=True, # Use sitemap for better page discovery - # batch_size is optional and will be excluded if not provided - ) - execution_time = time.time() - start_time - print(f"POST /v1/crawl/ execution time: {execution_time:.2f} seconds") - print("\nCrawl job started. Response:") - print(json.dumps(crawl_response, indent=2)) - - # If the crawl is asynchronous and returns an ID, fetch the result - crawl_id = crawl_response.get("id") or crawl_response.get("task_id") - start_time = time.time() - if crawl_id: - print("\nPolling for crawl result...") - for _ in range(10): - await asyncio.sleep(5) - result = await client.get_crawl(crawl_id) - if result.get("status") == "success" and result.get("result"): - execution_time = time.time() - start_time - print( - f"GET /v1/crawl/{crawl_id} execution time: {execution_time:.2f} seconds" - ) - print("\nCrawl completed. Result:") - print(json.dumps(result["result"]["llm_result"], indent=2)) - break - elif result.get("status") == "failed": - print("\nCrawl failed. Result:") - print(json.dumps(result, indent=2)) - break - else: - print(f"Status: {result.get('status')}, waiting...") - else: - print("Crawl did not complete in time.") - else: - print("No crawl ID found in response. Synchronous result:") - print(json.dumps(crawl_response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/crawl/async/async_crawl_markdown_direct_api_example.py b/scrapegraph-py/examples/crawl/async/async_crawl_markdown_direct_api_example.py deleted file mode 100644 index 61df0a14..00000000 --- a/scrapegraph-py/examples/crawl/async/async_crawl_markdown_direct_api_example.py +++ /dev/null @@ -1,254 +0,0 @@ -#!/usr/bin/env python3 -""" -Async example script demonstrating the ScrapeGraphAI Crawler markdown conversion mode. - -This example shows how to use the crawler in markdown conversion mode: -- Cost-effective markdown conversion (NO AI/LLM processing) -- 2 credits per page (80% savings compared to AI mode) -- Clean HTML to markdown conversion with metadata extraction - -Requirements: -- Python 3.7+ -- aiohttp -- python-dotenv -- A .env file with your API_KEY - -Example .env file: -API_KEY=your_api_key_here -""" - -import asyncio -import json -import os -from typing import Any, Dict - -import aiohttp -from dotenv import load_dotenv - -# Load environment variables from .env file -load_dotenv() - -# Configuration - API key from environment or fallback -API_KEY = os.getenv("TEST_API_KEY", "sgai-xxx") # Load from .env file -BASE_URL = os.getenv("BASE_URL", "http://localhost:8001") # Can be overridden via env - - -async def make_request(url: str, data: Dict[str, Any]) -> Dict[str, Any]: - """Make an HTTP request to the API.""" - headers = {"Content-Type": "application/json", "SGAI-APIKEY": API_KEY} - - async with aiohttp.ClientSession() as session: - async with session.post(url, json=data, headers=headers) as response: - return await response.json() - - -async def poll_result(task_id: str) -> Dict[str, Any]: - """Poll for the result of a crawl job with rate limit handling.""" - headers = {"SGAI-APIKEY": API_KEY} - url = f"{BASE_URL}/v1/crawl/{task_id}" - - async with aiohttp.ClientSession() as session: - async with session.get(url, headers=headers) as response: - if response.status == 429: - # Rate limited - return special status to handle in polling loop - return {"status": "rate_limited", "retry_after": 60} - return await response.json() - - -async def poll_with_backoff(task_id: str, max_attempts: int = 20) -> Dict[str, Any]: - """ - Poll for crawl results with intelligent backoff to avoid rate limits. - - Args: - task_id: The task ID to poll for - max_attempts: Maximum number of polling attempts - - Returns: - The final result or raises an exception on timeout/failure - """ - print("โณ Starting to poll for results with rate-limit protection...") - - # Initial wait to give the job time to start processing - await asyncio.sleep(15) - - for attempt in range(max_attempts): - try: - result = await poll_result(task_id) - status = result.get("status") - - if status == "rate_limited": - wait_time = min( - 90, 30 + (attempt * 10) - ) # Exponential backoff for rate limits - print(f"โš ๏ธ Rate limited! Waiting {wait_time}s before retry...") - await asyncio.sleep(wait_time) - continue - - elif status == "success": - return result - - elif status == "failed": - raise Exception(f"Crawl failed: {result.get('error', 'Unknown error')}") - - else: - # Calculate progressive wait time: start at 15s, increase gradually - base_wait = 15 - progressive_wait = min(60, base_wait + (attempt * 3)) # Cap at 60s - - print( - f"โณ Status: {status} (attempt {attempt + 1}/{max_attempts}) - waiting {progressive_wait}s..." - ) - await asyncio.sleep(progressive_wait) - - except Exception as e: - if "rate" in str(e).lower() or "429" in str(e): - wait_time = min(90, 45 + (attempt * 10)) - print(f"โš ๏ธ Rate limit detected in error, waiting {wait_time}s...") - await asyncio.sleep(wait_time) - continue - else: - print(f"โŒ Error polling for results: {e}") - if attempt < max_attempts - 1: - await asyncio.sleep(20) # Wait before retry - continue - raise - - raise Exception(f"โฐ Timeout: Job did not complete after {max_attempts} attempts") - - -async def markdown_crawling_example(): - """ - Markdown Conversion Mode (NO AI/LLM Used) - - This example demonstrates cost-effective crawling that converts pages to clean markdown - WITHOUT any AI processing. Perfect for content archival and when you only need clean markdown. - """ - print("=" * 60) - print("ASYNC MARKDOWN CONVERSION MODE (NO AI/LLM)") - print("=" * 60) - print("Use case: Get clean markdown content without AI processing") - print("Cost: 2 credits per page (80% savings!)") - print("Features: Clean markdown conversion, metadata extraction") - print("โš ๏ธ NO AI/LLM PROCESSING - Pure HTML to markdown conversion only!") - print() - - # Markdown conversion request - NO AI/LLM processing - request_data = { - "url": "https://scrapegraphai.com/", - "extraction_mode": False, # FALSE = Markdown conversion mode (NO AI/LLM used) - "depth": 2, - "max_pages": 2, - "same_domain_only": True, - "sitemap": False, # Use sitemap for better coverage - # Note: No prompt needed when extraction_mode = False - } - - print(f"๐ŸŒ Target URL: {request_data['url']}") - print("๐Ÿค– AI Prompt: None (no AI processing)") - print(f"๐Ÿ“Š Crawl Depth: {request_data['depth']}") - print(f"๐Ÿ“„ Max Pages: {request_data['max_pages']}") - print(f"๐Ÿ—บ๏ธ Use Sitemap: {request_data['sitemap']}") - print("๐Ÿ’ก Mode: Pure HTML to markdown conversion") - print() - - # Start the markdown conversion job - print("๐Ÿš€ Starting markdown conversion job...") - response = await make_request(f"{BASE_URL}/v1/crawl", request_data) - task_id = response.get("task_id") - - if not task_id: - print("โŒ Failed to start markdown conversion job") - return - - print(f"๐Ÿ“‹ Task ID: {task_id}") - print("โณ Polling for results...") - print() - - # Poll for results with rate-limit protection - try: - result = await poll_with_backoff(task_id, max_attempts=20) - - print("โœ… Markdown conversion completed successfully!") - print() - - result_data = result.get("result", {}) - pages = result_data.get("pages", []) - crawled_urls = result_data.get("crawled_urls", []) - credits_used = result_data.get("credits_used", 0) - pages_processed = result_data.get("pages_processed", 0) - - # Prepare JSON output - json_output = { - "conversion_results": { - "pages_processed": pages_processed, - "credits_used": credits_used, - "cost_per_page": ( - credits_used / pages_processed if pages_processed > 0 else 0 - ), - "crawled_urls": crawled_urls, - }, - "markdown_content": {"total_pages": len(pages), "pages": []}, - } - - # Add page details to JSON - for i, page in enumerate(pages): - metadata = page.get("metadata", {}) - page_data = { - "page_number": i + 1, - "url": page.get("url"), - "title": page.get("title"), - "metadata": { - "word_count": metadata.get("word_count", 0), - "headers": metadata.get("headers", []), - "links_count": metadata.get("links_count", 0), - }, - "markdown_content": page.get("markdown", ""), - } - json_output["markdown_content"]["pages"].append(page_data) - - # Print JSON output - print("๐Ÿ“Š RESULTS IN JSON FORMAT:") - print("-" * 40) - print(json.dumps(json_output, indent=2, ensure_ascii=False)) - - except Exception as e: - print(f"โŒ Markdown conversion failed: {str(e)}") - - -async def main(): - """Run the async markdown crawling example.""" - print("๐ŸŒ ScrapeGraphAI Async Crawler - Markdown Conversion Example") - print("Cost-effective HTML to Markdown conversion (NO AI/LLM)") - print("=" * 60) - - # Check if API key is set - if API_KEY == "sgai-xxx": - print("โš ๏ธ Please set your API key in the .env file") - print(" Create a .env file with your API key:") - print(" API_KEY=your_api_key_here") - print() - print(" You can get your API key from: https://dashboard.scrapegraphai.com") - print() - print(" Example .env file:") - print(" API_KEY=sgai-your-actual-api-key-here") - print(" BASE_URL=https://api.scrapegraphai.com # Optional") - return - - print(f"๐Ÿ”‘ Using API key: {API_KEY[:10]}...") - print(f"๐ŸŒ Base URL: {BASE_URL}") - print() - - # Run the single example - await markdown_crawling_example() # Markdown conversion mode (NO AI) - - print("\n" + "=" * 60) - print("๐ŸŽ‰ Example completed!") - print("๐Ÿ’ก This demonstrates async markdown conversion mode:") - print(" โ€ข Cost-effective: Only 2 credits per page") - print(" โ€ข No AI/LLM processing - pure HTML to markdown conversion") - print(" โ€ข Perfect for content archival and documentation") - print(" โ€ข 80% cheaper than AI extraction modes!") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/crawl/async/async_crawl_markdown_example.py b/scrapegraph-py/examples/crawl/async/async_crawl_markdown_example.py deleted file mode 100644 index 767f1979..00000000 --- a/scrapegraph-py/examples/crawl/async/async_crawl_markdown_example.py +++ /dev/null @@ -1,218 +0,0 @@ -#!/usr/bin/env python3 -""" -Async example demonstrating the ScrapeGraphAI Crawler markdown conversion mode. - -This example shows how to use the async crawler in markdown conversion mode: -- Cost-effective markdown conversion (NO AI/LLM processing) -- 2 credits per page (80% savings compared to AI mode) -- Clean HTML to markdown conversion with metadata extraction - -Requirements: -- Python 3.7+ -- scrapegraph-py -- aiohttp (installed with scrapegraph-py) -- A valid API key - -Usage: - python async_crawl_markdown_example.py -""" - -import asyncio -import json -import os -from typing import Any, Dict - -from scrapegraph_py import AsyncClient - - -async def poll_for_result( - client: AsyncClient, crawl_id: str, max_attempts: int = 20 -) -> Dict[str, Any]: - """ - Poll for crawl results with intelligent backoff to avoid rate limits. - - Args: - client: The async ScrapeGraph client - crawl_id: The crawl ID to poll for - max_attempts: Maximum number of polling attempts - - Returns: - The final result or raises an exception on timeout/failure - """ - print("โณ Starting to poll for results with rate-limit protection...") - - # Initial wait to give the job time to start processing - await asyncio.sleep(15) - - for attempt in range(max_attempts): - try: - result = await client.get_crawl(crawl_id) - status = result.get("status") - - if status == "success": - return result - elif status == "failed": - raise Exception(f"Crawl failed: {result.get('error', 'Unknown error')}") - else: - # Calculate progressive wait time: start at 15s, increase gradually - base_wait = 15 - progressive_wait = min(60, base_wait + (attempt * 3)) # Cap at 60s - - print( - f"โณ Status: {status} (attempt {attempt + 1}/{max_attempts}) - waiting {progressive_wait}s..." - ) - await asyncio.sleep(progressive_wait) - - except Exception as e: - if "rate" in str(e).lower() or "429" in str(e): - wait_time = min(90, 45 + (attempt * 10)) - print(f"โš ๏ธ Rate limit detected in error, waiting {wait_time}s...") - await asyncio.sleep(wait_time) - continue - else: - print(f"โŒ Error polling for results: {e}") - if attempt < max_attempts - 1: - await asyncio.sleep(20) # Wait before retry - continue - raise - - raise Exception(f"โฐ Timeout: Job did not complete after {max_attempts} attempts") - - -async def markdown_crawling_example(): - """ - Markdown Conversion Mode (NO AI/LLM Used) - - This example demonstrates cost-effective crawling that converts pages to clean markdown - WITHOUT any AI processing. Perfect for content archival and when you only need clean markdown. - """ - print("=" * 60) - print("ASYNC MARKDOWN CONVERSION MODE (NO AI/LLM)") - print("=" * 60) - print("Use case: Get clean markdown content without AI processing") - print("Cost: 2 credits per page (80% savings!)") - print("Features: Clean markdown conversion, metadata extraction") - print("โš ๏ธ NO AI/LLM PROCESSING - Pure HTML to markdown conversion only!") - print() - - # Initialize the async client - client = AsyncClient.from_env() - - # Target URL for markdown conversion - url = "https://scrapegraphai.com/" - - print(f"๐ŸŒ Target URL: {url}") - print("๐Ÿค– AI Prompt: None (no AI processing)") - print("๐Ÿ“Š Crawl Depth: 2") - print("๐Ÿ“„ Max Pages: 2") - print("๐Ÿ—บ๏ธ Use Sitemap: True") - print("๐Ÿ’ก Mode: Pure HTML to markdown conversion") - print() - - # Start the markdown conversion job - print("๐Ÿš€ Starting markdown conversion job...") - - # Call crawl with extraction_mode=False for markdown conversion - response = await client.crawl( - url=url, - extraction_mode=False, # FALSE = Markdown conversion mode (NO AI/LLM used) - depth=2, - max_pages=2, - same_domain_only=True, - sitemap=True, # Use sitemap for better coverage - # Note: No prompt or data_schema needed when extraction_mode=False - ) - - crawl_id = response.get("crawl_id") or response.get("task_id") - - if not crawl_id: - print("โŒ Failed to start markdown conversion job") - return - - print(f"๐Ÿ“‹ Crawl ID: {crawl_id}") - print("โณ Polling for results...") - print() - - # Poll for results with rate-limit protection - try: - result = await poll_for_result(client, crawl_id, max_attempts=20) - - print("โœ… Markdown conversion completed successfully!") - print() - - result_data = result.get("result", {}) - pages = result_data.get("pages", []) - crawled_urls = result_data.get("crawled_urls", []) - credits_used = result_data.get("credits_used", 0) - pages_processed = result_data.get("pages_processed", 0) - - # Prepare JSON output - json_output = { - "conversion_results": { - "pages_processed": pages_processed, - "credits_used": credits_used, - "cost_per_page": ( - credits_used / pages_processed if pages_processed > 0 else 0 - ), - "crawled_urls": crawled_urls, - }, - "markdown_content": {"total_pages": len(pages), "pages": []}, - } - - # Add page details to JSON - for i, page in enumerate(pages): - metadata = page.get("metadata", {}) - page_data = { - "page_number": i + 1, - "url": page.get("url"), - "title": page.get("title"), - "metadata": { - "word_count": metadata.get("word_count", 0), - "headers": metadata.get("headers", []), - "links_count": metadata.get("links_count", 0), - }, - "markdown_content": page.get("markdown", ""), - } - json_output["markdown_content"]["pages"].append(page_data) - - # Print JSON output - print("๐Ÿ“Š RESULTS IN JSON FORMAT:") - print("-" * 40) - print(json.dumps(json_output, indent=2, ensure_ascii=False)) - - except Exception as e: - print(f"โŒ Markdown conversion failed: {str(e)}") - - -async def main(): - """Run the async markdown crawling example.""" - print("๐ŸŒ ScrapeGraphAI Async Crawler - Markdown Conversion Example") - print("Cost-effective HTML to Markdown conversion (NO AI/LLM)") - print("=" * 60) - - # Check if API key is set - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โš ๏ธ Please set your API key in the environment variable SGAI_API_KEY") - print(" export SGAI_API_KEY=your_api_key_here") - print() - print(" You can get your API key from: https://dashboard.scrapegraphai.com") - return - - print(f"๐Ÿ”‘ Using API key: {api_key[:10]}...") - print() - - # Run the markdown conversion example - await markdown_crawling_example() - - print("\n" + "=" * 60) - print("๐ŸŽ‰ Example completed!") - print("๐Ÿ’ก This demonstrates async markdown conversion mode:") - print(" โ€ข Cost-effective: Only 2 credits per page") - print(" โ€ข No AI/LLM processing - pure HTML to markdown conversion") - print(" โ€ข Perfect for content archival and documentation") - print(" โ€ข 80% cheaper than AI extraction modes!") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/crawl/async/async_crawl_sitemap_example.py b/scrapegraph-py/examples/crawl/async/async_crawl_sitemap_example.py deleted file mode 100644 index 3e18b42e..00000000 --- a/scrapegraph-py/examples/crawl/async/async_crawl_sitemap_example.py +++ /dev/null @@ -1,239 +0,0 @@ -#!/usr/bin/env python3 -""" -Async example demonstrating the ScrapeGraphAI Crawler with sitemap functionality. - -This example shows how to use the async crawler with sitemap enabled for better page discovery: -- Sitemap helps discover more pages efficiently -- Better coverage of website content -- More comprehensive crawling results - -Requirements: -- Python 3.7+ -- scrapegraph-py -- aiohttp (installed with scrapegraph-py) -- A valid API key - -Usage: - python async_crawl_sitemap_example.py -""" - -import asyncio -import json -import os -from typing import Any, Dict - -from scrapegraph_py import AsyncClient - - -async def poll_for_result( - client: AsyncClient, crawl_id: str, max_attempts: int = 20 -) -> Dict[str, Any]: - """ - Poll for crawl results with intelligent backoff to avoid rate limits. - - Args: - client: The async ScrapeGraph client - crawl_id: The crawl ID to poll for - max_attempts: Maximum number of polling attempts - - Returns: - The final result or raises an exception on timeout/failure - """ - print("โณ Starting to poll for results with rate-limit protection...") - - # Initial wait to give the job time to start processing - await asyncio.sleep(15) - - for attempt in range(max_attempts): - try: - result = await client.get_crawl(crawl_id) - status = result.get("status") - - if status == "success": - return result - elif status == "failed": - raise Exception(f"Crawl failed: {result.get('error', 'Unknown error')}") - else: - # Calculate progressive wait time: start at 15s, increase gradually - base_wait = 15 - progressive_wait = min(60, base_wait + (attempt * 3)) # Cap at 60s - - print( - f"โณ Status: {status} (attempt {attempt + 1}/{max_attempts}) - waiting {progressive_wait}s..." - ) - await asyncio.sleep(progressive_wait) - - except Exception as e: - if "rate" in str(e).lower() or "429" in str(e): - wait_time = min(90, 45 + (attempt * 10)) - print(f"โš ๏ธ Rate limit detected in error, waiting {wait_time}s...") - await asyncio.sleep(wait_time) - continue - else: - print(f"โŒ Error polling for results: {e}") - if attempt < max_attempts - 1: - await asyncio.sleep(20) # Wait before retry - continue - raise - - raise Exception(f"โฐ Timeout: Job did not complete after {max_attempts} attempts") - - -async def sitemap_crawling_example(): - """ - Async Sitemap-enabled Crawling Example - - This example demonstrates how to use sitemap for better page discovery with async client. - Sitemap helps the crawler find more pages efficiently by using the website's sitemap.xml. - """ - print("=" * 60) - print("ASYNC SITEMAP-ENABLED CRAWLING EXAMPLE") - print("=" * 60) - print("Use case: Comprehensive website crawling with sitemap discovery") - print("Benefits: Better page coverage, more efficient crawling") - print("Features: Sitemap-based page discovery, structured data extraction") - print() - - # Initialize the async client - client = AsyncClient.from_env() - - # Target URL - using a website that likely has a sitemap - url = "https://www.giemmeagordo.com/risultati-ricerca-annunci/?sort=newest&search_city=&search_lat=null&search_lng=null&search_category=0&search_type=0&search_min_price=&search_max_price=&bagni=&bagni_comparison=equal&camere=&camere_comparison=equal" - - # Schema for real estate listings - schema = { - "type": "object", - "properties": { - "listings": { - "type": "array", - "items": { - "type": "object", - "properties": { - "title": {"type": "string"}, - "price": {"type": "string"}, - "location": {"type": "string"}, - "description": {"type": "string"}, - "features": {"type": "array", "items": {"type": "string"}}, - "url": {"type": "string"}, - }, - }, - } - }, - } - - prompt = "Extract all real estate listings with their details including title, price, location, description, and features" - - print(f"๐ŸŒ Target URL: {url}") - print("๐Ÿค– AI Prompt: Extract real estate listings") - print("๐Ÿ“Š Crawl Depth: 1") - print("๐Ÿ“„ Max Pages: 10") - print("๐Ÿ—บ๏ธ Use Sitemap: True (enabled for better page discovery)") - print("๐Ÿ  Same Domain Only: True") - print("๐Ÿ’พ Cache Website: True") - print("๐Ÿ’ก Mode: AI extraction with sitemap discovery") - print() - - # Start the sitemap-enabled crawl job - print("๐Ÿš€ Starting async sitemap-enabled crawl job...") - - # Call crawl with sitemap=True for better page discovery - response = await client.crawl( - url=url, - prompt=prompt, - data_schema=schema, - extraction_mode=True, # AI extraction mode - depth=1, - max_pages=10, - same_domain_only=True, - cache_website=True, - sitemap=True, # Enable sitemap for better page discovery - ) - - crawl_id = response.get("crawl_id") or response.get("task_id") - - if not crawl_id: - print("โŒ Failed to start sitemap-enabled crawl job") - return - - print(f"๐Ÿ“‹ Crawl ID: {crawl_id}") - print("โณ Polling for results...") - print() - - # Poll for results with rate-limit protection - try: - result = await poll_for_result(client, crawl_id, max_attempts=20) - - print("โœ… Async sitemap-enabled crawl completed successfully!") - print() - - result_data = result.get("result", {}) - llm_result = result_data.get("llm_result", {}) - crawled_urls = result_data.get("crawled_urls", []) - credits_used = result_data.get("credits_used", 0) - pages_processed = result_data.get("pages_processed", 0) - - # Prepare JSON output - json_output = { - "crawl_results": { - "pages_processed": pages_processed, - "credits_used": credits_used, - "cost_per_page": ( - credits_used / pages_processed if pages_processed > 0 else 0 - ), - "crawled_urls": crawled_urls, - "sitemap_enabled": True, - }, - "extracted_data": llm_result, - } - - # Print JSON output - print("๐Ÿ“Š RESULTS IN JSON FORMAT:") - print("-" * 40) - print(json.dumps(json_output, indent=2, ensure_ascii=False)) - - # Print summary - print("\n" + "=" * 60) - print("๐Ÿ“ˆ CRAWL SUMMARY:") - print("=" * 60) - print(f"โœ… Pages processed: {pages_processed}") - print(f"๐Ÿ’ฐ Credits used: {credits_used}") - print(f"๐Ÿ”— URLs crawled: {len(crawled_urls)}") - print(f"๐Ÿ—บ๏ธ Sitemap enabled: Yes") - print(f"๐Ÿ“Š Data extracted: {len(llm_result.get('listings', []))} listings found") - - except Exception as e: - print(f"โŒ Async sitemap-enabled crawl failed: {str(e)}") - - -async def main(): - """Run the async sitemap crawling example.""" - print("๐ŸŒ ScrapeGraphAI Async Crawler - Sitemap Example") - print("Comprehensive website crawling with sitemap discovery") - print("=" * 60) - - # Check if API key is set - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โš ๏ธ Please set your API key in the environment variable SGAI_API_KEY") - print(" export SGAI_API_KEY=your_api_key_here") - print() - print(" You can get your API key from: https://dashboard.scrapegraphai.com") - return - - print(f"๐Ÿ”‘ Using API key: {api_key[:10]}...") - print() - - # Run the sitemap crawling example - await sitemap_crawling_example() - - print("\n" + "=" * 60) - print("๐ŸŽ‰ Example completed!") - print("๐Ÿ’ก This demonstrates async sitemap-enabled crawling:") - print(" โ€ข Better page discovery using sitemap.xml") - print(" โ€ข More comprehensive website coverage") - print(" โ€ข Efficient crawling of structured websites") - print(" โ€ข Perfect for e-commerce, news sites, and content-heavy websites") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/crawl/async/async_crawl_with_path_filtering_example.py b/scrapegraph-py/examples/crawl/async/async_crawl_with_path_filtering_example.py deleted file mode 100644 index b3650ad5..00000000 --- a/scrapegraph-py/examples/crawl/async/async_crawl_with_path_filtering_example.py +++ /dev/null @@ -1,98 +0,0 @@ -""" -Example of using the async crawl endpoint with path filtering. - -This example demonstrates how to use include_paths and exclude_paths -to control which pages are crawled on a website (async version). -""" -import asyncio -import os -from scrapegraph_py import AsyncClient -from pydantic import BaseModel, Field - - -# Define your output schema -class ProductInfo(BaseModel): - name: str = Field(description="Product name") - price: str = Field(description="Product price") - category: str = Field(description="Product category") - - -class CrawlResult(BaseModel): - products: list[ProductInfo] = Field(description="List of products found") - categories: list[str] = Field(description="List of product categories") - - -async def main(): - # Initialize the async client - sgai_api_key = os.getenv("SGAI_API_KEY") - - async with AsyncClient(api_key=sgai_api_key) as client: - print("๐Ÿ” Starting async crawl with path filtering...") - print("=" * 50) - - # Example: Crawl only product pages, excluding certain sections - print("\n๐Ÿ“ Crawling e-commerce site with smart path filtering") - print("-" * 50) - - result = await client.crawl( - url="https://example-shop.com", - prompt="Extract all products with their names, prices, and categories", - data_schema=CrawlResult.model_json_schema(), - extraction_mode=True, - depth=3, - max_pages=50, - sitemap=True, # Use sitemap for better coverage - include_paths=[ - "/products/**", # Include all product pages - "/categories/*", # Include category listings - "/collections/*" # Include collection pages - ], - exclude_paths=[ - "/products/out-of-stock/*", # Skip out-of-stock items - "/products/*/reviews", # Skip review pages - "/admin/**", # Skip admin pages - "/api/**", # Skip API endpoints - "/*.pdf" # Skip PDF files - ] - ) - - print(f"Task ID: {result.get('task_id')}") - print("\nโœ… Async crawl job started successfully!") - - # You can then poll for results using get_crawl - task_id = result.get('task_id') - if task_id: - print(f"\nโณ Polling for results (task: {task_id})...") - - # Poll every 5 seconds until complete - max_attempts = 60 # 5 minutes max - for attempt in range(max_attempts): - await asyncio.sleep(5) - status = await client.get_crawl(task_id) - - state = status.get('state', 'UNKNOWN') - print(f"Attempt {attempt + 1}: Status = {state}") - - if state == 'SUCCESS': - print("\nโœจ Crawl completed successfully!") - result_data = status.get('result', {}) - print(f"Found {len(result_data.get('products', []))} products") - break - elif state in ['FAILURE', 'REVOKED']: - print(f"\nโŒ Crawl failed with status: {state}") - break - else: - print("\nโฐ Timeout: Crawl took too long") - - print("\n" + "=" * 50) - print("๐Ÿ’ก Tips for effective path filtering:") - print("=" * 50) - print("โ€ข Combine with sitemap=True for better page discovery") - print("โ€ข Use include_paths to focus on content-rich sections") - print("โ€ข Use exclude_paths to skip pages with duplicate content") - print("โ€ข Test your patterns on a small max_pages first") - print("โ€ข Remember: exclude_paths overrides include_paths") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/crawl/sync/basic_crawl_example.py b/scrapegraph-py/examples/crawl/sync/basic_crawl_example.py deleted file mode 100644 index 1f58d5d5..00000000 --- a/scrapegraph-py/examples/crawl/sync/basic_crawl_example.py +++ /dev/null @@ -1,138 +0,0 @@ -""" -Example demonstrating how to use the ScrapeGraphAI /v1/crawl/ API endpoint with a custom schema. - -Requirements: -- Python 3.7+ -- scrapegraph-py -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY="your_sgai_api_key" -""" - -import json -import os -import time -from typing import Any, Dict - -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def main(): - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print('SGAI_API_KEY="your_sgai_api_key"') - return - - schema: Dict[str, Any] = { - "$schema": "http://json-schema.org/draft-07/schema#", - "title": "ScrapeGraphAI Website Content", - "type": "object", - "properties": { - "company": { - "type": "object", - "properties": { - "name": {"type": "string"}, - "description": {"type": "string"}, - "features": {"type": "array", "items": {"type": "string"}}, - "contact_email": {"type": "string", "format": "email"}, - "social_links": { - "type": "object", - "properties": { - "github": {"type": "string", "format": "uri"}, - "linkedin": {"type": "string", "format": "uri"}, - "twitter": {"type": "string", "format": "uri"}, - }, - "additionalProperties": False, - }, - }, - "required": ["name", "description"], - }, - "services": { - "type": "array", - "items": { - "type": "object", - "properties": { - "service_name": {"type": "string"}, - "description": {"type": "string"}, - "features": {"type": "array", "items": {"type": "string"}}, - }, - "required": ["service_name", "description"], - }, - }, - "legal": { - "type": "object", - "properties": { - "privacy_policy": {"type": "string"}, - "terms_of_service": {"type": "string"}, - }, - "required": ["privacy_policy", "terms_of_service"], - }, - }, - "required": ["company", "services", "legal"], - } - - url = "https://scrapegraphai.com/" - prompt = ( - "What does the company do? and I need text content from there privacy and terms" - ) - - try: - client = Client.from_env() - print(f"\nStarting crawl for: {url}") - start_time = time.time() - crawl_response = client.crawl( - url=url, - prompt=prompt, - data_schema=schema, - cache_website=True, - depth=2, - max_pages=2, - same_domain_only=True, - sitemap=True, # Use sitemap for better page discovery - batch_size=1, - ) - execution_time = time.time() - start_time - print(f"POST /v1/crawl/ execution time: {execution_time:.2f} seconds") - print("\nCrawl job started. Response:") - print(json.dumps(crawl_response, indent=2)) - - crawl_id = crawl_response.get("id") or crawl_response.get("task_id") - start_time = time.time() - if crawl_id: - print("\nPolling for crawl result...") - for _ in range(10): - time.sleep(5) - result = client.get_crawl(crawl_id) - if result.get("status") == "success" and result.get("result"): - execution_time = time.time() - start_time - print( - f"GET /v1/crawl/{crawl_id} execution time: {execution_time:.2f} seconds" - ) - print("\nCrawl completed. Result:") - print(json.dumps(result["result"]["llm_result"], indent=2)) - break - elif result.get("status") == "failed": - print("\nCrawl failed. Result:") - print(json.dumps(result, indent=2)) - break - else: - print(f"Status: {result.get('status')}, waiting...") - else: - print("Crawl did not complete in time.") - else: - print("No crawl ID found in response. Synchronous result:") - print(json.dumps(crawl_response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/crawl/sync/crawl_example.py b/scrapegraph-py/examples/crawl/sync/crawl_example.py deleted file mode 100644 index fa25639e..00000000 --- a/scrapegraph-py/examples/crawl/sync/crawl_example.py +++ /dev/null @@ -1,115 +0,0 @@ -""" -Example demonstrating how to use the ScrapeGraphAI /v1/crawl/ API endpoint. - -Requirements: -- Python 3.7+ -- scrapegraph-py -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import json -import os -import time - -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def main(): - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - # Simple schema for founders' information - schema = { - "type": "object", - "properties": { - "founders": { - "type": "array", - "items": { - "type": "object", - "properties": { - "name": {"type": "string"}, - "title": {"type": "string"}, - "bio": {"type": "string"}, - "linkedin": {"type": "string"}, - "twitter": {"type": "string"}, - }, - }, - } - }, - } - - url = "https://scrapegraphai.com" - prompt = "extract the founders'infos" - - try: - # Initialize the client - client = Client.from_env() - - # Start the crawl job - print(f"\nStarting crawl for: {url}") - start_time = time.time() - crawl_response = client.crawl( - url=url, - prompt=prompt, - data_schema=schema, - cache_website=True, - depth=2, - max_pages=2, - same_domain_only=True, - sitemap=True, # Use sitemap for better page discovery - # batch_size is optional and will be excluded if not provided - ) - execution_time = time.time() - start_time - print(f"POST /v1/crawl/ execution time: {execution_time:.2f} seconds") - print("\nCrawl job started. Response:") - print(json.dumps(crawl_response, indent=2)) - - # If the crawl is asynchronous and returns an ID, fetch the result - crawl_id = crawl_response.get("id") or crawl_response.get("task_id") - start_time = time.time() - if crawl_id: - print("\nPolling for crawl result...") - # Increase timeout to 5 minutes (60 iterations ร— 5 seconds) - for i in range(60): - time.sleep(5) - result = client.get_crawl(crawl_id) - if result.get("status") == "success" and result.get("result"): - execution_time = time.time() - start_time - print( - f"GET /v1/crawl/{crawl_id} execution time: {execution_time:.2f} seconds" - ) - print("\nCrawl completed. Result:") - print(json.dumps(result["result"]["llm_result"], indent=2)) - break - elif result.get("status") == "failed": - print("\nCrawl failed. Result:") - print(json.dumps(result, indent=2)) - break - else: - elapsed_time = (i + 1) * 5 - print( - f"Status: {result.get('status')}, waiting... ({elapsed_time}s elapsed)" - ) - else: - print("Crawl did not complete within 5 minutes.") - else: - print("No crawl ID found in response. Synchronous result:") - print(json.dumps(crawl_response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/crawl/sync/crawl_markdown_direct_api_example.py b/scrapegraph-py/examples/crawl/sync/crawl_markdown_direct_api_example.py deleted file mode 100644 index 2f73ab43..00000000 --- a/scrapegraph-py/examples/crawl/sync/crawl_markdown_direct_api_example.py +++ /dev/null @@ -1,254 +0,0 @@ -#!/usr/bin/env python3 -""" -Example script demonstrating the ScrapeGraphAI Crawler markdown conversion mode. - -This example shows how to use the crawler in markdown conversion mode: -- Cost-effective markdown conversion (NO AI/LLM processing) -- 2 credits per page (80% savings compared to AI mode) -- Clean HTML to markdown conversion with metadata extraction - -Requirements: -- Python 3.7+ -- requests -- python-dotenv -- A .env file with your API_KEY - -Example .env file: -API_KEY=your_api_key_here -""" - -import json -import os -import time -from typing import Any, Dict - -import requests -from dotenv import load_dotenv - -# Load environment variables from .env file -load_dotenv() - -# Configuration - API key from environment or fallback -API_KEY = os.getenv("TEST_API_KEY", "sgai-xxx") # Load from .env file -BASE_URL = os.getenv("BASE_URL", "http://localhost:8001") # Can be overridden via env - - -def make_request(url: str, data: Dict[str, Any]) -> Dict[str, Any]: - """Make an HTTP request to the API.""" - headers = {"Content-Type": "application/json", "SGAI-APIKEY": API_KEY} - - response = requests.post(url, json=data, headers=headers) - return response.json() - - -def poll_result(task_id: str) -> Dict[str, Any]: - """Poll for the result of a crawl job with rate limit handling.""" - headers = {"SGAI-APIKEY": API_KEY} - url = f"{BASE_URL}/v1/crawl/{task_id}" - - response = requests.get(url, headers=headers) - - if response.status_code == 429: - # Rate limited - return special status to handle in polling loop - return {"status": "rate_limited", "retry_after": 60} - - return response.json() - - -def poll_with_backoff(task_id: str, max_attempts: int = 20) -> Dict[str, Any]: - """ - Poll for crawl results with intelligent backoff to avoid rate limits. - - Args: - task_id: The task ID to poll for - max_attempts: Maximum number of polling attempts - - Returns: - The final result or raises an exception on timeout/failure - """ - print("โณ Starting to poll for results with rate-limit protection...") - - # Initial wait to give the job time to start processing - time.sleep(15) - - for attempt in range(max_attempts): - try: - result = poll_result(task_id) - status = result.get("status") - - if status == "rate_limited": - wait_time = min( - 90, 30 + (attempt * 10) - ) # Exponential backoff for rate limits - print(f"โš ๏ธ Rate limited! Waiting {wait_time}s before retry...") - time.sleep(wait_time) - continue - - elif status == "success": - return result - - elif status == "failed": - raise Exception(f"Crawl failed: {result.get('error', 'Unknown error')}") - - else: - # Calculate progressive wait time: start at 15s, increase gradually - base_wait = 15 - progressive_wait = min(60, base_wait + (attempt * 3)) # Cap at 60s - - print( - f"โณ Status: {status} (attempt {attempt + 1}/{max_attempts}) - waiting {progressive_wait}s..." - ) - time.sleep(progressive_wait) - - except Exception as e: - if "rate" in str(e).lower() or "429" in str(e): - wait_time = min(90, 45 + (attempt * 10)) - print(f"โš ๏ธ Rate limit detected in error, waiting {wait_time}s...") - time.sleep(wait_time) - continue - else: - print(f"โŒ Error polling for results: {e}") - if attempt < max_attempts - 1: - time.sleep(20) # Wait before retry - continue - raise - - raise Exception(f"โฐ Timeout: Job did not complete after {max_attempts} attempts") - - -def markdown_crawling_example(): - """ - Markdown Conversion Mode (NO AI/LLM Used) - - This example demonstrates cost-effective crawling that converts pages to clean markdown - WITHOUT any AI processing. Perfect for content archival and when you only need clean markdown. - """ - print("=" * 60) - print("MARKDOWN CONVERSION MODE (NO AI/LLM)") - print("=" * 60) - print("Use case: Get clean markdown content without AI processing") - print("Cost: 2 credits per page (80% savings!)") - print("Features: Clean markdown conversion, metadata extraction") - print("โš ๏ธ NO AI/LLM PROCESSING - Pure HTML to markdown conversion only!") - print() - - # Markdown conversion request - NO AI/LLM processing - request_data = { - "url": "https://scrapegraphai.com/", - "extraction_mode": False, # FALSE = Markdown conversion mode (NO AI/LLM used) - "depth": 2, - "max_pages": 2, - "same_domain_only": True, - "sitemap": False, # Use sitemap for better coverage - # Note: No prompt needed when extraction_mode = False - } - - print(f"๐ŸŒ Target URL: {request_data['url']}") - print("๐Ÿค– AI Prompt: None (no AI processing)") - print(f"๐Ÿ“Š Crawl Depth: {request_data['depth']}") - print(f"๐Ÿ“„ Max Pages: {request_data['max_pages']}") - print(f"๐Ÿ—บ๏ธ Use Sitemap: {request_data['sitemap']}") - print("๐Ÿ’ก Mode: Pure HTML to markdown conversion") - print() - - # Start the markdown conversion job - print("๐Ÿš€ Starting markdown conversion job...") - response = make_request(f"{BASE_URL}/v1/crawl", request_data) - task_id = response.get("task_id") - - if not task_id: - print("โŒ Failed to start markdown conversion job") - return - - print(f"๐Ÿ“‹ Task ID: {task_id}") - print("โณ Polling for results...") - print() - - # Poll for results with rate-limit protection - try: - result = poll_with_backoff(task_id, max_attempts=20) - - print("โœ… Markdown conversion completed successfully!") - print() - - result_data = result.get("result", {}) - pages = result_data.get("pages", []) - crawled_urls = result_data.get("crawled_urls", []) - credits_used = result_data.get("credits_used", 0) - pages_processed = result_data.get("pages_processed", 0) - - # Prepare JSON output - json_output = { - "conversion_results": { - "pages_processed": pages_processed, - "credits_used": credits_used, - "cost_per_page": ( - credits_used / pages_processed if pages_processed > 0 else 0 - ), - "crawled_urls": crawled_urls, - }, - "markdown_content": {"total_pages": len(pages), "pages": []}, - } - - # Add page details to JSON - for i, page in enumerate(pages): - metadata = page.get("metadata", {}) - page_data = { - "page_number": i + 1, - "url": page.get("url"), - "title": page.get("title"), - "metadata": { - "word_count": metadata.get("word_count", 0), - "headers": metadata.get("headers", []), - "links_count": metadata.get("links_count", 0), - }, - "markdown_content": page.get("markdown", ""), - } - json_output["markdown_content"]["pages"].append(page_data) - - # Print JSON output - print("๐Ÿ“Š RESULTS IN JSON FORMAT:") - print("-" * 40) - print(json.dumps(json_output, indent=2, ensure_ascii=False)) - - except Exception as e: - print(f"โŒ Markdown conversion failed: {str(e)}") - - -def main(): - """Run the markdown crawling example.""" - print("๐ŸŒ ScrapeGraphAI Crawler - Markdown Conversion Example") - print("Cost-effective HTML to Markdown conversion (NO AI/LLM)") - print("=" * 60) - - # Check if API key is set - if API_KEY == "sgai-xxx": - print("โš ๏ธ Please set your API key in the .env file") - print(" Create a .env file with your API key:") - print(" API_KEY=your_api_key_here") - print() - print(" You can get your API key from: https://dashboard.scrapegraphai.com") - print() - print(" Example .env file:") - print(" API_KEY=sgai-your-actual-api-key-here") - print(" BASE_URL=https://api.scrapegraphai.com # Optional") - return - - print(f"๐Ÿ”‘ Using API key: {API_KEY[:10]}...") - print(f"๐ŸŒ Base URL: {BASE_URL}") - print() - - # Run the single example - markdown_crawling_example() # Markdown conversion mode (NO AI) - - print("\n" + "=" * 60) - print("๐ŸŽ‰ Example completed!") - print("๐Ÿ’ก This demonstrates markdown conversion mode:") - print(" โ€ข Cost-effective: Only 2 credits per page") - print(" โ€ข No AI/LLM processing - pure HTML to markdown conversion") - print(" โ€ข Perfect for content archival and documentation") - print(" โ€ข 80% cheaper than AI extraction modes!") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/crawl/sync/crawl_markdown_example.py b/scrapegraph-py/examples/crawl/sync/crawl_markdown_example.py deleted file mode 100644 index 01c682b3..00000000 --- a/scrapegraph-py/examples/crawl/sync/crawl_markdown_example.py +++ /dev/null @@ -1,226 +0,0 @@ -#!/usr/bin/env python3 -""" -Example demonstrating the ScrapeGraphAI Crawler markdown conversion mode. - -This example shows how to use the crawler in markdown conversion mode: -- Cost-effective markdown conversion (NO AI/LLM processing) -- 2 credits per page (80% savings compared to AI mode) -- Clean HTML to markdown conversion with metadata extraction - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- A valid API key (set in .env file as SGAI_API_KEY=your_key or environment variable) - -Usage: - python crawl_markdown_example.py -""" - -import json -import os -import time -from typing import Any, Dict - -from dotenv import load_dotenv - -from scrapegraph_py import Client - - -def poll_for_result( - client: Client, crawl_id: str, max_attempts: int = 20 -) -> Dict[str, Any]: - """ - Poll for crawl results with intelligent backoff to avoid rate limits. - - Args: - client: The ScrapeGraph client - crawl_id: The crawl ID to poll for - max_attempts: Maximum number of polling attempts - - Returns: - The final result or raises an exception on timeout/failure - """ - print("โณ Starting to poll for results with rate-limit protection...") - - # Initial wait to give the job time to start processing - time.sleep(15) - - for attempt in range(max_attempts): - try: - result = client.get_crawl(crawl_id) - status = result.get("status") - - if status == "success": - return result - elif status == "failed": - raise Exception(f"Crawl failed: {result.get('error', 'Unknown error')}") - else: - # Calculate progressive wait time: start at 15s, increase gradually - base_wait = 15 - progressive_wait = min(60, base_wait + (attempt * 3)) # Cap at 60s - - print( - f"โณ Status: {status} (attempt {attempt + 1}/{max_attempts}) - waiting {progressive_wait}s..." - ) - time.sleep(progressive_wait) - - except Exception as e: - if "rate" in str(e).lower() or "429" in str(e): - wait_time = min(90, 45 + (attempt * 10)) - print(f"โš ๏ธ Rate limit detected in error, waiting {wait_time}s...") - time.sleep(wait_time) - continue - else: - print(f"โŒ Error polling for results: {e}") - if attempt < max_attempts - 1: - time.sleep(20) # Wait before retry - continue - raise - - raise Exception(f"โฐ Timeout: Job did not complete after {max_attempts} attempts") - - -def markdown_crawling_example(): - """ - Markdown Conversion Mode (NO AI/LLM Used) - - This example demonstrates cost-effective crawling that converts pages to clean markdown - WITHOUT any AI processing. Perfect for content archival and when you only need clean markdown. - """ - print("=" * 60) - print("MARKDOWN CONVERSION MODE (NO AI/LLM)") - print("=" * 60) - print("Use case: Get clean markdown content without AI processing") - print("Cost: 2 credits per page (80% savings!)") - print("Features: Clean markdown conversion, metadata extraction") - print("โš ๏ธ NO AI/LLM PROCESSING - Pure HTML to markdown conversion only!") - print() - - # Initialize the client - client = Client.from_env() - - # Target URL for markdown conversion - url = "https://scrapegraphai.com/" - - print(f"๐ŸŒ Target URL: {url}") - print("๐Ÿค– AI Prompt: None (no AI processing)") - print("๐Ÿ“Š Crawl Depth: 2") - print("๐Ÿ“„ Max Pages: 2") - print("๐Ÿ—บ๏ธ Use Sitemap: True") - print("๐Ÿ’ก Mode: Pure HTML to markdown conversion") - print() - - # Start the markdown conversion job - print("๐Ÿš€ Starting markdown conversion job...") - - # Call crawl with extraction_mode=False for markdown conversion - response = client.crawl( - url=url, - extraction_mode=False, # FALSE = Markdown conversion mode (NO AI/LLM used) - depth=2, - max_pages=2, - same_domain_only=True, - sitemap=True, # Use sitemap for better coverage - # Note: No prompt or data_schema needed when extraction_mode=False - ) - - crawl_id = response.get("crawl_id") or response.get("task_id") - - if not crawl_id: - print("โŒ Failed to start markdown conversion job") - return - - print(f"๐Ÿ“‹ Crawl ID: {crawl_id}") - print("โณ Polling for results...") - print() - - # Poll for results with rate-limit protection - try: - result = poll_for_result(client, crawl_id, max_attempts=20) - - print("โœ… Markdown conversion completed successfully!") - print() - - result_data = result.get("result", {}) - pages = result_data.get("pages", []) - crawled_urls = result_data.get("crawled_urls", []) - credits_used = result_data.get("credits_used", 0) - pages_processed = result_data.get("pages_processed", 0) - - # Prepare JSON output - json_output = { - "conversion_results": { - "pages_processed": pages_processed, - "credits_used": credits_used, - "cost_per_page": ( - credits_used / pages_processed if pages_processed > 0 else 0 - ), - "crawled_urls": crawled_urls, - }, - "markdown_content": {"total_pages": len(pages), "pages": []}, - } - - # Add page details to JSON - for i, page in enumerate(pages): - metadata = page.get("metadata", {}) - page_data = { - "page_number": i + 1, - "url": page.get("url"), - "title": page.get("title"), - "metadata": { - "word_count": metadata.get("word_count", 0), - "headers": metadata.get("headers", []), - "links_count": metadata.get("links_count", 0), - }, - "markdown_content": page.get("markdown", ""), - } - json_output["markdown_content"]["pages"].append(page_data) - - # Print JSON output - print("๐Ÿ“Š RESULTS IN JSON FORMAT:") - print("-" * 40) - print(json.dumps(json_output, indent=2, ensure_ascii=False)) - - except Exception as e: - print(f"โŒ Markdown conversion failed: {str(e)}") - - -def main(): - """Run the markdown crawling example.""" - print("๐ŸŒ ScrapeGraphAI Crawler - Markdown Conversion Example") - print("Cost-effective HTML to Markdown conversion (NO AI/LLM)") - print("=" * 60) - - # Load environment variables from .env file - load_dotenv() - - # Check if API key is set - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โš ๏ธ Please set your API key in the environment variable SGAI_API_KEY") - print(" Option 1: Create a .env file with: SGAI_API_KEY=your_api_key_here") - print( - " Option 2: Set environment variable: export SGAI_API_KEY=your_api_key_here" - ) - print() - print(" You can get your API key from: https://dashboard.scrapegraphai.com") - return - - print(f"๐Ÿ”‘ Using API key: {api_key[:10]}...") - print() - - # Run the markdown conversion example - markdown_crawling_example() - - print("\n" + "=" * 60) - print("๐ŸŽ‰ Example completed!") - print("๐Ÿ’ก This demonstrates markdown conversion mode:") - print(" โ€ข Cost-effective: Only 2 credits per page") - print(" โ€ข No AI/LLM processing - pure HTML to markdown conversion") - print(" โ€ข Perfect for content archival and documentation") - print(" โ€ข 80% cheaper than AI extraction modes!") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/crawl/sync/crawl_sitemap_example.py b/scrapegraph-py/examples/crawl/sync/crawl_sitemap_example.py deleted file mode 100644 index 1fb9a695..00000000 --- a/scrapegraph-py/examples/crawl/sync/crawl_sitemap_example.py +++ /dev/null @@ -1,247 +0,0 @@ -#!/usr/bin/env python3 -""" -Example demonstrating the ScrapeGraphAI Crawler with sitemap functionality. - -This example shows how to use the crawler with sitemap enabled for better page discovery: -- Sitemap helps discover more pages efficiently -- Better coverage of website content -- More comprehensive crawling results - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- A valid API key (set in .env file as SGAI_API_KEY=your_key or environment variable) - -Usage: - python crawl_sitemap_example.py -""" - -import json -import os -import time -from typing import Any, Dict - -from dotenv import load_dotenv - -from scrapegraph_py import Client - - -def poll_for_result( - client: Client, crawl_id: str, max_attempts: int = 20 -) -> Dict[str, Any]: - """ - Poll for crawl results with intelligent backoff to avoid rate limits. - - Args: - client: The ScrapeGraph client - crawl_id: The crawl ID to poll for - max_attempts: Maximum number of polling attempts - - Returns: - The final result or raises an exception on timeout/failure - """ - print("โณ Starting to poll for results with rate-limit protection...") - - # Initial wait to give the job time to start processing - time.sleep(15) - - for attempt in range(max_attempts): - try: - result = client.get_crawl(crawl_id) - status = result.get("status") - - if status == "success": - return result - elif status == "failed": - raise Exception(f"Crawl failed: {result.get('error', 'Unknown error')}") - else: - # Calculate progressive wait time: start at 15s, increase gradually - base_wait = 15 - progressive_wait = min(60, base_wait + (attempt * 3)) # Cap at 60s - - print( - f"โณ Status: {status} (attempt {attempt + 1}/{max_attempts}) - waiting {progressive_wait}s..." - ) - time.sleep(progressive_wait) - - except Exception as e: - if "rate" in str(e).lower() or "429" in str(e): - wait_time = min(90, 45 + (attempt * 10)) - print(f"โš ๏ธ Rate limit detected in error, waiting {wait_time}s...") - time.sleep(wait_time) - continue - else: - print(f"โŒ Error polling for results: {e}") - if attempt < max_attempts - 1: - time.sleep(20) # Wait before retry - continue - raise - - raise Exception(f"โฐ Timeout: Job did not complete after {max_attempts} attempts") - - -def sitemap_crawling_example(): - """ - Sitemap-enabled Crawling Example - - This example demonstrates how to use sitemap for better page discovery. - Sitemap helps the crawler find more pages efficiently by using the website's sitemap.xml. - """ - print("=" * 60) - print("SITEMAP-ENABLED CRAWLING EXAMPLE") - print("=" * 60) - print("Use case: Comprehensive website crawling with sitemap discovery") - print("Benefits: Better page coverage, more efficient crawling") - print("Features: Sitemap-based page discovery, structured data extraction") - print() - - # Initialize the client - client = Client.from_env() - - # Target URL - using a website that likely has a sitemap - url = "https://www.giemmeagordo.com/risultati-ricerca-annunci/?sort=newest&search_city=&search_lat=null&search_lng=null&search_category=0&search_type=0&search_min_price=&search_max_price=&bagni=&bagni_comparison=equal&camere=&camere_comparison=equal" - - # Schema for real estate listings - schema = { - "type": "object", - "properties": { - "listings": { - "type": "array", - "items": { - "type": "object", - "properties": { - "title": {"type": "string"}, - "price": {"type": "string"}, - "location": {"type": "string"}, - "description": {"type": "string"}, - "features": {"type": "array", "items": {"type": "string"}}, - "url": {"type": "string"}, - }, - }, - } - }, - } - - prompt = "Extract all real estate listings with their details including title, price, location, description, and features" - - print(f"๐ŸŒ Target URL: {url}") - print("๐Ÿค– AI Prompt: Extract real estate listings") - print("๐Ÿ“Š Crawl Depth: 1") - print("๐Ÿ“„ Max Pages: 10") - print("๐Ÿ—บ๏ธ Use Sitemap: True (enabled for better page discovery)") - print("๐Ÿ  Same Domain Only: True") - print("๐Ÿ’พ Cache Website: True") - print("๐Ÿ’ก Mode: AI extraction with sitemap discovery") - print() - - # Start the sitemap-enabled crawl job - print("๐Ÿš€ Starting sitemap-enabled crawl job...") - - # Call crawl with sitemap=True for better page discovery - response = client.crawl( - url=url, - prompt=prompt, - data_schema=schema, - extraction_mode=True, # AI extraction mode - depth=1, - max_pages=10, - same_domain_only=True, - cache_website=True, - sitemap=True, # Enable sitemap for better page discovery - ) - - crawl_id = response.get("crawl_id") or response.get("task_id") - - if not crawl_id: - print("โŒ Failed to start sitemap-enabled crawl job") - return - - print(f"๐Ÿ“‹ Crawl ID: {crawl_id}") - print("โณ Polling for results...") - print() - - # Poll for results with rate-limit protection - try: - result = poll_for_result(client, crawl_id, max_attempts=20) - - print("โœ… Sitemap-enabled crawl completed successfully!") - print() - - result_data = result.get("result", {}) - llm_result = result_data.get("llm_result", {}) - crawled_urls = result_data.get("crawled_urls", []) - credits_used = result_data.get("credits_used", 0) - pages_processed = result_data.get("pages_processed", 0) - - # Prepare JSON output - json_output = { - "crawl_results": { - "pages_processed": pages_processed, - "credits_used": credits_used, - "cost_per_page": ( - credits_used / pages_processed if pages_processed > 0 else 0 - ), - "crawled_urls": crawled_urls, - "sitemap_enabled": True, - }, - "extracted_data": llm_result, - } - - # Print JSON output - print("๐Ÿ“Š RESULTS IN JSON FORMAT:") - print("-" * 40) - print(json.dumps(json_output, indent=2, ensure_ascii=False)) - - # Print summary - print("\n" + "=" * 60) - print("๐Ÿ“ˆ CRAWL SUMMARY:") - print("=" * 60) - print(f"โœ… Pages processed: {pages_processed}") - print(f"๐Ÿ’ฐ Credits used: {credits_used}") - print(f"๐Ÿ”— URLs crawled: {len(crawled_urls)}") - print(f"๐Ÿ—บ๏ธ Sitemap enabled: Yes") - print(f"๐Ÿ“Š Data extracted: {len(llm_result.get('listings', []))} listings found") - - except Exception as e: - print(f"โŒ Sitemap-enabled crawl failed: {str(e)}") - - -def main(): - """Run the sitemap crawling example.""" - print("๐ŸŒ ScrapeGraphAI Crawler - Sitemap Example") - print("Comprehensive website crawling with sitemap discovery") - print("=" * 60) - - # Load environment variables from .env file - load_dotenv() - - # Check if API key is set - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โš ๏ธ Please set your API key in the environment variable SGAI_API_KEY") - print(" Option 1: Create a .env file with: SGAI_API_KEY=your_api_key_here") - print( - " Option 2: Set environment variable: export SGAI_API_KEY=your_api_key_here" - ) - print() - print(" You can get your API key from: https://dashboard.scrapegraphai.com") - return - - print(f"๐Ÿ”‘ Using API key: {api_key[:10]}...") - print() - - # Run the sitemap crawling example - sitemap_crawling_example() - - print("\n" + "=" * 60) - print("๐ŸŽ‰ Example completed!") - print("๐Ÿ’ก This demonstrates sitemap-enabled crawling:") - print(" โ€ข Better page discovery using sitemap.xml") - print(" โ€ข More comprehensive website coverage") - print(" โ€ข Efficient crawling of structured websites") - print(" โ€ข Perfect for e-commerce, news sites, and content-heavy websites") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/crawl/sync/crawl_with_path_filtering_example.py b/scrapegraph-py/examples/crawl/sync/crawl_with_path_filtering_example.py deleted file mode 100644 index ba9530ed..00000000 --- a/scrapegraph-py/examples/crawl/sync/crawl_with_path_filtering_example.py +++ /dev/null @@ -1,111 +0,0 @@ -""" -Example of using the crawl endpoint with path filtering. - -This example demonstrates how to use include_paths and exclude_paths -to control which pages are crawled on a website. -""" -import os -from scrapegraph_py import Client -from pydantic import BaseModel, Field - - -# Define your output schema -class ProductInfo(BaseModel): - name: str = Field(description="Product name") - price: str = Field(description="Product price") - description: str = Field(description="Product description") - - -class CrawlResult(BaseModel): - products: list[ProductInfo] = Field(description="List of products found") - total_products: int = Field(description="Total number of products") - - -def main(): - # Initialize the client - sgai_api_key = os.getenv("SGAI_API_KEY") - client = Client(api_key=sgai_api_key) - - print("๐Ÿ” Starting crawl with path filtering...") - print("=" * 50) - - # Example 1: Include only specific paths - print("\n๐Ÿ“ Example 1: Crawl only /products/* pages") - print("-" * 50) - - result = client.crawl( - url="https://example.com", - prompt="Extract product information including name, price, and description", - data_schema=CrawlResult.model_json_schema(), - extraction_mode=True, - depth=2, - max_pages=10, - include_paths=["/products/*", "/items/*"], # Only crawl product pages - exclude_paths=["/products/archived/*"] # But skip archived products - ) - - print(f"Task ID: {result.get('task_id')}") - print("\nโœ… Crawl job started successfully!") - - # Example 2: Exclude admin and API paths - print("\n๐Ÿ“ Example 2: Crawl all pages except admin and API") - print("-" * 50) - - result = client.crawl( - url="https://example.com", - prompt="Extract all relevant information", - data_schema=CrawlResult.model_json_schema(), - extraction_mode=True, - depth=2, - max_pages=20, - exclude_paths=[ - "/admin/*", # Skip all admin pages - "/api/*", # Skip all API endpoints - "/private/*", # Skip private pages - "/*.json" # Skip JSON files - ] - ) - - print(f"Task ID: {result.get('task_id')}") - print("\nโœ… Crawl job started successfully!") - - # Example 3: Complex filtering with wildcards - print("\n๐Ÿ“ Example 3: Complex path filtering with wildcards") - print("-" * 50) - - result = client.crawl( - url="https://example.com", - prompt="Extract blog content and metadata", - data_schema=CrawlResult.model_json_schema(), - extraction_mode=True, - depth=3, - max_pages=15, - include_paths=[ - "/blog/**", # Include all blog pages (any depth) - "/articles/*", # Include top-level articles - "/news/2024/*" # Include 2024 news only - ], - exclude_paths=[ - "/blog/draft/*", # Skip draft blog posts - "/blog/*/comments" # Skip comment pages - ] - ) - - print(f"Task ID: {result.get('task_id')}") - print("\nโœ… Crawl job started successfully!") - - print("\n" + "=" * 50) - print("๐Ÿ“š Path Filtering Guide:") - print("=" * 50) - print("โ€ข Use '/*' to match a single path segment") - print(" Example: '/products/*' matches '/products/item1' but not '/products/cat/item1'") - print("\nโ€ข Use '/**' to match any number of path segments") - print(" Example: '/blog/**' matches '/blog/2024/post' and '/blog/category/2024/post'") - print("\nโ€ข exclude_paths takes precedence over include_paths") - print(" You can include a broad pattern and exclude specific subsets") - print("\nโ€ข Paths must start with '/'") - print(" Example: '/products/*' is valid, 'products/*' is not") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/markdownify/async/async_markdownify_example.py b/scrapegraph-py/examples/markdownify/async/async_markdownify_example.py deleted file mode 100644 index 129a690d..00000000 --- a/scrapegraph-py/examples/markdownify/async/async_markdownify_example.py +++ /dev/null @@ -1,37 +0,0 @@ -import asyncio - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - - -async def main(): - # Initialize async client - sgai_client = AsyncClient(api_key="your-api-key-here") - - # Concurrent markdownify requests - urls = [ - "https://scrapegraphai.com/", - "https://github.com/ScrapeGraphAI/Scrapegraph-ai", - ] - - tasks = [sgai_client.markdownify(website_url=url) for url in urls] - - # Execute requests concurrently - responses = await asyncio.gather(*tasks, return_exceptions=True) - - # Process results - for i, response in enumerate(responses): - if isinstance(response, Exception): - print(f"\nError for {urls[i]}: {response}") - else: - print(f"\nPage {i+1} Markdown:") - print(f"URL: {urls[i]}") - print(f"Result: {response['result']}") - - await sgai_client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/markdownify/sync/markdownify_example.py b/scrapegraph-py/examples/markdownify/sync/markdownify_example.py deleted file mode 100644 index 90d6bcb4..00000000 --- a/scrapegraph-py/examples/markdownify/sync/markdownify_example.py +++ /dev/null @@ -1,16 +0,0 @@ -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - -# Initialize the client -sgai_client = Client(api_key="your-api-key-here") - -# Markdownify request -response = sgai_client.markdownify( - website_url="https://example.com", -) - -# Print the response -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") diff --git a/scrapegraph-py/examples/markdownify/sync/markdownify_movements_example.py b/scrapegraph-py/examples/markdownify/sync/markdownify_movements_example.py deleted file mode 100644 index 8f8e9ba4..00000000 --- a/scrapegraph-py/examples/markdownify/sync/markdownify_movements_example.py +++ /dev/null @@ -1,314 +0,0 @@ -#!/usr/bin/env python3 -""" -Example demonstrating how to use the Markdownify API with enhanced features. - -This example shows how to: -1. Set up the API request for markdownify with custom headers -2. Make the API call to convert a website to markdown -3. Handle the response and save the markdown content -4. Display comprehensive results with statistics and timing - -Note: Unlike Smart Scraper, Markdownify doesn't support interactive movements/steps. -It focuses on converting websites to clean markdown format. - -Requirements: -- Python 3.7+ -- requests -- python-dotenv -- A .env file with your TEST_API_KEY - -Example .env file: -TEST_API_KEY=your_api_key_here -""" - -import os -import time - -import requests -from dotenv import load_dotenv - -# Load environment variables from .env file -load_dotenv() - - -def markdownify_movements(): - """ - Enhanced markdownify function with comprehensive features and timing. - - Note: Markdownify doesn't support interactive movements like Smart Scraper. - Instead, it excels at converting websites to clean markdown format. - """ - # Get API key from .env file - api_key = os.getenv("TEST_API_KEY") - if not api_key: - raise ValueError( - "API key must be provided or set in .env file as TEST_API_KEY. " - "Create a .env file with: TEST_API_KEY=your_api_key_here" - ) - steps = [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search", - ] - # Target website configuration - website_url = "https://scrapegraphai.com/" - - # Enhanced headers for better scraping (similar to interactive movements) - custom_headers = { - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.5", - "Accept-Encoding": "gzip, deflate, br", - "Connection": "keep-alive", - "Upgrade-Insecure-Requests": "1", - } - - # Prepare API request headers - headers = { - "SGAI-APIKEY": api_key, - "Content-Type": "application/json", - } - - # Request body for markdownify - body = { - "website_url": website_url, - "headers": custom_headers, - "steps": steps, - } - - print("๐Ÿš€ Starting Markdownify with Enhanced Features...") - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐Ÿ“‹ Custom Headers: {len(custom_headers)} headers configured") - print("๐ŸŽฏ Goal: Convert website to clean markdown format") - print("\n" + "=" * 60) - - # Start timer - start_time = time.time() - print( - f"โฑ๏ธ Timer started at: {time.strftime('%H:%M:%S', time.localtime(start_time))}" - ) - print("๐Ÿ”„ Processing markdown conversion...") - - try: - response = requests.post( - "http://localhost:8001/v1/markdownify", - json=body, - headers=headers, - ) - - # Calculate execution time - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Total execution time: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print( - f"๐Ÿ“Š Performance: {execution_time:.1f}s ({execution_minutes:.1f}m) for markdown conversion" - ) - - if response.status_code == 200: - result = response.json() - markdown_content = result.get("result", "") - - print("โœ… Request completed successfully!") - print(f"๐Ÿ“Š Request ID: {result.get('request_id', 'N/A')}") - print(f"๐Ÿ”„ Status: {result.get('status', 'N/A')}") - print(f"๐Ÿ“ Content Length: {len(markdown_content)} characters") - - if result.get("error"): - print(f"โŒ Error: {result['error']}") - else: - print("\n๐Ÿ“‹ MARKDOWN CONVERSION RESULTS:") - print("=" * 60) - - # Display markdown statistics - lines = markdown_content.split("\n") - words = len(markdown_content.split()) - - print("๐Ÿ“Š Statistics:") - print(f" - Total Lines: {len(lines)}") - print(f" - Total Words: {words}") - print(f" - Total Characters: {len(markdown_content)}") - print( - f" - Processing Speed: {len(markdown_content)/execution_time:.0f} chars/second" - ) - - # Display first 500 characters - print("\n๐Ÿ” First 500 characters:") - print("-" * 50) - print(markdown_content[:500]) - if len(markdown_content) > 500: - print("...") - print("-" * 50) - - # Save to file - filename = f"markdownify_output_{int(time.time())}.md" - save_markdown_to_file(markdown_content, filename) - - # Display content analysis - analyze_markdown_content(markdown_content) - - else: - print(f"โŒ Request failed with status code: {response.status_code}") - print(f"Response: {response.text}") - - except requests.exceptions.RequestException as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐ŸŒ Network error: {str(e)}") - except Exception as e: - end_time = time.time() - execution_time = end_time - start_time - execution_minutes = execution_time / 60 - print( - f"โฑ๏ธ Timer stopped at: {time.strftime('%H:%M:%S', time.localtime(end_time))}" - ) - print( - f"โšก Execution time before error: {execution_time:.2f} seconds ({execution_minutes:.2f} minutes)" - ) - print(f"๐Ÿ’ฅ Unexpected error: {str(e)}") - - -def save_markdown_to_file(markdown_content: str, filename: str): - """ - Save markdown content to a file with enhanced error handling. - - Args: - markdown_content: The markdown content to save - filename: The name of the file to save to - """ - try: - with open(filename, "w", encoding="utf-8") as f: - f.write(markdown_content) - print(f"๐Ÿ’พ Markdown saved to: {filename}") - except Exception as e: - print(f"โŒ Error saving file: {str(e)}") - - -def analyze_markdown_content(markdown_content: str): - """ - Analyze the markdown content and provide insights. - - Args: - markdown_content: The markdown content to analyze - """ - print("\n๐Ÿ” CONTENT ANALYSIS:") - print("-" * 50) - - # Count different markdown elements - lines = markdown_content.split("\n") - headers = [line for line in lines if line.strip().startswith("#")] - links = [line for line in lines if "[" in line and "](" in line] - code_blocks = markdown_content.count("```") - - print(f"๐Ÿ“‘ Headers found: {len(headers)}") - print(f"๐Ÿ”— Links found: {len(links)}") - print( - f"๐Ÿ’ป Code blocks: {code_blocks // 2}" - ) # Divide by 2 since each block has opening and closing - - # Show first few headers if they exist - if headers: - print("\n๐Ÿ“‹ First few headers:") - for i, header in enumerate(headers[:3]): - print(f" {i+1}. {header.strip()}") - if len(headers) > 3: - print(f" ... and {len(headers) - 3} more") - - -def show_curl_equivalent(): - """Show the equivalent curl command for reference""" - - # Load environment variables from .env file - load_dotenv() - - api_key = os.getenv("TEST_API_KEY", "your-api-key-here") - curl_command = f""" -curl --location 'http://localhost:8001/v1/markdownify' \\ ---header 'SGAI-APIKEY: {api_key}' \\ ---header 'Content-Type: application/json' \\ ---data '{{ - "website_url": "https://scrapegraphai.com/", - "headers": {{ - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.5", - "Accept-Encoding": "gzip, deflate, br", - "Connection": "keep-alive", - "Upgrade-Insecure-Requests": "1" - }}, - "steps": [ - "click on search bar", - "wait for 500ms", - "fill email input box with mdehsan873@gmail.com", - "wait a sec", - "click on the first time of search result", - "wait for 2 seconds to load the result of search" - ] -}}' - """ - - print("Equivalent curl command:") - print(curl_command) - - -def main(): - """ - Main function to run the markdownify movements example with timing. - """ - try: - print("๐ŸŽฏ MARKDOWNIFY MOVEMENTS EXAMPLE") - print("=" * 60) - print("Note: Markdownify converts websites to clean markdown format") - print("Unlike Smart Scraper, it doesn't support interactive movements") - print("but excels at creating readable markdown content.") - print("This example includes comprehensive timing information.") - print() - - # Show the curl equivalent - show_curl_equivalent() - - print("\n" + "=" * 60) - - # Make the actual API request - markdownify_movements() - - print("\n" + "=" * 60) - print("Example completed!") - print("\nKey takeaways:") - print("1. Markdownify excels at converting websites to clean markdown") - print("2. Custom headers can improve scraping success") - print("3. Content analysis provides valuable insights") - print("4. File saving enables content persistence") - print("\nNext steps:") - print("- Try different websites and content types") - print("- Customize headers for specific websites") - print("- Implement content filtering and processing") - print("- Use the saved markdown files for further analysis") - - except Exception as e: - print(f"๐Ÿ’ฅ Error occurred: {str(e)}") - print("\n๐Ÿ› ๏ธ Troubleshooting:") - print("1. Make sure your .env file contains TEST_API_KEY") - print("2. Ensure the API server is running on localhost:8001") - print("3. Check your internet connection") - print("4. Verify the target website is accessible") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/scheduled_jobs/async/async_scheduled_jobs_advanced_example.py b/scrapegraph-py/examples/scheduled_jobs/async/async_scheduled_jobs_advanced_example.py deleted file mode 100644 index 9c46b993..00000000 --- a/scrapegraph-py/examples/scheduled_jobs/async/async_scheduled_jobs_advanced_example.py +++ /dev/null @@ -1,369 +0,0 @@ -import asyncio -import os -from datetime import datetime, timedelta -from typing import Dict, Any, List - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -class ScheduledJobManager: - """Advanced scheduled job manager with monitoring and automation""" - - def __init__(self, client: AsyncClient): - self.client = client - self.active_jobs: Dict[str, Dict[str, Any]] = {} - - async def create_monitoring_job(self, website_url: str, job_name: str, cron_expression: str) -> str: - """Create a job that monitors website changes""" - print(f"๐Ÿ“… Creating monitoring job for {website_url}...") - - job_config = { - "website_url": website_url, - "user_prompt": "Monitor for any changes in content, new articles, or updates. Extract the latest information.", - "render_heavy_js": True, - "headers": { - "User-Agent": "Mozilla/5.0 (compatible; MonitoringBot/1.0)" - } - } - - result = await self.client.create_scheduled_job( - job_name=job_name, - service_type="smartscraper", - cron_expression=cron_expression, - job_config=job_config, - is_active=True - ) - - job_id = result["id"] - self.active_jobs[job_id] = { - "name": job_name, - "url": website_url, - "type": "monitoring", - "created_at": datetime.now() - } - - print(f"โœ… Created monitoring job with ID: {job_id}") - return job_id - - async def create_data_collection_job(self, search_prompt: str, job_name: str, cron_expression: str) -> str: - """Create a job that collects data from multiple sources""" - print(f"๐Ÿ“… Creating data collection job: {search_prompt}...") - - job_config = { - "user_prompt": search_prompt, - "num_results": 10, - "headers": { - "User-Agent": "Mozilla/5.0 (compatible; DataCollector/1.0)" - } - } - - result = await self.client.create_scheduled_job( - job_name=job_name, - service_type="searchscraper", - cron_expression=cron_expression, - job_config=job_config, - is_active=True - ) - - job_id = result["id"] - self.active_jobs[job_id] = { - "name": job_name, - "prompt": search_prompt, - "type": "data_collection", - "created_at": datetime.now() - } - - print(f"โœ… Created data collection job with ID: {job_id}") - return job_id - - async def create_crawl_job(self, base_url: str, job_name: str, cron_expression: str) -> str: - """Create a job that crawls websites for comprehensive data""" - print(f"๐Ÿ“… Creating crawl job for {base_url}...") - - job_config = { - "url": base_url, - "prompt": "Extract all relevant information including titles, descriptions, links, and metadata", - "extraction_mode": True, - "depth": 3, - "max_pages": 50, - "same_domain_only": True, - "cache_website": True, - "sitemap": True - } - - result = await self.client.create_scheduled_job( - job_name=job_name, - service_type="crawl", - cron_expression=cron_expression, - job_config=job_config, - is_active=True - ) - - job_id = result["id"] - self.active_jobs[job_id] = { - "name": job_name, - "url": base_url, - "type": "crawl", - "created_at": datetime.now() - } - - print(f"โœ… Created crawl job with ID: {job_id}") - return job_id - - async def monitor_job_executions(self, job_id: str, duration_minutes: int = 5): - """Monitor job executions for a specified duration""" - print(f"๐Ÿ“Š Monitoring executions for job {job_id} for {duration_minutes} minutes...") - - start_time = datetime.now() - end_time = start_time + timedelta(minutes=duration_minutes) - - while datetime.now() < end_time: - try: - executions = await self.client.get_job_executions(job_id, page=1, page_size=10) - - if executions["executions"]: - latest_execution = executions["executions"][0] - print(f" Latest execution: {latest_execution['status']} at {latest_execution['started_at']}") - - if latest_execution.get('completed_at'): - print(f" Completed at: {latest_execution['completed_at']}") - if latest_execution.get('credits_used'): - print(f" Credits used: {latest_execution['credits_used']}") - - await asyncio.sleep(30) # Check every 30 seconds - - except Exception as e: - print(f" Error monitoring job {job_id}: {e}") - await asyncio.sleep(30) - - async def batch_trigger_jobs(self, job_ids: List[str]): - """Trigger multiple jobs concurrently""" - print(f"๐Ÿš€ Triggering {len(job_ids)} jobs concurrently...") - - tasks = [self.client.trigger_scheduled_job(job_id) for job_id in job_ids] - results = await asyncio.gather(*tasks, return_exceptions=True) - - for i, result in enumerate(results): - if isinstance(result, Exception): - print(f" โŒ Failed to trigger job {job_ids[i]}: {result}") - else: - print(f" โœ… Triggered job {job_ids[i]}: {result['execution_id']}") - - async def get_job_statistics(self) -> Dict[str, Any]: - """Get comprehensive statistics about all jobs""" - print("๐Ÿ“ˆ Collecting job statistics...") - - all_jobs = await self.client.get_scheduled_jobs(page=1, page_size=100) - - stats = { - "total_jobs": all_jobs["total"], - "active_jobs": 0, - "inactive_jobs": 0, - "service_types": {}, - "recent_executions": 0, - "total_credits_used": 0 - } - - for job in all_jobs["jobs"]: - if job["is_active"]: - stats["active_jobs"] += 1 - else: - stats["inactive_jobs"] += 1 - - service_type = job["service_type"] - stats["service_types"][service_type] = stats["service_types"].get(service_type, 0) + 1 - - # Get execution history for each job - try: - executions = await self.client.get_job_executions(job["id"], page=1, page_size=5) - stats["recent_executions"] += len(executions["executions"]) - - for execution in executions["executions"]: - if execution.get("credits_used"): - stats["total_credits_used"] += execution["credits_used"] - - except Exception as e: - print(f" โš ๏ธ Could not get executions for job {job['id']}: {e}") - - return stats - - async def cleanup_old_jobs(self, days_old: int = 7): - """Clean up jobs older than specified days""" - print(f"๐Ÿงน Cleaning up jobs older than {days_old} days...") - - cutoff_date = datetime.now() - timedelta(days=days_old) - jobs_to_delete = [] - - all_jobs = await self.client.get_scheduled_jobs(page=1, page_size=100) - - for job in all_jobs["jobs"]: - created_at = datetime.fromisoformat(job["created_at"].replace('Z', '+00:00')) - if created_at < cutoff_date: - jobs_to_delete.append(job["id"]) - - if jobs_to_delete: - print(f" Found {len(jobs_to_delete)} jobs to delete") - - for job_id in jobs_to_delete: - try: - await self.client.delete_scheduled_job(job_id) - print(f" โœ… Deleted job {job_id}") - except Exception as e: - print(f" โŒ Failed to delete job {job_id}: {e}") - else: - print(" No old jobs found to delete") - - async def export_job_configurations(self) -> List[Dict[str, Any]]: - """Export all job configurations for backup""" - print("๐Ÿ’พ Exporting job configurations...") - - all_jobs = await self.client.get_scheduled_jobs(page=1, page_size=100) - configurations = [] - - for job in all_jobs["jobs"]: - config = { - "job_name": job["job_name"], - "service_type": job["service_type"], - "cron_expression": job["cron_expression"], - "job_config": job["job_config"], - "is_active": job["is_active"], - "created_at": job["created_at"] - } - configurations.append(config) - - print(f" Exported {len(configurations)} job configurations") - return configurations - - -async def main(): - """Main function demonstrating advanced scheduled jobs management""" - # Initialize async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - async with AsyncClient(api_key=api_key) as client: - print("๐Ÿš€ Starting Advanced Scheduled Jobs Demo") - print("=" * 60) - - manager = ScheduledJobManager(client) - job_ids = [] - - try: - # Create different types of advanced jobs - print("\n๐Ÿ“… Creating Advanced Scheduled Jobs:") - print("-" * 40) - - # News monitoring job - news_job_id = await manager.create_monitoring_job( - website_url="https://techcrunch.com", - job_name="TechCrunch News Monitor", - cron_expression="0 */2 * * *" # Every 2 hours - ) - job_ids.append(news_job_id) - - # AI research job - ai_job_id = await manager.create_data_collection_job( - search_prompt="Latest developments in artificial intelligence and machine learning", - job_name="AI Research Collector", - cron_expression="0 8 * * 1" # Every Monday at 8 AM - ) - job_ids.append(ai_job_id) - - # E-commerce crawl job - ecommerce_job_id = await manager.create_crawl_job( - base_url="https://example-store.com", - job_name="E-commerce Product Crawler", - cron_expression="0 3 * * *" # Daily at 3 AM - ) - job_ids.append(ecommerce_job_id) - - # Get comprehensive statistics - print("\n๐Ÿ“ˆ Job Statistics:") - print("-" * 40) - stats = await manager.get_job_statistics() - print(f"Total jobs: {stats['total_jobs']}") - print(f"Active jobs: {stats['active_jobs']}") - print(f"Inactive jobs: {stats['inactive_jobs']}") - print(f"Service types: {stats['service_types']}") - print(f"Recent executions: {stats['recent_executions']}") - print(f"Total credits used: {stats['total_credits_used']}") - - # Trigger jobs concurrently - print("\n๐Ÿš€ Concurrent Job Triggering:") - print("-" * 40) - await manager.batch_trigger_jobs(job_ids) - - # Monitor executions - print("\n๐Ÿ“Š Monitoring Job Executions:") - print("-" * 40) - if job_ids: - await manager.monitor_job_executions(job_ids[0], duration_minutes=2) - - # Export configurations - print("\n๐Ÿ’พ Exporting Job Configurations:") - print("-" * 40) - configurations = await manager.export_job_configurations() - print(f"Exported {len(configurations)} configurations") - - # Demonstrate job management - print("\n๐Ÿ”ง Advanced Job Management:") - print("-" * 40) - - # Update job configurations - if job_ids: - print(f"Updating job {job_ids[0]}:") - await client.update_scheduled_job( - job_ids[0], - job_name="Updated TechCrunch Monitor", - cron_expression="0 */1 * * *" # Every hour - ) - print(" โœ… Job updated successfully") - - # Pause and resume - print(f"Pausing job {job_ids[0]}:") - await client.pause_scheduled_job(job_ids[0]) - print(" โœ… Job paused") - - await asyncio.sleep(1) - - print(f"Resuming job {job_ids[0]}:") - await client.resume_scheduled_job(job_ids[0]) - print(" โœ… Job resumed") - - # Cleanup demonstration (commented out to avoid deleting real jobs) - # print("\n๐Ÿงน Cleanup Demonstration:") - # print("-" * 40) - # await manager.cleanup_old_jobs(days_old=1) - - except Exception as e: - print(f"โŒ Error during execution: {e}") - - finally: - # Clean up created jobs - print("\n๐Ÿงน Cleaning up created jobs:") - print("-" * 40) - for job_id in job_ids: - try: - await client.delete_scheduled_job(job_id) - print(f" โœ… Deleted job {job_id}") - except Exception as e: - print(f" โŒ Failed to delete job {job_id}: {e}") - - print("\nโœ… Advanced Scheduled Jobs Demo completed!") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/scheduled_jobs/async/async_scheduled_jobs_example.py b/scrapegraph-py/examples/scheduled_jobs/async/async_scheduled_jobs_example.py deleted file mode 100644 index b9ec835b..00000000 --- a/scrapegraph-py/examples/scheduled_jobs/async/async_scheduled_jobs_example.py +++ /dev/null @@ -1,219 +0,0 @@ -import asyncio -import os -from datetime import datetime -from typing import Dict, Any - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -async def create_smartscraper_job(client: AsyncClient) -> str: - """Create a scheduled job for smartscraper""" - print("๐Ÿ“… Creating SmartScraper scheduled job...") - - job_config = { - "website_url": "https://news.ycombinator.com", - "user_prompt": "Extract the top 5 news titles and their URLs", - "render_heavy_js": False, - "headers": { - "User-Agent": "Mozilla/5.0 (compatible; ScheduledJob/1.0)" - } - } - - result = await client.create_scheduled_job( - job_name="HN Top News Scraper", - service_type="smartscraper", - cron_expression="0 */6 * * *", # Every 6 hours - job_config=job_config, - is_active=True - ) - - job_id = result["id"] - print(f"โœ… Created SmartScraper job with ID: {job_id}") - return job_id - - -async def create_searchscraper_job(client: AsyncClient) -> str: - """Create a scheduled job for searchscraper""" - print("๐Ÿ“… Creating SearchScraper scheduled job...") - - job_config = { - "user_prompt": "Find the latest AI and machine learning news", - "num_results": 5, - "headers": { - "User-Agent": "Mozilla/5.0 (compatible; ScheduledJob/1.0)" - } - } - - result = await client.create_scheduled_job( - job_name="AI News Search", - service_type="searchscraper", - cron_expression="0 9 * * 1", # Every Monday at 9 AM - job_config=job_config, - is_active=True - ) - - job_id = result["id"] - print(f"โœ… Created SearchScraper job with ID: {job_id}") - return job_id - - -async def create_crawl_job(client: AsyncClient) -> str: - """Create a scheduled job for crawl""" - print("๐Ÿ“… Creating Crawl scheduled job...") - - job_config = { - "url": "https://example.com", - "prompt": "Extract all product information", - "extraction_mode": True, - "depth": 2, - "max_pages": 10, - "same_domain_only": True, - "cache_website": True - } - - result = await client.create_scheduled_job( - job_name="Product Catalog Crawler", - service_type="crawl", - cron_expression="0 2 * * *", # Daily at 2 AM - job_config=job_config, - is_active=True - ) - - job_id = result["id"] - print(f"โœ… Created Crawl job with ID: {job_id}") - return job_id - - -async def manage_jobs(client: AsyncClient, job_ids: list[str]): - """Demonstrate job management operations""" - print("\n๐Ÿ”ง Managing scheduled jobs...") - - # List all jobs - print("\n๐Ÿ“‹ Listing all scheduled jobs:") - jobs_result = await client.get_scheduled_jobs(page=1, page_size=10) - print(f"Total jobs: {jobs_result['total']}") - - for job in jobs_result["jobs"]: - print(f" - {job['job_name']} ({job['service_type']}) - Active: {job['is_active']}") - - # Get details of first job - if job_ids: - print(f"\n๐Ÿ” Getting details for job {job_ids[0]}:") - job_details = await client.get_scheduled_job(job_ids[0]) - print(f" Name: {job_details['job_name']}") - print(f" Cron: {job_details['cron_expression']}") - print(f" Next run: {job_details.get('next_run_at', 'N/A')}") - - # Pause the first job - print(f"\nโธ๏ธ Pausing job {job_ids[0]}:") - pause_result = await client.pause_scheduled_job(job_ids[0]) - print(f" Status: {pause_result['message']}") - - # Resume the job - print(f"\nโ–ถ๏ธ Resuming job {job_ids[0]}:") - resume_result = await client.resume_scheduled_job(job_ids[0]) - print(f" Status: {resume_result['message']}") - - # Update job configuration - print(f"\n๐Ÿ“ Updating job {job_ids[0]}:") - update_result = await client.update_scheduled_job( - job_ids[0], - job_name="Updated HN News Scraper", - cron_expression="0 */4 * * *" # Every 4 hours instead of 6 - ) - print(f" Updated job name: {update_result['job_name']}") - print(f" Updated cron: {update_result['cron_expression']}") - - -async def trigger_and_monitor_jobs(client: AsyncClient, job_ids: list[str]): - """Demonstrate manual job triggering and execution monitoring""" - print("\n๐Ÿš€ Triggering and monitoring jobs...") - - for job_id in job_ids: - print(f"\n๐ŸŽฏ Manually triggering job {job_id}:") - trigger_result = await client.trigger_scheduled_job(job_id) - execution_id = trigger_result["execution_id"] - print(f" Execution ID: {execution_id}") - print(f" Message: {trigger_result['message']}") - - # Wait a bit for execution to start - await asyncio.sleep(2) - - # Get execution history - print(f"\n๐Ÿ“Š Getting execution history for job {job_id}:") - executions = await client.get_job_executions(job_id, page=1, page_size=5) - print(f" Total executions: {executions['total']}") - - for execution in executions["executions"][:3]: # Show last 3 executions - print(f" - Execution {execution['id']}: {execution['status']}") - print(f" Started: {execution['started_at']}") - if execution.get('completed_at'): - print(f" Completed: {execution['completed_at']}") - if execution.get('credits_used'): - print(f" Credits used: {execution['credits_used']}") - - -async def cleanup_jobs(client: AsyncClient, job_ids: list[str]): - """Clean up created jobs""" - print("\n๐Ÿงน Cleaning up created jobs...") - - for job_id in job_ids: - print(f"๐Ÿ—‘๏ธ Deleting job {job_id}:") - delete_result = await client.delete_scheduled_job(job_id) - print(f" Status: {delete_result['message']}") - - -async def main(): - """Main function demonstrating async scheduled jobs""" - # Initialize async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - async with AsyncClient(api_key=api_key) as client: - print("๐Ÿš€ Starting Async Scheduled Jobs Demo") - print("=" * 50) - - job_ids = [] - - try: - # Create different types of scheduled jobs - smartscraper_job_id = await create_smartscraper_job(client) - job_ids.append(smartscraper_job_id) - - searchscraper_job_id = await create_searchscraper_job(client) - job_ids.append(searchscraper_job_id) - - crawl_job_id = await create_crawl_job(client) - job_ids.append(crawl_job_id) - - # Manage jobs - await manage_jobs(client, job_ids) - - # Trigger and monitor jobs - await trigger_and_monitor_jobs(client, job_ids) - - except Exception as e: - print(f"โŒ Error during execution: {e}") - - finally: - # Clean up - await cleanup_jobs(client, job_ids) - - print("\nโœ… Async Scheduled Jobs Demo completed!") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/scheduled_jobs/sync/scheduled_jobs_example.py b/scrapegraph-py/examples/scheduled_jobs/sync/scheduled_jobs_example.py deleted file mode 100644 index e0e08831..00000000 --- a/scrapegraph-py/examples/scheduled_jobs/sync/scheduled_jobs_example.py +++ /dev/null @@ -1,164 +0,0 @@ -#!/usr/bin/env python3 -"""Scheduled Jobs Example - Sync Client""" - -import os -from scrapegraph_py import Client -from scrapegraph_py.models.scheduled_jobs import ServiceType - -def main(): - client = Client.from_env() - - print("๐Ÿš€ ScrapeGraph AI Scheduled Jobs Example") - print("=" * 50) - - try: - print("\n๐Ÿ“… Creating a scheduled SmartScraper job...") - - smartscraper_config = { - "website_url": "https://example.com", - "user_prompt": "Extract the main heading and description from the page" - } - - job = client.create_scheduled_job( - job_name="Daily Example Scraping", - service_type=ServiceType.SMARTSCRAPER, - cron_expression="0 9 * * *", - job_config=smartscraper_config, - is_active=True - ) - - job_id = job["id"] - print(f"โœ… Created job: {job['job_name']} (ID: {job_id})") - print(f" Next run: {job.get('next_run_at', 'Not scheduled')}") - - print("\n๐Ÿ“… Creating a scheduled SearchScraper job...") - - searchscraper_config = { - "user_prompt": "Find the latest news about artificial intelligence", - "num_results": 5 - } - - search_job = client.create_scheduled_job( - job_name="Weekly AI News Search", - service_type=ServiceType.SEARCHSCRAPER, - cron_expression="0 10 * * 1", - job_config=searchscraper_config, - is_active=True - ) - - search_job_id = search_job["id"] - print(f"โœ… Created job: {search_job['job_name']} (ID: {search_job_id})") - - print("\n๐Ÿ“‹ Listing all scheduled jobs...") - - jobs_response = client.get_scheduled_jobs(page=1, page_size=10) - jobs = jobs_response["jobs"] - - print(f"Found {jobs_response['total']} total jobs:") - for job in jobs: - status = "๐ŸŸข Active" if job["is_active"] else "๐Ÿ”ด Inactive" - print(f" - {job['job_name']} ({job['service_type']}) - {status}") - print(f" Schedule: {job['cron_expression']}") - if job.get('next_run_at'): - print(f" Next run: {job['next_run_at']}") - - print(f"\n๐Ÿ” Getting details for job {job_id}...") - - job_details = client.get_scheduled_job(job_id) - print(f"Job Name: {job_details['job_name']}") - print(f"Service Type: {job_details['service_type']}") - print(f"Created: {job_details['created_at']}") - print(f"Active: {job_details['is_active']}") - - print(f"\n๐Ÿ“ Updating job schedule...") - - updated_job = client.update_scheduled_job( - job_id=job_id, - cron_expression="0 8 * * *", - job_name="Daily Example Scraping (Updated)" - ) - - print(f"โœ… Updated job: {updated_job['job_name']}") - print(f" New schedule: {updated_job['cron_expression']}") - - print(f"\nโธ๏ธ Pausing job {job_id}...") - - pause_result = client.pause_scheduled_job(job_id) - print(f"โœ… {pause_result['message']}") - print(f" Job is now: {'Active' if pause_result['is_active'] else 'Paused'}") - - print(f"\nโ–ถ๏ธ Resuming job {job_id}...") - - resume_result = client.resume_scheduled_job(job_id) - print(f"โœ… {resume_result['message']}") - print(f" Job is now: {'Active' if resume_result['is_active'] else 'Paused'}") - if resume_result.get('next_run_at'): - print(f" Next run: {resume_result['next_run_at']}") - - print(f"\n๐Ÿš€ Manually triggering job {job_id}...") - - trigger_result = client.trigger_scheduled_job(job_id) - print(f"โœ… {trigger_result['message']}") - print(f" Execution ID: {trigger_result['execution_id']}") - print(f" Triggered at: {trigger_result['triggered_at']}") - - print(f"\n๐Ÿ“Š Getting execution history for job {job_id}...") - - executions_response = client.get_job_executions( - job_id=job_id, - page=1, - page_size=5 - ) - - executions = executions_response["executions"] - print(f"Found {executions_response['total']} total executions:") - - for execution in executions: - status_emoji = { - "completed": "โœ…", - "failed": "โŒ", - "running": "๐Ÿ”„", - "pending": "โณ" - }.get(execution["status"], "โ“") - - print(f" {status_emoji} {execution['status'].upper()}") - print(f" Started: {execution['started_at']}") - if execution.get('completed_at'): - print(f" Completed: {execution['completed_at']}") - if execution.get('credits_used'): - print(f" Credits used: {execution['credits_used']}") - - print(f"\n๐Ÿ”ง Filtering jobs by service type (smartscraper)...") - - filtered_jobs = client.get_scheduled_jobs( - service_type=ServiceType.SMARTSCRAPER, - is_active=True - ) - - print(f"Found {filtered_jobs['total']} active SmartScraper jobs:") - for job in filtered_jobs["jobs"]: - print(f" - {job['job_name']} (Schedule: {job['cron_expression']})") - - print(f"\n๐Ÿ—‘๏ธ Cleaning up - deleting created jobs...") - - delete_result1 = client.delete_scheduled_job(job_id) - print(f"โœ… {delete_result1['message']} (Job 1)") - - delete_result2 = client.delete_scheduled_job(search_job_id) - print(f"โœ… {delete_result2['message']} (Job 2)") - - print("\n๐ŸŽ‰ Scheduled jobs example completed successfully!") - - except Exception as e: - print(f"\nโŒ Error: {str(e)}") - raise - - finally: - client.close() - - -if __name__ == "__main__": - if os.getenv("SGAI_MOCK", "0").lower() in ["1", "true", "yes"]: - print("๐Ÿงช Running in MOCK mode - no real API calls will be made") - - main() \ No newline at end of file diff --git a/scrapegraph-py/examples/scrape/async/async_scrape_example.py b/scrapegraph-py/examples/scrape/async/async_scrape_example.py deleted file mode 100644 index 11ab1f54..00000000 --- a/scrapegraph-py/examples/scrape/async/async_scrape_example.py +++ /dev/null @@ -1,331 +0,0 @@ -""" -Basic asynchronous example demonstrating how to use the Scrape API. - -This example shows: -1. How to make async scrape requests -2. How to process multiple URLs concurrently -3. How to use render_heavy_js for JavaScript-heavy websites -4. How to use branding parameter -5. How to add custom headers in async mode - -Equivalent curl command: -curl -X POST https://api.scrapegraphai.com/v1/scrape \ - -H "Content-Type: application/json" \ - -H "SGAI-APIKEY: your-api-key-here" \ - -d '{ - "website_url": "https://www.cubic.dev/", - "render_heavy_js": false, - "branding": true - }' - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- aiohttp -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -import time -from pathlib import Path -from typing import List, Dict, Any -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -async def basic_async_scrape(): - """Demonstrate basic async scrape functionality.""" - print("๐ŸŒ Basic Async Scrape Example") - print("=" * 35) - - async with AsyncClient.from_env() as client: - try: - print("Making basic async scrape request...") - result = await client.scrape( - website_url="https://example.com", - render_heavy_js=False - ) - - html_content = result.get("html", "") - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def async_scrape_with_heavy_js(): - """Demonstrate async scraping with heavy JavaScript rendering.""" - print("\n๐Ÿš€ Async Heavy JavaScript Rendering Example") - print("=" * 50) - - async with AsyncClient.from_env() as client: - try: - print("Making async scrape request with heavy JS rendering...") - start_time = time.time() - - result = await client.scrape( - website_url="https://example.com", - render_heavy_js=True - ) - - execution_time = time.time() - start_time - html_content = result.get("html", "") - - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"โฑ๏ธ Execution time: {execution_time:.2f} seconds") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def scrape_single_url(client: AsyncClient, url: str, use_js: bool = False) -> Dict[str, Any]: - """Scrape a single URL with error handling.""" - try: - result = await client.scrape( - website_url=url, - render_heavy_js=use_js - ) - - html_content = result.get("html", "") - return { - "url": url, - "success": True, - "html_length": len(html_content), - "request_id": result.get("request_id"), - "result": result - } - - except Exception as e: - return { - "url": url, - "success": False, - "error": str(e), - "html_length": 0 - } - - -async def concurrent_scraping_example(): - """Demonstrate scraping multiple URLs concurrently.""" - print("\nโšก Concurrent Scraping Example") - print("=" * 35) - - # URLs to scrape concurrently - urls = [ - "https://example.com", - "https://httpbin.org/html", - "https://httpbin.org/json" - ] - - async with AsyncClient.from_env() as client: - print(f"Scraping {len(urls)} URLs concurrently...") - start_time = time.time() - - # Create tasks for concurrent execution - tasks = [scrape_single_url(client, url) for url in urls] - results = await asyncio.gather(*tasks, return_exceptions=True) - - total_time = time.time() - start_time - - # Process results - successful = 0 - total_html_length = 0 - - for result in results: - if isinstance(result, Exception): - print(f"โŒ Exception: {result}") - continue - - if result["success"]: - successful += 1 - total_html_length += result["html_length"] - print(f"โœ… {result['url']}: {result['html_length']:,} chars") - else: - print(f"โŒ {result['url']}: {result['error']}") - - print(f"\n๐Ÿ“Š Results:") - print(f" Total time: {total_time:.2f} seconds") - print(f" Successful: {successful}/{len(urls)}") - print(f" Total HTML: {total_html_length:,} characters") - print(f" Average per URL: {total_time/len(urls):.2f} seconds") - - return results - - -async def async_scrape_with_branding(): - """Demonstrate async scraping with branding enabled.""" - print("\n๐Ÿท๏ธ Async Branding Example") - print("=" * 30) - - async with AsyncClient.from_env() as client: - try: - print("Making async scrape request with branding enabled...") - result = await client.scrape( - website_url="https://www.cubic.dev/", - render_heavy_js=False, - branding=True - ) - - html_content = result.get("html", "") - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def async_scrape_with_custom_headers(): - """Demonstrate async scraping with custom headers.""" - print("\n๐Ÿ”ง Async Custom Headers Example") - print("=" * 35) - - # Custom headers - custom_headers = { - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.5", - "Connection": "keep-alive" - } - - async with AsyncClient.from_env() as client: - try: - print("Making async scrape request with custom headers...") - result = await client.scrape( - website_url="https://httpbin.org/headers", - render_heavy_js=False, - headers=custom_headers - ) - - html_content = result.get("html", "") - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def save_html_to_file_async(html_content: str, filename: str): - """Save HTML content to a file asynchronously.""" - output_dir = Path("async_scrape_output") - output_dir.mkdir(exist_ok=True) - - file_path = output_dir / f"{filename}.html" - - # Use asyncio.to_thread for file I/O - await asyncio.to_thread( - lambda: file_path.write_text(html_content, encoding="utf-8") - ) - - print(f"๐Ÿ’พ HTML saved to: {file_path}") - return file_path - - -def demonstrate_curl_equivalent(): - """Show the equivalent curl commands.""" - print("๐ŸŒ Equivalent curl commands:") - print("=" * 35) - - print("1. Basic async scrape (same as sync):") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": false") - print(" }'") - - print("\n2. With branding enabled:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://www.cubic.dev/\",") - print(" \"render_heavy_js\": false,") - print(" \"branding\": true") - print(" }'") - - print("\n3. Multiple concurrent requests:") - print("# Run multiple curl commands in parallel:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{\"website_url\": \"https://example.com\"}' &") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{\"website_url\": \"https://httpbin.org/html\"}' &") - print("wait # Wait for all background jobs to complete") - - -async def main(): - """Main async function demonstrating scrape functionality.""" - print("๐Ÿš€ Async Scrape API Examples") - print("=" * 30) - - # Show curl equivalents first - demonstrate_curl_equivalent() - - try: - # Run async examples - result1 = await basic_async_scrape() - result2 = await async_scrape_with_heavy_js() - result3 = await async_scrape_with_branding() - result4 = await async_scrape_with_custom_headers() - concurrent_results = await concurrent_scraping_example() - - # Save results if successful - if result1: - html1 = result1.get("html", "") - if html1: - await save_html_to_file_async(html1, "basic_async_scrape") - - if result3: - html3 = result3.get("html", "") - if html3: - await save_html_to_file_async(html3, "branding_async_scrape") - - if result4: - html4 = result4.get("html", "") - if html4: - await save_html_to_file_async(html4, "custom_headers_async_scrape") - - print("\n๐ŸŽฏ Summary:") - print(f"โœ… Basic async scrape: {'Success' if result1 else 'Failed'}") - print(f"โœ… Heavy JS async scrape: {'Success' if result2 else 'Failed'}") - print(f"โœ… Branding async scrape: {'Success' if result3 else 'Failed'}") - print(f"โœ… Custom headers async scrape: {'Success' if result4 else 'Failed'}") - print(f"โœ… Concurrent scraping: {'Success' if concurrent_results else 'Failed'}") - - except Exception as e: - print(f"โŒ Unexpected error: {str(e)}") - - print("\n๐Ÿ“š Next steps:") - print("โ€ข Try running multiple curl commands in parallel") - print("โ€ข Experiment with different concurrency levels") - print("โ€ข Test with your own list of URLs") - print("โ€ข Compare async vs sync performance for multiple URLs") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/scrape/sync/scrape_example.py b/scrapegraph-py/examples/scrape/sync/scrape_example.py deleted file mode 100644 index 4b3e4c42..00000000 --- a/scrapegraph-py/examples/scrape/sync/scrape_example.py +++ /dev/null @@ -1,272 +0,0 @@ -""" -Basic synchronous example demonstrating how to use the Scrape API. - -This example shows: -1. How to make a basic scrape request -2. How to use render_heavy_js for JavaScript-heavy websites -3. How to use branding parameter -4. How to add custom headers -5. How to handle the response - -Equivalent curl command: -curl -X POST https://api.scrapegraphai.com/v1/scrape \ - -H "Content-Type: application/json" \ - -H "SGAI-APIKEY: your-api-key-here" \ - -d '{ - "website_url": "https://example.com", - "render_heavy_js": false, - "branding": true - }' - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import time -from pathlib import Path -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def basic_scrape_example(): - """Demonstrate basic scrape functionality.""" - print("๐ŸŒ Basic Scrape Example") - print("=" * 30) - - # Initialize client - client = Client.from_env() - - try: - # Basic scrape request - print("Making basic scrape request...") - result = client.scrape( - website_url="https://example.com", - render_heavy_js=False - ) - - # Display results - html_content = result.get("html", "") - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def scrape_with_heavy_js(): - """Demonstrate scraping with heavy JavaScript rendering.""" - print("\n๐Ÿš€ Heavy JavaScript Rendering Example") - print("=" * 45) - - client = Client.from_env() - - try: - print("Making scrape request with heavy JS rendering...") - start_time = time.time() - - result = client.scrape( - website_url="https://example.com", - render_heavy_js=True # Enable JavaScript rendering - ) - - execution_time = time.time() - start_time - html_content = result.get("html", "") - - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"โฑ๏ธ Execution time: {execution_time:.2f} seconds") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def scrape_with_branding(): - """Demonstrate scraping with branding enabled.""" - print("\n๐Ÿท๏ธ Branding Example") - print("=" * 30) - - client = Client.from_env() - - try: - print("Making scrape request with branding enabled...") - result = client.scrape( - website_url="https://www.cubic.dev/", - render_heavy_js=False, - branding=True - ) - - html_content = result.get("html", "") - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - # Show a preview of the HTML - preview = html_content[:200].replace('\n', ' ').strip() - print(f"HTML Preview: {preview}...") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def scrape_with_custom_headers(): - """Demonstrate scraping with custom headers.""" - print("\n๐Ÿ”ง Custom Headers Example") - print("=" * 30) - - client = Client.from_env() - - # Custom headers for better compatibility - custom_headers = { - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.5", - "Accept-Encoding": "gzip, deflate, br", - "Connection": "keep-alive", - "Upgrade-Insecure-Requests": "1" - } - - try: - print("Making scrape request with custom headers...") - result = client.scrape( - website_url="https://httpbin.org/html", - render_heavy_js=False, - headers=custom_headers - ) - - html_content = result.get("html", "") - print(f"โœ… Success! Received {len(html_content):,} characters of HTML") - print(f"Request ID: {result.get('request_id', 'N/A')}") - - # Show a preview of the HTML - preview = html_content[:200].replace('\n', ' ').strip() - print(f"HTML Preview: {preview}...") - - return result - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def save_html_to_file(html_content: str, filename: str): - """Save HTML content to a file.""" - output_dir = Path("scrape_output") - output_dir.mkdir(exist_ok=True) - - file_path = output_dir / f"{filename}.html" - with open(file_path, "w", encoding="utf-8") as f: - f.write(html_content) - - print(f"๐Ÿ’พ HTML saved to: {file_path}") - return file_path - - -def demonstrate_curl_equivalent(): - """Show the equivalent curl commands.""" - print("\n๐ŸŒ Equivalent curl commands:") - print("=" * 35) - - print("1. Basic scrape:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": false") - print(" }'") - - print("\n2. With heavy JS rendering:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": true") - print(" }'") - - print("\n3. With branding enabled:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://www.cubic.dev/\",") - print(" \"render_heavy_js\": false,") - print(" \"branding\": true") - print(" }'") - - -def main(): - """Main function demonstrating scrape functionality.""" - print("๐Ÿš€ Scrape API Examples") - print("=" * 25) - - # Show curl equivalents first - demonstrate_curl_equivalent() - - try: - # Run examples - result1 = basic_scrape_example() - result2 = scrape_with_heavy_js() - result3 = scrape_with_branding() - result4 = scrape_with_custom_headers() - - # Save results if successful - if result1: - html1 = result1.get("html", "") - if html1: - save_html_to_file(html1, "basic_scrape") - - if result3: - html3 = result3.get("html", "") - if html3: - save_html_to_file(html3, "branding_scrape") - - if result4: - html4 = result4.get("html", "") - if html4: - save_html_to_file(html4, "custom_headers_scrape") - - print("\n๐ŸŽฏ Summary:") - print(f"โœ… Basic scrape: {'Success' if result1 else 'Failed'}") - print(f"โœ… Heavy JS scrape: {'Success' if result2 else 'Failed'}") - print(f"โœ… Branding scrape: {'Success' if result3 else 'Failed'}") - print(f"โœ… Custom headers scrape: {'Success' if result4 else 'Failed'}") - - except Exception as e: - print(f"โŒ Unexpected error: {str(e)}") - - print("\n๐Ÿ“š Next steps:") - print("โ€ข Try the curl commands in your terminal") - print("โ€ข Experiment with different websites") - print("โ€ข Test with your own custom headers") - print("โ€ข Compare render_heavy_js=true vs false for dynamic sites") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/searchscraper/async/async_searchscraper_example.py b/scrapegraph-py/examples/searchscraper/async/async_searchscraper_example.py deleted file mode 100644 index c3326314..00000000 --- a/scrapegraph-py/examples/searchscraper/async/async_searchscraper_example.py +++ /dev/null @@ -1,58 +0,0 @@ -""" -Example of using the async searchscraper functionality to search for information concurrently. - -This example demonstrates the configurable website limits feature: -- Default: 3 websites (30 credits) -- Enhanced: 5 websites (50 credits) - for better research depth -- Maximum: 20 websites (200 credits) - for comprehensive research -""" - -import asyncio - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - - -async def main(): - # Initialize async client - sgai_client = AsyncClient(api_key="your-api-key-here") - - # List of search queries with different website limits for demonstration - queries = [ - ("What is the latest version of Python and what are its main features?", 3), - ("What are the key differences between Python 2 and Python 3?", 5), - ("What is Python's GIL and how does it work?", 3), - ] - - # Create tasks for concurrent execution with configurable website limits - tasks = [ - sgai_client.searchscraper(user_prompt=query, num_results=num_results) - for query, num_results in queries - ] - - # Execute requests concurrently - responses = await asyncio.gather(*tasks, return_exceptions=True) - - # Process results - for i, response in enumerate(responses): - if isinstance(response, Exception): - print(f"\nError for query {i+1}: {response}") - else: - query, num_results = queries[i] - print(f"\nSearch {i+1}:") - print(f"Query: {query}") - print( - f"Websites searched: {num_results} (Credits: {30 if num_results <= 3 else 30 + (num_results - 3) * 10})" - ) - print(f"Result: {response['result']}") - print("Reference URLs:") - for url in response["reference_urls"]: - print(f"- {url}") - - await sgai_client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/searchscraper/async/async_searchscraper_markdown_example.py b/scrapegraph-py/examples/searchscraper/async/async_searchscraper_markdown_example.py deleted file mode 100644 index aec44376..00000000 --- a/scrapegraph-py/examples/searchscraper/async/async_searchscraper_markdown_example.py +++ /dev/null @@ -1,168 +0,0 @@ -#!/usr/bin/env python3 -""" -Async SearchScraper Markdown Example - -This example demonstrates using the async SearchScraper API in markdown mode -to search and scrape web pages, returning raw markdown content instead of -AI-extracted data. - -Features demonstrated: -- Async search and scrape with markdown output -- Polling for async results -- Error handling with async operations -- Cost-effective: Only 2 credits per page (vs 10 credits for AI extraction) - -Requirements: -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -import os -from typing import Optional - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -async def wait_for_completion( - client: AsyncClient, request_id: str, max_wait_time: int = 60 -) -> Optional[dict]: - """ - Poll for completion of an async SearchScraper request. - - Args: - client: The AsyncClient instance - request_id: The request ID to poll for - max_wait_time: Maximum time to wait in seconds - - Returns: - The completed response or None if timeout - """ - import time - - start_time = time.time() - - while time.time() - start_time < max_wait_time: - try: - result = await client.get_searchscraper(request_id) - - if result.get("status") == "completed": - return result - elif result.get("status") == "failed": - print(f"โŒ Request failed: {result.get('error', 'Unknown error')}") - return None - else: - print(f"โณ Status: {result.get('status', 'processing')}... waiting 5 seconds") - await asyncio.sleep(5) - - except Exception as e: - print(f"โš ๏ธ Error polling for results: {str(e)}") - await asyncio.sleep(5) - - print("โฐ Timeout waiting for completion") - return None - - -async def basic_searchscraper_markdown_example() -> bool: - """ - Run a basic SearchScraper example with markdown output. - - Returns: - bool: True if successful, False otherwise - """ - print("๐Ÿ” Async SearchScraper Markdown Example") - print("=" * 50) - - # Initialize the async client with API key from environment - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ SGAI_API_KEY not found in environment variables.") - print("Please create a .env file with: SGAI_API_KEY=your_api_key_here") - return False - - async with AsyncClient(api_key=api_key) as client: - try: - # Configuration - user_prompt = "Latest developments in artificial intelligence" - num_results = 3 - - print(f"๐Ÿ“ Query: {user_prompt}") - print(f"๐Ÿ“Š Results: {num_results} websites") - print("๐Ÿ”ง Mode: Markdown conversion") - print("๐Ÿ’ฐ Cost: 2 credits per page (vs 10 for AI extraction)") - - # Send a searchscraper request in markdown mode - response = await client.searchscraper( - user_prompt=user_prompt, - num_results=num_results, - extraction_mode=False, # False = markdown mode, True = AI extraction mode - ) - - print(f"\nโœ… SearchScraper request submitted successfully!") - print(f"๐Ÿ“„ Request ID: {response.get('request_id', 'N/A')}") - - # Check if this is an async request that needs polling - if 'request_id' in response and 'status' not in response: - print("โณ Waiting for async processing to complete...") - - # Poll for completion - final_result = await wait_for_completion(client, response['request_id']) - - if final_result: - response = final_result - else: - print("โŒ Failed to get completed results") - return False - - # Display results - if response.get("status") == "completed": - print("\n๐ŸŽ‰ SearchScraper markdown completed successfully!") - - # Display markdown content (first 500 chars) - markdown_content = response.get("markdown_content", "") - if markdown_content: - print("\n๐Ÿ“ Markdown Content Preview:") - print(f"{markdown_content[:500]}{'...' if len(markdown_content) > 500 else ''}") - else: - print("โš ๏ธ No markdown content returned") - - # Display reference URLs - reference_urls = response.get("reference_urls", []) - if reference_urls: - print(f"\n๐Ÿ”— References: {len(reference_urls)}") - print("\n๐Ÿ”— Reference URLs:") - for i, url in enumerate(reference_urls, 1): - print(f" {i}. {url}") - else: - print("โš ๏ธ No reference URLs returned") - - return True - else: - print(f"โŒ Request not completed. Status: {response.get('status', 'unknown')}") - return False - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return False - - -async def main(): - """Main function to run the example.""" - success = await basic_searchscraper_markdown_example() - return success - - -if __name__ == "__main__": - success = asyncio.run(main()) - exit(0 if success else 1) - diff --git a/scrapegraph-py/examples/searchscraper/async/async_searchscraper_schema_example.py b/scrapegraph-py/examples/searchscraper/async/async_searchscraper_schema_example.py deleted file mode 100644 index 385078ea..00000000 --- a/scrapegraph-py/examples/searchscraper/async/async_searchscraper_schema_example.py +++ /dev/null @@ -1,138 +0,0 @@ -""" -Example of using the async searchscraper functionality with output schemas for extraction. - -This example demonstrates both schema-based output and configurable website limits: -- Using different website limits for different complexity levels -- Enhanced searches provide better data for complex schema population -- Concurrent processing of multiple schema-based searches -""" - -import asyncio -from typing import List - -from pydantic import BaseModel - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - - -# Define schemas for extracting structured data -class PythonVersionInfo(BaseModel): - version: str - release_date: str - major_features: List[str] - - -class PythonComparison(BaseModel): - key_differences: List[str] - backward_compatible: bool - migration_difficulty: str - - -class GILInfo(BaseModel): - definition: str - purpose: str - limitations: List[str] - workarounds: List[str] - - -async def main(): - # Initialize async client - sgai_client = AsyncClient(api_key="your-api-key-here") - - # Define search queries with their corresponding schemas and website limits - searches = [ - { - "prompt": "What is the latest version of Python? Include the release date and main features.", - "schema": PythonVersionInfo, - "num_results": 4, # Moderate search for version info (40 credits) - }, - { - "prompt": "Compare Python 2 and Python 3, including backward compatibility and migration difficulty.", - "schema": PythonComparison, - "num_results": 6, # Enhanced search for comparison (60 credits) - }, - { - "prompt": "Explain Python's GIL, its purpose, limitations, and possible workarounds.", - "schema": GILInfo, - "num_results": 8, # Deep search for technical details (80 credits) - }, - ] - - print("๐Ÿš€ Starting concurrent schema-based searches with configurable limits:") - for i, search in enumerate(searches, 1): - credits = ( - 30 if search["num_results"] <= 3 else 30 + (search["num_results"] - 3) * 10 - ) - print( - f" {i}. {search['num_results']} websites ({credits} credits): {search['prompt'][:50]}..." - ) - print() - - # Create tasks for concurrent execution with configurable website limits - tasks = [ - sgai_client.searchscraper( - user_prompt=search["prompt"], - num_results=search["num_results"], - output_schema=search["schema"], - ) - for search in searches - ] - - # Execute requests concurrently - responses = await asyncio.gather(*tasks, return_exceptions=True) - - # Process results - for i, response in enumerate(responses): - if isinstance(response, Exception): - print(f"\nError for search {i+1}: {response}") - else: - print(f"\nSearch {i+1}:") - print(f"Query: {searches[i]['prompt']}") - # print(f"Raw Result: {response['result']}") - - try: - # Try to extract structured data using the schema - result = searches[i]["schema"].model_validate(response["result"]) - - # Print extracted structured data - if isinstance(result, PythonVersionInfo): - print("\nExtracted Data:") - print(f"Python Version: {result.version}") - print(f"Release Date: {result.release_date}") - print("Major Features:") - for feature in result.major_features: - print(f"- {feature}") - - elif isinstance(result, PythonComparison): - print("\nExtracted Data:") - print("Key Differences:") - for diff in result.key_differences: - print(f"- {diff}") - print(f"Backward Compatible: {result.backward_compatible}") - print(f"Migration Difficulty: {result.migration_difficulty}") - - elif isinstance(result, GILInfo): - print("\nExtracted Data:") - print(f"Definition: {result.definition}") - print(f"Purpose: {result.purpose}") - print("Limitations:") - for limit in result.limitations: - print(f"- {limit}") - print("Workarounds:") - for workaround in result.workarounds: - print(f"- {workaround}") - except Exception as e: - print(f"\nCould not extract structured data: {e}") - - print("\nReference URLs:") - for url in response["reference_urls"]: - print(f"- {url}") - - await sgai_client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/searchscraper/sync/searchscraper_example.py b/scrapegraph-py/examples/searchscraper/sync/searchscraper_example.py deleted file mode 100644 index 6fd735ba..00000000 --- a/scrapegraph-py/examples/searchscraper/sync/searchscraper_example.py +++ /dev/null @@ -1,53 +0,0 @@ -""" -Example of using the searchscraper functionality to search for information. - -This example demonstrates the configurable website limits feature: -- Default: 3 websites (30 credits) -- Enhanced: 5 websites (50 credits) - uncomment to try -- Maximum: 20 websites (200 credits) - for comprehensive research - -Requirements: -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import os - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - -# Initialize the client with API key from environment -api_key = os.getenv("SGAI_API_KEY") -if not api_key: - raise ValueError( - "SGAI_API_KEY not found in environment variables. Please create a .env file with: SGAI_API_KEY=your_api_key_here" - ) - -client = Client(api_key=api_key) - -# Send a searchscraper request with configurable website limits -response = client.searchscraper( - user_prompt="What is the latest version of Python and what are its main features?", - num_results=3, # Default: 3 websites (30 credits) - # num_results=5 # Enhanced: 5 websites (50 credits) - uncomment for more comprehensive results - # num_results=10 # Deep research: 10 websites (100 credits) - uncomment for extensive research -) - -# Print the results -print("\nResults:") -print(f"Answer: {response['result']}") -print("\nReference URLs:") -for url in response["reference_urls"]: - print(f"- {url}") - -# Close the client -client.close() diff --git a/scrapegraph-py/examples/searchscraper/sync/searchscraper_markdown_example.py b/scrapegraph-py/examples/searchscraper/sync/searchscraper_markdown_example.py deleted file mode 100644 index 3e4cb557..00000000 --- a/scrapegraph-py/examples/searchscraper/sync/searchscraper_markdown_example.py +++ /dev/null @@ -1,100 +0,0 @@ -#!/usr/bin/env python3 -""" -Basic SearchScraper Markdown Example - -This example demonstrates the simplest way to use the SearchScraper API -in markdown mode to search and scrape web pages, returning raw markdown content -instead of AI-extracted data. - -Features demonstrated: -- Basic search and scrape with markdown output -- Simple error handling -- Minimal code approach -- Cost-effective: Only 2 credits per page (vs 10 credits for AI extraction) - -Requirements: -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import os - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -def main(): - """Run a basic SearchScraper example with markdown output.""" - print("๐Ÿ” Basic SearchScraper Markdown Example") - print("=" * 50) - - # Initialize the client with API key from environment - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ SGAI_API_KEY not found in environment variables.") - print("Please create a .env file with: SGAI_API_KEY=your_api_key_here") - return False - - client = Client(api_key=api_key) - - try: - # Configuration - user_prompt = "Latest developments in artificial intelligence" - num_results = 3 - - print(f"๐Ÿ“ Query: {user_prompt}") - print(f"๐Ÿ“Š Results: {num_results} websites") - print("๐Ÿ”ง Mode: Markdown conversion") - print("๐Ÿ’ฐ Cost: 2 credits per page (vs 10 for AI extraction)") - - # Send a searchscraper request in markdown mode - response = client.searchscraper( - user_prompt=user_prompt, - num_results=num_results, - extraction_mode=False, # False = markdown mode, True = AI extraction mode - ) - - print("\nโœ… SearchScraper markdown completed successfully!") - print(f"๐Ÿ“„ Request ID: {response.get('request_id', 'N/A')}") - - # For async requests, you would need to poll for results - if 'request_id' in response: - print("๐Ÿ“ This is an async request. Use get_searchscraper() to retrieve results.") - print(f"๐Ÿ” Use: client.get_searchscraper('{response['request_id']}')") - else: - # If it's a sync response, display the results - if 'markdown_content' in response: - markdown_content = response.get("markdown_content", "") - print(f"\n๐Ÿ“ Markdown Content Preview:") - print(f"{markdown_content[:500]}{'...' if len(markdown_content) > 500 else ''}") - - if 'reference_urls' in response: - print(f"\n๐Ÿ”— References: {len(response.get('reference_urls', []))}") - print("\n๐Ÿ”— Reference URLs:") - for i, url in enumerate(response.get("reference_urls", []), 1): - print(f" {i}. {url}") - - return True - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return False - - finally: - # Close the client - client.close() - - -if __name__ == "__main__": - success = main() - exit(0 if success else 1) - diff --git a/scrapegraph-py/examples/searchscraper/sync/searchscraper_schema_example.py b/scrapegraph-py/examples/searchscraper/sync/searchscraper_schema_example.py deleted file mode 100644 index fbc54223..00000000 --- a/scrapegraph-py/examples/searchscraper/sync/searchscraper_schema_example.py +++ /dev/null @@ -1,51 +0,0 @@ -""" -Example of using the searchscraper functionality with a custom output schema. - -This example demonstrates both schema-based output and configurable website limits: -- Default: 3 websites (30 credits) -- Enhanced: 5 websites (50 credits) - provides more comprehensive data for schema -- Maximum: 20 websites (200 credits) - for highly detailed schema population -""" - -from typing import List - -from pydantic import BaseModel - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - - -# Define a custom schema for the output -class PythonVersionInfo(BaseModel): - version: str - release_date: str - major_features: List[str] - is_latest: bool - - -# Initialize the client -client = Client(api_key="your-api-key-here") - -# Send a searchscraper request with schema and configurable website limits -num_results = 5 # Enhanced search for better schema data (50 credits) -print(f"๐Ÿ” Searching {num_results} websites with custom schema") -print(f"๐Ÿ’ณ Credits required: {30 if num_results <= 3 else 30 + (num_results - 3) * 10}") - -response = client.searchscraper( - user_prompt="What is the latest version of Python? Include the release date and main features.", - num_results=num_results, # More websites for better schema population - output_schema=PythonVersionInfo, -) - -# The result will be structured according to our schema -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") - -print("\nReference URLs:") -for url in response["reference_urls"]: - print(f"- {url}") - -# Close the client -client.close() diff --git a/scrapegraph-py/examples/sitemap/async/async_sitemap_example.py b/scrapegraph-py/examples/sitemap/async/async_sitemap_example.py deleted file mode 100644 index f9d986ee..00000000 --- a/scrapegraph-py/examples/sitemap/async/async_sitemap_example.py +++ /dev/null @@ -1,276 +0,0 @@ -""" -Asynchronous example demonstrating how to use the Sitemap API. - -This example shows: -1. How to extract URLs from a website's sitemap asynchronously -2. How to process multiple sitemaps concurrently -3. How to combine sitemap with async smartscraper operations - -The Sitemap API automatically discovers the sitemap from: -- robots.txt file -- Common locations like /sitemap.xml -- Sitemap index files - -Requirements: -- Python 3.10+ -- scrapegraph-py -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -from pathlib import Path -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -async def basic_sitemap_example(): - """Demonstrate basic async sitemap extraction.""" - print("๐Ÿ—บ๏ธ Basic Async Sitemap Example") - print("=" * 40) - - async with AsyncClient.from_env() as client: - try: - # Extract sitemap URLs - print("Extracting sitemap from https://scrapegraphai.com...") - response = await client.sitemap(website_url="https://scrapegraphai.com") - - # Display results - print(f"โœ… Success! Found {len(response.urls)} URLs\n") - - # Show first 10 URLs - print("First 10 URLs:") - for i, url in enumerate(response.urls[:10], 1): - print(f" {i}. {url}") - - if len(response.urls) > 10: - print(f" ... and {len(response.urls) - 10} more URLs") - - return response - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def save_urls_to_file(urls: list[str], filename: str): - """Save sitemap URLs to a text file asynchronously.""" - output_dir = Path("sitemap_output") - output_dir.mkdir(exist_ok=True) - - file_path = output_dir / f"{filename}.txt" - - # Use asyncio to write file asynchronously - loop = asyncio.get_event_loop() - await loop.run_in_executor( - None, - lambda: file_path.write_text("\n".join(urls), encoding="utf-8") - ) - - print(f"๐Ÿ’พ URLs saved to: {file_path}") - return file_path - - -async def concurrent_sitemaps_example(): - """Demonstrate extracting multiple sitemaps concurrently.""" - print("\nโšก Concurrent Sitemaps Example") - print("=" * 40) - - websites = [ - "https://scrapegraphai.com", - "https://example.com", - "https://python.org" - ] - - async with AsyncClient.from_env() as client: - try: - print(f"Extracting sitemaps from {len(websites)} websites concurrently...") - - # Create tasks for concurrent execution - tasks = [ - client.sitemap(website_url=url) - for url in websites - ] - - # Execute all tasks concurrently - results = await asyncio.gather(*tasks, return_exceptions=True) - - # Process results - successful = 0 - for url, result in zip(websites, results): - if isinstance(result, Exception): - print(f"โŒ {url}: {str(result)}") - else: - print(f"โœ… {url}: {len(result.urls)} URLs") - successful += 1 - - print(f"\n๐Ÿ“Š Summary: {successful}/{len(websites)} successful") - - return [r for r in results if not isinstance(r, Exception)] - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def filter_and_scrape_example(): - """Demonstrate filtering sitemap URLs and scraping them asynchronously.""" - print("\n๐Ÿค– Filter + Async Scrape Example") - print("=" * 40) - - async with AsyncClient.from_env() as client: - try: - # Extract sitemap - print("Step 1: Extracting sitemap...") - response = await client.sitemap(website_url="https://scrapegraphai.com") - - # Filter for specific URLs - target_urls = [url for url in response.urls if '/blog/' in url][:3] - - if not target_urls: - target_urls = response.urls[:3] - - print(f"โœ… Found {len(response.urls)} URLs") - print(f"๐ŸŽฏ Selected {len(target_urls)} URLs to scrape\n") - - # Create scraping tasks - print("Step 2: Scraping URLs concurrently...") - - async def scrape_url(url): - """Scrape a single URL.""" - try: - result = await client.smartscraper( - website_url=url, - user_prompt="Extract the page title and main heading" - ) - return { - 'url': url, - 'data': result.get('result'), - 'status': 'success' - } - except Exception as e: - return { - 'url': url, - 'error': str(e), - 'status': 'failed' - } - - # Execute scraping tasks concurrently - tasks = [scrape_url(url) for url in target_urls] - results = await asyncio.gather(*tasks) - - # Display results - successful = sum(1 for r in results if r['status'] == 'success') - print(f"\n๐Ÿ“Š Summary:") - print(f" โœ… Successful: {successful}/{len(results)}") - print(f" โŒ Failed: {len(results) - successful}/{len(results)}") - - # Show sample results - print("\nSample results:") - for i, result in enumerate(results[:3], 1): - print(f"\n {i}. {result['url']}") - if result['status'] == 'success': - print(f" Data: {result['data']}") - else: - print(f" Error: {result['error']}") - - return results - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def batch_process_with_rate_limit(): - """Demonstrate batch processing with rate limiting.""" - print("\nโฑ๏ธ Batch Processing with Rate Limit") - print("=" * 40) - - async with AsyncClient.from_env() as client: - try: - # Extract sitemap - print("Extracting sitemap...") - response = await client.sitemap(website_url="https://scrapegraphai.com") - - # Get URLs to process - urls_to_process = response.urls[:10] - print(f"Processing {len(urls_to_process)} URLs with rate limiting...") - - # Process in batches to avoid overwhelming the API - batch_size = 3 - results = [] - - for i in range(0, len(urls_to_process), batch_size): - batch = urls_to_process[i:i + batch_size] - print(f"\nProcessing batch {i // batch_size + 1}...") - - # Process batch - batch_tasks = [ - client.smartscraper( - website_url=url, - user_prompt="Extract title" - ) - for url in batch - ] - - batch_results = await asyncio.gather(*batch_tasks, return_exceptions=True) - results.extend(batch_results) - - # Rate limiting: wait between batches - if i + batch_size < len(urls_to_process): - print("Waiting 2 seconds before next batch...") - await asyncio.sleep(2) - - successful = sum(1 for r in results if not isinstance(r, Exception)) - print(f"\nโœ… Processed {successful}/{len(results)} URLs successfully") - - return results - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - - -async def main(): - """Main function demonstrating async sitemap functionality.""" - print("๐Ÿš€ Async Sitemap API Examples") - print("=" * 40) - - try: - # Basic sitemap extraction - response = await basic_sitemap_example() - - if response and response.urls: - # Save URLs to file - await save_urls_to_file(response.urls, "async_scrapegraphai_sitemap") - - # Concurrent sitemaps - await concurrent_sitemaps_example() - - # Filter and scrape - await filter_and_scrape_example() - - # Batch processing with rate limit - await batch_process_with_rate_limit() - - print("\n๐ŸŽฏ All examples completed!") - - except Exception as e: - print(f"โŒ Unexpected error: {str(e)}") - - print("\n๐Ÿ“š Next steps:") - print("โ€ข Experiment with different websites") - print("โ€ข Adjust batch sizes for your use case") - print("โ€ข Combine with other async operations") - print("โ€ข Implement custom error handling and retry logic") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/sitemap/sync/sitemap_example.py b/scrapegraph-py/examples/sitemap/sync/sitemap_example.py deleted file mode 100644 index ea963f6e..00000000 --- a/scrapegraph-py/examples/sitemap/sync/sitemap_example.py +++ /dev/null @@ -1,251 +0,0 @@ -""" -Basic synchronous example demonstrating how to use the Sitemap API. - -This example shows: -1. How to extract URLs from a website's sitemap -2. How to save sitemap URLs to a file -3. How to combine sitemap with other scraping operations - -The Sitemap API automatically discovers the sitemap from: -- robots.txt file -- Common locations like /sitemap.xml -- Sitemap index files - -Equivalent curl command: -curl -X POST https://api.scrapegraphai.com/v1/sitemap \ - -H "Content-Type: application/json" \ - -H "SGAI-APIKEY: your-api-key-here" \ - -d '{ - "website_url": "https://example.com" - }' - -Requirements: -- Python 3.10+ -- scrapegraph-py -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -from pathlib import Path -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def basic_sitemap_example(): - """Demonstrate basic sitemap extraction.""" - print("๐Ÿ—บ๏ธ Basic Sitemap Example") - print("=" * 40) - - # Initialize client - client = Client.from_env() - - try: - # Extract sitemap URLs - print("Extracting sitemap from https://scrapegraphai.com...") - response = client.sitemap(website_url="https://scrapegraphai.com") - - # Display results - print(f"โœ… Success! Found {len(response.urls)} URLs\n") - - # Show first 10 URLs - print("First 10 URLs:") - for i, url in enumerate(response.urls[:10], 1): - print(f" {i}. {url}") - - if len(response.urls) > 10: - print(f" ... and {len(response.urls) - 10} more URLs") - - return response - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def save_urls_to_file(urls: list[str], filename: str): - """Save sitemap URLs to a text file.""" - output_dir = Path("sitemap_output") - output_dir.mkdir(exist_ok=True) - - file_path = output_dir / f"{filename}.txt" - with open(file_path, "w", encoding="utf-8") as f: - for url in urls: - f.write(url + "\n") - - print(f"๐Ÿ’พ URLs saved to: {file_path}") - return file_path - - -def filter_urls_example(): - """Demonstrate filtering sitemap URLs by pattern.""" - print("\n๐Ÿ” Filtering URLs Example") - print("=" * 40) - - client = Client.from_env() - - try: - # Extract sitemap - print("Extracting sitemap...") - response = client.sitemap(website_url="https://scrapegraphai.com") - - # Filter URLs containing specific patterns - blog_urls = [url for url in response.urls if '/blog/' in url] - doc_urls = [url for url in response.urls if '/docs/' in url or '/documentation/' in url] - - print(f"โœ… Total URLs: {len(response.urls)}") - print(f"๐Ÿ“ Blog URLs: {len(blog_urls)}") - print(f"๐Ÿ“š Documentation URLs: {len(doc_urls)}") - - # Show sample blog URLs - if blog_urls: - print("\nSample blog URLs:") - for url in blog_urls[:5]: - print(f" โ€ข {url}") - - return { - 'all_urls': response.urls, - 'blog_urls': blog_urls, - 'doc_urls': doc_urls - } - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def combine_with_smartscraper(): - """Demonstrate combining sitemap with smartscraper.""" - print("\n๐Ÿค– Sitemap + SmartScraper Example") - print("=" * 40) - - client = Client.from_env() - - try: - # First, get sitemap URLs - print("Step 1: Extracting sitemap...") - sitemap_response = client.sitemap(website_url="https://scrapegraphai.com") - - # Filter for specific pages (e.g., blog posts) - target_urls = [url for url in sitemap_response.urls if '/blog/' in url][:3] - - if not target_urls: - # If no blog URLs, use first 3 URLs - target_urls = sitemap_response.urls[:3] - - print(f"โœ… Found {len(sitemap_response.urls)} URLs") - print(f"๐ŸŽฏ Selected {len(target_urls)} URLs to scrape\n") - - # Scrape selected URLs - print("Step 2: Scraping selected URLs...") - results = [] - - for i, url in enumerate(target_urls, 1): - print(f" Scraping ({i}/{len(target_urls)}): {url}") - - try: - # Use smartscraper to extract data - scrape_result = client.smartscraper( - website_url=url, - user_prompt="Extract the page title and main heading" - ) - - results.append({ - 'url': url, - 'data': scrape_result.get('result'), - 'status': 'success' - }) - print(f" โœ… Success") - - except Exception as e: - results.append({ - 'url': url, - 'error': str(e), - 'status': 'failed' - }) - print(f" โŒ Failed: {str(e)}") - - # Summary - successful = sum(1 for r in results if r['status'] == 'success') - print(f"\n๐Ÿ“Š Summary:") - print(f" โœ… Successful: {successful}/{len(results)}") - print(f" โŒ Failed: {len(results) - successful}/{len(results)}") - - return results - - except Exception as e: - print(f"โŒ Error: {str(e)}") - return None - finally: - client.close() - - -def demonstrate_curl_equivalent(): - """Show the equivalent curl command.""" - print("\n๐ŸŒ Equivalent curl command:") - print("=" * 40) - - print("curl -X POST https://api.scrapegraphai.com/v1/sitemap \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://scrapegraphai.com\"") - print(" }'") - - -def main(): - """Main function demonstrating sitemap functionality.""" - print("๐Ÿš€ Sitemap API Examples") - print("=" * 40) - - # Show curl equivalent first - demonstrate_curl_equivalent() - - try: - # Run examples - print("\n" + "=" * 40 + "\n") - - # Basic sitemap extraction - response = basic_sitemap_example() - - if response and response.urls: - # Save URLs to file - save_urls_to_file(response.urls, "scrapegraphai_sitemap") - - # Filter URLs by pattern - filtered = filter_urls_example() - - if filtered: - # Save filtered URLs - if filtered['blog_urls']: - save_urls_to_file(filtered['blog_urls'], "blog_urls") - if filtered['doc_urls']: - save_urls_to_file(filtered['doc_urls'], "doc_urls") - - # Advanced: Combine with smartscraper - combine_with_smartscraper() - - print("\n๐ŸŽฏ All examples completed!") - - except Exception as e: - print(f"โŒ Unexpected error: {str(e)}") - - print("\n๐Ÿ“š Next steps:") - print("โ€ข Try the curl command in your terminal") - print("โ€ข Experiment with different websites") - print("โ€ข Combine sitemap with other scraping operations") - print("โ€ข Filter URLs based on your specific needs") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/smartscraper/async/async_generate_schema_example.py b/scrapegraph-py/examples/smartscraper/async/async_generate_schema_example.py deleted file mode 100644 index 5e796a26..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_generate_schema_example.py +++ /dev/null @@ -1,236 +0,0 @@ -#!/usr/bin/env python3 -""" -Async example script demonstrating the Generate Schema API endpoint using ScrapeGraph Python SDK. - -This script shows how to: -1. Generate a new JSON schema from a search query asynchronously -2. Modify an existing schema -3. Handle different types of search queries -4. Check the status of schema generation requests -5. Run multiple concurrent schema generations - -Requirements: -- Python 3.7+ -- scrapegraph-py package -- aiohttp -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here - -Usage: - python async_generate_schema_example.py -""" - -import asyncio -import json -import os -from typing import Any, Dict, Optional - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -class AsyncGenerateSchemaExample: - """Async example class for demonstrating the Generate Schema API using ScrapeGraph SDK""" - - def __init__(self, base_url: str = None, api_key: str = None): - # Get API key from environment if not provided - self.api_key = api_key or os.getenv("SGAI_API_KEY") - if not self.api_key: - raise ValueError( - "API key must be provided or set in .env file as SGAI_API_KEY. " - "Create a .env file with: SGAI_API_KEY=your_api_key_here" - ) - - # Initialize the ScrapeGraph async client - if base_url: - # If base_url is provided, we'll need to modify the client to use it - # For now, we'll use the default client and note the limitation - print(f"โš ๏ธ Note: Custom base_url {base_url} not yet supported in this example") - - self.client = AsyncClient(api_key=self.api_key) - - def print_schema_response( - self, response: Dict[str, Any], title: str = "Schema Generation Response" - ): - """Pretty print the schema generation response""" - print(f"\n{'='*60}") - print(f" {title}") - print(f"{'='*60}") - - if "error" in response and response["error"]: - print(f"โŒ Error: {response['error']}") - return - - print(f"โœ… Request ID: {response.get('request_id', 'N/A')}") - print(f"๐Ÿ“Š Status: {response.get('status', 'N/A')}") - print(f"๐Ÿ” User Prompt: {response.get('user_prompt', 'N/A')}") - print(f"โœจ Refined Prompt: {response.get('refined_prompt', 'N/A')}") - - if "generated_schema" in response: - print(f"\n๐Ÿ“‹ Generated Schema:") - print(json.dumps(response["generated_schema"], indent=2)) - - async def run_examples(self): - """Run all the example scenarios asynchronously""" - print("๐Ÿš€ Async Generate Schema API Examples using ScrapeGraph Python SDK") - print("=" * 60) - - # Example 1: Generate schema for e-commerce products - print("\n1๏ธโƒฃ Example: E-commerce Product Search") - ecommerce_prompt = "Find laptops with specifications like brand, processor, RAM, storage, and price" - try: - response = await self.client.generate_schema(ecommerce_prompt) - self.print_schema_response(response, "E-commerce Products Schema") - except Exception as e: - print(f"โŒ Error in e-commerce example: {e}") - - # Example 2: Generate schema for job listings - print("\n2๏ธโƒฃ Example: Job Listings Search") - job_prompt = "Search for software engineering jobs with company name, position, location, salary range, and requirements" - try: - response = await self.client.generate_schema(job_prompt) - self.print_schema_response(response, "Job Listings Schema") - except Exception as e: - print(f"โŒ Error in job listings example: {e}") - - # Example 3: Generate schema for news articles - print("\n3๏ธโƒฃ Example: News Articles Search") - news_prompt = "Find technology news articles with headline, author, publication date, category, and summary" - try: - response = await self.client.generate_schema(news_prompt) - self.print_schema_response(response, "News Articles Schema") - except Exception as e: - print(f"โŒ Error in news articles example: {e}") - - # Example 4: Modify existing schema - print("\n4๏ธโƒฃ Example: Modify Existing Schema") - existing_schema = { - "$defs": { - "ProductSchema": { - "title": "ProductSchema", - "type": "object", - "properties": { - "name": {"title": "Name", "type": "string"}, - "price": {"title": "Price", "type": "number"}, - }, - "required": ["name", "price"], - } - }, - "title": "ProductList", - "type": "object", - "properties": { - "products": { - "title": "Products", - "type": "array", - "items": {"$ref": "#/$defs/ProductSchema"}, - } - }, - "required": ["products"], - } - - modification_prompt = ( - "Add brand, category, and rating fields to the existing product schema" - ) - try: - response = await self.client.generate_schema(modification_prompt, existing_schema) - self.print_schema_response(response, "Modified Product Schema") - except Exception as e: - print(f"โŒ Error in schema modification example: {e}") - - # Example 5: Complex nested schema - print("\n5๏ธโƒฃ Example: Complex Nested Schema") - complex_prompt = "Create a schema for a company directory with departments, each containing employees with contact info and projects" - try: - response = await self.client.generate_schema(complex_prompt) - self.print_schema_response(response, "Company Directory Schema") - except Exception as e: - print(f"โŒ Error in complex schema example: {e}") - - async def run_concurrent_examples(self): - """Run multiple schema generations concurrently""" - print("\n๐Ÿ”„ Running Concurrent Examples...") - - # Example: Multiple concurrent schema generations - prompts = [ - "Find restaurants with name, cuisine, rating, and address", - "Search for books with title, author, genre, and publication year", - "Find movies with title, director, cast, rating, and release date", - ] - - try: - tasks = [self.client.generate_schema(prompt) for prompt in prompts] - results = await asyncio.gather(*tasks) - - for i, (prompt, result) in enumerate(zip(prompts, results), 1): - self.print_schema_response(result, f"Concurrent Example {i}: {prompt[:30]}...") - - except Exception as e: - print(f"โŒ Error in concurrent examples: {e}") - - async def demonstrate_status_checking(self): - """Demonstrate how to check the status of schema generation requests""" - print("\n๐Ÿ”„ Demonstrating Status Checking...") - - # Generate a simple schema first - prompt = "Find restaurants with name, cuisine, rating, and address" - try: - response = await self.client.generate_schema(prompt) - request_id = response.get('request_id') - - if request_id: - print(f"๐Ÿ“ Generated schema request with ID: {request_id}") - - # Check the status - print("๐Ÿ” Checking status...") - status_response = await self.client.get_schema_status(request_id) - self.print_schema_response(status_response, f"Status Check for {request_id}") - else: - print("โš ๏ธ No request ID returned from schema generation") - - except Exception as e: - print(f"โŒ Error in status checking demonstration: {e}") - - async def close(self): - """Close the client to free up resources""" - if hasattr(self, 'client'): - await self.client.close() - - -async def main(): - """Main function to run the async examples""" - # Check if API key is available - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - # Initialize the example class - example = AsyncGenerateSchemaExample() - - try: - # Run synchronous examples - await example.run_examples() - - # Run concurrent examples - await example.run_concurrent_examples() - - # Demonstrate status checking - await example.demonstrate_status_checking() - - except Exception as e: - print(f"โŒ Unexpected Error: {e}") - finally: - # Always close the client - await example.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_cookies_example.py b/scrapegraph-py/examples/smartscraper/async/async_smartscraper_cookies_example.py deleted file mode 100644 index b68794b4..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_cookies_example.py +++ /dev/null @@ -1,131 +0,0 @@ -""" -Example demonstrating how to use the SmartScraper API with cookies (Async). - -This example shows how to: -1. Set up the API request with cookies for authentication -2. Use cookies with infinite scrolling -3. Define a Pydantic model for structured output -4. Make the API call and handle the response -5. Process the extracted data - -Requirements: -- Python 3.7+ -- scrapegraph-py -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -import json -import os -from typing import Dict - -from dotenv import load_dotenv -from pydantic import BaseModel, Field - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -# Define the data models for structured output -class CookieInfo(BaseModel): - """Model representing cookie information.""" - - cookies: Dict[str, str] = Field(description="Dictionary of cookie key-value pairs") - - -async def main(): - """Example usage of the cookies scraper.""" - # Check if API key is available - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - # Initialize the async client - async with AsyncClient.from_env() as client: - # Example 1: Basic cookies example (httpbin.org/cookies) - print("=" * 60) - print("EXAMPLE 1: Basic Cookies Example") - print("=" * 60) - - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies info" - cookies = {"cookies_key": "cookies_value"} - - try: - # Perform the scraping with cookies - response = await client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - output_schema=CookieInfo, - ) - - # Print the results - print("\nExtracted Cookie Information:") - print(json.dumps(response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - # Example 2: Cookies with infinite scrolling - print("\n" + "=" * 60) - print("EXAMPLE 2: Cookies with Infinite Scrolling") - print("=" * 60) - - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies and scroll information" - cookies = {"session_id": "abc123", "user_token": "xyz789"} - - try: - # Perform the scraping with cookies and infinite scrolling - response = await client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - number_of_scrolls=3, - output_schema=CookieInfo, - ) - - # Print the results - print("\nExtracted Cookie Information with Scrolling:") - print(json.dumps(response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - # Example 3: Cookies with pagination - print("\n" + "=" * 60) - print("EXAMPLE 3: Cookies with Pagination") - print("=" * 60) - - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies from multiple pages" - cookies = {"auth_token": "secret123", "preferences": "dark_mode"} - - try: - # Perform the scraping with cookies and pagination - response = await client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - total_pages=3, - output_schema=CookieInfo, - ) - - # Print the results - print("\nExtracted Cookie Information with Pagination:") - print(json.dumps(response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_example.py b/scrapegraph-py/examples/smartscraper/async/async_smartscraper_example.py deleted file mode 100644 index f9a0f694..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_example.py +++ /dev/null @@ -1,56 +0,0 @@ -import asyncio -import os - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -async def main(): - # Initialize async client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - sgai_client = AsyncClient(api_key=api_key) - - # Concurrent scraping requests - urls = [ - "https://scrapegraphai.com/", - "https://github.com/ScrapeGraphAI/Scrapegraph-ai", - ] - - tasks = [ - sgai_client.smartscraper( - website_url=url, user_prompt="Summarize the main content" - ) - for url in urls - ] - - # Execute requests concurrently - responses = await asyncio.gather(*tasks, return_exceptions=True) - - # Process results - for i, response in enumerate(responses): - if isinstance(response, Exception): - print(f"\nError for {urls[i]}: {response}") - else: - print(f"\nPage {i+1} Summary:") - print(f"URL: {urls[i]}") - print(f"Result: {response['result']}") - - await sgai_client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_infinite_scroll_example.py b/scrapegraph-py/examples/smartscraper/async/async_smartscraper_infinite_scroll_example.py deleted file mode 100644 index 9e4f1802..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_infinite_scroll_example.py +++ /dev/null @@ -1,62 +0,0 @@ -import asyncio - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - - -async def scrape_companies(client: AsyncClient, url: str, batch: str) -> None: - """Scrape companies from a specific YC batch with infinite scroll.""" - try: - # Initial scrape with infinite scroll enabled - response = await client.smartscraper( - website_url=url, - user_prompt="Extract all company information from this page, including name, description, and website", - number_of_scrolls=10, - ) - # Process the results - companies = response.get("result", {}).get("companies", []) - if not companies: - print(f"No companies found for batch {batch}") - return - - # Save or process the companies data - print(f"Found {len(companies)} companies in batch {batch}") - - for company in companies: - print(f"Company: {company.get('name', 'N/A')}") - print(f"Description: {company.get('description', 'N/A')}") - print(f"Website: {company.get('website', 'N/A')}") - print("-" * 50) - - except Exception as e: - print(f"Error scraping batch {batch}: {str(e)}") - - -async def main(): - # Initialize async client - client = AsyncClient(api_key="Your-API-Key") - - try: - # Example YC batch URLs - batch_urls = { - "W24": "https://www.ycombinator.com/companies?batch=Winter%202024", - "S23": "https://www.ycombinator.com/companies?batch=Summer%202023", - } - - # Create tasks for each batch - tasks = [ - scrape_companies(client, url, batch) for batch, url in batch_urls.items() - ] - - # Execute all batch scraping concurrently - await asyncio.gather(*tasks) - - finally: - # Ensure client is properly closed - await client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_pagination_example.py b/scrapegraph-py/examples/smartscraper/async/async_smartscraper_pagination_example.py deleted file mode 100644 index 4ef32d4a..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_pagination_example.py +++ /dev/null @@ -1,286 +0,0 @@ -#!/usr/bin/env python3 -""" -SmartScraper Pagination Example (Async) - -This example demonstrates how to use pagination functionality with SmartScraper API using the asynchronous client. -""" - -import asyncio -import json -import logging -import os -import time -from typing import List, Optional - -from dotenv import load_dotenv -from pydantic import BaseModel - -from scrapegraph_py import AsyncClient -from scrapegraph_py.exceptions import APIError - -# Load environment variables from .env file -load_dotenv() - - -# Configure logging -logging.basicConfig( - level=logging.INFO, - format="%(asctime)s - %(levelname)s - %(message)s", - handlers=[logging.StreamHandler()], -) -logger = logging.getLogger(__name__) - - -class ProductInfo(BaseModel): - """Schema for product information""" - - name: str - price: Optional[str] = None - rating: Optional[str] = None - image_url: Optional[str] = None - description: Optional[str] = None - - -class ProductList(BaseModel): - """Schema for list of products""" - - products: List[ProductInfo] - - -async def smartscraper_pagination_example(): - """Example of using pagination with SmartScraper (async)""" - - print("SmartScraper Pagination Example (Async)") - print("=" * 50) - - # Initialize client from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - try: - client = AsyncClient(api_key=api_key) - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return - - # Configuration - website_url = "https://www.amazon.in/s?k=tv&crid=1TEF1ZFVLU8R8&sprefix=t%2Caps%2C390&ref=nb_sb_noss_2" - user_prompt = "Extract all product info including name, price, rating, image_url, and description" - total_pages = 3 # Number of pages to scrape - - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐Ÿ“ User Prompt: {user_prompt}") - print(f"๐Ÿ“„ Total Pages: {total_pages}") - print("-" * 50) - - try: - # Start timing - start_time = time.time() - - # Make the request with pagination - result = await client.smartscraper( - user_prompt=user_prompt, - website_url=website_url, - output_schema=ProductList, - total_pages=total_pages, - ) - - # Calculate duration - duration = time.time() - start_time - - print(f"โœ… Request completed in {duration:.2f} seconds") - print(f"๐Ÿ“Š Response type: {type(result)}") - - # Display results - if isinstance(result, dict): - print("\n๐Ÿ” Response:") - print(json.dumps(result, indent=2, ensure_ascii=False)) - - # Check for pagination success indicators - if "data" in result: - print( - f"\nโœจ Pagination successful! Data extracted from {total_pages} pages" - ) - - elif isinstance(result, list): - print(f"\nโœ… Pagination successful! Extracted {len(result)} items") - for i, item in enumerate(result[:5]): # Show first 5 items - print(f" {i+1}. {item}") - if len(result) > 5: - print(f" ... and {len(result) - 5} more items") - else: - print(f"\n๐Ÿ“‹ Result: {result}") - - except APIError as e: - print(f"โŒ API Error: {e}") - print("This could be due to:") - print(" - Invalid API key") - print(" - Rate limiting") - print(" - Server issues") - - except Exception as e: - print(f"โŒ Unexpected error: {e}") - print("This could be due to:") - print(" - Network connectivity issues") - print(" - Invalid website URL") - print(" - Pagination limitations") - - -async def test_concurrent_pagination(): - """Test multiple pagination requests concurrently""" - - print("\n" + "=" * 50) - print("Testing concurrent pagination requests") - print("=" * 50) - - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return - - try: - client = AsyncClient(api_key=api_key) - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return - - # Test concurrent requests - urls = [ - "https://example.com/products?page=1", - "https://example.com/products?page=2", - "https://example.com/products?page=3", - ] - - tasks = [] - for i, url in enumerate(urls): - print(f"๐Ÿš€ Creating task {i+1} for URL: {url}") - # Note: In a real scenario, you would use actual URLs - # This is just to demonstrate the async functionality - tasks.append( - asyncio.create_task(simulate_pagination_request(client, url, i + 1)) - ) - - print(f"โฑ๏ธ Starting {len(tasks)} concurrent tasks...") - start_time = time.time() - - try: - results = await asyncio.gather(*tasks, return_exceptions=True) - duration = time.time() - start_time - - print(f"โœ… All tasks completed in {duration:.2f} seconds") - - for i, result in enumerate(results): - if isinstance(result, Exception): - print(f"โŒ Task {i+1} failed: {result}") - else: - print(f"โœ… Task {i+1} succeeded: {result}") - - except Exception as e: - print(f"โŒ Concurrent execution failed: {e}") - - -async def simulate_pagination_request(client: AsyncClient, url: str, task_id: int): - """Simulate a pagination request (for demonstration)""" - - print(f"๐Ÿ“‹ Task {task_id}: Processing {url}") - - # Simulate some work - await asyncio.sleep(0.5) - - # Return a simulated result - return f"Task {task_id} completed successfully" - - -async def test_pagination_with_different_parameters(): - """Test pagination with different parameters""" - - print("\n" + "=" * 50) - print("Testing pagination with different parameters") - print("=" * 50) - - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return - - try: - AsyncClient(api_key=api_key) - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return - - # Test cases - test_cases = [ - { - "name": "Single page (default)", - "url": "https://example.com", - "total_pages": None, - "user_prompt": "Extract basic info", - }, - { - "name": "Two pages with schema", - "url": "https://example.com/products", - "total_pages": 2, - "user_prompt": "Extract product information", - "output_schema": ProductList, - }, - { - "name": "Maximum pages with scrolling", - "url": "https://example.com/search", - "total_pages": 5, - "user_prompt": "Extract all available data", - "number_of_scrolls": 3, - }, - ] - - for test_case in test_cases: - print(f"\n๐Ÿงช Test: {test_case['name']}") - print(f" Pages: {test_case['total_pages']}") - print(f" Prompt: {test_case['user_prompt']}") - - try: - # This is just to demonstrate the API call structure - # In a real scenario, you'd make actual API calls - print(" โœ… Configuration valid") - - except Exception as e: - print(f" โŒ Configuration error: {e}") - - -async def main(): - """Main function to run the pagination examples""" - - print("ScrapeGraph SDK - SmartScraper Pagination Examples (Async)") - print("=" * 60) - - # Run the main example - await smartscraper_pagination_example() - - # Test concurrent pagination - await test_concurrent_pagination() - - # Test different parameters - await test_pagination_with_different_parameters() - - print("\n" + "=" * 60) - print("Examples completed!") - print("\nNext steps:") - print("1. Set SGAI_API_KEY environment variable") - print("2. Replace example URLs with real websites") - print("3. Adjust total_pages parameter (1-10)") - print("4. Customize user_prompt for your use case") - print("5. Define output_schema for structured data") - print("\nAsync-specific tips:") - print("- Use asyncio.gather() for concurrent requests") - print("- Consider rate limiting with asyncio.Semaphore") - print("- Handle exceptions properly in async context") - print("- Use proper context managers for cleanup") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_render_heavy_example.py b/scrapegraph-py/examples/smartscraper/async/async_smartscraper_render_heavy_example.py deleted file mode 100644 index 90f22606..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_render_heavy_example.py +++ /dev/null @@ -1,39 +0,0 @@ -import asyncio -import os - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -async def main(): - # Initialize the client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - async with AsyncClient(api_key=api_key) as sgai_client: - # SmartScraper request with render_heavy_js enabled - response = await sgai_client.smartscraper( - website_url="https://example.com", - user_prompt="Find the CEO of company X and their contact details", - render_heavy_js=True, # Enable heavy JavaScript rendering - ) - - # Print the response - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - - -if __name__ == "__main__": - asyncio.run(main()) \ No newline at end of file diff --git a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_schema_example.py b/scrapegraph-py/examples/smartscraper/async/async_smartscraper_schema_example.py deleted file mode 100644 index d7cd4fa6..00000000 --- a/scrapegraph-py/examples/smartscraper/async/async_smartscraper_schema_example.py +++ /dev/null @@ -1,34 +0,0 @@ -import asyncio - -from pydantic import BaseModel, Field - -from scrapegraph_py import AsyncClient - - -# Define a Pydantic model for the output schema -class WebpageSchema(BaseModel): - title: str = Field(description="The title of the webpage") - description: str = Field(description="The description of the webpage") - summary: str = Field(description="A brief summary of the webpage") - - -async def main(): - # Initialize the async client - sgai_client = AsyncClient(api_key="your-api-key-here") - - # SmartScraper request with output schema - response = await sgai_client.smartscraper( - website_url="https://example.com", - user_prompt="Extract webpage information", - output_schema=WebpageSchema, - ) - - # Print the response - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - - await sgai_client.close() - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/scrapegraph-py/examples/smartscraper/sync/generate_schema_example.py b/scrapegraph-py/examples/smartscraper/sync/generate_schema_example.py deleted file mode 100644 index 205e5796..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/generate_schema_example.py +++ /dev/null @@ -1,208 +0,0 @@ -#!/usr/bin/env python3 -""" -Example script demonstrating the Generate Schema API endpoint using ScrapeGraph Python SDK. - -This script shows how to: -1. Generate a new JSON schema from a search query -2. Modify an existing schema -3. Handle different types of search queries -4. Check the status of schema generation requests - -Requirements: -- Python 3.7+ -- scrapegraph-py package -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here - -Usage: - python generate_schema_example.py -""" - -import json -import os -from typing import Any, Dict, Optional - -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -class GenerateSchemaExample: - """Example class for demonstrating the Generate Schema API using ScrapeGraph SDK""" - - def __init__(self, base_url: str = None, api_key: str = None): - # Get API key from environment if not provided - self.api_key = api_key or os.getenv("SGAI_API_KEY") - if not self.api_key: - raise ValueError( - "API key must be provided or set in .env file as SGAI_API_KEY. " - "Create a .env file with: SGAI_API_KEY=your_api_key_here" - ) - - # Initialize the ScrapeGraph client - if base_url: - # If base_url is provided, we'll need to modify the client to use it - # For now, we'll use the default client and note the limitation - print(f"โš ๏ธ Note: Custom base_url {base_url} not yet supported in this example") - - self.client = Client(api_key=self.api_key) - - def print_schema_response( - self, response: Dict[str, Any], title: str = "Schema Generation Response" - ): - """Pretty print the schema generation response""" - print(f"\n{'='*60}") - print(f" {title}") - print(f"{'='*60}") - - if "error" in response and response["error"]: - print(f"โŒ Error: {response['error']}") - return - - print(f"โœ… Request ID: {response.get('request_id', 'N/A')}") - print(f"๐Ÿ“Š Status: {response.get('status', 'N/A')}") - print(f"๐Ÿ” User Prompt: {response.get('user_prompt', 'N/A')}") - print(f"โœจ Refined Prompt: {response.get('refined_prompt', 'N/A')}") - - if "generated_schema" in response: - print(f"\n๐Ÿ“‹ Generated Schema:") - print(json.dumps(response["generated_schema"], indent=2)) - - def run_examples(self): - """Run all the example scenarios""" - print("๐Ÿš€ Generate Schema API Examples using ScrapeGraph Python SDK") - print("=" * 60) - - # Example 1: Generate schema for e-commerce products - print("\n1๏ธโƒฃ Example: E-commerce Product Search") - ecommerce_prompt = "Find laptops with specifications like brand, processor, RAM, storage, and price" - try: - response = self.client.generate_schema(ecommerce_prompt) - self.print_schema_response(response, "E-commerce Products Schema") - except Exception as e: - print(f"โŒ Error in e-commerce example: {e}") - - # Example 2: Generate schema for job listings - print("\n2๏ธโƒฃ Example: Job Listings Search") - job_prompt = "Search for software engineering jobs with company name, position, location, salary range, and requirements" - try: - response = self.client.generate_schema(job_prompt) - self.print_schema_response(response, "Job Listings Schema") - except Exception as e: - print(f"โŒ Error in job listings example: {e}") - - # Example 3: Generate schema for news articles - print("\n3๏ธโƒฃ Example: News Articles Search") - news_prompt = "Find technology news articles with headline, author, publication date, category, and summary" - try: - response = self.client.generate_schema(news_prompt) - self.print_schema_response(response, "News Articles Schema") - except Exception as e: - print(f"โŒ Error in news articles example: {e}") - - # Example 4: Modify existing schema - print("\n4๏ธโƒฃ Example: Modify Existing Schema") - existing_schema = { - "$defs": { - "ProductSchema": { - "title": "ProductSchema", - "type": "object", - "properties": { - "name": {"title": "Name", "type": "string"}, - "price": {"title": "Price", "type": "number"}, - }, - "required": ["name", "price"], - } - }, - "title": "ProductList", - "type": "object", - "properties": { - "products": { - "title": "Products", - "type": "array", - "items": {"$ref": "#/$defs/ProductSchema"}, - } - }, - "required": ["products"], - } - - modification_prompt = ( - "Add brand, category, and rating fields to the existing product schema" - ) - try: - response = self.client.generate_schema(modification_prompt, existing_schema) - self.print_schema_response(response, "Modified Product Schema") - except Exception as e: - print(f"โŒ Error in schema modification example: {e}") - - # Example 5: Complex nested schema - print("\n5๏ธโƒฃ Example: Complex Nested Schema") - complex_prompt = "Create a schema for a company directory with departments, each containing employees with contact info and projects" - try: - response = self.client.generate_schema(complex_prompt) - self.print_schema_response(response, "Company Directory Schema") - except Exception as e: - print(f"โŒ Error in complex schema example: {e}") - - def demonstrate_status_checking(self): - """Demonstrate how to check the status of schema generation requests""" - print("\n๐Ÿ”„ Demonstrating Status Checking...") - - # Generate a simple schema first - prompt = "Find restaurants with name, cuisine, rating, and address" - try: - response = self.client.generate_schema(prompt) - request_id = response.get('request_id') - - if request_id: - print(f"๐Ÿ“ Generated schema request with ID: {request_id}") - - # Check the status - print("๐Ÿ” Checking status...") - status_response = self.client.get_schema_status(request_id) - self.print_schema_response(status_response, f"Status Check for {request_id}") - else: - print("โš ๏ธ No request ID returned from schema generation") - - except Exception as e: - print(f"โŒ Error in status checking demonstration: {e}") - - def close(self): - """Close the client to free up resources""" - if hasattr(self, 'client'): - self.client.close() - - -def main(): - """Main function to run the examples""" - # Check if API key is available - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - # Initialize the example class - example = GenerateSchemaExample() - - try: - # Run synchronous examples - example.run_examples() - - # Demonstrate status checking - example.demonstrate_status_checking() - - except Exception as e: - print(f"โŒ Unexpected Error: {e}") - finally: - # Always close the client - example.close() - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/smartscraper/sync/sample_product.html b/scrapegraph-py/examples/smartscraper/sync/sample_product.html deleted file mode 100644 index 2872e1af..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/sample_product.html +++ /dev/null @@ -1,77 +0,0 @@ - - - - - - Sample Product Page - - - -
-

Premium Wireless Headphones

-

โ‚ฌ299.99

- -
-

Description

-

- Experience crystal-clear audio with our premium wireless headphones. - Featuring advanced noise cancellation technology and up to 30 hours - of battery life, these headphones are perfect for music lovers and - professionals alike. -

-
- -
-

Key Features

-
    -
  • Active Noise Cancellation (ANC)
  • -
  • 30-hour battery life
  • -
  • Bluetooth 5.0 connectivity
  • -
  • Premium leather ear cushions
  • -
  • Foldable design with carry case
  • -
  • Built-in microphone for calls
  • -
-
- -
-

Contact Information

-

Email: support@example.com

-

Phone: +1 (555) 123-4567

-

Website: www.example.com

-
- -
-

Stock Status: In Stock

-

SKU: WH-1000XM5-BLK

-

Category: Electronics > Audio > Headphones

-
-
- - \ No newline at end of file diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_cookies_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_cookies_example.py deleted file mode 100644 index cc812356..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_cookies_example.py +++ /dev/null @@ -1,134 +0,0 @@ -""" -Example demonstrating how to use the SmartScraper API with cookies. - -This example shows how to: -1. Set up the API request with cookies for authentication -2. Use cookies with infinite scrolling -3. Define a Pydantic model for structured output -4. Make the API call and handle the response -5. Process the extracted data - -Requirements: -- Python 3.7+ -- scrapegraph-py -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import json -import os -from typing import Dict - -from dotenv import load_dotenv -from pydantic import BaseModel, Field - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -# Define the data models for structured output -class CookieInfo(BaseModel): - """Model representing cookie information.""" - - cookies: Dict[str, str] = Field(description="Dictionary of cookie key-value pairs") - - -def main(): - """Example usage of the cookies scraper.""" - # Check if API key is available - if not os.getenv("SGAI_API_KEY"): - print("Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - # Initialize the client - client = Client.from_env() - - # Example 1: Basic cookies example (httpbin.org/cookies) - print("=" * 60) - print("EXAMPLE 1: Basic Cookies Example") - print("=" * 60) - - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies info" - cookies = {"cookies_key": "cookies_value"} - - try: - # Perform the scraping with cookies - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - output_schema=CookieInfo, - ) - - # Print the results - print("\nExtracted Cookie Information:") - print(json.dumps(response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - # Example 2: Cookies with infinite scrolling - print("\n" + "=" * 60) - print("EXAMPLE 2: Cookies with Infinite Scrolling") - print("=" * 60) - - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies and scroll information" - cookies = {"session_id": "abc123", "user_token": "xyz789"} - - try: - # Perform the scraping with cookies and infinite scrolling - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - number_of_scrolls=3, - output_schema=CookieInfo, - ) - - # Print the results - print("\nExtracted Cookie Information with Scrolling:") - print(json.dumps(response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - # Example 3: Cookies with pagination - print("\n" + "=" * 60) - print("EXAMPLE 3: Cookies with Pagination") - print("=" * 60) - - website_url = "https://httpbin.org/cookies" - user_prompt = "Extract all cookies from multiple pages" - cookies = {"auth_token": "secret123", "preferences": "dark_mode"} - - try: - # Perform the scraping with cookies and pagination - response = client.smartscraper( - website_url=website_url, - user_prompt=user_prompt, - cookies=cookies, - total_pages=3, - output_schema=CookieInfo, - ) - - # Print the results - print("\nExtracted Cookie Information with Pagination:") - print(json.dumps(response, indent=2)) - - except Exception as e: - print(f"Error occurred: {str(e)}") - - # Close the client - client.close() - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_example.py deleted file mode 100644 index f6b82066..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_example.py +++ /dev/null @@ -1,36 +0,0 @@ -import os - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - -# Initialize the client with API key from environment variable -api_key = os.getenv("SGAI_API_KEY") -if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - exit(1) - -sgai_client = Client(api_key=api_key) - -# SmartScraper request -response = sgai_client.smartscraper( - website_url="https://example.com", - # website_html="...", # Optional, if you want to pass in HTML content instead of a URL - user_prompt="Extract the main heading, description, and summary of the webpage", -) - - -# Print the response -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") - -sgai_client.close() diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_infinite_scroll_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_infinite_scroll_example.py deleted file mode 100644 index ece65795..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_infinite_scroll_example.py +++ /dev/null @@ -1,65 +0,0 @@ -import os -from typing import List - -from dotenv import load_dotenv -from pydantic import BaseModel - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -# Define the output schema -class Company(BaseModel): - name: str - category: str - location: str - - -class CompaniesResponse(BaseModel): - companies: List[Company] - - -# Initialize the client with API key from environment variable -# Make sure to set SGAI_API_KEY in your environment or .env file -api_key = os.getenv("SGAI_API_KEY") -if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - exit(1) - -sgai_client = Client(api_key=api_key) - -try: - # SmartScraper request with infinite scroll - response = sgai_client.smartscraper( - website_url="https://www.ycombinator.com/companies?batch=Spring%202025", - user_prompt="Extract all company names and their categories from the page", - output_schema=CompaniesResponse, - number_of_scrolls=10, # Scroll 10 times to load more companies - ) - - # Print the response - print(f"Request ID: {response['request_id']}") - - # Parse and print the results in a structured way - result = CompaniesResponse.model_validate(response["result"]) - print("\nExtracted Companies:") - print("-" * 80) - for company in result.companies: - print(f"Name: {company.name}") - print(f"Category: {company.category}") - print(f"Location: {company.location}") - print("-" * 80) - -except Exception as e: - print(f"An error occurred: {e}") - -finally: - sgai_client.close() diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_local_html_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_local_html_example.py deleted file mode 100644 index c79cab14..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_local_html_example.py +++ /dev/null @@ -1,115 +0,0 @@ -""" -SmartScraper with Local HTML File Example - -This example demonstrates how to use SmartScraper with a local HTML file -instead of fetching content from a URL. Perfect for: -- Testing with static HTML files -- Processing saved web pages -- Working offline -- Debugging and development - -Requirements: -- SGAI_API_KEY environment variable must be set -""" - -import os -from pathlib import Path - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - - -def read_html_file(file_path: str) -> str: - """ - Read HTML content from a local file. - - Args: - file_path: Path to the HTML file - - Returns: - HTML content as string - """ - try: - with open(file_path, "r", encoding="utf-8") as f: - return f.read() - except FileNotFoundError: - print(f"โŒ File not found: {file_path}") - raise - except Exception as e: - print(f"โŒ Error reading file: {str(e)}") - raise - - -def main(): - """Extract data from a local HTML file using SmartScraper.""" - - # Initialize the client with API key from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - # Path to the sample HTML file in the same directory - script_dir = Path(__file__).parent - html_file_path = script_dir / "sample_product.html" - - # Check if the HTML file exists - if not html_file_path.exists(): - print(f"โŒ HTML file not found at: {html_file_path}") - print(" Make sure sample_product.html exists in the sync/ directory") - return - - # Read the HTML file - print(f"๐Ÿ“‚ Reading HTML file: {html_file_path.name}") - html_content = read_html_file(str(html_file_path)) - - # Check file size (max 2MB) - html_size_mb = len(html_content.encode("utf-8")) / (1024 * 1024) - print(f"๐Ÿ“Š HTML file size: {html_size_mb:.4f} MB") - - if html_size_mb > 2: - print("โŒ HTML file exceeds 2MB limit") - return - - # Define what to extract - user_prompt = "Extract the product name, price, description, all features, and contact information" - - # Create client and scrape using local HTML - sgai_client = Client(api_key=api_key) - - print(f"๐ŸŽฏ Prompt: {user_prompt}") - print() - - # Pass website_html instead of website_url - # Note: website_url should be empty string when using website_html - response = sgai_client.smartscraper( - website_url="", # Empty when using website_html - user_prompt=user_prompt, - website_html=html_content, # Pass the HTML content here - ) - - # Print the response - print("โœ… Success! Extracted data from local HTML:") - print() - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - print() - - sgai_client.close() - - -if __name__ == "__main__": - print("SmartScraper with Local HTML File Example") - print("=" * 45) - print() - main() diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_pagination_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_pagination_example.py deleted file mode 100644 index 08d76aac..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_pagination_example.py +++ /dev/null @@ -1,207 +0,0 @@ -#!/usr/bin/env python3 -""" -SmartScraper Pagination Example (Sync) - -This example demonstrates how to use pagination functionality with SmartScraper API using the synchronous client. -""" - -import json -import logging -import os -import time -from typing import List, Optional - -from dotenv import load_dotenv -from pydantic import BaseModel - -from scrapegraph_py import Client -from scrapegraph_py.exceptions import APIError - -# Load environment variables from .env file -load_dotenv() - - -# Configure logging -logging.basicConfig( - level=logging.INFO, - format="%(asctime)s - %(levelname)s - %(message)s", - handlers=[logging.StreamHandler()], -) -logger = logging.getLogger(__name__) - - -class ProductInfo(BaseModel): - """Schema for product information""" - - name: str - price: Optional[str] = None - rating: Optional[str] = None - image_url: Optional[str] = None - description: Optional[str] = None - - -class ProductList(BaseModel): - """Schema for list of products""" - - products: List[ProductInfo] - - -def smartscraper_pagination_example(): - """Example of using pagination with SmartScraper (sync)""" - - print("SmartScraper Pagination Example (Sync)") - print("=" * 50) - - # Initialize client from environment variable - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - return - - try: - client = Client(api_key=api_key) - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return - - # Configuration - website_url = "https://www.amazon.in/s?k=tv&crid=1TEF1ZFVLU8R8&sprefix=t%2Caps%2C390&ref=nb_sb_noss_2" - user_prompt = "Extract all product info including name, price, rating, image_url, and description" - total_pages = 3 # Number of pages to scrape - - print(f"๐ŸŒ Website URL: {website_url}") - print(f"๐Ÿ“ User Prompt: {user_prompt}") - print(f"๐Ÿ“„ Total Pages: {total_pages}") - print("-" * 50) - - try: - # Start timing - start_time = time.time() - - # Make the request with pagination - result = client.smartscraper( - user_prompt=user_prompt, - website_url=website_url, - output_schema=ProductList, - total_pages=total_pages, - ) - - # Calculate duration - duration = time.time() - start_time - - print(f"โœ… Request completed in {duration:.2f} seconds") - print(f"๐Ÿ“Š Response type: {type(result)}") - - # Display results - if isinstance(result, dict): - print("\n๐Ÿ” Response:") - print(json.dumps(result, indent=2, ensure_ascii=False)) - - # Check for pagination success indicators - if "data" in result: - print( - f"\nโœจ Pagination successful! Data extracted from {total_pages} pages" - ) - - elif isinstance(result, list): - print(f"\nโœ… Pagination successful! Extracted {len(result)} items") - for i, item in enumerate(result[:5]): # Show first 5 items - print(f" {i+1}. {item}") - if len(result) > 5: - print(f" ... and {len(result) - 5} more items") - else: - print(f"\n๐Ÿ“‹ Result: {result}") - - except APIError as e: - print(f"โŒ API Error: {e}") - print("This could be due to:") - print(" - Invalid API key") - print(" - Rate limiting") - print(" - Server issues") - - except Exception as e: - print(f"โŒ Unexpected error: {e}") - print("This could be due to:") - print(" - Network connectivity issues") - print(" - Invalid website URL") - print(" - Pagination limitations") - - -def test_pagination_parameters(): - """Test different pagination parameters""" - - print("\n" + "=" * 50) - print("Testing different pagination parameters") - print("=" * 50) - - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - return - - try: - Client(api_key=api_key) - except Exception as e: - print(f"โŒ Error initializing client: {e}") - return - - # Test cases - test_cases = [ - { - "name": "Single page (default)", - "url": "https://example.com", - "total_pages": None, - }, - {"name": "Two pages", "url": "https://example.com/products", "total_pages": 2}, - { - "name": "Maximum pages", - "url": "https://example.com/search", - "total_pages": 10, - }, - ] - - for test_case in test_cases: - print(f"\n๐Ÿงช Test: {test_case['name']}") - print(f" Pages: {test_case['total_pages']}") - - try: - # This is just to demonstrate the API call structure - # In a real scenario, you'd use actual URLs - print(" โœ… Configuration valid") - - except Exception as e: - print(f" โŒ Configuration error: {e}") - - -def main(): - """Main function to run the pagination examples""" - - print("ScrapeGraph SDK - SmartScraper Pagination Examples") - print("=" * 60) - - # Run the main example - smartscraper_pagination_example() - - # Test different parameters - test_pagination_parameters() - - print("\n" + "=" * 60) - print("Examples completed!") - print("\nNext steps:") - print("1. Set SGAI_API_KEY environment variable") - print("2. Replace example URLs with real websites") - print("3. Adjust total_pages parameter (1-10)") - print("4. Customize user_prompt for your use case") - print("5. Define output_schema for structured data") - print("\nTips:") - print("- Use smaller total_pages for testing") - print("- Pagination requests may take longer") - print("- Some websites may not support pagination") - print("- Consider rate limiting for large requests") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_render_heavy_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_render_heavy_example.py deleted file mode 100644 index 1a956999..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_render_heavy_example.py +++ /dev/null @@ -1,35 +0,0 @@ -import os - -from dotenv import load_dotenv - -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -# Load environment variables from .env file -load_dotenv() - -sgai_logger.set_logging(level="INFO") - -# Initialize the client with API key from environment variable -api_key = os.getenv("SGAI_API_KEY") -if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - exit(1) - -sgai_client = Client(api_key=api_key) - -# SmartScraper request with render_heavy_js enabled -response = sgai_client.smartscraper( - website_url="https://example.com", - user_prompt="Find the CEO of company X and their contact details", - render_heavy_js=True, # Enable heavy JavaScript rendering -) - -# Print the response -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") - -sgai_client.close() \ No newline at end of file diff --git a/scrapegraph-py/examples/smartscraper/sync/smartscraper_schema_example.py b/scrapegraph-py/examples/smartscraper/sync/smartscraper_schema_example.py deleted file mode 100644 index e2c7c047..00000000 --- a/scrapegraph-py/examples/smartscraper/sync/smartscraper_schema_example.py +++ /dev/null @@ -1,42 +0,0 @@ -import os - -from dotenv import load_dotenv -from pydantic import BaseModel, Field - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -# Define a Pydantic model for the output schema -class WebpageSchema(BaseModel): - title: str = Field(description="The title of the webpage") - description: str = Field(description="The description of the webpage") - summary: str = Field(description="A brief summary of the webpage") - - -# Initialize the client with API key from environment variable -api_key = os.getenv("SGAI_API_KEY") -if not api_key: - print("โŒ Error: SGAI_API_KEY environment variable not set") - print("Please either:") - print(" 1. Set environment variable: export SGAI_API_KEY='your-api-key-here'") - print(" 2. Create a .env file with: SGAI_API_KEY=your-api-key-here") - exit(1) - -sgai_client = Client(api_key=api_key) - -# SmartScraper request with output schema -response = sgai_client.smartscraper( - website_url="https://example.com", - # website_html="...", # Optional, if you want to pass in HTML content instead of a URL - user_prompt="Extract webpage information", - output_schema=WebpageSchema, -) - -# Print the response -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") - -sgai_client.close() diff --git a/scrapegraph-py/examples/stealth_mode_example.py b/scrapegraph-py/examples/stealth_mode_example.py deleted file mode 100644 index 442c3a36..00000000 --- a/scrapegraph-py/examples/stealth_mode_example.py +++ /dev/null @@ -1,494 +0,0 @@ -""" -Stealth Mode Examples for ScrapeGraph AI Python SDK - -This file demonstrates how to use stealth mode with various endpoints -to avoid bot detection when scraping websites. - -Stealth mode enables advanced techniques to make requests appear more -like those from a real browser, helping to bypass basic bot detection. -""" - -import os -from scrapegraph_py import Client -from pydantic import BaseModel, Field - -# Get API key from environment variable -API_KEY = os.getenv("SGAI_API_KEY", "your-api-key-here") - - -# ============================================================================ -# EXAMPLE 1: SmartScraper with Stealth Mode -# ============================================================================ - - -def example_smartscraper_with_stealth(): - """ - Extract structured data from a webpage using stealth mode. - Useful for websites with bot detection. - """ - print("\n" + "=" * 60) - print("EXAMPLE 1: SmartScraper with Stealth Mode") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.smartscraper( - website_url="https://www.scrapethissite.com/pages/simple/", - user_prompt="Extract country names and capitals", - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 2: SmartScraper with Stealth Mode and Output Schema -# ============================================================================ - - -def example_smartscraper_with_stealth_and_schema(): - """ - Use stealth mode with a structured output schema to extract data - from websites that might detect bots. - """ - print("\n" + "=" * 60) - print("EXAMPLE 2: SmartScraper with Stealth Mode and Schema") - print("=" * 60) - - # Define output schema using Pydantic - class Product(BaseModel): - name: str = Field(description="Product name") - price: str = Field(description="Product price") - rating: float = Field(description="Product rating (0-5)") - - with Client(api_key=API_KEY) as client: - try: - response = client.smartscraper( - website_url="https://example.com/products", - user_prompt="Extract product information including name, price, and rating", - output_schema=Product, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 3: SearchScraper with Stealth Mode -# ============================================================================ - - -def example_searchscraper_with_stealth(): - """ - Search and extract information from multiple sources using stealth mode. - """ - print("\n" + "=" * 60) - print("EXAMPLE 3: SearchScraper with Stealth Mode") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.searchscraper( - user_prompt="What are the latest developments in AI technology?", - num_results=5, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - if "reference_urls" in response: - print(f"Reference URLs: {response['reference_urls']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 4: Markdownify with Stealth Mode -# ============================================================================ - - -def example_markdownify_with_stealth(): - """ - Convert a webpage to markdown format using stealth mode. - """ - print("\n" + "=" * 60) - print("EXAMPLE 4: Markdownify with Stealth Mode") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.markdownify( - website_url="https://www.example.com", - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - print(f"Markdown Preview (first 500 chars):") - print(response["result"][:500]) - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 5: Scrape with Stealth Mode -# ============================================================================ - - -def example_scrape_with_stealth(): - """ - Get raw HTML from a webpage using stealth mode. - """ - print("\n" + "=" * 60) - print("EXAMPLE 5: Scrape with Stealth Mode") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.scrape( - website_url="https://www.example.com", - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Scrape Request ID: {response['scrape_request_id']}") - print(f"HTML Preview (first 500 chars):") - print(response["html"][:500]) - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 6: Scrape with Stealth Mode and Heavy JS Rendering -# ============================================================================ - - -def example_scrape_with_stealth_and_js(): - """ - Scrape a JavaScript-heavy website using stealth mode. - Combines JavaScript rendering with stealth techniques. - """ - print("\n" + "=" * 60) - print("EXAMPLE 6: Scrape with Stealth Mode and Heavy JS") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.scrape( - website_url="https://www.example.com", - render_heavy_js=True, # Enable JavaScript rendering - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Scrape Request ID: {response['scrape_request_id']}") - print(f"HTML Preview (first 500 chars):") - print(response["html"][:500]) - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 7: Agentic Scraper with Stealth Mode -# ============================================================================ - - -def example_agenticscraper_with_stealth(): - """ - Perform automated browser actions using stealth mode. - Ideal for interacting with protected forms or multi-step workflows. - """ - print("\n" + "=" * 60) - print("EXAMPLE 7: Agentic Scraper with Stealth Mode") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.agenticscraper( - url="https://dashboard.example.com/login", - steps=[ - "Type user@example.com in email input box", - "Type password123 in password input box", - "Click on login button", - ], - use_session=True, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 8: Agentic Scraper with Stealth Mode and AI Extraction -# ============================================================================ - - -def example_agenticscraper_with_stealth_and_ai(): - """ - Combine stealth mode with AI extraction in agentic scraping. - Performs actions and then extracts structured data. - """ - print("\n" + "=" * 60) - print("EXAMPLE 8: Agentic Scraper with Stealth and AI Extraction") - print("=" * 60) - - with Client(api_key=API_KEY) as client: - try: - response = client.agenticscraper( - url="https://dashboard.example.com", - steps=[ - "Navigate to user profile section", - "Click on settings tab", - ], - use_session=True, - user_prompt="Extract user profile information and settings", - ai_extraction=True, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 9: Crawl with Stealth Mode -# ============================================================================ - - -def example_crawl_with_stealth(): - """ - Crawl an entire website using stealth mode. - Useful for comprehensive data extraction from protected sites. - """ - print("\n" + "=" * 60) - print("EXAMPLE 9: Crawl with Stealth Mode") - print("=" * 60) - - schema = { - "$schema": "http://json-schema.org/draft-07/schema#", - "title": "Website Content", - "type": "object", - "properties": { - "title": {"type": "string", "description": "Page title"}, - "content": {"type": "string", "description": "Main content"}, - }, - "required": ["title"], - } - - with Client(api_key=API_KEY) as client: - try: - response = client.crawl( - url="https://www.example.com", - prompt="Extract page titles and main content", - data_schema=schema, - depth=2, - max_pages=5, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Crawl ID: {response['id']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 10: Crawl with Stealth Mode and Sitemap -# ============================================================================ - - -def example_crawl_with_stealth_and_sitemap(): - """ - Use sitemap for efficient crawling with stealth mode enabled. - """ - print("\n" + "=" * 60) - print("EXAMPLE 10: Crawl with Stealth Mode and Sitemap") - print("=" * 60) - - schema = { - "$schema": "http://json-schema.org/draft-07/schema#", - "title": "Product Information", - "type": "object", - "properties": { - "product_name": {"type": "string"}, - "price": {"type": "string"}, - "description": {"type": "string"}, - }, - "required": ["product_name"], - } - - with Client(api_key=API_KEY) as client: - try: - response = client.crawl( - url="https://www.example-shop.com", - prompt="Extract product information from all pages", - data_schema=schema, - sitemap=True, # Use sitemap for better page discovery - depth=3, - max_pages=10, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Crawl ID: {response['id']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 11: SmartScraper with Stealth, Custom Headers, and Pagination -# ============================================================================ - - -def example_smartscraper_advanced_stealth(): - """ - Advanced example combining stealth mode with custom headers and pagination. - """ - print("\n" + "=" * 60) - print("EXAMPLE 11: SmartScraper Advanced with Stealth") - print("=" * 60) - - headers = { - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", - "Accept-Language": "en-US,en;q=0.9", - } - - with Client(api_key=API_KEY) as client: - try: - response = client.smartscraper( - website_url="https://www.example-marketplace.com/products", - user_prompt="Extract all product listings from multiple pages", - headers=headers, - number_of_scrolls=10, - total_pages=5, - render_heavy_js=True, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - print(f"Result: {response['result']}") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# EXAMPLE 12: Using Stealth Mode with Custom Headers -# ============================================================================ - - -def example_stealth_with_custom_headers(): - """ - Demonstrate using stealth mode together with custom headers - for maximum control over request appearance. - """ - print("\n" + "=" * 60) - print("EXAMPLE 12: Stealth Mode with Custom Headers") - print("=" * 60) - - # Custom headers to simulate a real browser request - headers = { - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.5", - "Accept-Encoding": "gzip, deflate, br", - "DNT": "1", - } - - with Client(api_key=API_KEY) as client: - try: - # Using with markdownify - response = client.markdownify( - website_url="https://www.protected-site.com", - headers=headers, - stealth=True, # Enable stealth mode - ) - - print(f"Status: {response['status']}") - print(f"Request ID: {response['request_id']}") - print(f"Success! Stealth mode + custom headers bypassed detection.") - - except Exception as e: - print(f"Error: {e}") - - -# ============================================================================ -# RUN ALL EXAMPLES -# ============================================================================ - - -def run_all_examples(): - """Run all stealth mode examples""" - print("\n") - print("=" * 60) - print("STEALTH MODE EXAMPLES FOR SCRAPEGRAPH AI PYTHON SDK") - print("=" * 60) - print("\nThese examples demonstrate how to use stealth mode") - print("to avoid bot detection when scraping websites.") - print("\nStealth mode is available for all major endpoints:") - print("- SmartScraper") - print("- SearchScraper") - print("- Markdownify") - print("- Scrape") - print("- Agentic Scraper") - print("- Crawl") - - examples = [ - example_smartscraper_with_stealth, - example_smartscraper_with_stealth_and_schema, - example_searchscraper_with_stealth, - example_markdownify_with_stealth, - example_scrape_with_stealth, - example_scrape_with_stealth_and_js, - example_agenticscraper_with_stealth, - example_agenticscraper_with_stealth_and_ai, - example_crawl_with_stealth, - example_crawl_with_stealth_and_sitemap, - example_smartscraper_advanced_stealth, - example_stealth_with_custom_headers, - ] - - for i, example_func in enumerate(examples, 1): - try: - example_func() - except Exception as e: - print(f"\nExample {i} failed: {e}") - - print("\n" + "=" * 60) - print("ALL EXAMPLES COMPLETED") - print("=" * 60) - - -if __name__ == "__main__": - # You can run all examples or specific ones - run_all_examples() - - # Or run individual examples: - # example_smartscraper_with_stealth() - # example_searchscraper_with_stealth() - # example_crawl_with_stealth() diff --git a/scrapegraph-py/examples/steps/step_by_step_schema_generation.py b/scrapegraph-py/examples/steps/step_by_step_schema_generation.py deleted file mode 100644 index ff91a750..00000000 --- a/scrapegraph-py/examples/steps/step_by_step_schema_generation.py +++ /dev/null @@ -1,185 +0,0 @@ -#!/usr/bin/env python3 -""" -Step-by-step example for schema generation using ScrapeGraph Python SDK. - -This script demonstrates the basic workflow for schema generation: -1. Initialize the client -2. Generate a schema from a prompt -3. Check the status of the request -4. Retrieve the final result - -Requirements: -- Python 3.7+ -- scrapegraph-py package -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here - -Usage: - python step_by_step_schema_generation.py -""" - -import json -import os -import time -from typing import Any, Dict - -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def print_step(step_number: int, title: str, description: str = ""): - """Print a formatted step header""" - print(f"\n{'='*60}") - print(f"STEP {step_number}: {title}") - print(f"{'='*60}") - if description: - print(description) - print() - - -def print_response(response: Dict[str, Any], title: str = "API Response"): - """Pretty print an API response""" - print(f"\n๐Ÿ“‹ {title}") - print("-" * 40) - - if "error" in response and response["error"]: - print(f"โŒ Error: {response['error']}") - return - - for key, value in response.items(): - if key == "generated_schema" and value: - print(f"๐Ÿ”ง {key}:") - print(json.dumps(value, indent=2)) - else: - print(f"๐Ÿ”ง {key}: {value}") - - -def main(): - """Main function demonstrating step-by-step schema generation""" - - # Step 1: Check API key and initialize client - print_step(1, "Initialize Client", "Setting up the ScrapeGraph client with your API key") - - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("โŒ Error: SGAI_API_KEY not found in .env file") - print("Please create a .env file with your API key:") - print("SGAI_API_KEY=your_api_key_here") - return - - try: - client = Client(api_key=api_key) - print("โœ… Client initialized successfully") - except Exception as e: - print(f"โŒ Failed to initialize client: {e}") - return - - # Step 2: Define the schema generation request - print_step(2, "Define Request", "Creating a prompt for schema generation") - - user_prompt = "Find laptops with specifications like brand, processor, RAM, storage, and price" - print(f"๐Ÿ’ญ User Prompt: {user_prompt}") - - # Step 3: Generate the schema - print_step(3, "Generate Schema", "Sending the schema generation request to the API") - - try: - response = client.generate_schema(user_prompt) - print("โœ… Schema generation request sent successfully") - print_response(response, "Initial Response") - - # Extract the request ID for status checking - request_id = response.get('request_id') - if not request_id: - print("โŒ No request ID returned from the API") - return - - except Exception as e: - print(f"โŒ Failed to generate schema: {e}") - return - - # Step 4: Check the status (polling) - print_step(4, "Check Status", "Polling the API to check the status of the request") - - max_attempts = 10 - attempt = 0 - - while attempt < max_attempts: - attempt += 1 - print(f"๐Ÿ” Attempt {attempt}/{max_attempts}: Checking status...") - - try: - status_response = client.get_schema_status(request_id) - current_status = status_response.get('status', 'unknown') - - print(f"๐Ÿ“Š Current Status: {current_status}") - - if current_status == 'completed': - print("โœ… Schema generation completed successfully!") - print_response(status_response, "Final Result") - break - elif current_status == 'failed': - print("โŒ Schema generation failed") - print_response(status_response, "Error Response") - break - elif current_status in ['pending', 'processing']: - print("โณ Request is still being processed, waiting...") - if attempt < max_attempts: - time.sleep(2) # Wait 2 seconds before next check - else: - print(f"โš ๏ธ Unknown status: {current_status}") - break - - except Exception as e: - print(f"โŒ Error checking status: {e}") - break - - if attempt >= max_attempts: - print("โš ๏ธ Maximum attempts reached. The request might still be processing.") - print("You can check the status later using the request ID.") - - # Step 5: Demonstrate schema modification - print_step(5, "Schema Modification", "Demonstrating how to modify an existing schema") - - existing_schema = { - "type": "object", - "properties": { - "name": {"type": "string"}, - "price": {"type": "number"}, - }, - "required": ["name", "price"], - } - - modification_prompt = "Add brand and rating fields to the existing schema" - print(f"๐Ÿ’ญ Modification Prompt: {modification_prompt}") - print(f"๐Ÿ“‹ Existing Schema: {json.dumps(existing_schema, indent=2)}") - - try: - modification_response = client.generate_schema(modification_prompt, existing_schema) - print("โœ… Schema modification request sent successfully") - print_response(modification_response, "Modification Response") - - except Exception as e: - print(f"โŒ Failed to modify schema: {e}") - - # Step 6: Cleanup - print_step(6, "Cleanup", "Closing the client to free up resources") - - try: - client.close() - print("โœ… Client closed successfully") - except Exception as e: - print(f"โš ๏ธ Warning: Error closing client: {e}") - - print("\n๐ŸŽ‰ Schema generation demonstration completed!") - print(f"๐Ÿ“ Request ID for reference: {request_id}") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/toon_async_example.py b/scrapegraph-py/examples/toon_async_example.py deleted file mode 100644 index 2ffea9d3..00000000 --- a/scrapegraph-py/examples/toon_async_example.py +++ /dev/null @@ -1,117 +0,0 @@ -#!/usr/bin/env python3 -""" -Async example demonstrating TOON format integration with ScrapeGraph SDK. - -TOON (Token-Oriented Object Notation) reduces token usage by 30-60% compared to JSON, -which can significantly reduce costs when working with LLM APIs. - -This example shows how to use the `return_toon` parameter with various async scraping methods. -""" -import asyncio -import os -from scrapegraph_py import AsyncClient - - -async def main(): - """Demonstrate TOON format with different async scraping methods.""" - - # Set your API key as an environment variable - # export SGAI_API_KEY="your-api-key-here" - # or set it in your .env file - - # Initialize the async client - async with AsyncClient.from_env() as client: - print("๐ŸŽจ Async TOON Format Integration Example\n") - print("=" * 60) - - # Example 1: SmartScraper with TOON format - print("\n๐Ÿ“Œ Example 1: Async SmartScraper with TOON Format") - print("-" * 60) - - try: - # Request with return_toon=False (default JSON response) - json_response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the page title and main heading", - return_toon=False - ) - - print("\nJSON Response:") - print(json_response) - - # Request with return_toon=True (TOON formatted response) - toon_response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the page title and main heading", - return_toon=True - ) - - print("\nTOON Response:") - print(toon_response) - - # Compare token sizes (approximate) - if isinstance(json_response, dict): - import json - json_str = json.dumps(json_response) - json_tokens = len(json_str.split()) - toon_tokens = len(str(toon_response).split()) - - savings = ((json_tokens - toon_tokens) / json_tokens) * 100 if json_tokens > 0 else 0 - - print(f"\n๐Ÿ“Š Token Comparison:") - print(f" JSON tokens (approx): {json_tokens}") - print(f" TOON tokens (approx): {toon_tokens}") - print(f" Savings: {savings:.1f}%") - - except Exception as e: - print(f"Error in Example 1: {e}") - - # Example 2: SearchScraper with TOON format - print("\n\n๐Ÿ“Œ Example 2: Async SearchScraper with TOON Format") - print("-" * 60) - - try: - # Request with TOON format - toon_search_response = await client.searchscraper( - user_prompt="Latest AI developments in 2024", - num_results=3, - return_toon=True - ) - - print("\nTOON Search Response:") - print(toon_search_response) - - except Exception as e: - print(f"Error in Example 2: {e}") - - # Example 3: Markdownify with TOON format - print("\n\n๐Ÿ“Œ Example 3: Async Markdownify with TOON Format") - print("-" * 60) - - try: - # Request with TOON format - toon_markdown_response = await client.markdownify( - website_url="https://example.com", - return_toon=True - ) - - print("\nTOON Markdown Response:") - print(str(toon_markdown_response)[:500]) # Print first 500 chars - print("...(truncated)") - - except Exception as e: - print(f"Error in Example 3: {e}") - - print("\n\nโœ… Async TOON Integration Examples Completed!") - print("=" * 60) - print("\n๐Ÿ’ก Benefits of TOON Format:") - print(" โ€ข 30-60% reduction in token usage") - print(" โ€ข Lower LLM API costs") - print(" โ€ข Faster processing") - print(" โ€ข Human-readable format") - print("\n๐Ÿ”— Learn more: https://github.com/ScrapeGraphAI/toonify") - - -if __name__ == "__main__": - asyncio.run(main()) - diff --git a/scrapegraph-py/examples/toon_example.py b/scrapegraph-py/examples/toon_example.py deleted file mode 100644 index e4e29217..00000000 --- a/scrapegraph-py/examples/toon_example.py +++ /dev/null @@ -1,117 +0,0 @@ -#!/usr/bin/env python3 -""" -Example demonstrating TOON format integration with ScrapeGraph SDK. - -TOON (Token-Oriented Object Notation) reduces token usage by 30-60% compared to JSON, -which can significantly reduce costs when working with LLM APIs. - -This example shows how to use the `return_toon` parameter with various scraping methods. -""" -import os -from scrapegraph_py import Client - -# Set your API key as an environment variable -# export SGAI_API_KEY="your-api-key-here" -# or set it in your .env file - - -def main(): - """Demonstrate TOON format with different scraping methods.""" - - # Initialize the client - client = Client.from_env() - - print("๐ŸŽจ TOON Format Integration Example\n") - print("=" * 60) - - # Example 1: SmartScraper with TOON format - print("\n๐Ÿ“Œ Example 1: SmartScraper with TOON Format") - print("-" * 60) - - try: - # Request with return_toon=False (default JSON response) - json_response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the page title and main heading", - return_toon=False - ) - - print("\nJSON Response:") - print(json_response) - - # Request with return_toon=True (TOON formatted response) - toon_response = client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the page title and main heading", - return_toon=True - ) - - print("\nTOON Response:") - print(toon_response) - - # Compare token sizes (approximate) - if isinstance(json_response, dict): - import json - json_str = json.dumps(json_response) - json_tokens = len(json_str.split()) - toon_tokens = len(str(toon_response).split()) - - savings = ((json_tokens - toon_tokens) / json_tokens) * 100 if json_tokens > 0 else 0 - - print(f"\n๐Ÿ“Š Token Comparison:") - print(f" JSON tokens (approx): {json_tokens}") - print(f" TOON tokens (approx): {toon_tokens}") - print(f" Savings: {savings:.1f}%") - - except Exception as e: - print(f"Error in Example 1: {e}") - - # Example 2: SearchScraper with TOON format - print("\n\n๐Ÿ“Œ Example 2: SearchScraper with TOON Format") - print("-" * 60) - - try: - # Request with TOON format - toon_search_response = client.searchscraper( - user_prompt="Latest AI developments in 2024", - num_results=3, - return_toon=True - ) - - print("\nTOON Search Response:") - print(toon_search_response) - - except Exception as e: - print(f"Error in Example 2: {e}") - - # Example 3: Markdownify with TOON format - print("\n\n๐Ÿ“Œ Example 3: Markdownify with TOON Format") - print("-" * 60) - - try: - # Request with TOON format - toon_markdown_response = client.markdownify( - website_url="https://example.com", - return_toon=True - ) - - print("\nTOON Markdown Response:") - print(str(toon_markdown_response)[:500]) # Print first 500 chars - print("...(truncated)") - - except Exception as e: - print(f"Error in Example 3: {e}") - - print("\n\nโœ… TOON Integration Examples Completed!") - print("=" * 60) - print("\n๐Ÿ’ก Benefits of TOON Format:") - print(" โ€ข 30-60% reduction in token usage") - print(" โ€ข Lower LLM API costs") - print(" โ€ข Faster processing") - print(" โ€ข Human-readable format") - print("\n๐Ÿ”— Learn more: https://github.com/ScrapeGraphAI/toonify") - - -if __name__ == "__main__": - main() - diff --git a/scrapegraph-py/examples/utilities/async_scrape_example.py b/scrapegraph-py/examples/utilities/async_scrape_example.py deleted file mode 100644 index 0a6c227d..00000000 --- a/scrapegraph-py/examples/utilities/async_scrape_example.py +++ /dev/null @@ -1,278 +0,0 @@ -""" -Async example demonstrating how to use the Scrape API with the scrapegraph-py SDK. - -This example shows how to: -1. Set up the async client for Scrape -2. Make async API calls to get HTML content from websites -3. Handle responses and save HTML content -4. Demonstrate both regular and heavy JS rendering modes -5. Process multiple websites concurrently - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- aiofiles (for async file operations) -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import asyncio -import json -import os -import time -from pathlib import Path -from typing import Optional - -from dotenv import load_dotenv - -from scrapegraph_py import AsyncClient - -# Load environment variables from .env file -load_dotenv() - - -async def scrape_website( - client: AsyncClient, - website_url: str, - render_heavy_js: bool = False, - headers: Optional[dict[str, str]] = None, -) -> dict: - """ - Get HTML content from a website using the async Scrape API. - - Args: - client: The async scrapegraph-py client instance - website_url: The URL of the website to get HTML from - render_heavy_js: Whether to render heavy JavaScript (defaults to False) - headers: Optional headers to send with the request - - Returns: - dict: A dictionary containing the HTML content and metadata - - Raises: - Exception: If the API request fails - """ - js_mode = "with heavy JS rendering" if render_heavy_js else "without JS rendering" - print(f"Getting HTML content from: {website_url}") - print(f"Mode: {js_mode}") - - start_time = time.time() - - try: - result = await client.scrape( - website_url=website_url, - render_heavy_js=render_heavy_js, - headers=headers, - ) - execution_time = time.time() - start_time - print(f"Execution time: {execution_time:.2f} seconds") - return result - except Exception as e: - print(f"Error: {str(e)}") - raise - - -async def save_html_content( - html_content: str, filename: str, output_dir: str = "async_scrape_output" -): - """ - Save HTML content to a file asynchronously. - - Args: - html_content: The HTML content to save - filename: The name of the file (without extension) - output_dir: The directory to save the file in - """ - # Create output directory if it doesn't exist - output_path = Path(output_dir) - output_path.mkdir(exist_ok=True) - - # Save HTML file - html_file = output_path / f"{filename}.html" - - # Use asyncio to run file I/O in a thread pool - await asyncio.to_thread( - lambda: html_file.write_text(html_content, encoding="utf-8") - ) - - print(f"HTML content saved to: {html_file}") - return html_file - - -def analyze_html_content(html_content: str) -> dict: - """ - Analyze HTML content and provide basic statistics. - - Args: - html_content: The HTML content to analyze - - Returns: - dict: Basic statistics about the HTML content - """ - stats = { - "total_length": len(html_content), - "lines": len(html_content.splitlines()), - "has_doctype": html_content.strip().startswith(" dict: - """ - Process a single website and return results. - - Args: - client: The async client instance - website: Website configuration dictionary - - Returns: - dict: Processing results - """ - print(f"\nProcessing: {website['description']}") - print("-" * 40) - - try: - # Get HTML content - result = await scrape_website( - client=client, - website_url=website["url"], - render_heavy_js=website["render_heavy_js"], - ) - - # Display response metadata - print(f"Request ID: {result.get('scrape_request_id', 'N/A')}") - print(f"Status: {result.get('status', 'N/A')}") - print(f"Error: {result.get('error', 'None')}") - - # Analyze HTML content - html_content = result.get("html", "") - if html_content: - stats = analyze_html_content(html_content) - print(f"\nHTML Content Analysis:") - print(f" Total length: {stats['total_length']:,} characters") - print(f" Lines: {stats['lines']:,}") - print(f" Has DOCTYPE: {stats['has_doctype']}") - print(f" Has HTML tag: {stats['has_html_tag']}") - print(f" Has Head tag: {stats['has_head_tag']}") - print(f" Has Body tag: {stats['has_body_tag']}") - print(f" Script tags: {stats['script_tags']}") - print(f" Style tags: {stats['style_tags']}") - print(f" Div tags: {stats['div_tags']}") - print(f" Paragraph tags: {stats['p_tags']}") - print(f" Image tags: {stats['img_tags']}") - print(f" Link tags: {stats['link_tags']}") - - # Save HTML content - filename = f"{website['name']}_{'js' if website['render_heavy_js'] else 'nojs'}" - saved_file = await save_html_content(html_content, filename) - - # Show first 500 characters as preview - preview = html_content[:500].replace("\n", " ").strip() - print(f"\nHTML Preview (first 500 chars):") - print(f" {preview}...") - - return { - "success": True, - "website": website["url"], - "saved_file": str(saved_file), - "stats": stats, - "preview": preview - } - else: - print("No HTML content received") - return { - "success": False, - "website": website["url"], - "error": "No HTML content received" - } - - except Exception as e: - print(f"Error processing {website['url']}: {str(e)}") - return { - "success": False, - "website": website["url"], - "error": str(e) - } - - -async def main(): - """ - Main async function demonstrating Scrape API usage. - """ - # Example websites to test - test_websites = [ - { - "url": "https://example.com", - "name": "example", - "render_heavy_js": False, - "description": "Simple static website", - }, - { - "url": "https://httpbin.org/html", - "name": "httpbin_html", - "render_heavy_js": False, - "description": "HTTP testing service", - }, - ] - - print("Async Scrape API Example with scrapegraph-py SDK") - print("=" * 60) - - # Initialize the async client - try: - async with AsyncClient.from_env() as client: - print("โœ… Async client initialized successfully") - - # Process websites concurrently - print(f"\n๐Ÿš€ Processing {len(test_websites)} websites concurrently...") - - tasks = [ - process_website(client, website) - for website in test_websites - ] - - results = await asyncio.gather(*tasks, return_exceptions=True) - - # Display summary - print(f"\n๐Ÿ“Š Processing Summary") - print("=" * 40) - - successful = 0 - for result in results: - if isinstance(result, Exception): - print(f"โŒ Exception occurred: {result}") - elif result["success"]: - successful += 1 - print(f"โœ… {result['website']}: {result['saved_file']}") - else: - print(f"โŒ {result['website']}: {result.get('error', 'Unknown error')}") - - print(f"\n๐ŸŽฏ Results: {successful}/{len(test_websites)} websites processed successfully") - - except Exception as e: - print(f"โŒ Failed to initialize async client: {str(e)}") - print("Make sure you have SGAI_API_KEY in your .env file") - return - - print("\nโœ… Async processing completed") - - -if __name__ == "__main__": - # Run the async main function - asyncio.run(main()) diff --git a/scrapegraph-py/examples/utilities/get_credits_example.py b/scrapegraph-py/examples/utilities/get_credits_example.py deleted file mode 100644 index 6ef9e2f6..00000000 --- a/scrapegraph-py/examples/utilities/get_credits_example.py +++ /dev/null @@ -1,13 +0,0 @@ -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_level("DEBUG") - -# Initialize the client -sgai_client = Client(api_key="your-api-key-here") - -# Check remaining credits -credits = sgai_client.get_credits() -print(f"Credits Info: {credits}") - -sgai_client.close() diff --git a/scrapegraph-py/examples/utilities/healthz_async_example.py b/scrapegraph-py/examples/utilities/healthz_async_example.py deleted file mode 100644 index b8f92429..00000000 --- a/scrapegraph-py/examples/utilities/healthz_async_example.py +++ /dev/null @@ -1,174 +0,0 @@ -""" -Health Check Example - Asynchronous - -This example demonstrates how to use the health check endpoint asynchronously -to monitor the ScrapeGraphAI API service status. This is particularly useful for: -- Async production monitoring and alerting -- Health checks in async web frameworks (FastAPI, Sanic, aiohttp) -- Concurrent health monitoring of multiple services -- Integration with async monitoring tools - -The health check endpoint (/healthz) provides a quick way to verify that -the API service is operational and ready to handle requests. -""" - -import asyncio -from scrapegraph_py import AsyncClient - - -async def main(): - """ - Demonstrates the async health check functionality with the ScrapeGraphAI API. - - The healthz endpoint returns status information about the service, - which can be used for monitoring and alerting purposes. - """ - # Initialize the async client from environment variables - # Ensure SGAI_API_KEY is set in your environment - async with AsyncClient.from_env() as client: - try: - print("๐Ÿฅ Checking ScrapeGraphAI API health status (async)...") - print("-" * 50) - - # Perform health check - health_status = await client.healthz() - - # Display results - print("\nโœ… Health Check Response:") - print(f"Status: {health_status.get('status', 'unknown')}") - - if 'message' in health_status: - print(f"Message: {health_status['message']}") - - # Additional fields that might be returned - for key, value in health_status.items(): - if key not in ['status', 'message']: - print(f"{key.capitalize()}: {value}") - - print("\n" + "-" * 50) - print("โœจ Health check completed successfully!") - - # Example: Use in a monitoring context - if health_status.get('status') == 'healthy': - print("\nโœ“ Service is healthy and ready to accept requests") - else: - print("\nโš ๏ธ Service may be experiencing issues") - - except Exception as e: - print(f"\nโŒ Health check failed: {e}") - print("The service may be unavailable or experiencing issues") - - -async def monitoring_example(): - """ - Example of using health check in an async monitoring/alerting context. - - This function demonstrates how you might integrate the health check - into an async monitoring system or scheduled health check script. - """ - async with AsyncClient.from_env() as client: - try: - health_status = await client.healthz() - - # Simple health check logic - is_healthy = health_status.get('status') == 'healthy' - - if is_healthy: - print("โœ“ Health check passed") - return 0 # Success exit code - else: - print("โœ— Health check failed") - return 1 # Failure exit code - - except Exception as e: - print(f"โœ— Health check error: {e}") - return 2 # Error exit code - - -async def concurrent_health_checks(): - """ - Example of performing concurrent health checks. - - This demonstrates how you can efficiently check the health status - multiple times or monitor multiple aspects concurrently. - """ - async with AsyncClient.from_env() as client: - print("๐Ÿฅ Performing concurrent health checks...") - - # Perform multiple health checks concurrently - results = await asyncio.gather( - client.healthz(), - client.healthz(), - client.healthz(), - return_exceptions=True - ) - - # Analyze results - successful_checks = sum( - 1 for r in results - if isinstance(r, dict) and r.get('status') == 'healthy' - ) - - print(f"\nโœ“ Successful health checks: {successful_checks}/{len(results)}") - - if successful_checks == len(results): - print("โœ“ All health checks passed - service is stable") - elif successful_checks > 0: - print("โš ๏ธ Some health checks failed - service may be unstable") - else: - print("โœ— All health checks failed - service is down") - - -async def fastapi_health_endpoint_example(): - """ - Example of how to integrate the health check into a FastAPI endpoint. - - This demonstrates a pattern for creating a health check endpoint - in your own FastAPI application that checks the ScrapeGraphAI API. - """ - # This is a demonstration of the pattern, not a runnable endpoint - print("\n๐Ÿ“ FastAPI Integration Pattern:") - print("-" * 50) - print(""" -from fastapi import FastAPI, HTTPException -from scrapegraph_py import AsyncClient - -app = FastAPI() - -@app.get("/health") -async def health_check(): - '''Health check endpoint that verifies ScrapeGraphAI API status''' - try: - async with AsyncClient.from_env() as client: - health = await client.healthz() - - if health.get('status') == 'healthy': - return { - "status": "healthy", - "scrape_graph_api": "operational" - } - else: - raise HTTPException( - status_code=503, - detail="ScrapeGraphAI API is unhealthy" - ) - except Exception as e: - raise HTTPException( - status_code=503, - detail=f"Health check failed: {str(e)}" - ) - """) - print("-" * 50) - - -if __name__ == "__main__": - # Run the main health check example - asyncio.run(main()) - - # Uncomment to run other examples - # exit_code = asyncio.run(monitoring_example()) - # exit(exit_code) - - # asyncio.run(concurrent_health_checks()) - # asyncio.run(fastapi_health_endpoint_example()) - diff --git a/scrapegraph-py/examples/utilities/healthz_example.py b/scrapegraph-py/examples/utilities/healthz_example.py deleted file mode 100644 index 3362a1e9..00000000 --- a/scrapegraph-py/examples/utilities/healthz_example.py +++ /dev/null @@ -1,102 +0,0 @@ -""" -Health Check Example - Synchronous - -This example demonstrates how to use the health check endpoint to monitor -the ScrapeGraphAI API service status. This is particularly useful for: -- Production monitoring and alerting -- Health checks in containerized environments (Kubernetes, Docker) -- Ensuring service availability before making API calls -- Integration with monitoring tools (Prometheus, Datadog, etc.) - -The health check endpoint (/healthz) provides a quick way to verify that -the API service is operational and ready to handle requests. -""" - -from scrapegraph_py import Client - -def main(): - """ - Demonstrates the health check functionality with the ScrapeGraphAI API. - - The healthz endpoint returns status information about the service, - which can be used for monitoring and alerting purposes. - """ - # Initialize the client from environment variables - # Ensure SGAI_API_KEY is set in your environment - client = Client.from_env() - - try: - print("๐Ÿฅ Checking ScrapeGraphAI API health status...") - print("-" * 50) - - # Perform health check - health_status = client.healthz() - - # Display results - print("\nโœ… Health Check Response:") - print(f"Status: {health_status.get('status', 'unknown')}") - - if 'message' in health_status: - print(f"Message: {health_status['message']}") - - # Additional fields that might be returned - for key, value in health_status.items(): - if key not in ['status', 'message']: - print(f"{key.capitalize()}: {value}") - - print("\n" + "-" * 50) - print("โœจ Health check completed successfully!") - - # Example: Use in a monitoring context - if health_status.get('status') == 'healthy': - print("\nโœ“ Service is healthy and ready to accept requests") - else: - print("\nโš ๏ธ Service may be experiencing issues") - - except Exception as e: - print(f"\nโŒ Health check failed: {e}") - print("The service may be unavailable or experiencing issues") - - finally: - # Clean up - client.close() - - -def monitoring_example(): - """ - Example of using health check in a monitoring/alerting context. - - This function demonstrates how you might integrate the health check - into a monitoring system or scheduled health check script. - """ - client = Client.from_env() - - try: - health_status = client.healthz() - - # Simple health check logic - is_healthy = health_status.get('status') == 'healthy' - - if is_healthy: - print("โœ“ Health check passed") - return 0 # Success exit code - else: - print("โœ— Health check failed") - return 1 # Failure exit code - - except Exception as e: - print(f"โœ— Health check error: {e}") - return 2 # Error exit code - - finally: - client.close() - - -if __name__ == "__main__": - # Run the main health check example - main() - - # Uncomment to run monitoring example - # exit_code = monitoring_example() - # exit(exit_code) - diff --git a/scrapegraph-py/examples/utilities/optional_headers_example.py b/scrapegraph-py/examples/utilities/optional_headers_example.py deleted file mode 100644 index 7763f8fb..00000000 --- a/scrapegraph-py/examples/utilities/optional_headers_example.py +++ /dev/null @@ -1,28 +0,0 @@ -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - -# Initialize the client with explicit API key -sgai_client = Client(api_key="your-api-key-here") - -# SmartScraper request -response = sgai_client.smartscraper( - website_url="https://example.com", - user_prompt="Extract the main heading, description, and summary of the webpage", - headers={ - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.9", - "Accept-Encoding": "gzip, deflate, br", - "Connection": "keep-alive", - "Upgrade-Insecure-Requests": "1", - }, -) - - -# Print the response -print(f"Request ID: {response['request_id']}") -print(f"Result: {response['result']}") - -sgai_client.close() diff --git a/scrapegraph-py/examples/utilities/scrape_direct_api_example.py b/scrapegraph-py/examples/utilities/scrape_direct_api_example.py deleted file mode 100644 index ba2c9a82..00000000 --- a/scrapegraph-py/examples/utilities/scrape_direct_api_example.py +++ /dev/null @@ -1,342 +0,0 @@ -""" -Direct API example showing how to use the Scrape API endpoint directly. - -This example demonstrates: -1. Direct API calls using requests library (equivalent to curl) -2. How to construct the API request manually -3. Comparison with the scrapegraph-py SDK -4. Error handling for direct API calls -5. The exact curl commands for each request - -Curl command examples: -# Basic scrape request -curl -X POST https://api.scrapegraphai.com/v1/scrape \ - -H "Content-Type: application/json" \ - -H "SGAI-APIKEY: sgai-e32215fb-5940-400f-91ea-30af5f35e0c9" \ - -d '{ - "website_url": "https://example.com", - "render_heavy_js": false - }' - -# With heavy JavaScript rendering -curl -X POST https://api.scrapegraphai.com/v1/scrape \ - -H "Content-Type: application/json" \ - -H "SGAI-APIKEY: sgai-e32215fb-5940-400f-91ea-30af5f35e0c9" \ - -d '{ - "website_url": "https://example.com", - "render_heavy_js": true - }' - -Requirements: -- Python 3.7+ -- requests -- scrapegraph-py -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import json -import time -from typing import Dict, Any, Optional - -import requests -from dotenv import load_dotenv -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -class DirectScrapeAPI: - """ - Direct API client for the Scrape endpoint (without using scrapegraph-py SDK). - This demonstrates how to make raw API calls equivalent to curl commands. - """ - - def __init__(self, api_key: str, base_url: str = "https://api.scrapegraphai.com/v1"): - """ - Initialize the direct API client. - - Args: - api_key: Your ScrapeGraph AI API key - base_url: Base URL for the API - """ - self.api_key = api_key - self.base_url = base_url - self.headers = { - "Content-Type": "application/json", - "SGAI-APIKEY": api_key - } - - def scrape( - self, - website_url: str, - render_heavy_js: bool = False, - headers: Optional[Dict[str, str]] = None - ) -> Dict[str, Any]: - """ - Make a direct scrape API request. - - Args: - website_url: The URL to scrape - render_heavy_js: Whether to render heavy JavaScript - headers: Optional headers to send with the scraping request - - Returns: - API response as dictionary - - Raises: - requests.RequestException: If the API request fails - """ - url = f"{self.base_url}/scrape" - - payload = { - "website_url": website_url, - "render_heavy_js": render_heavy_js - } - - if headers: - payload["headers"] = headers - - print(f"๐ŸŒ Making direct API request to: {url}") - print(f"๐Ÿ“‹ Payload: {json.dumps(payload, indent=2)}") - - try: - response = requests.post( - url, - json=payload, - headers=self.headers, - timeout=30 - ) - - print(f"๐Ÿ“ฅ Response Status: {response.status_code}") - - # Handle different response status codes - if response.status_code == 200: - result = response.json() - print(f"โœ… Request successful") - return result - elif response.status_code == 400: - error_data = response.json() - raise requests.RequestException(f"Bad Request: {error_data.get('error', 'Unknown error')}") - elif response.status_code == 401: - raise requests.RequestException("Unauthorized: Check your API key") - elif response.status_code == 429: - raise requests.RequestException("Rate limit exceeded") - elif response.status_code == 500: - raise requests.RequestException("Internal server error") - else: - raise requests.RequestException(f"Unexpected status code: {response.status_code}") - - except requests.Timeout: - raise requests.RequestException("Request timeout - API took too long to respond") - except requests.ConnectionError: - raise requests.RequestException("Connection error - unable to reach API") - except json.JSONDecodeError: - raise requests.RequestException("Invalid JSON response from API") - - -def demonstrate_curl_commands(): - """ - Display the equivalent curl commands for the API requests. - """ - print("๐ŸŒ EQUIVALENT CURL COMMANDS") - print("=" * 50) - - print("1๏ธโƒฃ Basic scrape request (render_heavy_js=false):") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": false") - print(" }'") - - print("\n2๏ธโƒฃ Heavy JS rendering (render_heavy_js=true):") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": true") - print(" }'") - - print("\n3๏ธโƒฃ With custom headers:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": false,") - print(" \"headers\": {") - print(" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\",") - print(" \"Accept-Language\": \"en-US,en;q=0.9\",") - print(" \"Cookie\": \"session=abc123; preferences=dark_mode\"") - print(" }") - print(" }'") - - print("\n4๏ธโƒฃ Real example with actual API key format:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: sgai-e32215fb-5940-400f-91ea-30af5f35e0c9\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": false") - print(" }'") - - -def compare_direct_vs_sdk(api_key: str, website_url: str): - """ - Compare direct API calls vs SDK usage. - - Args: - api_key: API key for authentication - website_url: URL to scrape - """ - print(f"\n๐Ÿ”„ COMPARISON: Direct API vs SDK") - print("=" * 40) - - # Test with direct API - print("\n1๏ธโƒฃ Using Direct API (equivalent to curl):") - try: - direct_client = DirectScrapeAPI(api_key) - start_time = time.time() - direct_result = direct_client.scrape(website_url, render_heavy_js=False) - direct_time = time.time() - start_time - - direct_html = direct_result.get("html", "") - print(f"โœ… Direct API completed in {direct_time:.2f}s") - print(f"๐Ÿ“ HTML length: {len(direct_html):,} characters") - print(f"๐Ÿ“‹ Response keys: {list(direct_result.keys())}") - - except Exception as e: - print(f"โŒ Direct API failed: {str(e)}") - direct_result = None - direct_time = 0 - - # Test with SDK - print("\n2๏ธโƒฃ Using scrapegraph-py SDK:") - try: - sdk_client = Client(api_key=api_key) - start_time = time.time() - sdk_result = sdk_client.scrape(website_url, render_heavy_js=False) - sdk_time = time.time() - start_time - - sdk_html = sdk_result.get("html", "") - print(f"โœ… SDK completed in {sdk_time:.2f}s") - print(f"๐Ÿ“ HTML length: {len(sdk_html):,} characters") - print(f"๐Ÿ“‹ Response keys: {list(sdk_result.keys())}") - - sdk_client.close() - - except Exception as e: - print(f"โŒ SDK failed: {str(e)}") - sdk_result = None - sdk_time = 0 - - # Compare results - if direct_result and sdk_result: - print(f"\n๐Ÿ“Š Comparison Results:") - print(f" Time difference: {abs(direct_time - sdk_time):.2f}s") - print(f" HTML length difference: {abs(len(direct_html) - len(sdk_html)):,} chars") - print(f" Results identical: {direct_result == sdk_result}") - - print(f"\n๐Ÿ’ก Conclusions:") - print(f" โ€ข Both methods produce identical results") - print(f" โ€ข SDK provides better error handling and validation") - print(f" โ€ข Direct API gives you full control over requests") - print(f" โ€ข Choose SDK for ease of use, direct API for custom integrations") - - -def demonstrate_error_handling(api_key: str): - """ - Demonstrate error handling for direct API calls. - - Args: - api_key: API key for authentication - """ - print(f"\n๐Ÿšจ ERROR HANDLING DEMONSTRATION") - print("=" * 40) - - direct_client = DirectScrapeAPI(api_key) - - # Test cases for different errors - error_tests = [ - { - "name": "Invalid URL", - "url": "not-a-valid-url", - "expected": "ValidationError" - }, - { - "name": "Empty URL", - "url": "", - "expected": "ValidationError" - }, - { - "name": "Non-existent domain", - "url": "https://this-domain-definitely-does-not-exist-12345.com", - "expected": "Connection/Timeout Error" - } - ] - - for test in error_tests: - print(f"\n๐Ÿงช Testing: {test['name']}") - print(f" URL: {test['url']}") - print(f" Expected: {test['expected']}") - - try: - result = direct_client.scrape(test["url"]) - print(f" โš ๏ธ Unexpected success: {result.get('status', 'Unknown')}") - except Exception as e: - print(f" โœ… Expected error caught: {str(e)}") - - -def main(): - """ - Main function demonstrating direct API usage. - """ - print("๐Ÿš€ Scrape API: Direct API Usage Example") - print("=" * 50) - - # Show curl command equivalents - demonstrate_curl_commands() - - # Get API key from environment - import os - api_key = os.getenv("SGAI_API_KEY") - if not api_key: - print("\nโŒ Error: SGAI_API_KEY not found in environment variables") - print("Please add your API key to your .env file:") - print("SGAI_API_KEY=your-api-key-here") - return - - print(f"\nโœ… API key loaded from environment") - - # Test website - test_url = "https://example.com" - - # Compare direct API vs SDK - compare_direct_vs_sdk(api_key, test_url) - - # Demonstrate error handling - demonstrate_error_handling(api_key) - - print(f"\n๐ŸŽฏ SUMMARY") - print("=" * 20) - print("โœ… Direct API calls work identically to curl commands") - print("โœ… SDK provides additional convenience and error handling") - print("โœ… Both approaches produce the same results") - print("โœ… Choose based on your integration needs") - - print(f"\n๐Ÿ“š Next Steps:") - print("โ€ข Try the curl commands in your terminal") - print("โ€ข Experiment with different render_heavy_js settings") - print("โ€ข Test with your own websites") - print("โ€ข Consider using the SDK for production applications") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/utilities/scrape_example.py b/scrapegraph-py/examples/utilities/scrape_example.py deleted file mode 100644 index 552d79f2..00000000 --- a/scrapegraph-py/examples/utilities/scrape_example.py +++ /dev/null @@ -1,217 +0,0 @@ -""" -Example demonstrating how to use the Scrape API with the scrapegraph-py SDK. - -This example shows how to: -1. Set up the client for Scrape -2. Make the API call to get HTML content from a website -3. Handle the response and save the HTML content -4. Demonstrate both regular and heavy JS rendering modes -5. Display the results and metadata - -Requirements: -- Python 3.7+ -- scrapegraph-py -- python-dotenv -- A .env file with your SGAI_API_KEY - -Example .env file: -SGAI_API_KEY=your_api_key_here -""" - -import json -import os -import time -from pathlib import Path -from typing import Optional - -from dotenv import load_dotenv - -from scrapegraph_py import Client - -# Load environment variables from .env file -load_dotenv() - - -def scrape_website( - client: Client, - website_url: str, - render_heavy_js: bool = False, - headers: Optional[dict[str, str]] = None, -) -> dict: - """ - Get HTML content from a website using the Scrape API. - - Args: - client: The scrapegraph-py client instance - website_url: The URL of the website to get HTML from - render_heavy_js: Whether to render heavy JavaScript (defaults to False) - headers: Optional headers to send with the request - - Returns: - dict: A dictionary containing the HTML content and metadata - - Raises: - Exception: If the API request fails - """ - js_mode = "with heavy JS rendering" if render_heavy_js else "without JS rendering" - print(f"Getting HTML content from: {website_url}") - print(f"Mode: {js_mode}") - - start_time = time.time() - - try: - result = client.scrape( - website_url=website_url, - render_heavy_js=render_heavy_js, - headers=headers, - ) - execution_time = time.time() - start_time - print(f"Execution time: {execution_time:.2f} seconds") - return result - except Exception as e: - print(f"Error: {str(e)}") - raise - - -def save_html_content( - html_content: str, filename: str, output_dir: str = "scrape_output" -): - """ - Save HTML content to a file. - - Args: - html_content: The HTML content to save - filename: The name of the file (without extension) - output_dir: The directory to save the file in - """ - # Create output directory if it doesn't exist - output_path = Path(output_dir) - output_path.mkdir(exist_ok=True) - - # Save HTML file - html_file = output_path / f"{filename}.html" - with open(html_file, "w", encoding="utf-8") as f: - f.write(html_content) - - print(f"HTML content saved to: {html_file}") - return html_file - - -def analyze_html_content(html_content: str) -> dict: - """ - Analyze HTML content and provide basic statistics. - - Args: - html_content: The HTML content to analyze - - Returns: - dict: Basic statistics about the HTML content - """ - stats = { - "total_length": len(html_content), - "lines": len(html_content.splitlines()), - "has_doctype": html_content.strip().startswith(" Dict[str, Any]: - """ - Compare scraping results with and without heavy JS rendering. - - Args: - client: The scrapegraph-py client instance - website_url: The URL to scrape - headers: Optional headers to send with the request - - Returns: - Dict containing comparison results - """ - print(f"๐ŸŒ Scraping {website_url} with comparison...") - print("=" * 60) - - results = {} - - # Test without heavy JS rendering (default) - print("\n1๏ธโƒฃ Scraping WITHOUT heavy JS rendering...") - start_time = time.time() - - try: - result_no_js = client.scrape( - website_url=website_url, - render_heavy_js=False, - headers=headers - ) - no_js_time = time.time() - start_time - - html_no_js = result_no_js.get("html", "") - results["no_js"] = { - "success": True, - "html_length": len(html_no_js), - "execution_time": no_js_time, - "html_content": html_no_js, - "result": result_no_js - } - - print(f"โœ… Completed in {no_js_time:.2f} seconds") - print(f"๐Ÿ“ HTML length: {len(html_no_js):,} characters") - - except Exception as e: - results["no_js"] = { - "success": False, - "error": str(e), - "execution_time": time.time() - start_time - } - print(f"โŒ Failed: {str(e)}") - - # Test with heavy JS rendering - print("\n2๏ธโƒฃ Scraping WITH heavy JS rendering...") - start_time = time.time() - - try: - result_with_js = client.scrape( - website_url=website_url, - render_heavy_js=True, - headers=headers - ) - with_js_time = time.time() - start_time - - html_with_js = result_with_js.get("html", "") - results["with_js"] = { - "success": True, - "html_length": len(html_with_js), - "execution_time": with_js_time, - "html_content": html_with_js, - "result": result_with_js - } - - print(f"โœ… Completed in {with_js_time:.2f} seconds") - print(f"๐Ÿ“ HTML length: {len(html_with_js):,} characters") - - except Exception as e: - results["with_js"] = { - "success": False, - "error": str(e), - "execution_time": time.time() - start_time - } - print(f"โŒ Failed: {str(e)}") - - return results - - -def analyze_differences(results: Dict[str, Any]) -> Dict[str, Any]: - """ - Analyze the differences between JS and non-JS rendering results. - - Args: - results: Results from scrape_with_comparison - - Returns: - Analysis results - """ - print("\n๐Ÿ” ANALYSIS: Comparing Results") - print("=" * 40) - - analysis = {} - - if results["no_js"]["success"] and results["with_js"]["success"]: - no_js_html = results["no_js"]["html_content"] - with_js_html = results["with_js"]["html_content"] - - # Length comparison - length_diff = results["with_js"]["html_length"] - results["no_js"]["html_length"] - length_percent = (length_diff / results["no_js"]["html_length"]) * 100 if results["no_js"]["html_length"] > 0 else 0 - - # Time comparison - time_diff = results["with_js"]["execution_time"] - results["no_js"]["execution_time"] - time_percent = (time_diff / results["no_js"]["execution_time"]) * 100 if results["no_js"]["execution_time"] > 0 else 0 - - # Content analysis - no_js_scripts = no_js_html.lower().count(" 1000: - print(" โœ… Heavy JS rendering captured significantly more content") - print(" โœ… Use render_heavy_js=True for this website") - elif length_diff > 0: - print(" โš ๏ธ Heavy JS rendering captured some additional content") - print(" โš ๏ธ Consider using render_heavy_js=True if you need dynamic content") - else: - print(" โ„น๏ธ No significant difference in content") - print(" โ„น๏ธ render_heavy_js=False is sufficient for this website") - - if time_diff > 5: - print(" โš ๏ธ Heavy JS rendering is significantly slower") - print(" โš ๏ธ Consider cost vs. benefit for your use case") - - else: - print("โŒ Cannot compare - one or both requests failed") - if not results["no_js"]["success"]: - print(f" No JS error: {results['no_js'].get('error', 'Unknown')}") - if not results["with_js"]["success"]: - print(f" With JS error: {results['with_js'].get('error', 'Unknown')}") - - return analysis - - -def save_comparison_results(results: Dict[str, Any], analysis: Dict[str, Any], website_url: str): - """ - Save the comparison results to files. - - Args: - results: Scraping results - analysis: Analysis results - website_url: The scraped website URL - """ - print(f"\n๐Ÿ’พ Saving comparison results...") - - # Create output directory - output_dir = Path("render_heavy_js_comparison") - output_dir.mkdir(exist_ok=True) - - # Save HTML files - if results["no_js"]["success"]: - no_js_file = output_dir / "scrape_no_js.html" - with open(no_js_file, "w", encoding="utf-8") as f: - f.write(results["no_js"]["html_content"]) - print(f"๐Ÿ“„ No JS HTML saved to: {no_js_file}") - - if results["with_js"]["success"]: - with_js_file = output_dir / "scrape_with_js.html" - with open(with_js_file, "w", encoding="utf-8") as f: - f.write(results["with_js"]["html_content"]) - print(f"๐Ÿ“„ With JS HTML saved to: {with_js_file}") - - # Save analysis report - report = { - "website_url": website_url, - "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"), - "results_summary": { - "no_js_success": results["no_js"]["success"], - "with_js_success": results["with_js"]["success"], - "no_js_html_length": results["no_js"].get("html_length", 0), - "with_js_html_length": results["with_js"].get("html_length", 0), - "no_js_time": results["no_js"].get("execution_time", 0), - "with_js_time": results["with_js"].get("execution_time", 0), - }, - "analysis": analysis - } - - report_file = output_dir / "comparison_report.json" - with open(report_file, "w", encoding="utf-8") as f: - json.dump(report, f, indent=2) - print(f"๐Ÿ“Š Analysis report saved to: {report_file}") - - -def demonstrate_curl_equivalent(): - """ - Show the curl command equivalent for the scrape API calls. - """ - print(f"\n๐ŸŒ CURL COMMAND EQUIVALENTS") - print("=" * 50) - - print("1๏ธโƒฃ Scrape WITHOUT heavy JS rendering:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": false") - print(" }'") - - print("\n2๏ธโƒฃ Scrape WITH heavy JS rendering:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": true") - print(" }'") - - print("\n3๏ธโƒฃ With custom headers:") - print("curl -X POST https://api.scrapegraphai.com/v1/scrape \\") - print(" -H \"Content-Type: application/json\" \\") - print(" -H \"SGAI-APIKEY: your-api-key-here\" \\") - print(" -d '{") - print(" \"website_url\": \"https://example.com\",") - print(" \"render_heavy_js\": true,") - print(" \"headers\": {") - print(" \"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36\",") - print(" \"Accept-Language\": \"en-US,en;q=0.9\"") - print(" }") - print(" }'") - - -def main(): - """ - Main function demonstrating render_heavy_js functionality. - """ - print("๐Ÿš€ Scrape API: render_heavy_js Comparison Example") - print("=" * 60) - - # Test websites - mix of static and dynamic content - test_websites = [ - { - "url": "https://example.com", - "name": "Example.com (Static)", - "description": "Simple static website - minimal JS" - }, - { - "url": "https://httpbin.org/html", - "name": "HTTPBin HTML", - "description": "HTTP testing service - static HTML" - } - ] - - # Show curl equivalents first - demonstrate_curl_equivalent() - - # Initialize client - try: - client = Client.from_env() - print(f"\nโœ… Client initialized successfully") - except Exception as e: - print(f"โŒ Failed to initialize client: {str(e)}") - print("Make sure you have SGAI_API_KEY in your .env file") - return - - # Test each website - for website in test_websites: - print(f"\n{'='*80}") - print(f"๐Ÿงช TESTING: {website['name']}") - print(f"๐Ÿ“ Description: {website['description']}") - print(f"๐Ÿ”— URL: {website['url']}") - print(f"{'='*80}") - - try: - # Perform comparison - results = scrape_with_comparison(client, website["url"]) - - # Analyze differences - analysis = analyze_differences(results) - - # Save results - save_comparison_results(results, analysis, website["url"]) - - except Exception as e: - print(f"โŒ Error testing {website['url']}: {str(e)}") - - # Close client - client.close() - print(f"\n๐Ÿ”’ Client closed successfully") - - # Final recommendations - print(f"\n๐Ÿ’ก GENERAL RECOMMENDATIONS") - print("=" * 30) - print("๐Ÿ”น Use render_heavy_js=False (default) for:") - print(" โ€ข Static websites") - print(" โ€ข Simple content sites") - print(" โ€ข When speed is priority") - print(" โ€ข When cost optimization is important") - - print("\n๐Ÿ”น Use render_heavy_js=True for:") - print(" โ€ข Single Page Applications (SPAs)") - print(" โ€ข React/Vue/Angular websites") - print(" โ€ข Sites with dynamic content loading") - print(" โ€ข When you need JavaScript-rendered content") - - print("\n๐Ÿ”น Cost considerations:") - print(" โ€ข render_heavy_js=True takes longer and uses more resources") - print(" โ€ข Test both options to determine if the extra content is worth it") - print(" โ€ข Consider caching results for frequently accessed pages") - - -if __name__ == "__main__": - main() diff --git a/scrapegraph-py/examples/utilities/send_feedback_example.py b/scrapegraph-py/examples/utilities/send_feedback_example.py deleted file mode 100644 index 4c397ed3..00000000 --- a/scrapegraph-py/examples/utilities/send_feedback_example.py +++ /dev/null @@ -1,28 +0,0 @@ -from scrapegraph_py import Client -from scrapegraph_py.logger import sgai_logger - -sgai_logger.set_logging(level="INFO") - -# Initialize the client -sgai_client = Client(api_key="your-api-key-here") - -# Example request_id (replace with an actual request_id from a previous request) -request_id = "your-request-id-here" - -# Check remaining credits -credits = sgai_client.get_credits() -print(f"Credits Info: {credits}") - -# Submit feedback for a previous request -feedback_response = sgai_client.submit_feedback( - request_id=request_id, - rating=5, # Rating from 1-5 - feedback_text="The extraction was accurate and exactly what I needed!", -) -print(f"\nFeedback Response: {feedback_response}") - -# Get previous results using get_smartscraper -previous_result = sgai_client.get_smartscraper(request_id=request_id) -print(f"\nRetrieved Previous Result: {previous_result}") - -sgai_client.close() diff --git a/scrapegraph-py/pyproject.toml b/scrapegraph-py/pyproject.toml deleted file mode 100644 index 95ec2dbc..00000000 --- a/scrapegraph-py/pyproject.toml +++ /dev/null @@ -1,108 +0,0 @@ -[project] -name = "scrapegraph_py" -version = "1.12.2" -description = "ScrapeGraph Python SDK for API" -authors = [ - { name = "Marco Vinciguerra", email = "marco@scrapegraphai.com" }, - { name = "Lorenzo Padoan", email = "lorenzo@scrapegraphai.com" } -] - - -license = "MIT" -readme = "README.md" -homepage = "https://scrapegraphai.com/" -repository = "https://github.com/ScrapeGraphAI/scrapegraph-sdk" -documentation = "https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py#readme" -keywords = [ - "ai", - "artificial intelligence", - "gpt", - "machine learning", - "nlp", - "natural language processing", - "openai", - "scraping", - "web scraping tool", - "webscraping", - "graph", - "sdk", - "api" - ] -classifiers = [ - "Intended Audience :: Developers", - "Topic :: Software Development :: Libraries :: Python Modules", - "Programming Language :: Python :: 3", - "Operating System :: OS Independent", -] -requires-python = ">=3.10,<4.0" - -dependencies = [ - "requests>=2.32.3", - "pydantic>=2.10.2", - "python-dotenv>=1.0.1", - "aiohttp>=3.10", - "requests>=2.32.3", - "beautifulsoup4>=4.12.3", - "toonify>=1.0.0", -] - -[project.optional-dependencies] -html = ["beautifulsoup4>=4.12.3"] -langchain = [ - "langchain>=0.3.0", - "langchain-community>=0.2.11", - "langchain-scrapegraph>=0.1.0", -] -docs = ["sphinx==6.0", "furo==2024.5.6"] - -[tool.uv] -managed = true -dev-dependencies = [ - "pytest>=7.4.0", - "pytest-mock==3.14.0", - "pylint>=3.2.5", - "pytest-asyncio>=0.23.8", - "aioresponses>=0.7.7", - "responses>=0.25.3", - "pytest-sugar>=1.0.0", - "pytest-cov>=6.0.0", - "black>=24.10.0", - "mypy>=1.13.0", - "ruff>=0.8.0", - "isort>=5.13.2", - "pre-commit>=4.0.1", - "types-setuptools>=75.6.0.20241126", - "mkdocs>=1.6.1", - "mkdocs-material>=9.5.46", - "mkdocstrings-python>=1.12.2", - "poethepoet>=0.31.1", - "twine>=6.1.0", -] - -[tool.black] -line-length = 88 -target-version = ["py310"] - -[tool.isort] -profile = "black" - -[tool.ruff] -line-length = 88 - -[tool.ruff.lint] -select = ["F", "E", "W", "C"] -ignore = ["E203", "E501", "C901"] # Ignore conflicts with Black and function complexity - -[tool.mypy] -python_version = "3.10" -strict = true -disallow_untyped_calls = true -ignore_missing_imports = true - -[build-system] -requires = ["hatchling==1.26.3"] -build-backend = "hatchling.build" - -[tool.poe.tasks] -pylint-local = "pylint scrapegraph_py/**/*.py" -pylint-ci = "pylint --disable=C0114,C0115,C0116,C901 --exit-zero scrapegraph_py/**/*.py" diff --git a/scrapegraph-py/pytest.ini b/scrapegraph-py/pytest.ini deleted file mode 100644 index d5f46ec0..00000000 --- a/scrapegraph-py/pytest.ini +++ /dev/null @@ -1,23 +0,0 @@ -[tool:pytest] -testpaths = tests -python_files = test_*.py -python_classes = Test* -python_functions = test_* -addopts = - -v - --tb=short - --strict-markers - --disable-warnings - --cov=scrapegraph_py - --cov-report=term-missing - --cov-report=html - --cov-report=xml - --cov-fail-under=80 -markers = - asyncio: marks tests as async (deselect with '-m "not asyncio"') - slow: marks tests as slow (deselect with '-m "not slow"') - integration: marks tests as integration tests - unit: marks tests as unit tests -filterwarnings = - ignore::DeprecationWarning - ignore::PendingDeprecationWarning diff --git a/scrapegraph-py/requirements-test.txt b/scrapegraph-py/requirements-test.txt deleted file mode 100644 index 7af9631b..00000000 --- a/scrapegraph-py/requirements-test.txt +++ /dev/null @@ -1,18 +0,0 @@ -# Testing dependencies -pytest>=7.0.0 -pytest-asyncio>=0.21.0 -pytest-cov>=4.0.0 -responses>=0.23.0 - -# Linting and code quality -flake8>=6.0.0 -black>=23.0.0 -isort>=5.12.0 -mypy>=1.0.0 - -# Security -bandit>=1.7.0 -safety>=2.3.0 - -# Coverage reporting -coverage>=7.0.0 diff --git a/scrapegraph-py/scrapegraph_py/__init__.py b/scrapegraph-py/scrapegraph_py/__init__.py deleted file mode 100644 index 588effe0..00000000 --- a/scrapegraph-py/scrapegraph_py/__init__.py +++ /dev/null @@ -1,97 +0,0 @@ -""" -ScrapeGraphAI Python SDK - -A comprehensive Python SDK for the ScrapeGraphAI API, providing both synchronous -and asynchronous clients for all API endpoints. - -Main Features: - - SmartScraper: AI-powered web scraping with structured data extraction - - SearchScraper: Web research across multiple sources - - Agentic Scraper: Automated browser interactions and form filling - - Crawl: Website crawling with AI extraction or markdown conversion - - Markdownify: Convert web pages to clean markdown - - Schema Generation: AI-assisted schema creation for data extraction - - Scheduled Jobs: Automate recurring scraping tasks - -Quick Start: - >>> from scrapegraph_py import Client - >>> - >>> # Initialize client from environment variables - >>> client = Client.from_env() - >>> - >>> # Basic scraping - >>> result = client.smartscraper( - ... website_url="https://example.com", - ... user_prompt="Extract all product information" - ... ) - >>> - >>> # With context manager - >>> with Client.from_env() as client: - ... result = client.scrape(website_url="https://example.com") - -Async Usage: - >>> import asyncio - >>> from scrapegraph_py import AsyncClient - >>> - >>> async def main(): - ... async with AsyncClient.from_env() as client: - ... result = await client.smartscraper( - ... website_url="https://example.com", - ... user_prompt="Extract products" - ... ) - >>> - >>> asyncio.run(main()) - -For more information visit: https://scrapegraphai.com -Documentation: https://docs.scrapegraphai.com -""" - -from .async_client import AsyncClient -from .client import Client - -# Scrape Models -from .models.scrape import ( - ScrapeRequest, - GetScrapeRequest, -) - -# Scheduled Jobs Models -from .models.scheduled_jobs import ( - GetJobExecutionsRequest, - GetScheduledJobRequest, - GetScheduledJobsRequest, - JobActionRequest, - JobActionResponse, - JobExecutionListResponse, - JobExecutionResponse, - JobTriggerResponse, - ScheduledJobCreate, - ScheduledJobListResponse, - ScheduledJobResponse, - ScheduledJobUpdate, - ServiceType, - TriggerJobRequest, -) - -__all__ = [ - "Client", - "AsyncClient", - # Scrape Models - "ScrapeRequest", - "GetScrapeRequest", - # Scheduled Jobs Models - "ServiceType", - "ScheduledJobCreate", - "ScheduledJobUpdate", - "ScheduledJobResponse", - "ScheduledJobListResponse", - "JobExecutionResponse", - "JobExecutionListResponse", - "JobTriggerResponse", - "JobActionResponse", - "GetScheduledJobsRequest", - "GetScheduledJobRequest", - "GetJobExecutionsRequest", - "TriggerJobRequest", - "JobActionRequest", -] diff --git a/scrapegraph-py/scrapegraph_py/async_client.py b/scrapegraph-py/scrapegraph_py/async_client.py deleted file mode 100644 index 51118491..00000000 --- a/scrapegraph-py/scrapegraph_py/async_client.py +++ /dev/null @@ -1,1324 +0,0 @@ -""" -Asynchronous HTTP client for the ScrapeGraphAI API. - -This module provides an asynchronous client for interacting with all ScrapeGraphAI -API endpoints including smartscraper, searchscraper, crawl, agentic scraper, -markdownify, schema generation, scheduled jobs, and utility functions. - -The AsyncClient class supports: -- API key authentication -- SSL verification configuration -- Request timeout configuration -- Automatic retry logic with exponential backoff -- Mock mode for testing -- Async context manager support for proper resource cleanup -- Concurrent requests using asyncio - -Example: - Basic usage with environment variables: - >>> import asyncio - >>> from scrapegraph_py import AsyncClient - >>> async def main(): - ... client = AsyncClient.from_env() - ... result = await client.smartscraper( - ... website_url="https://example.com", - ... user_prompt="Extract product information" - ... ) - ... await client.close() - >>> asyncio.run(main()) - - Using async context manager: - >>> async def main(): - ... async with AsyncClient(api_key="sgai-...") as client: - ... result = await client.scrape(website_url="https://example.com") - >>> asyncio.run(main()) -""" -import asyncio -from typing import Any, Dict, Optional, Callable - -from aiohttp import ClientSession, ClientTimeout, TCPConnector -from aiohttp.client_exceptions import ClientError -from pydantic import BaseModel -from urllib.parse import urlparse -import uuid as _uuid - -from scrapegraph_py.config import API_BASE_URL, DEFAULT_HEADERS -from scrapegraph_py.exceptions import APIError -from scrapegraph_py.logger import sgai_logger as logger -from scrapegraph_py.models.agenticscraper import ( - AgenticScraperRequest, - GetAgenticScraperRequest, -) -from scrapegraph_py.models.crawl import CrawlRequest, GetCrawlRequest -from scrapegraph_py.models.feedback import FeedbackRequest -from scrapegraph_py.models.scrape import GetScrapeRequest, ScrapeRequest -from scrapegraph_py.models.markdownify import GetMarkdownifyRequest, MarkdownifyRequest -from scrapegraph_py.models.schema import ( - GenerateSchemaRequest, - GetSchemaStatusRequest, - SchemaGenerationResponse, -) -from scrapegraph_py.models.searchscraper import ( - GetSearchScraperRequest, - SearchScraperRequest, - TimeRange, -) -from scrapegraph_py.models.sitemap import SitemapRequest, SitemapResponse -from scrapegraph_py.models.smartscraper import ( - GetSmartScraperRequest, - SmartScraperRequest, -) -from scrapegraph_py.models.scheduled_jobs import ( - GetJobExecutionsRequest, - GetScheduledJobRequest, - GetScheduledJobsRequest, - JobActionRequest, - ScheduledJobCreate, - ScheduledJobUpdate, - TriggerJobRequest, -) -from scrapegraph_py.utils.helpers import handle_async_response, validate_api_key -from scrapegraph_py.utils.toon_converter import process_response_with_toon - - -class AsyncClient: - """ - Asynchronous client for the ScrapeGraphAI API. - - This class provides asynchronous methods for all ScrapeGraphAI API endpoints. - It handles authentication, request management, error handling, and supports - mock mode for testing. Uses aiohttp for efficient async HTTP requests. - - Attributes: - api_key (str): The API key for authentication - headers (dict): Default headers including API key - timeout (ClientTimeout): Request timeout configuration - max_retries (int): Maximum number of retry attempts - retry_delay (float): Base delay between retries in seconds - mock (bool): Whether mock mode is enabled - session (ClientSession): Aiohttp session for connection pooling - - Example: - >>> async def example(): - ... async with AsyncClient.from_env() as client: - ... result = await client.smartscraper( - ... website_url="https://example.com", - ... user_prompt="Extract all products" - ... ) - """ - @classmethod - def from_env( - cls, - verify_ssl: bool = True, - timeout: Optional[float] = None, - max_retries: int = 3, - retry_delay: float = 1.0, - mock: Optional[bool] = None, - mock_handler: Optional[Callable[[str, str, Dict[str, Any]], Any]] = None, - mock_responses: Optional[Dict[str, Any]] = None, - ): - """Initialize AsyncClient using API key from environment variable. - - Args: - verify_ssl: Whether to verify SSL certificates - timeout: Request timeout in seconds. None means no timeout (infinite) - max_retries: Maximum number of retry attempts - retry_delay: Delay between retries in seconds - """ - from os import getenv - - # Allow enabling mock mode from environment if not explicitly provided - if mock is None: - mock_env = getenv("SGAI_MOCK", "0").strip().lower() - mock = mock_env in {"1", "true", "yes", "on"} - - api_key = getenv("SGAI_API_KEY") - # In mock mode, we don't need a real API key - if not api_key: - if mock: - api_key = "sgai-00000000-0000-0000-0000-000000000000" - else: - raise ValueError("SGAI_API_KEY environment variable not set") - return cls( - api_key=api_key, - verify_ssl=verify_ssl, - timeout=timeout, - max_retries=max_retries, - retry_delay=retry_delay, - mock=bool(mock), - mock_handler=mock_handler, - mock_responses=mock_responses, - ) - - def __init__( - self, - api_key: str = None, - verify_ssl: bool = True, - timeout: Optional[float] = None, - max_retries: int = 3, - retry_delay: float = 1.0, - mock: bool = False, - mock_handler: Optional[Callable[[str, str, Dict[str, Any]], Any]] = None, - mock_responses: Optional[Dict[str, Any]] = None, - ): - """Initialize AsyncClient with configurable parameters. - - Args: - api_key: API key for authentication. If None, will try to - load from environment - verify_ssl: Whether to verify SSL certificates - timeout: Request timeout in seconds. None means no timeout (infinite) - max_retries: Maximum number of retry attempts - retry_delay: Delay between retries in seconds - """ - logger.info("๐Ÿ”‘ Initializing AsyncClient") - - # Try to get API key from environment if not provided - if api_key is None: - from os import getenv - - api_key = getenv("SGAI_API_KEY") - if not api_key: - raise ValueError( - "SGAI_API_KEY not provided and not found in environment" - ) - - validate_api_key(api_key) - logger.debug( - f"๐Ÿ› ๏ธ Configuration: verify_ssl={verify_ssl}, " - f"timeout={timeout}, max_retries={max_retries}" - ) - self.api_key = api_key - self.headers = {**DEFAULT_HEADERS, "SGAI-APIKEY": api_key} - self.max_retries = max_retries - self.retry_delay = retry_delay - self.mock = bool(mock) - self.mock_handler = mock_handler - self.mock_responses = mock_responses or {} - - ssl = None if verify_ssl else False - self.timeout = ClientTimeout(total=timeout) if timeout is not None else None - - self.session = ClientSession( - headers=self.headers, connector=TCPConnector(ssl=ssl), timeout=self.timeout - ) - - logger.info("โœ… AsyncClient initialized successfully") - - async def _make_request(self, method: str, url: str, **kwargs) -> Any: - """ - Make asynchronous HTTP request with retry logic and error handling. - - Args: - method: HTTP method (GET, POST, etc.) - url: Full URL for the request - **kwargs: Additional arguments to pass to aiohttp - - Returns: - Parsed JSON response data - - Raises: - APIError: If the API returns an error response - ConnectionError: If unable to connect after all retries - - Note: - In mock mode, this method returns deterministic responses without - making actual HTTP requests. - """ - # Short-circuit when mock mode is enabled - if getattr(self, "mock", False): - return self._mock_response(method, url, **kwargs) - for attempt in range(self.max_retries): - try: - logger.info( - f"๐Ÿš€ Making {method} request to {url} " - f"(Attempt {attempt + 1}/{self.max_retries})" - ) - logger.debug(f"๐Ÿ” Request parameters: {kwargs}") - - async with self.session.request(method, url, **kwargs) as response: - logger.debug(f"๐Ÿ“ฅ Response status: {response.status}") - result = await handle_async_response(response) - logger.info(f"โœ… Request completed successfully: {method} {url}") - return result - - except ClientError as e: - logger.warning(f"โš ๏ธ Request attempt {attempt + 1} failed: {str(e)}") - if hasattr(e, "status") and e.status is not None: - try: - error_data = await e.response.json() - error_msg = error_data.get("error", str(e)) - logger.error(f"๐Ÿ”ด API Error: {error_msg}") - raise APIError(error_msg, status_code=e.status) - except ValueError: - logger.error("๐Ÿ”ด Could not parse error response") - raise APIError( - str(e), - status_code=e.status if hasattr(e, "status") else None, - ) - - if attempt == self.max_retries - 1: - logger.error(f"โŒ All retry attempts failed for {method} {url}") - raise ConnectionError(f"Failed to connect to API: {str(e)}") - - retry_delay = self.retry_delay * (attempt + 1) - logger.info(f"โณ Waiting {retry_delay}s before retry {attempt + 2}") - await asyncio.sleep(retry_delay) - - def _mock_response(self, method: str, url: str, **kwargs) -> Any: - """Return a deterministic mock response without performing network I/O. - - Resolution order: - 1) If a custom mock_handler is provided, delegate to it - 2) If mock_responses contains a key for the request path, use it - 3) Fallback to built-in defaults per endpoint family - """ - logger.info(f"๐Ÿงช Mock mode active. Returning stub for {method} {url}") - - # 1) Custom handler - if self.mock_handler is not None: - try: - return self.mock_handler(method, url, kwargs) - except Exception as handler_error: - logger.warning(f"Custom mock_handler raised: {handler_error}. Falling back to defaults.") - - # 2) Path-based override - try: - parsed = urlparse(url) - path = parsed.path.rstrip("/") - except Exception: - path = url - - override = self.mock_responses.get(path) - if override is not None: - return override() if callable(override) else override - - # 3) Built-in defaults - def new_id(prefix: str) -> str: - return f"{prefix}-{_uuid.uuid4()}" - - upper_method = method.upper() - - # Credits endpoint - if path.endswith("/credits") and upper_method == "GET": - return {"remaining_credits": 1000, "total_credits_used": 0} - - # Health check endpoint - if path.endswith("/healthz") and upper_method == "GET": - return {"status": "healthy", "message": "Service is operational"} - - # Feedback acknowledge - if path.endswith("/feedback") and upper_method == "POST": - return {"status": "success"} - - # Create-like endpoints (POST) - if upper_method == "POST": - if path.endswith("/crawl"): - return {"crawl_id": new_id("mock-crawl")} - elif path.endswith("/scheduled-jobs"): - return { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 9 * * 1", - "job_config": {"mock": "config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T00:00:00Z", - "next_run_at": "2024-01-08T09:00:00Z" - } - elif "/pause" in path: - return { - "message": "Job paused successfully", - "job_id": new_id("mock-job"), - "is_active": False - } - elif "/resume" in path: - return { - "message": "Job resumed successfully", - "job_id": new_id("mock-job"), - "is_active": True, - "next_run_at": "2024-01-08T09:00:00Z" - } - elif "/trigger" in path: - task_id = new_id("mock-task") - return { - "execution_id": task_id, - "scheduled_job_id": new_id("mock-job"), - "triggered_at": "2024-01-01T00:00:00Z", - "message": f"Job triggered successfully. Task ID: {task_id}" - } - # All other POST endpoints return a request id - return {"request_id": new_id("mock-req")} - - # Status-like endpoints (GET) - if upper_method == "GET": - if "markdownify" in path: - return {"status": "completed", "content": "# Mock markdown\n\n..."} - if "smartscraper" in path: - return {"status": "completed", "result": [{"field": "value"}]} - if "searchscraper" in path: - return { - "status": "completed", - "results": [{"url": "https://example.com"}], - "markdown_content": "# Mock Markdown Content\n\nThis is mock markdown content for testing purposes.\n\n## Section 1\n\nSome content here.\n\n## Section 2\n\nMore content here.", - "reference_urls": ["https://example.com", "https://example2.com"] - } - if "crawl" in path: - return {"status": "completed", "pages": []} - if "agentic-scrapper" in path: - return {"status": "completed", "actions": []} - if "scheduled-jobs" in path: - if "/executions" in path: - return { - "executions": [ - { - "id": new_id("mock-exec"), - "scheduled_job_id": new_id("mock-job"), - "execution_id": new_id("mock-task"), - "status": "completed", - "started_at": "2024-01-01T00:00:00Z", - "completed_at": "2024-01-01T00:01:00Z", - "result": {"mock": "result"}, - "credits_used": 10 - } - ], - "total": 1, - "page": 1, - "page_size": 20 - } - elif path.endswith("/scheduled-jobs"): # List jobs endpoint - return { - "jobs": [ - { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 9 * * 1", - "job_config": {"mock": "config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T00:00:00Z", - "next_run_at": "2024-01-08T09:00:00Z" - } - ], - "total": 1, - "page": 1, - "page_size": 20 - } - else: # Single job endpoint - return { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 9 * * 1", - "job_config": {"mock": "config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T00:00:00Z", - "next_run_at": "2024-01-08T09:00:00Z" - } - - # Update operations (PATCH/PUT) - if upper_method in ["PATCH", "PUT"] and "scheduled-jobs" in path: - return { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Updated Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 10 * * 1", - "job_config": {"mock": "updated_config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T01:00:00Z", - "next_run_at": "2024-01-08T10:00:00Z" - } - - # Delete operations - if upper_method == "DELETE" and "scheduled-jobs" in path: - return {"message": "Scheduled job deleted successfully"} - - # Generic fallback - return {"status": "mock", "url": url, "method": method, "kwargs": kwargs} - - async def markdownify( - self, website_url: str, headers: Optional[dict[str, str]] = None, mock: bool = False, render_heavy_js: bool = False, stealth: bool = False, wait_ms: Optional[int] = None, return_toon: bool = False - ): - """Send a markdownify request - - Args: - website_url: The URL to convert to markdown - headers: Optional HTTP headers - mock: Enable mock mode for testing - render_heavy_js: Enable heavy JavaScript rendering - stealth: Enable stealth mode to avoid bot detection - wait_ms: Number of milliseconds to wait before scraping the website - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Starting markdownify request for {website_url}") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if render_heavy_js: - logger.debug("โšก Heavy JavaScript rendering enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = MarkdownifyRequest(website_url=website_url, headers=headers, mock=mock, render_heavy_js=render_heavy_js, stealth=stealth, wait_ms=wait_ms) - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/markdownify", json=request.model_dump() - ) - logger.info("โœจ Markdownify request completed successfully") - return process_response_with_toon(result, return_toon) - - async def get_markdownify(self, request_id: str, return_toon: bool = False): - """Get the result of a previous markdownify request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching markdownify result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetMarkdownifyRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request( - "GET", f"{API_BASE_URL}/markdownify/{request_id}" - ) - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - async def scrape( - self, - website_url: str, - render_heavy_js: bool = False, - branding: bool = False, - headers: Optional[dict[str, str]] = None, - stealth: bool = False, - wait_ms: Optional[int] = None, - return_toon: bool = False, - ): - """Send a scrape request to get HTML content from a website - - Args: - website_url: The URL of the website to get HTML from - render_heavy_js: Whether to render heavy JavaScript (defaults to False) - branding: Whether to include branding in the response (defaults to False) - headers: Optional headers to send with the request - stealth: Enable stealth mode to avoid bot detection - wait_ms: Number of milliseconds to wait before scraping the website - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Starting scrape request for {website_url}") - logger.debug(f"๐Ÿ”ง Render heavy JS: {render_heavy_js}") - logger.debug(f"๐Ÿ”ง Branding: {branding}") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = ScrapeRequest( - website_url=website_url, - render_heavy_js=render_heavy_js, - branding=branding, - headers=headers, - stealth=stealth, - wait_ms=wait_ms, - ) - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/scrape", json=request.model_dump() - ) - logger.info("โœจ Scrape request completed successfully") - return process_response_with_toon(result, return_toon) - - async def get_scrape(self, request_id: str, return_toon: bool = False): - """Get the result of a previous scrape request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching scrape result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetScrapeRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request( - "GET", f"{API_BASE_URL}/scrape/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - async def sitemap( - self, - website_url: str, - mock: bool = False, - ) -> SitemapResponse: - """Extract all URLs from a website's sitemap. - - Automatically discovers sitemap from robots.txt or common sitemap locations. - - Args: - website_url: The URL of the website to extract sitemap from - mock: Whether to use mock mode for this request - - Returns: - SitemapResponse: Object containing list of URLs extracted from sitemap - - Raises: - ValueError: If website_url is invalid - APIError: If the API request fails - - Examples: - >>> async with AsyncClient(api_key="your-api-key") as client: - ... response = await client.sitemap("https://example.com") - ... print(f"Found {len(response.urls)} URLs") - ... for url in response.urls[:5]: - ... print(url) - """ - logger.info(f"๐Ÿ—บ๏ธ Starting sitemap extraction for {website_url}") - - request = SitemapRequest( - website_url=website_url, - mock=mock - ) - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/sitemap", json=request.model_dump() - ) - logger.info(f"โœจ Sitemap extraction completed successfully - found {len(result.get('urls', []))} URLs") - - # Parse response into SitemapResponse model - return SitemapResponse(**result) - - async def smartscraper( - self, - user_prompt: str, - website_url: Optional[str] = None, - website_html: Optional[str] = None, - website_markdown: Optional[str] = None, - headers: Optional[dict[str, str]] = None, - cookies: Optional[Dict[str, str]] = None, - output_schema: Optional[BaseModel] = None, - number_of_scrolls: Optional[int] = None, - total_pages: Optional[int] = None, - mock: bool = False, - plain_text: bool = False, - render_heavy_js: bool = False, - stealth: bool = False, - wait_ms: Optional[int] = None, - return_toon: bool = False, - ): - """ - Send a smartscraper request with optional pagination support and cookies. - - Supports three types of input (must provide exactly one): - - website_url: Scrape from a URL - - website_html: Process local HTML content - - website_markdown: Process local Markdown content - - Args: - user_prompt: Natural language prompt describing what to extract - website_url: URL to scrape (optional) - website_html: Raw HTML content to process (optional, max 2MB) - website_markdown: Markdown content to process (optional, max 2MB) - headers: Optional HTTP headers - cookies: Optional cookies for authentication - output_schema: Optional Pydantic model for structured output - number_of_scrolls: Number of times to scroll (0-100) - total_pages: Number of pages to scrape (1-10) - mock: Enable mock mode for testing - plain_text: Return plain text instead of structured data - render_heavy_js: Enable heavy JavaScript rendering - stealth: Enable stealth mode to avoid bot detection - wait_ms: Number of milliseconds to wait before scraping the website - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - - Returns: - Dictionary containing the scraping results, or TOON formatted string if return_toon=True - - Raises: - ValueError: If validation fails or invalid parameters provided - APIError: If the API request fails - """ - logger.info("๐Ÿ” Starting smartscraper request") - if website_url: - logger.debug(f"๐ŸŒ URL: {website_url}") - if website_html: - logger.debug("๐Ÿ“„ Using provided HTML content") - if website_markdown: - logger.debug("๐Ÿ“ Using provided Markdown content") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if cookies: - logger.debug("๐Ÿช Using cookies for authentication/session management") - if number_of_scrolls is not None: - logger.debug(f"๐Ÿ”„ Number of scrolls: {number_of_scrolls}") - if total_pages is not None: - logger.debug(f"๐Ÿ“„ Total pages to scrape: {total_pages}") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if render_heavy_js: - logger.debug("โšก Heavy JavaScript rendering enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - logger.debug(f"๐Ÿ“ Prompt: {user_prompt}") - - request = SmartScraperRequest( - website_url=website_url, - website_html=website_html, - website_markdown=website_markdown, - headers=headers, - cookies=cookies, - user_prompt=user_prompt, - output_schema=output_schema, - number_of_scrolls=number_of_scrolls, - total_pages=total_pages, - mock=mock, - plain_text=plain_text, - render_heavy_js=render_heavy_js, - stealth=stealth, - wait_ms=wait_ms, - ) - - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/smartscraper", json=request.model_dump() - ) - logger.info("โœจ Smartscraper request completed successfully") - return process_response_with_toon(result, return_toon) - - async def get_smartscraper(self, request_id: str, return_toon: bool = False): - """Get the result of a previous smartscraper request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching smartscraper result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetSmartScraperRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request( - "GET", f"{API_BASE_URL}/smartscraper/{request_id}" - ) - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - async def submit_feedback( - self, request_id: str, rating: int, feedback_text: Optional[str] = None - ): - """Submit feedback for a request""" - logger.info(f"๐Ÿ“ Submitting feedback for request {request_id}") - logger.debug(f"โญ Rating: {rating}, Feedback: {feedback_text}") - - feedback = FeedbackRequest( - request_id=request_id, rating=rating, feedback_text=feedback_text - ) - logger.debug("โœ… Feedback validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/feedback", json=feedback.model_dump() - ) - logger.info("โœจ Feedback submitted successfully") - return result - - async def get_credits(self): - """Get credits information""" - logger.info("๐Ÿ’ณ Fetching credits information") - - result = await self._make_request( - "GET", - f"{API_BASE_URL}/credits", - ) - logger.info( - f"โœจ Credits info retrieved: " - f"{result.get('remaining_credits')} credits remaining" - ) - return result - - async def healthz(self): - """Check the health status of the service - - This endpoint is useful for monitoring and ensuring the service is operational. - It returns a JSON response indicating the service's health status. - - Returns: - dict: Health status information - - Example: - >>> async with AsyncClient.from_env() as client: - ... health = await client.healthz() - ... print(health) - """ - logger.info("๐Ÿฅ Checking service health") - - result = await self._make_request( - "GET", - f"{API_BASE_URL}/healthz", - ) - logger.info("โœจ Health check completed successfully") - return result - - async def searchscraper( - self, - user_prompt: str, - num_results: Optional[int] = 3, - headers: Optional[dict[str, str]] = None, - output_schema: Optional[BaseModel] = None, - extraction_mode: bool = True, - stealth: bool = False, - location_geo_code: Optional[str] = None, - time_range: Optional[TimeRange] = None, - return_toon: bool = False, - ): - """Send a searchscraper request - - Args: - user_prompt: The search prompt string - num_results: Number of websites to scrape (3-20). Default is 3. - More websites provide better research depth but cost more - credits. Credit calculation: 30 base + 10 per additional - website beyond 3. - headers: Optional headers to send with the request - output_schema: Optional schema to structure the output - extraction_mode: Whether to use AI extraction (True) or markdown conversion (False). - AI extraction costs 10 credits per page, markdown conversion costs 2 credits per page. - stealth: Enable stealth mode to avoid bot detection - location_geo_code: Optional geo code of the location to search in (e.g., "us") - time_range: Optional time range filter for search results (e.g., TimeRange.PAST_WEEK) - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info("๐Ÿ” Starting searchscraper request") - logger.debug(f"๐Ÿ“ Prompt: {user_prompt}") - logger.debug(f"๐ŸŒ Number of results: {num_results}") - logger.debug(f"๐Ÿค– Extraction mode: {'AI extraction' if extraction_mode else 'Markdown conversion'}") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if location_geo_code: - logger.debug(f"๐ŸŒ Location geo code: {location_geo_code}") - if time_range: - logger.debug(f"๐Ÿ“… Time range: {time_range.value}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = SearchScraperRequest( - user_prompt=user_prompt, - num_results=num_results, - headers=headers, - output_schema=output_schema, - extraction_mode=extraction_mode, - stealth=stealth, - location_geo_code=location_geo_code, - time_range=time_range, - ) - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/searchscraper", json=request.model_dump() - ) - logger.info("โœจ Searchscraper request completed successfully") - return process_response_with_toon(result, return_toon) - - async def get_searchscraper(self, request_id: str, return_toon: bool = False): - """Get the result of a previous searchscraper request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching searchscraper result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetSearchScraperRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request( - "GET", f"{API_BASE_URL}/searchscraper/{request_id}" - ) - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - async def crawl( - self, - url: str, - prompt: Optional[str] = None, - data_schema: Optional[Dict[str, Any]] = None, - extraction_mode: bool = True, - cache_website: bool = True, - depth: int = 2, - breadth: Optional[int] = None, - max_pages: int = 2, - same_domain_only: bool = True, - batch_size: Optional[int] = None, - sitemap: bool = False, - headers: Optional[dict[str, str]] = None, - render_heavy_js: bool = False, - stealth: bool = False, - include_paths: Optional[list[str]] = None, - exclude_paths: Optional[list[str]] = None, - webhook_url: Optional[str] = None, - wait_ms: Optional[int] = None, - return_toon: bool = False, - ): - """Send a crawl request with support for both AI extraction and - markdown conversion modes - - Args: - url: The starting URL to crawl - prompt: AI prompt for data extraction (required for AI extraction mode) - data_schema: Schema for structured output - extraction_mode: Whether to use AI extraction (True) or markdown (False) - cache_website: Whether to cache the website - depth: Maximum depth of link traversal - breadth: Maximum number of links to crawl per depth level. If None, unlimited (default). - Controls the 'width' of exploration at each depth. Useful for limiting crawl scope - on large sites. Note: max_pages always takes priority. Ignored when sitemap=True. - max_pages: Maximum number of pages to crawl - same_domain_only: Only crawl pages within the same domain - batch_size: Number of pages to process in batch - sitemap: Use sitemap for crawling - headers: Optional HTTP headers - render_heavy_js: Enable heavy JavaScript rendering - stealth: Enable stealth mode to avoid bot detection - include_paths: List of path patterns to include (e.g., ['/products/*', '/blog/**']) - Supports wildcards: * matches any characters, ** matches any path segments - exclude_paths: List of path patterns to exclude (e.g., ['/admin/*', '/api/*']) - Supports wildcards and takes precedence over include_paths - webhook_url: URL to receive webhook notifications when the crawl completes - wait_ms: Number of milliseconds to wait before scraping each page - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info("๐Ÿ” Starting crawl request") - logger.debug(f"๐ŸŒ URL: {url}") - logger.debug( - f"๐Ÿค– Extraction mode: {'AI' if extraction_mode else 'Markdown conversion'}" - ) - if extraction_mode: - logger.debug(f"๐Ÿ“ Prompt: {prompt}") - logger.debug(f"๐Ÿ“Š Schema provided: {bool(data_schema)}") - else: - logger.debug( - "๐Ÿ“„ Markdown conversion mode - no AI processing, 2 credits per page" - ) - logger.debug(f"๐Ÿ’พ Cache website: {cache_website}") - logger.debug(f"๐Ÿ” Depth: {depth}") - if breadth is not None: - logger.debug(f"๐Ÿ“ Breadth: {breadth}") - logger.debug(f"๐Ÿ“„ Max pages: {max_pages}") - logger.debug(f"๐Ÿ  Same domain only: {same_domain_only}") - logger.debug(f"๐Ÿ—บ๏ธ Use sitemap: {sitemap}") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if render_heavy_js: - logger.debug("โšก Heavy JavaScript rendering enabled") - if batch_size is not None: - logger.debug(f"๐Ÿ“ฆ Batch size: {batch_size}") - if include_paths: - logger.debug(f"โœ… Include paths: {include_paths}") - if exclude_paths: - logger.debug(f"โŒ Exclude paths: {exclude_paths}") - if webhook_url: - logger.debug(f"๐Ÿ”” Webhook URL: {webhook_url}") - if wait_ms is not None: - logger.debug(f"โฑ๏ธ Wait ms: {wait_ms}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Build request data, excluding None values - request_data = { - "url": url, - "extraction_mode": extraction_mode, - "cache_website": cache_website, - "depth": depth, - "max_pages": max_pages, - "same_domain_only": same_domain_only, - "sitemap": sitemap, - "render_heavy_js": render_heavy_js, - "stealth": stealth, - } - - # Add optional parameters only if provided - if prompt is not None: - request_data["prompt"] = prompt - if data_schema is not None: - request_data["data_schema"] = data_schema - if breadth is not None: - request_data["breadth"] = breadth - if batch_size is not None: - request_data["batch_size"] = batch_size - if headers is not None: - request_data["headers"] = headers - if include_paths is not None: - request_data["include_paths"] = include_paths - if exclude_paths is not None: - request_data["exclude_paths"] = exclude_paths - if webhook_url is not None: - request_data["webhook_url"] = webhook_url - if wait_ms is not None: - request_data["wait_ms"] = wait_ms - - request = CrawlRequest(**request_data) - logger.debug("โœ… Request validation passed") - - # Serialize the request, excluding None values - request_json = request.model_dump(exclude_none=True) - result = await self._make_request( - "POST", f"{API_BASE_URL}/crawl", json=request_json - ) - logger.info("โœจ Crawl request completed successfully") - return process_response_with_toon(result, return_toon) - - async def get_crawl(self, crawl_id: str, return_toon: bool = False): - """Get the result of a previous crawl request - - Args: - crawl_id: The crawl ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching crawl result for request {crawl_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetCrawlRequest(crawl_id=crawl_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request("GET", f"{API_BASE_URL}/crawl/{crawl_id}") - logger.info(f"โœจ Successfully retrieved result for request {crawl_id}") - return process_response_with_toon(result, return_toon) - - async def agenticscraper( - self, - url: str, - steps: list[str], - use_session: bool = True, - user_prompt: Optional[str] = None, - output_schema: Optional[Dict[str, Any]] = None, - ai_extraction: bool = False, - stealth: bool = False, - return_toon: bool = False, - ): - """Send an agentic scraper request to perform automated actions on a webpage - - Args: - url: The URL to scrape - steps: List of steps to perform on the webpage - use_session: Whether to use session for the scraping (default: True) - user_prompt: Prompt for AI extraction (required when ai_extraction=True) - output_schema: Schema for structured data extraction (optional, used with ai_extraction=True) - ai_extraction: Whether to use AI for data extraction from the scraped content (default: False) - stealth: Enable stealth mode to avoid bot detection - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿค– Starting agentic scraper request for {url}") - logger.debug(f"๐Ÿ”ง Use session: {use_session}") - logger.debug(f"๐Ÿ“‹ Steps: {steps}") - logger.debug(f"๐Ÿง  AI extraction: {ai_extraction}") - if ai_extraction: - logger.debug(f"๐Ÿ’ญ User prompt: {user_prompt}") - logger.debug(f"๐Ÿ“‹ Output schema provided: {output_schema is not None}") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = AgenticScraperRequest( - url=url, - steps=steps, - use_session=use_session, - user_prompt=user_prompt, - output_schema=output_schema, - ai_extraction=ai_extraction, - stealth=stealth, - ) - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/agentic-scrapper", json=request.model_dump() - ) - logger.info("โœจ Agentic scraper request completed successfully") - return process_response_with_toon(result, return_toon) - - async def get_agenticscraper(self, request_id: str, return_toon: bool = False): - """Get the result of a previous agentic scraper request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching agentic scraper result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetAgenticScraperRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request("GET", f"{API_BASE_URL}/agentic-scrapper/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - async def generate_schema( - self, - user_prompt: str, - existing_schema: Optional[Dict[str, Any]] = None, - ): - """Generate a JSON schema from a user prompt - - Args: - user_prompt: The user's search query to be refined into a schema - existing_schema: Optional existing JSON schema to modify/extend - """ - logger.info("๐Ÿ”ง Starting schema generation request") - logger.debug(f"๐Ÿ’ญ User prompt: {user_prompt}") - if existing_schema: - logger.debug(f"๐Ÿ“‹ Existing schema provided: {existing_schema is not None}") - - request = GenerateSchemaRequest( - user_prompt=user_prompt, - existing_schema=existing_schema, - ) - logger.debug("โœ… Request validation passed") - - result = await self._make_request( - "POST", f"{API_BASE_URL}/generate_schema", json=request.model_dump() - ) - logger.info("โœจ Schema generation request completed successfully") - return result - - async def get_schema_status(self, request_id: str): - """Get the result of a previous schema generation request - - Args: - request_id: The request ID returned from generate_schema - """ - logger.info(f"๐Ÿ” Fetching schema generation status for request {request_id}") - - # Validate input using Pydantic model - GetSchemaStatusRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = await self._make_request("GET", f"{API_BASE_URL}/generate_schema/{request_id}") - logger.info(f"โœจ Successfully retrieved schema status for request {request_id}") - return result - - async def create_scheduled_job( - self, - job_name: str, - service_type: str, - cron_expression: str, - job_config: dict, - is_active: bool = True, - ): - """Create a new scheduled job""" - logger.info(f"๐Ÿ“… Creating scheduled job: {job_name}") - - request = ScheduledJobCreate( - job_name=job_name, - service_type=service_type, - cron_expression=cron_expression, - job_config=job_config, - is_active=is_active, - ) - - result = await self._make_request( - "POST", f"{API_BASE_URL}/scheduled-jobs", json=request.model_dump() - ) - logger.info("โœจ Scheduled job created successfully") - return result - - async def get_scheduled_jobs( - self, - page: int = 1, - page_size: int = 20, - service_type: Optional[str] = None, - is_active: Optional[bool] = None, - ): - """Get list of scheduled jobs with pagination""" - logger.info("๐Ÿ“‹ Fetching scheduled jobs") - - GetScheduledJobsRequest( - page=page, - page_size=page_size, - service_type=service_type, - is_active=is_active, - ) - - params = {"page": page, "page_size": page_size} - if service_type: - params["service_type"] = service_type - if is_active is not None: - params["is_active"] = is_active - - result = await self._make_request("GET", f"{API_BASE_URL}/scheduled-jobs", params=params) - logger.info(f"โœจ Successfully retrieved {len(result.get('jobs', []))} scheduled jobs") - return result - - async def get_scheduled_job(self, job_id: str): - """Get details of a specific scheduled job""" - logger.info(f"๐Ÿ” Fetching scheduled job {job_id}") - - GetScheduledJobRequest(job_id=job_id) - - result = await self._make_request("GET", f"{API_BASE_URL}/scheduled-jobs/{job_id}") - logger.info(f"โœจ Successfully retrieved scheduled job {job_id}") - return result - - async def update_scheduled_job( - self, - job_id: str, - job_name: Optional[str] = None, - cron_expression: Optional[str] = None, - job_config: Optional[dict] = None, - is_active: Optional[bool] = None, - ): - """Update an existing scheduled job (partial update)""" - logger.info(f"๐Ÿ“ Updating scheduled job {job_id}") - - update_data = {} - if job_name is not None: - update_data["job_name"] = job_name - if cron_expression is not None: - update_data["cron_expression"] = cron_expression - if job_config is not None: - update_data["job_config"] = job_config - if is_active is not None: - update_data["is_active"] = is_active - - ScheduledJobUpdate(**update_data) - - result = await self._make_request( - "PATCH", f"{API_BASE_URL}/scheduled-jobs/{job_id}", json=update_data - ) - logger.info(f"โœจ Successfully updated scheduled job {job_id}") - return result - - async def replace_scheduled_job( - self, - job_id: str, - job_name: str, - cron_expression: str, - job_config: dict, - is_active: bool = True, - ): - """Replace an existing scheduled job (full update)""" - logger.info(f"๐Ÿ”„ Replacing scheduled job {job_id}") - - request_data = { - "job_name": job_name, - "cron_expression": cron_expression, - "job_config": job_config, - "is_active": is_active, - } - - result = await self._make_request( - "PUT", f"{API_BASE_URL}/scheduled-jobs/{job_id}", json=request_data - ) - logger.info(f"โœจ Successfully replaced scheduled job {job_id}") - return result - - async def delete_scheduled_job(self, job_id: str): - """Delete a scheduled job""" - logger.info(f"๐Ÿ—‘๏ธ Deleting scheduled job {job_id}") - - JobActionRequest(job_id=job_id) - - result = await self._make_request("DELETE", f"{API_BASE_URL}/scheduled-jobs/{job_id}") - logger.info(f"โœจ Successfully deleted scheduled job {job_id}") - return result - - async def pause_scheduled_job(self, job_id: str): - """Pause a scheduled job""" - logger.info(f"โธ๏ธ Pausing scheduled job {job_id}") - - JobActionRequest(job_id=job_id) - - result = await self._make_request("POST", f"{API_BASE_URL}/scheduled-jobs/{job_id}/pause") - logger.info(f"โœจ Successfully paused scheduled job {job_id}") - return result - - async def resume_scheduled_job(self, job_id: str): - """Resume a paused scheduled job""" - logger.info(f"โ–ถ๏ธ Resuming scheduled job {job_id}") - - JobActionRequest(job_id=job_id) - - result = await self._make_request("POST", f"{API_BASE_URL}/scheduled-jobs/{job_id}/resume") - logger.info(f"โœจ Successfully resumed scheduled job {job_id}") - return result - - async def trigger_scheduled_job(self, job_id: str): - """Manually trigger a scheduled job""" - logger.info(f"๐Ÿš€ Manually triggering scheduled job {job_id}") - - TriggerJobRequest(job_id=job_id) - - result = await self._make_request("POST", f"{API_BASE_URL}/scheduled-jobs/{job_id}/trigger") - logger.info(f"โœจ Successfully triggered scheduled job {job_id}") - return result - - async def get_job_executions( - self, - job_id: str, - page: int = 1, - page_size: int = 20, - status: Optional[str] = None, - ): - """Get execution history for a scheduled job""" - logger.info(f"๐Ÿ“Š Fetching execution history for job {job_id}") - - GetJobExecutionsRequest( - job_id=job_id, - page=page, - page_size=page_size, - status=status, - ) - - params = {"page": page, "page_size": page_size} - if status: - params["status"] = status - - result = await self._make_request( - "GET", f"{API_BASE_URL}/scheduled-jobs/{job_id}/executions", params=params - ) - logger.info(f"โœจ Successfully retrieved execution history for job {job_id}") - return result - - async def close(self): - """Close the session to free up resources""" - logger.info("๐Ÿ”’ Closing AsyncClient session") - await self.session.close() - logger.debug("โœ… Session closed successfully") - - async def __aenter__(self): - return self - - async def __aexit__(self, exc_type, exc_val, exc_tb): - await self.close() diff --git a/scrapegraph-py/scrapegraph_py/client.py b/scrapegraph-py/scrapegraph_py/client.py deleted file mode 100644 index 28fc1bf8..00000000 --- a/scrapegraph-py/scrapegraph_py/client.py +++ /dev/null @@ -1,1335 +0,0 @@ -""" -Synchronous HTTP client for the ScrapeGraphAI API. - -This module provides a synchronous client for interacting with all ScrapeGraphAI -API endpoints including smartscraper, searchscraper, crawl, agentic scraper, -markdownify, schema generation, scheduled jobs, and utility functions. - -The Client class supports: -- API key authentication -- SSL verification configuration -- Request timeout configuration -- Automatic retry logic with exponential backoff -- Mock mode for testing -- Context manager support for proper resource cleanup - -Example: - Basic usage with environment variables: - >>> from scrapegraph_py import Client - >>> client = Client.from_env() - >>> result = client.smartscraper( - ... website_url="https://example.com", - ... user_prompt="Extract product information" - ... ) - - Using context manager: - >>> with Client(api_key="sgai-...") as client: - ... result = client.scrape(website_url="https://example.com") -""" -import uuid as _uuid -from typing import Any, Callable, Dict, Optional -from urllib.parse import urlparse - -import requests -import urllib3 -from pydantic import BaseModel -from requests.exceptions import RequestException - -from scrapegraph_py.config import API_BASE_URL, DEFAULT_HEADERS -from scrapegraph_py.exceptions import APIError -from scrapegraph_py.logger import sgai_logger as logger -from scrapegraph_py.models.agenticscraper import ( - AgenticScraperRequest, - GetAgenticScraperRequest, -) -from scrapegraph_py.models.crawl import CrawlRequest, GetCrawlRequest -from scrapegraph_py.models.feedback import FeedbackRequest -from scrapegraph_py.models.markdownify import GetMarkdownifyRequest, MarkdownifyRequest -from scrapegraph_py.models.schema import ( - GenerateSchemaRequest, - GetSchemaStatusRequest, - SchemaGenerationResponse, -) -from scrapegraph_py.models.scrape import GetScrapeRequest, ScrapeRequest -from scrapegraph_py.models.searchscraper import ( - GetSearchScraperRequest, - SearchScraperRequest, - TimeRange, -) -from scrapegraph_py.models.sitemap import SitemapRequest, SitemapResponse -from scrapegraph_py.models.smartscraper import ( - GetSmartScraperRequest, - SmartScraperRequest, -) -from scrapegraph_py.models.scheduled_jobs import ( - GetJobExecutionsRequest, - GetScheduledJobRequest, - GetScheduledJobsRequest, - JobActionRequest, - JobActionResponse, - JobExecutionListResponse, - JobTriggerResponse, - ScheduledJobCreate, - ScheduledJobListResponse, - ScheduledJobResponse, - ScheduledJobUpdate, - TriggerJobRequest, -) -from scrapegraph_py.utils.helpers import handle_sync_response, validate_api_key -from scrapegraph_py.utils.toon_converter import process_response_with_toon - - -class Client: - """ - Synchronous client for the ScrapeGraphAI API. - - This class provides synchronous methods for all ScrapeGraphAI API endpoints. - It handles authentication, request management, error handling, and supports - mock mode for testing. - - Attributes: - api_key (str): The API key for authentication - headers (dict): Default headers including API key - timeout (Optional[float]): Request timeout in seconds - max_retries (int): Maximum number of retry attempts - retry_delay (float): Delay between retries in seconds - mock (bool): Whether mock mode is enabled - session (requests.Session): HTTP session for connection pooling - - Example: - >>> client = Client.from_env() - >>> result = client.smartscraper( - ... website_url="https://example.com", - ... user_prompt="Extract all products" - ... ) - """ - @classmethod - def from_env( - cls, - verify_ssl: bool = True, - timeout: Optional[float] = None, - max_retries: int = 3, - retry_delay: float = 1.0, - mock: Optional[bool] = None, - mock_handler: Optional[Callable[[str, str, Dict[str, Any]], Any]] = None, - mock_responses: Optional[Dict[str, Any]] = None, - ): - """Initialize Client using API key from environment variable. - - Args: - verify_ssl: Whether to verify SSL certificates - timeout: Request timeout in seconds. None means no timeout (infinite) - max_retries: Maximum number of retry attempts - retry_delay: Delay between retries in seconds - mock: If True, the client will not perform real HTTP requests and - will return stubbed responses. If None, reads from SGAI_MOCK env. - """ - from os import getenv - - # Allow enabling mock mode from environment if not explicitly provided - if mock is None: - mock_env = getenv("SGAI_MOCK", "0").strip().lower() - mock = mock_env in {"1", "true", "yes", "on"} - - api_key = getenv("SGAI_API_KEY") - # In mock mode, we don't need a real API key - if not api_key: - if mock: - api_key = "sgai-00000000-0000-0000-0000-000000000000" - else: - raise ValueError("SGAI_API_KEY environment variable not set") - return cls( - api_key=api_key, - verify_ssl=verify_ssl, - timeout=timeout, - max_retries=max_retries, - retry_delay=retry_delay, - mock=bool(mock), - mock_handler=mock_handler, - mock_responses=mock_responses, - ) - - def __init__( - self, - api_key: str = None, - verify_ssl: bool = True, - timeout: Optional[float] = None, - max_retries: int = 3, - retry_delay: float = 1.0, - mock: bool = False, - mock_handler: Optional[Callable[[str, str, Dict[str, Any]], Any]] = None, - mock_responses: Optional[Dict[str, Any]] = None, - ): - """Initialize Client with configurable parameters. - - Args: - api_key: API key for authentication. If None, will try to load - from environment - verify_ssl: Whether to verify SSL certificates - timeout: Request timeout in seconds. None means no timeout (infinite) - max_retries: Maximum number of retry attempts - retry_delay: Delay between retries in seconds - mock: If True, the client will bypass HTTP calls and return - deterministic mock responses - mock_handler: Optional callable to generate custom mock responses - given (method, url, request_kwargs) - mock_responses: Optional mapping of path (e.g. "/v1/credits") to - static response or callable returning a response - """ - logger.info("๐Ÿ”‘ Initializing Client") - - # Try to get API key from environment if not provided - if api_key is None: - from os import getenv - - api_key = getenv("SGAI_API_KEY") - if not api_key: - raise ValueError( - "SGAI_API_KEY not provided and not found in environment" - ) - - validate_api_key(api_key) - logger.debug( - f"๐Ÿ› ๏ธ Configuration: verify_ssl={verify_ssl}, timeout={timeout}, " - f"max_retries={max_retries}" - ) - - self.api_key = api_key - self.headers = {**DEFAULT_HEADERS, "SGAI-APIKEY": api_key} - self.timeout = timeout - self.max_retries = max_retries - self.retry_delay = retry_delay - self.mock = bool(mock) - self.mock_handler = mock_handler - self.mock_responses = mock_responses or {} - - # Create a session for connection pooling - self.session = requests.Session() - self.session.headers.update(self.headers) - self.session.verify = verify_ssl - - # Configure retries - adapter = requests.adapters.HTTPAdapter( - max_retries=requests.urllib3.Retry( - total=max_retries, - backoff_factor=retry_delay, - status_forcelist=[500, 502, 503, 504], - ) - ) - self.session.mount("http://", adapter) - self.session.mount("https://", adapter) - - # Add warning suppression if verify_ssl is False - if not verify_ssl: - urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) - - logger.info("โœ… Client initialized successfully") - - def _make_request(self, method: str, url: str, **kwargs) -> Any: - """ - Make HTTP request with error handling and retry logic. - - Args: - method: HTTP method (GET, POST, etc.) - url: Full URL for the request - **kwargs: Additional arguments to pass to requests - - Returns: - Parsed JSON response data - - Raises: - APIError: If the API returns an error response - ConnectionError: If unable to connect to the API - - Note: - In mock mode, this method returns deterministic responses without - making actual HTTP requests. - """ - # Short-circuit when mock mode is enabled - if getattr(self, "mock", False): - return self._mock_response(method, url, **kwargs) - try: - logger.info(f"๐Ÿš€ Making {method} request to {url}") - logger.debug(f"๐Ÿ” Request parameters: {kwargs}") - - response = self.session.request(method, url, timeout=self.timeout, **kwargs) - logger.debug(f"๐Ÿ“ฅ Response status: {response.status_code}") - - result = handle_sync_response(response) - logger.info(f"โœ… Request completed successfully: {method} {url}") - return result - - except RequestException as e: - logger.error(f"โŒ Request failed: {str(e)}") - if hasattr(e, "response") and e.response is not None: - try: - error_data = e.response.json() - error_msg = error_data.get("error", str(e)) - logger.error(f"๐Ÿ”ด API Error: {error_msg}") - raise APIError(error_msg, status_code=e.response.status_code) - except ValueError: - logger.error("๐Ÿ”ด Could not parse error response") - raise APIError( - str(e), - status_code=( - e.response.status_code - if hasattr(e.response, "status_code") - else None - ), - ) - logger.error(f"๐Ÿ”ด Connection Error: {str(e)}") - raise ConnectionError(f"Failed to connect to API: {str(e)}") - - def _mock_response(self, method: str, url: str, **kwargs) -> Any: - """Return a deterministic mock response without performing network I/O. - - Resolution order: - 1) If a custom mock_handler is provided, delegate to it - 2) If mock_responses contains a key for the request path, use it - 3) Fallback to built-in defaults per endpoint family - """ - logger.info(f"๐Ÿงช Mock mode active. Returning stub for {method} {url}") - - # 1) Custom handler - if self.mock_handler is not None: - try: - return self.mock_handler(method, url, kwargs) - except Exception as handler_error: - logger.warning(f"Custom mock_handler raised: {handler_error}. Falling back to defaults.") - - # 2) Path-based override - try: - parsed = urlparse(url) - path = parsed.path.rstrip("/") - except Exception: - path = url - - override = self.mock_responses.get(path) - if override is not None: - return override() if callable(override) else override - - # 3) Built-in defaults - def new_id(prefix: str) -> str: - return f"{prefix}-{_uuid.uuid4()}" - - upper_method = method.upper() - - # Credits endpoint - if path.endswith("/credits") and upper_method == "GET": - return {"remaining_credits": 1000, "total_credits_used": 0} - - # Health check endpoint - if path.endswith("/healthz") and upper_method == "GET": - return {"status": "healthy", "message": "Service is operational"} - - # Feedback acknowledge - if path.endswith("/feedback") and upper_method == "POST": - return {"status": "success"} - - # Create-like endpoints (POST) - if upper_method == "POST": - if path.endswith("/crawl"): - return {"crawl_id": new_id("mock-crawl")} - elif path.endswith("/scheduled-jobs"): - return { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 9 * * 1", - "job_config": {"mock": "config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T00:00:00Z", - "next_run_at": "2024-01-08T09:00:00Z" - } - elif "/pause" in path: - return { - "message": "Job paused successfully", - "job_id": new_id("mock-job"), - "is_active": False - } - elif "/resume" in path: - return { - "message": "Job resumed successfully", - "job_id": new_id("mock-job"), - "is_active": True, - "next_run_at": "2024-01-08T09:00:00Z" - } - elif "/trigger" in path: - return { - "execution_id": new_id("mock-task"), - "scheduled_job_id": new_id("mock-job"), - "triggered_at": "2024-01-01T00:00:00Z", - "message": f"Job triggered successfully. Task ID: {new_id('mock-task')}" - } - # All other POST endpoints return a request id - return {"request_id": new_id("mock-req")} - - # Status-like endpoints (GET) - if upper_method == "GET": - if "markdownify" in path: - return {"status": "completed", "content": "# Mock markdown\n\n..."} - if "smartscraper" in path: - return {"status": "completed", "result": [{"field": "value"}]} - if "searchscraper" in path: - return { - "status": "completed", - "results": [{"url": "https://example.com"}], - "markdown_content": "# Mock Markdown Content\n\nThis is mock markdown content for testing purposes.\n\n## Section 1\n\nSome content here.\n\n## Section 2\n\nMore content here.", - "reference_urls": ["https://example.com", "https://example2.com"] - } - if "crawl" in path: - return {"status": "completed", "pages": []} - if "agentic-scrapper" in path: - return {"status": "completed", "actions": []} - if "scheduled-jobs" in path: - if "/executions" in path: - return { - "executions": [ - { - "id": new_id("mock-exec"), - "scheduled_job_id": new_id("mock-job"), - "execution_id": new_id("mock-task"), - "status": "completed", - "started_at": "2024-01-01T00:00:00Z", - "completed_at": "2024-01-01T00:01:00Z", - "result": {"mock": "result"}, - "credits_used": 10 - } - ], - "total": 1, - "page": 1, - "page_size": 20 - } - elif path.endswith("/scheduled-jobs"): # List jobs endpoint - return { - "jobs": [ - { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 9 * * 1", - "job_config": {"mock": "config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T00:00:00Z", - "next_run_at": "2024-01-08T09:00:00Z" - } - ], - "total": 1, - "page": 1, - "page_size": 20 - } - else: # Single job endpoint - return { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 9 * * 1", - "job_config": {"mock": "config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T00:00:00Z", - "next_run_at": "2024-01-08T09:00:00Z" - } - - # Update operations (PATCH/PUT) - if upper_method in ["PATCH", "PUT"] and "scheduled-jobs" in path: - return { - "id": new_id("mock-job"), - "user_id": new_id("mock-user"), - "job_name": "Updated Mock Scheduled Job", - "service_type": "smartscraper", - "cron_expression": "0 10 * * 1", - "job_config": {"mock": "updated_config"}, - "is_active": True, - "created_at": "2024-01-01T00:00:00Z", - "updated_at": "2024-01-01T01:00:00Z", - "next_run_at": "2024-01-08T10:00:00Z" - } - - # Delete operations - if upper_method == "DELETE" and "scheduled-jobs" in path: - return {"message": "Scheduled job deleted successfully"} - - # Generic fallback - return {"status": "mock", "url": url, "method": method, "kwargs": kwargs} - - def markdownify(self, website_url: str, headers: Optional[dict[str, str]] = None, mock: bool = False, render_heavy_js: bool = False, stealth: bool = False, wait_ms: Optional[int] = None, return_toon: bool = False): - """Send a markdownify request - - Args: - website_url: The URL to convert to markdown - headers: Optional HTTP headers - mock: Enable mock mode for testing - render_heavy_js: Enable heavy JavaScript rendering - stealth: Enable stealth mode to avoid bot detection - wait_ms: Number of milliseconds to wait before scraping the website - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Starting markdownify request for {website_url}") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if render_heavy_js: - logger.debug("โšก Heavy JavaScript rendering enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = MarkdownifyRequest(website_url=website_url, headers=headers, mock=mock, render_heavy_js=render_heavy_js, stealth=stealth, wait_ms=wait_ms) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/markdownify", json=request.model_dump() - ) - logger.info("โœจ Markdownify request completed successfully") - return process_response_with_toon(result, return_toon) - - def get_markdownify(self, request_id: str, return_toon: bool = False): - """Get the result of a previous markdownify request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching markdownify result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetMarkdownifyRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/markdownify/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - def scrape( - self, - website_url: str, - render_heavy_js: bool = False, - branding: bool = False, - headers: Optional[dict[str, str]] = None, - mock:bool=False, - stealth:bool=False, - wait_ms: Optional[int] = None, - return_toon: bool = False, - ): - """Send a scrape request to get HTML content from a website - - Args: - website_url: The URL of the website to get HTML from - render_heavy_js: Whether to render heavy JavaScript (defaults to False) - branding: Whether to include branding in the response (defaults to False) - headers: Optional headers to send with the request - mock: Enable mock mode for testing - stealth: Enable stealth mode to avoid bot detection - wait_ms: Number of milliseconds to wait before scraping the website - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Starting scrape request for {website_url}") - logger.debug(f"๐Ÿ”ง Render heavy JS: {render_heavy_js}") - logger.debug(f"๐Ÿ”ง Branding: {branding}") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = ScrapeRequest( - website_url=website_url, - render_heavy_js=render_heavy_js, - branding=branding, - headers=headers, - mock=mock, - stealth=stealth, - wait_ms=wait_ms, - ) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/scrape", json=request.model_dump() - ) - logger.info("โœจ Scrape request completed successfully") - return process_response_with_toon(result, return_toon) - - def get_scrape(self, request_id: str, return_toon: bool = False): - """Get the result of a previous scrape request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching scrape result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetScrapeRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/scrape/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - def sitemap( - self, - website_url: str, - mock: bool = False, - ) -> SitemapResponse: - """Extract all URLs from a website's sitemap. - - Automatically discovers sitemap from robots.txt or common sitemap locations. - - Args: - website_url: The URL of the website to extract sitemap from - mock: Whether to use mock mode for this request - - Returns: - SitemapResponse: Object containing list of URLs extracted from sitemap - - Raises: - ValueError: If website_url is invalid - APIError: If the API request fails - - Examples: - >>> client = Client(api_key="your-api-key") - >>> response = client.sitemap("https://example.com") - >>> print(f"Found {len(response.urls)} URLs") - >>> for url in response.urls[:5]: - ... print(url) - """ - logger.info(f"๐Ÿ—บ๏ธ Starting sitemap extraction for {website_url}") - - request = SitemapRequest( - website_url=website_url, - mock=mock - ) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/sitemap", json=request.model_dump() - ) - logger.info(f"โœจ Sitemap extraction completed successfully - found {len(result.get('urls', []))} URLs") - - # Parse response into SitemapResponse model - return SitemapResponse(**result) - - def smartscraper( - self, - user_prompt: str, - website_url: Optional[str] = None, - website_html: Optional[str] = None, - website_markdown: Optional[str] = None, - headers: Optional[dict[str, str]] = None, - cookies: Optional[Dict[str, str]] = None, - output_schema: Optional[BaseModel] = None, - number_of_scrolls: Optional[int] = None, - total_pages: Optional[int] = None, - mock: bool = False, - plain_text: bool = False, - render_heavy_js: bool = False, - stealth: bool = False, - wait_ms: Optional[int] = None, - return_toon: bool = False, - ): - """ - Send a smartscraper request with optional pagination support and cookies. - - Supports three types of input (must provide exactly one): - - website_url: Scrape from a URL - - website_html: Process local HTML content - - website_markdown: Process local Markdown content - - Args: - user_prompt: Natural language prompt describing what to extract - website_url: URL to scrape (optional) - website_html: Raw HTML content to process (optional, max 2MB) - website_markdown: Markdown content to process (optional, max 2MB) - headers: Optional HTTP headers - cookies: Optional cookies for authentication - output_schema: Optional Pydantic model for structured output - number_of_scrolls: Number of times to scroll (0-100) - total_pages: Number of pages to scrape (1-10) - mock: Enable mock mode for testing - plain_text: Return plain text instead of structured data - render_heavy_js: Enable heavy JavaScript rendering - stealth: Enable stealth mode to avoid bot detection - wait_ms: Number of milliseconds to wait before scraping the website - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - - Returns: - Dictionary containing the scraping results, or TOON formatted string if return_toon=True - - Raises: - ValueError: If validation fails or invalid parameters provided - APIError: If the API request fails - """ - logger.info("๐Ÿ” Starting smartscraper request") - if website_url: - logger.debug(f"๐ŸŒ URL: {website_url}") - if website_html: - logger.debug("๐Ÿ“„ Using provided HTML content") - if website_markdown: - logger.debug("๐Ÿ“ Using provided Markdown content") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if cookies: - logger.debug("๐Ÿช Using cookies for authentication/session management") - if number_of_scrolls is not None: - logger.debug(f"๐Ÿ”„ Number of scrolls: {number_of_scrolls}") - if total_pages is not None: - logger.debug(f"๐Ÿ“„ Total pages to scrape: {total_pages}") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if render_heavy_js: - logger.debug("โšก Heavy JavaScript rendering enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - logger.debug(f"๐Ÿ“ Prompt: {user_prompt}") - - request = SmartScraperRequest( - website_url=website_url, - website_html=website_html, - website_markdown=website_markdown, - headers=headers, - cookies=cookies, - user_prompt=user_prompt, - output_schema=output_schema, - number_of_scrolls=number_of_scrolls, - total_pages=total_pages, - mock=mock, - plain_text=plain_text, - render_heavy_js=render_heavy_js, - stealth=stealth, - wait_ms=wait_ms, - ) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/smartscraper", json=request.model_dump() - ) - logger.info("โœจ Smartscraper request completed successfully") - return process_response_with_toon(result, return_toon) - - def get_smartscraper(self, request_id: str, return_toon: bool = False): - """Get the result of a previous smartscraper request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching smartscraper result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetSmartScraperRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/smartscraper/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - def submit_feedback( - self, request_id: str, rating: int, feedback_text: Optional[str] = None - ): - """Submit feedback for a request""" - logger.info(f"๐Ÿ“ Submitting feedback for request {request_id}") - logger.debug(f"โญ Rating: {rating}, Feedback: {feedback_text}") - - feedback = FeedbackRequest( - request_id=request_id, rating=rating, feedback_text=feedback_text - ) - logger.debug("โœ… Feedback validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/feedback", json=feedback.model_dump() - ) - logger.info("โœจ Feedback submitted successfully") - return result - - def get_credits(self): - """Get credits information""" - logger.info("๐Ÿ’ณ Fetching credits information") - - result = self._make_request( - "GET", - f"{API_BASE_URL}/credits", - ) - logger.info( - f"โœจ Credits info retrieved: {result.get('remaining_credits')} " - f"credits remaining" - ) - return result - - def healthz(self): - """Check the health status of the service - - This endpoint is useful for monitoring and ensuring the service is operational. - It returns a JSON response indicating the service's health status. - - Returns: - dict: Health status information - - Example: - >>> client = Client.from_env() - >>> health = client.healthz() - >>> print(health) - """ - logger.info("๐Ÿฅ Checking service health") - - result = self._make_request( - "GET", - f"{API_BASE_URL}/healthz", - ) - logger.info("โœจ Health check completed successfully") - return result - - def searchscraper( - self, - user_prompt: str, - num_results: Optional[int] = 3, - headers: Optional[dict[str, str]] = None, - output_schema: Optional[BaseModel] = None, - extraction_mode: bool = True, - mock: bool = False, - stealth: bool = False, - location_geo_code: Optional[str] = None, - time_range: Optional[TimeRange] = None, - return_toon: bool = False, - ): - """Send a searchscraper request - - Args: - user_prompt: The search prompt string - num_results: Number of websites to scrape (3-20). Default is 3. - More websites provide better research depth but cost more - credits. Credit calculation: 30 base + 10 per additional - website beyond 3. - headers: Optional headers to send with the request - output_schema: Optional schema to structure the output - extraction_mode: Whether to use AI extraction (True) or markdown conversion (False). - AI extraction costs 10 credits per page, markdown conversion costs 2 credits per page. - mock: Enable mock mode for testing - stealth: Enable stealth mode to avoid bot detection - location_geo_code: Optional geo code of the location to search in (e.g., "us") - time_range: Optional time range filter for search results (e.g., TimeRange.PAST_WEEK) - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info("๐Ÿ” Starting searchscraper request") - logger.debug(f"๐Ÿ“ Prompt: {user_prompt}") - logger.debug(f"๐ŸŒ Number of results: {num_results}") - logger.debug(f"๐Ÿค– Extraction mode: {'AI extraction' if extraction_mode else 'Markdown conversion'}") - if headers: - logger.debug("๐Ÿ”ง Using custom headers") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if location_geo_code: - logger.debug(f"๐ŸŒ Location geo code: {location_geo_code}") - if time_range: - logger.debug(f"๐Ÿ“… Time range: {time_range.value}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = SearchScraperRequest( - user_prompt=user_prompt, - num_results=num_results, - headers=headers, - output_schema=output_schema, - extraction_mode=extraction_mode, - mock=mock, - stealth=stealth, - location_geo_code=location_geo_code, - time_range=time_range, - ) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/searchscraper", json=request.model_dump() - ) - logger.info("โœจ Searchscraper request completed successfully") - return process_response_with_toon(result, return_toon) - - def get_searchscraper(self, request_id: str, return_toon: bool = False): - """Get the result of a previous searchscraper request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching searchscraper result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetSearchScraperRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/searchscraper/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - def crawl( - self, - url: str, - prompt: Optional[str] = None, - data_schema: Optional[Dict[str, Any]] = None, - extraction_mode: bool = True, - cache_website: bool = True, - depth: int = 2, - breadth: Optional[int] = None, - max_pages: int = 2, - same_domain_only: bool = True, - batch_size: Optional[int] = None, - sitemap: bool = False, - headers: Optional[dict[str, str]] = None, - render_heavy_js: bool = False, - stealth: bool = False, - include_paths: Optional[list[str]] = None, - exclude_paths: Optional[list[str]] = None, - webhook_url: Optional[str] = None, - wait_ms: Optional[int] = None, - return_toon: bool = False, - ): - """Send a crawl request with support for both AI extraction and - markdown conversion modes - - Args: - url: The starting URL to crawl - prompt: AI prompt for data extraction (required for AI extraction mode) - data_schema: Schema for structured output - extraction_mode: Whether to use AI extraction (True) or markdown (False) - cache_website: Whether to cache the website - depth: Maximum depth of link traversal - breadth: Maximum number of links to crawl per depth level. If None, unlimited (default). - Controls the 'width' of exploration at each depth. Useful for limiting crawl scope - on large sites. Note: max_pages always takes priority. Ignored when sitemap=True. - max_pages: Maximum number of pages to crawl - same_domain_only: Only crawl pages within the same domain - batch_size: Number of pages to process in batch - sitemap: Use sitemap for crawling - headers: Optional HTTP headers - render_heavy_js: Enable heavy JavaScript rendering - stealth: Enable stealth mode to avoid bot detection - include_paths: List of path patterns to include (e.g., ['/products/*', '/blog/**']) - Supports wildcards: * matches any characters, ** matches any path segments - exclude_paths: List of path patterns to exclude (e.g., ['/admin/*', '/api/*']) - Supports wildcards and takes precedence over include_paths - webhook_url: URL to receive webhook notifications when the crawl completes - wait_ms: Number of milliseconds to wait before scraping each page - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info("๐Ÿ” Starting crawl request") - logger.debug(f"๐ŸŒ URL: {url}") - logger.debug( - f"๐Ÿค– Extraction mode: {'AI' if extraction_mode else 'Markdown conversion'}" - ) - if extraction_mode: - logger.debug(f"๐Ÿ“ Prompt: {prompt}") - logger.debug(f"๐Ÿ“Š Schema provided: {bool(data_schema)}") - else: - logger.debug( - "๐Ÿ“„ Markdown conversion mode - no AI processing, 2 credits per page" - ) - logger.debug(f"๐Ÿ’พ Cache website: {cache_website}") - logger.debug(f"๐Ÿ” Depth: {depth}") - if breadth is not None: - logger.debug(f"๐Ÿ“ Breadth: {breadth}") - logger.debug(f"๐Ÿ“„ Max pages: {max_pages}") - logger.debug(f"๐Ÿ  Same domain only: {same_domain_only}") - logger.debug(f"๐Ÿ—บ๏ธ Use sitemap: {sitemap}") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if render_heavy_js: - logger.debug("โšก Heavy JavaScript rendering enabled") - if batch_size is not None: - logger.debug(f"๐Ÿ“ฆ Batch size: {batch_size}") - if include_paths: - logger.debug(f"โœ… Include paths: {include_paths}") - if exclude_paths: - logger.debug(f"โŒ Exclude paths: {exclude_paths}") - if webhook_url: - logger.debug(f"๐Ÿ”” Webhook URL: {webhook_url}") - if wait_ms is not None: - logger.debug(f"โฑ๏ธ Wait ms: {wait_ms}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Build request data, excluding None values - request_data = { - "url": url, - "extraction_mode": extraction_mode, - "cache_website": cache_website, - "depth": depth, - "max_pages": max_pages, - "same_domain_only": same_domain_only, - "sitemap": sitemap, - "render_heavy_js": render_heavy_js, - "stealth": stealth, - } - - # Add optional parameters only if provided - if prompt is not None: - request_data["prompt"] = prompt - if data_schema is not None: - request_data["data_schema"] = data_schema - if breadth is not None: - request_data["breadth"] = breadth - if batch_size is not None: - request_data["batch_size"] = batch_size - if headers is not None: - request_data["headers"] = headers - if include_paths is not None: - request_data["include_paths"] = include_paths - if exclude_paths is not None: - request_data["exclude_paths"] = exclude_paths - if webhook_url is not None: - request_data["webhook_url"] = webhook_url - if wait_ms is not None: - request_data["wait_ms"] = wait_ms - - request = CrawlRequest(**request_data) - logger.debug("โœ… Request validation passed") - - # Serialize the request, excluding None values - request_json = request.model_dump(exclude_none=True) - result = self._make_request("POST", f"{API_BASE_URL}/crawl", json=request_json) - logger.info("โœจ Crawl request completed successfully") - return process_response_with_toon(result, return_toon) - - def get_crawl(self, crawl_id: str, return_toon: bool = False): - """Get the result of a previous crawl request - - Args: - crawl_id: The crawl ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching crawl result for request {crawl_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetCrawlRequest(crawl_id=crawl_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/crawl/{crawl_id}") - logger.info(f"โœจ Successfully retrieved result for request {crawl_id}") - return process_response_with_toon(result, return_toon) - - def agenticscraper( - self, - url: str, - steps: list[str], - use_session: bool = True, - user_prompt: Optional[str] = None, - output_schema: Optional[Dict[str, Any]] = None, - ai_extraction: bool = False, - mock: bool=False, - stealth: bool=False, - return_toon: bool = False, - ): - """Send an agentic scraper request to perform automated actions on a webpage - - Args: - url: The URL to scrape - steps: List of steps to perform on the webpage - use_session: Whether to use session for the scraping (default: True) - user_prompt: Prompt for AI extraction (required when ai_extraction=True) - output_schema: Schema for structured data extraction (optional, used with ai_extraction=True) - ai_extraction: Whether to use AI for data extraction from the scraped content (default: False) - mock: Enable mock mode for testing - stealth: Enable stealth mode to avoid bot detection - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿค– Starting agentic scraper request for {url}") - logger.debug(f"๐Ÿ”ง Use session: {use_session}") - logger.debug(f"๐Ÿ“‹ Steps: {steps}") - logger.debug(f"๐Ÿง  AI extraction: {ai_extraction}") - if ai_extraction: - logger.debug(f"๐Ÿ’ญ User prompt: {user_prompt}") - logger.debug(f"๐Ÿ“‹ Output schema provided: {output_schema is not None}") - if stealth: - logger.debug("๐Ÿฅท Stealth mode enabled") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - request = AgenticScraperRequest( - url=url, - steps=steps, - use_session=use_session, - user_prompt=user_prompt, - output_schema=output_schema, - ai_extraction=ai_extraction, - mock=mock, - stealth=stealth - ) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/agentic-scrapper", json=request.model_dump() - ) - logger.info("โœจ Agentic scraper request completed successfully") - return process_response_with_toon(result, return_toon) - - def get_agenticscraper(self, request_id: str, return_toon: bool = False): - """Get the result of a previous agentic scraper request - - Args: - request_id: The request ID to fetch - return_toon: If True, return response in TOON format (reduces token usage by 30-60%) - """ - logger.info(f"๐Ÿ” Fetching agentic scraper result for request {request_id}") - if return_toon: - logger.debug("๐ŸŽจ TOON format output enabled") - - # Validate input using Pydantic model - GetAgenticScraperRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/agentic-scrapper/{request_id}") - logger.info(f"โœจ Successfully retrieved result for request {request_id}") - return process_response_with_toon(result, return_toon) - - def generate_schema( - self, - user_prompt: str, - existing_schema: Optional[Dict[str, Any]] = None, - ): - """Generate a JSON schema from a user prompt - - Args: - user_prompt: The user's search query to be refined into a schema - existing_schema: Optional existing JSON schema to modify/extend - """ - logger.info("๐Ÿ”ง Starting schema generation request") - logger.debug(f"๐Ÿ’ญ User prompt: {user_prompt}") - if existing_schema: - logger.debug(f"๐Ÿ“‹ Existing schema provided: {existing_schema is not None}") - - request = GenerateSchemaRequest( - user_prompt=user_prompt, - existing_schema=existing_schema, - ) - logger.debug("โœ… Request validation passed") - - result = self._make_request( - "POST", f"{API_BASE_URL}/generate_schema", json=request.model_dump() - ) - logger.info("โœจ Schema generation request completed successfully") - return result - - def get_schema_status(self, request_id: str): - """Get the status of a schema generation request - - Args: - request_id: The request ID returned from generate_schema - """ - logger.info(f"๐Ÿ” Fetching schema generation status for request {request_id}") - - # Validate input using Pydantic model - GetSchemaStatusRequest(request_id=request_id) - logger.debug("โœ… Request ID validation passed") - - result = self._make_request("GET", f"{API_BASE_URL}/generate_schema/{request_id}") - logger.info(f"โœจ Successfully retrieved schema status for request {request_id}") - return result - - def create_scheduled_job( - self, - job_name: str, - service_type: str, - cron_expression: str, - job_config: dict, - is_active: bool = True, - ): - """Create a new scheduled job""" - logger.info(f"๐Ÿ“… Creating scheduled job: {job_name}") - - request = ScheduledJobCreate( - job_name=job_name, - service_type=service_type, - cron_expression=cron_expression, - job_config=job_config, - is_active=is_active, - ) - - result = self._make_request( - "POST", f"{API_BASE_URL}/scheduled-jobs", json=request.model_dump() - ) - logger.info("โœจ Scheduled job created successfully") - return result - - def get_scheduled_jobs( - self, - page: int = 1, - page_size: int = 20, - service_type: Optional[str] = None, - is_active: Optional[bool] = None, - ): - """Get list of scheduled jobs with pagination""" - logger.info("๐Ÿ“‹ Fetching scheduled jobs") - - GetScheduledJobsRequest( - page=page, - page_size=page_size, - service_type=service_type, - is_active=is_active, - ) - - params = {"page": page, "page_size": page_size} - if service_type: - params["service_type"] = service_type - if is_active is not None: - params["is_active"] = is_active - - result = self._make_request("GET", f"{API_BASE_URL}/scheduled-jobs", params=params) - logger.info(f"โœจ Successfully retrieved {len(result.get('jobs', []))} scheduled jobs") - return result - - def get_scheduled_job(self, job_id: str): - """Get details of a specific scheduled job""" - logger.info(f"๐Ÿ” Fetching scheduled job {job_id}") - - GetScheduledJobRequest(job_id=job_id) - - result = self._make_request("GET", f"{API_BASE_URL}/scheduled-jobs/{job_id}") - logger.info(f"โœจ Successfully retrieved scheduled job {job_id}") - return result - - def update_scheduled_job( - self, - job_id: str, - job_name: Optional[str] = None, - cron_expression: Optional[str] = None, - job_config: Optional[dict] = None, - is_active: Optional[bool] = None, - ): - """Update an existing scheduled job (partial update)""" - logger.info(f"๐Ÿ“ Updating scheduled job {job_id}") - - update_data = {} - if job_name is not None: - update_data["job_name"] = job_name - if cron_expression is not None: - update_data["cron_expression"] = cron_expression - if job_config is not None: - update_data["job_config"] = job_config - if is_active is not None: - update_data["is_active"] = is_active - - ScheduledJobUpdate(**update_data) - - result = self._make_request( - "PATCH", f"{API_BASE_URL}/scheduled-jobs/{job_id}", json=update_data - ) - logger.info(f"โœจ Successfully updated scheduled job {job_id}") - return result - - def replace_scheduled_job( - self, - job_id: str, - job_name: str, - cron_expression: str, - job_config: dict, - is_active: bool = True, - ): - """Replace an existing scheduled job (full update)""" - logger.info(f"๐Ÿ”„ Replacing scheduled job {job_id}") - - request_data = { - "job_name": job_name, - "cron_expression": cron_expression, - "job_config": job_config, - "is_active": is_active, - } - - result = self._make_request( - "PUT", f"{API_BASE_URL}/scheduled-jobs/{job_id}", json=request_data - ) - logger.info(f"โœจ Successfully replaced scheduled job {job_id}") - return result - - def delete_scheduled_job(self, job_id: str): - """Delete a scheduled job""" - logger.info(f"๐Ÿ—‘๏ธ Deleting scheduled job {job_id}") - - JobActionRequest(job_id=job_id) - - result = self._make_request("DELETE", f"{API_BASE_URL}/scheduled-jobs/{job_id}") - logger.info(f"โœจ Successfully deleted scheduled job {job_id}") - return result - - def pause_scheduled_job(self, job_id: str): - """Pause a scheduled job""" - logger.info(f"โธ๏ธ Pausing scheduled job {job_id}") - - JobActionRequest(job_id=job_id) - - result = self._make_request("POST", f"{API_BASE_URL}/scheduled-jobs/{job_id}/pause") - logger.info(f"โœจ Successfully paused scheduled job {job_id}") - return result - - def resume_scheduled_job(self, job_id: str): - """Resume a paused scheduled job""" - logger.info(f"โ–ถ๏ธ Resuming scheduled job {job_id}") - - JobActionRequest(job_id=job_id) - - result = self._make_request("POST", f"{API_BASE_URL}/scheduled-jobs/{job_id}/resume") - logger.info(f"โœจ Successfully resumed scheduled job {job_id}") - return result - - def trigger_scheduled_job(self, job_id: str): - """Manually trigger a scheduled job""" - logger.info(f"๐Ÿš€ Manually triggering scheduled job {job_id}") - - TriggerJobRequest(job_id=job_id) - - result = self._make_request("POST", f"{API_BASE_URL}/scheduled-jobs/{job_id}/trigger") - logger.info(f"โœจ Successfully triggered scheduled job {job_id}") - return result - - def get_job_executions( - self, - job_id: str, - page: int = 1, - page_size: int = 20, - status: Optional[str] = None, - ): - """Get execution history for a scheduled job""" - logger.info(f"๐Ÿ“Š Fetching execution history for job {job_id}") - - GetJobExecutionsRequest( - job_id=job_id, - page=page, - page_size=page_size, - status=status, - ) - - params = {"page": page, "page_size": page_size} - if status: - params["status"] = status - - result = self._make_request( - "GET", f"{API_BASE_URL}/scheduled-jobs/{job_id}/executions", params=params - ) - logger.info(f"โœจ Successfully retrieved execution history for job {job_id}") - return result - - def close(self): - """Close the session to free up resources""" - logger.info("๐Ÿ”’ Closing Client session") - self.session.close() - logger.debug("โœ… Session closed successfully") - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_val, exc_tb): - self.close() diff --git a/scrapegraph-py/scrapegraph_py/config.py b/scrapegraph-py/scrapegraph_py/config.py deleted file mode 100644 index e7ca1789..00000000 --- a/scrapegraph-py/scrapegraph_py/config.py +++ /dev/null @@ -1,15 +0,0 @@ -""" -Configuration and constants for the ScrapeGraphAI SDK. - -This module contains API configuration settings including the base URL -and default headers used for all API requests. - -Attributes: - API_BASE_URL (str): Base URL for the ScrapeGraphAI API endpoints - DEFAULT_HEADERS (dict): Default HTTP headers for API requests -""" -API_BASE_URL = "https://api.scrapegraphai.com/v1" -DEFAULT_HEADERS = { - "accept": "application/json", - "Content-Type": "application/json", -} diff --git a/scrapegraph-py/scrapegraph_py/exceptions.py b/scrapegraph-py/scrapegraph_py/exceptions.py deleted file mode 100644 index 3cfed7dd..00000000 --- a/scrapegraph-py/scrapegraph_py/exceptions.py +++ /dev/null @@ -1,30 +0,0 @@ -""" -Custom exceptions for the ScrapeGraphAI SDK. - -This module defines custom exception classes used throughout the SDK -for handling API errors and other exceptional conditions. -""" - - -class APIError(Exception): - """ - Exception raised for API errors. - - This exception is raised when the API returns an error response, - providing both the error message and HTTP status code for debugging. - - Attributes: - message (str): The error message from the API - status_code (int): HTTP status code of the error response - - Example: - >>> try: - ... client.smartscraper(website_url="invalid") - ... except APIError as e: - ... print(f"API error {e.status_code}: {e.message}") - """ - - def __init__(self, message: str, status_code: int = None): - self.status_code = status_code - self.message = message - super().__init__(f"[{status_code}] {message}") diff --git a/scrapegraph-py/scrapegraph_py/logger.py b/scrapegraph-py/scrapegraph_py/logger.py deleted file mode 100644 index 36ce7f17..00000000 --- a/scrapegraph-py/scrapegraph_py/logger.py +++ /dev/null @@ -1,197 +0,0 @@ -""" -Logging utilities for the ScrapeGraphAI SDK. - -This module provides a custom logging system with emoji support and -configurable output for debugging and monitoring SDK operations. - -The logger can be enabled/disabled dynamically and supports both -console and file output with customizable formatting. - -Example: - Enable logging: - >>> from scrapegraph_py.logger import sgai_logger - >>> sgai_logger.set_logging(level="DEBUG", log_file="scraping.log") - - Disable logging: - >>> sgai_logger.disable() -""" -import logging -import logging.handlers -from typing import Dict, Optional - -# Emoji mappings for different log levels -LOG_EMOJIS: Dict[int, str] = { - logging.DEBUG: "๐Ÿ›", - logging.INFO: "๐Ÿ’ฌ", - logging.WARNING: "โš ๏ธ", - logging.ERROR: "โŒ", - logging.CRITICAL: "๐Ÿšจ", -} - - -class EmojiFormatter(logging.Formatter): - """ - Custom log formatter that adds emojis to log messages. - - This formatter enhances log messages by prepending relevant emojis - based on the log level, making logs more visually distinctive. - - The emoji is added to the log record before formatting. - """ - - def format(self, record: logging.LogRecord) -> str: - """ - Format the log record with an emoji prefix. - - Args: - record: The log record to format - - Returns: - Formatted log string with emoji prefix - """ - # Add emoji based on log level - emoji = LOG_EMOJIS.get(record.levelno, "") - record.emoji = emoji - return super().format(record) - - -class ScrapegraphLogger: - """ - Singleton logger manager for the ScrapeGraphAI SDK. - - This class manages SDK-wide logging configuration, providing methods - to enable, disable, and configure logging behavior. It implements the - singleton pattern to ensure consistent logging across the SDK. - - Attributes: - logger (logging.Logger): The underlying Python logger instance - enabled (bool): Whether logging is currently enabled - - Example: - >>> logger = ScrapegraphLogger() - >>> logger.set_logging(level="INFO", log_file="api.log") - >>> logger.info("Starting API request") - """ - - _instance = None - _initialized = False - - def __new__(cls): - if cls._instance is None: - cls._instance = super(ScrapegraphLogger, cls).__new__(cls) - return cls._instance - - def __init__(self): - if not self._initialized: - self.logger = logging.getLogger("scrapegraph") - self.logger.setLevel(logging.INFO) - self.enabled = False - self._initialized = True - - def set_logging( - self, - level: Optional[str] = None, - log_file: Optional[str] = None, - log_format: Optional[str] = None, - ) -> None: - """ - Configure logging settings. If level is None, logging will be disabled. - - Args: - level: Logging level (e.g., 'DEBUG', 'INFO'). None to disable logging. - log_file: Optional file path to write logs to - log_format: Optional custom log format string - """ - # Clear existing handlers - self.logger.handlers.clear() - - if level is None: - # Disable logging - self.enabled = False - return - - # Enable logging with specified level - self.enabled = True - level = getattr(logging, level.upper(), logging.INFO) - self.logger.setLevel(level) - - # Default format if none provided - if not log_format: - log_format = "%(emoji)s %(asctime)-15s %(message)s" - - formatter = EmojiFormatter(log_format) - - # Console handler - console_handler = logging.StreamHandler() - console_handler.setFormatter(formatter) - self.logger.addHandler(console_handler) - - # File handler if log_file specified - if log_file: - file_handler = logging.FileHandler(log_file) - file_handler.setFormatter(formatter) - self.logger.addHandler(file_handler) - - def disable(self) -> None: - """ - Disable all logging. - - Clears all handlers and sets enabled flag to False, effectively - silencing all log output from the SDK. - """ - self.logger.handlers.clear() - self.enabled = False - - def debug(self, message: str) -> None: - """ - Log debug message if logging is enabled. - - Args: - message: The debug message to log - """ - if self.enabled: - self.logger.debug(message) - - def info(self, message: str) -> None: - """ - Log info message if logging is enabled. - - Args: - message: The info message to log - """ - if self.enabled: - self.logger.info(message) - - def warning(self, message: str) -> None: - """ - Log warning message if logging is enabled. - - Args: - message: The warning message to log - """ - if self.enabled: - self.logger.warning(message) - - def error(self, message: str) -> None: - """ - Log error message if logging is enabled. - - Args: - message: The error message to log - """ - if self.enabled: - self.logger.error(message) - - def critical(self, message: str) -> None: - """ - Log critical message if logging is enabled. - - Args: - message: The critical message to log - """ - if self.enabled: - self.logger.critical(message) - - -# Default logger instance -sgai_logger = ScrapegraphLogger() diff --git a/scrapegraph-py/scrapegraph_py/models/__init__.py b/scrapegraph-py/scrapegraph_py/models/__init__.py deleted file mode 100644 index 1f374b8e..00000000 --- a/scrapegraph-py/scrapegraph_py/models/__init__.py +++ /dev/null @@ -1,57 +0,0 @@ -""" -Pydantic models for all ScrapeGraphAI API endpoints. - -This module provides request and response models for validating and -structuring data for all API operations. All models use Pydantic for -data validation and serialization. - -Available Models: - - AgenticScraperRequest, GetAgenticScraperRequest: Agentic scraper operations - - CrawlRequest, GetCrawlRequest: Website crawling operations - - FeedbackRequest: User feedback submission - - ScrapeRequest, GetScrapeRequest: Basic HTML scraping - - MarkdownifyRequest, GetMarkdownifyRequest: Markdown conversion - - SearchScraperRequest, GetSearchScraperRequest: Web research - - SmartScraperRequest, GetSmartScraperRequest: AI-powered scraping - - GenerateSchemaRequest, GetSchemaStatusRequest: Schema generation - - ScheduledJob models: Job scheduling and management - -Example: - >>> from scrapegraph_py.models import SmartScraperRequest - >>> request = SmartScraperRequest( - ... website_url="https://example.com", - ... user_prompt="Extract product info" - ... ) -""" - -from .agenticscraper import AgenticScraperRequest, GetAgenticScraperRequest -from .crawl import CrawlRequest, GetCrawlRequest -from .feedback import FeedbackRequest -from .scrape import GetScrapeRequest, ScrapeRequest -from .markdownify import GetMarkdownifyRequest, MarkdownifyRequest -from .searchscraper import GetSearchScraperRequest, SearchScraperRequest, TimeRange -from .sitemap import SitemapRequest, SitemapResponse -from .smartscraper import GetSmartScraperRequest, SmartScraperRequest -from .schema import GenerateSchemaRequest, GetSchemaStatusRequest, SchemaGenerationResponse - -__all__ = [ - "AgenticScraperRequest", - "GetAgenticScraperRequest", - "CrawlRequest", - "GetCrawlRequest", - "FeedbackRequest", - "GetScrapeRequest", - "ScrapeRequest", - "GetMarkdownifyRequest", - "MarkdownifyRequest", - "GetSearchScraperRequest", - "SearchScraperRequest", - "TimeRange", - "SitemapRequest", - "SitemapResponse", - "GetSmartScraperRequest", - "SmartScraperRequest", - "GenerateSchemaRequest", - "GetSchemaStatusRequest", - "SchemaGenerationResponse", -] diff --git a/scrapegraph-py/scrapegraph_py/models/agenticscraper.py b/scrapegraph-py/scrapegraph_py/models/agenticscraper.py deleted file mode 100644 index 93b6234c..00000000 --- a/scrapegraph-py/scrapegraph_py/models/agenticscraper.py +++ /dev/null @@ -1,148 +0,0 @@ -""" -Pydantic models for the Agentic Scraper API endpoint. - -This module defines request and response models for the Agentic Scraper endpoint, -which performs automated browser interactions and optional AI data extraction. - -The Agentic Scraper can: -- Execute a sequence of browser actions (click, type, scroll, etc.) -- Handle authentication flows and form submissions -- Optionally extract structured data using AI after interactions -- Maintain browser sessions across multiple steps -""" - -from typing import Any, Dict, List, Optional -from uuid import UUID - -from pydantic import BaseModel, Field, model_validator - - -class AgenticScraperRequest(BaseModel): - """ - Request model for the Agentic Scraper endpoint. - - This model validates and structures requests for automated browser - interactions with optional AI extraction. - - Attributes: - url: The starting URL for the scraping session - use_session: Whether to maintain browser session across steps - steps: List of actions to perform (e.g., "Type email@example.com in email input") - user_prompt: Optional prompt for AI extraction (required if ai_extraction=True) - output_schema: Optional schema for structured data extraction - ai_extraction: Whether to use AI for data extraction after interactions - headers: Optional HTTP headers - mock: Whether to use mock mode for testing - render_heavy_js: Whether to render heavy JavaScript - - Example: - >>> request = AgenticScraperRequest( - ... url="https://dashboard.example.com", - ... steps=[ - ... "Type user@example.com in email input", - ... "Type password123 in password input", - ... "Click login button" - ... ], - ... ai_extraction=True, - ... user_prompt="Extract user dashboard information" - ... ) - """ - url: str = Field( - ..., - example="https://dashboard.scrapegraphai.com/", - description="The URL to scrape" - ) - use_session: bool = Field( - default=True, - description="Whether to use session for the scraping" - ) - steps: List[str] = Field( - ..., - example=[ - "Type email@gmail.com in email input box", - "Type test-password@123 in password inputbox", - "click on login" - ], - description="List of steps to perform on the webpage" - ) - user_prompt: Optional[str] = Field( - default=None, - example="Extract user information and available dashboard sections", - description="Prompt for AI extraction (only used when ai_extraction=True)" - ) - output_schema: Optional[Dict[str, Any]] = Field( - default=None, - example={ - "user_info": { - "type": "object", - "properties": { - "username": {"type": "string"}, - "email": {"type": "string"}, - "dashboard_sections": {"type": "array", "items": {"type": "string"}} - } - } - }, - description="Schema for structured data extraction (only used when ai_extraction=True)" - ) - ai_extraction: bool = Field( - default=False, - description="Whether to use AI for data extraction from the scraped content" - ) - headers: Optional[dict[str, str]] = Field( - None, - example={ - "User-Agent": "scrapegraph-py", - "Cookie": "cookie1=value1; cookie2=value2", - }, - description="Optional headers to send with the request, including cookies " - "and user agent", - ) - mock: bool = Field(default=False, description="Whether to use mock mode for the request") - render_heavy_js: bool = Field(default=False, description="Whether to render heavy JavaScript on the page") - stealth: bool = Field(default=False, description="Enable stealth mode to avoid bot detection") - - @model_validator(mode="after") - def validate_url(self) -> "AgenticScraperRequest": - if not self.url.strip(): - raise ValueError("URL cannot be empty") - if not ( - self.url.startswith("http://") - or self.url.startswith("https://") - ): - raise ValueError("Invalid URL - must start with http:// or https://") - return self - - @model_validator(mode="after") - def validate_steps(self) -> "AgenticScraperRequest": - if not self.steps: - raise ValueError("Steps cannot be empty") - if any(not step.strip() for step in self.steps): - raise ValueError("All steps must contain valid instructions") - return self - - @model_validator(mode="after") - def validate_ai_extraction(self) -> "AgenticScraperRequest": - if self.ai_extraction: - if not self.user_prompt or not self.user_prompt.strip(): - raise ValueError("user_prompt is required when ai_extraction=True") - return self - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) - - -class GetAgenticScraperRequest(BaseModel): - """Request model for get_agenticscraper endpoint""" - - request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - - @model_validator(mode="after") - def validate_request_id(self) -> "GetAgenticScraperRequest": - try: - # Validate the request_id is a valid UUID - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self diff --git a/scrapegraph-py/scrapegraph_py/models/crawl.py b/scrapegraph-py/scrapegraph_py/models/crawl.py deleted file mode 100644 index dd6cca99..00000000 --- a/scrapegraph-py/scrapegraph_py/models/crawl.py +++ /dev/null @@ -1,219 +0,0 @@ -# Models for crawl endpoint - -from typing import Any, Dict, Optional -from uuid import UUID - -from pydantic import BaseModel, Field, conint, model_validator - - -class CrawlRequest(BaseModel): - """ - Request model for the crawl endpoint. - - The crawl endpoint supports two modes: - 1. AI Extraction Mode (extraction_mode=True): Uses AI to extract structured data - 2. Markdown Conversion Mode (extraction_mode=False): Converts pages to markdown (80% cheaper) - - Sitemap Support: - - When sitemap=True, the crawler uses sitemap.xml for better page discovery - - Recommended for structured websites (e-commerce, news sites, blogs) - - Provides more comprehensive crawling coverage - - Works with both AI extraction and markdown conversion modes - - Path Filtering: - - include_paths: Specify which paths to crawl (e.g., ['/products/*', '/blog/**']) - - exclude_paths: Specify which paths to skip (e.g., ['/admin/*', '/api/*']) - - Supports wildcards: * (any characters), ** (any path segments) - - exclude_paths takes precedence over include_paths - """ - url: str = Field( - ..., - example="https://scrapegraphai.com/", - description="The starting URL for the crawl", - ) - extraction_mode: bool = Field( - default=True, - description="True for AI extraction mode, False for markdown conversion " - "mode (no AI/LLM processing)", - ) - prompt: Optional[str] = Field( - default=None, - example="What does the company do? and I need text content from there " - "privacy and terms", - description="The prompt to guide the crawl and extraction (required when " - "extraction_mode=True)", - ) - data_schema: Optional[Dict[str, Any]] = Field( - default=None, - description="JSON schema defining the structure of the extracted data " - "(required when extraction_mode=True)", - ) - cache_website: bool = Field( - default=True, description="Whether to cache the website content" - ) - depth: conint(ge=1, le=10) = Field( - default=2, description="Maximum depth of the crawl (1-10)" - ) - breadth: Optional[conint(ge=1)] = Field( - default=None, - description="Maximum number of links to crawl per depth level. " - "If None, unlimited (default). Controls the 'width' of exploration at each depth. " - "Useful for limiting crawl scope on large sites. Note: max_pages always takes priority - " - "the total crawled pages will never exceed max_pages regardless of breadth setting. " - "Ignored when sitemap=True (sitemap mode uses sitemap URLs directly instead of link discovery).", - ) - max_pages: conint(ge=1, le=100) = Field( - default=2, description="Maximum number of pages to crawl (1-100)" - ) - same_domain_only: bool = Field( - default=True, description="Whether to only crawl pages from the same domain" - ) - batch_size: Optional[conint(ge=1, le=10)] = Field( - default=None, description="Batch size for processing pages (1-10)" - ) - sitemap: bool = Field( - default=False, - description="Whether to use sitemap.xml for better page discovery and more comprehensive crawling. " - "When enabled, the crawler will use the website's sitemap.xml to discover pages more efficiently, " - "providing better coverage for structured websites like e-commerce sites, news portals, and content-heavy websites." - ) - headers: Optional[dict[str, str]] = Field( - None, - example={ - "User-Agent": "scrapegraph-py", - "Cookie": "cookie1=value1; cookie2=value2", - }, - description="Optional headers to send with the request, including cookies " - "and user agent", - ) - render_heavy_js: bool = Field(default=False, description="Whether to render heavy JavaScript on the page") - stealth: bool = Field(default=False, description="Enable stealth mode to avoid bot detection") - include_paths: Optional[list[str]] = Field( - default=None, - description="List of path patterns to include (e.g., ['/products/*', '/blog/**']). " - "Supports wildcards: * matches any characters, ** matches any path segments. " - "If empty, all paths are included.", - example=["/products/*", "/blog/**"] - ) - exclude_paths: Optional[list[str]] = Field( - default=None, - description="List of path patterns to exclude (e.g., ['/admin/*', '/api/*']). " - "Supports wildcards: * matches any characters, ** matches any path segments. " - "Takes precedence over include_paths.", - example=["/admin/*", "/api/**"] - ) - webhook_url: Optional[str] = Field( - default=None, - description="URL to receive webhook notifications when the crawl job completes. " - "The webhook will receive a POST request with the crawl results.", - example="https://example.com/webhook" - ) - wait_ms: Optional[int] = Field( - default=None, - description="Number of milliseconds to wait before scraping each page. " - "Useful for pages with heavy JavaScript rendering that need extra time to load.", - ) - - @model_validator(mode="after") - def validate_url(self) -> "CrawlRequest": - if not self.url.strip(): - raise ValueError("URL cannot be empty") - if not (self.url.startswith("http://") or self.url.startswith("https://")): - raise ValueError("Invalid URL - must start with http:// or https://") - return self - - @model_validator(mode="after") - def validate_extraction_mode_requirements(self) -> "CrawlRequest": - """Validate requirements based on extraction mode""" - if self.extraction_mode: - # AI extraction mode - require prompt and data_schema - if not self.prompt: - raise ValueError("Prompt is required when extraction_mode=True") - if not self.prompt.strip(): - raise ValueError("Prompt cannot be empty") - if not any(c.isalnum() for c in self.prompt): - raise ValueError("Prompt must contain valid content") - - if not self.data_schema: - raise ValueError("Data schema is required when extraction_mode=True") - if not isinstance(self.data_schema, dict): - raise ValueError("Data schema must be a dictionary") - if not self.data_schema: - raise ValueError("Data schema cannot be empty") - else: - # Markdown conversion mode - prompt and data_schema should be None - if self.prompt is not None: - raise ValueError( - "Prompt should not be provided when extraction_mode=False " - "(markdown mode)" - ) - if self.data_schema is not None: - raise ValueError( - "Data schema should not be provided when extraction_mode=False " - "(markdown mode)" - ) - - return self - - @model_validator(mode="after") - def validate_batch_size(self) -> "CrawlRequest": - if self.batch_size is not None and ( - self.batch_size < 1 or self.batch_size > 10 - ): - raise ValueError("Batch size must be between 1 and 10") - return self - - @model_validator(mode="after") - def validate_sitemap_usage(self) -> "CrawlRequest": - """Validate sitemap usage and provide recommendations""" - if self.sitemap: - # Log recommendation for sitemap usage - if self.max_pages < 5: - # This is just a recommendation, not an error - pass # Could add logging here if needed - return self - - @model_validator(mode="after") - def validate_path_patterns(self) -> "CrawlRequest": - """Validate path patterns start with '/'""" - if self.include_paths: - for path in self.include_paths: - if not path.startswith("/"): - raise ValueError(f"Include path must start with '/': {path}") - - if self.exclude_paths: - for path in self.exclude_paths: - if not path.startswith("/"): - raise ValueError(f"Exclude path must start with '/': {path}") - - return self - - @model_validator(mode="after") - def validate_webhook_url(self) -> "CrawlRequest": - """Validate webhook URL format if provided""" - if self.webhook_url is not None: - if not self.webhook_url.strip(): - raise ValueError("Webhook URL cannot be empty") - if not ( - self.webhook_url.startswith("http://") - or self.webhook_url.startswith("https://") - ): - raise ValueError( - "Invalid webhook URL - must start with http:// or https://" - ) - return self - - -class GetCrawlRequest(BaseModel): - """Request model for get_crawl endpoint""" - - crawl_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - - @model_validator(mode="after") - def validate_crawl_id(self) -> "GetCrawlRequest": - try: - # Validate the crawl_id is a valid UUID - UUID(self.crawl_id) - except ValueError: - raise ValueError("crawl_id must be a valid UUID") - return self diff --git a/scrapegraph-py/scrapegraph_py/models/feedback.py b/scrapegraph-py/scrapegraph_py/models/feedback.py deleted file mode 100644 index 43c41ecd..00000000 --- a/scrapegraph-py/scrapegraph_py/models/feedback.py +++ /dev/null @@ -1,32 +0,0 @@ -""" -Pydantic models for the Feedback API endpoint. - -This module defines request models for submitting user feedback about -API requests, helping improve the service quality. -""" - -from typing import Optional -from uuid import UUID - -from pydantic import BaseModel, Field, model_validator - - -class FeedbackRequest(BaseModel): - """Request model for feedback endpoint""" - - request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - rating: int = Field(..., ge=1, le=5, example=5) - feedback_text: Optional[str] = Field(None, example="Great results!") - - @model_validator(mode="after") - def validate_request_id(self) -> "FeedbackRequest": - try: - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) diff --git a/scrapegraph-py/scrapegraph_py/models/markdownify.py b/scrapegraph-py/scrapegraph_py/models/markdownify.py deleted file mode 100644 index 0b959032..00000000 --- a/scrapegraph-py/scrapegraph_py/models/markdownify.py +++ /dev/null @@ -1,80 +0,0 @@ -""" -Pydantic models for the Markdownify API endpoint. - -This module defines request and response models for the Markdownify endpoint, -which converts web pages into clean markdown format. - -The Markdownify endpoint is useful for: -- Converting HTML to markdown for easier processing -- Extracting clean text content from websites -- Preparing content for LLM consumption -""" - -from typing import Optional -from uuid import UUID - -from pydantic import BaseModel, Field, model_validator - - -class MarkdownifyRequest(BaseModel): - """ - Request model for the Markdownify endpoint. - - This model validates and structures requests for converting web pages - to markdown format. - - Attributes: - website_url: URL of the website to convert to markdown - headers: Optional HTTP headers including cookies - mock: Whether to use mock mode for testing - render_heavy_js: Whether to render heavy JavaScript on the page - stealth: Enable stealth mode to avoid bot detection - - Example: - >>> request = MarkdownifyRequest(website_url="https://example.com") - """ - website_url: str = Field(..., example="https://scrapegraphai.com/") - headers: Optional[dict[str, str]] = Field( - None, - example={ - "User-Agent": "scrapegraph-py", - "Cookie": "cookie1=value1; cookie2=value2", - }, - description="Optional headers to send with the request, including cookies " - "and user agent", - ) - mock: bool = Field(default=False, description="Whether to use mock mode for the request") - render_heavy_js: bool = Field(default=False, description="Whether to render heavy JavaScript on the page") - stealth: bool = Field(default=False, description="Enable stealth mode to avoid bot detection") - wait_ms: Optional[int] = Field(default=None, description="The number of milliseconds to wait before scraping the website") - - @model_validator(mode="after") - def validate_url(self) -> "MarkdownifyRequest": - if self.website_url is None or not self.website_url.strip(): - raise ValueError("Website URL cannot be empty") - if not ( - self.website_url.startswith("http://") - or self.website_url.startswith("https://") - ): - raise ValueError("Invalid URL") - return self - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) - - -class GetMarkdownifyRequest(BaseModel): - """Request model for get_markdownify endpoint""" - - request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - - @model_validator(mode="after") - def validate_request_id(self) -> "GetMarkdownifyRequest": - try: - # Validate the request_id is a valid UUID - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self diff --git a/scrapegraph-py/scrapegraph_py/models/scheduled_jobs.py b/scrapegraph-py/scrapegraph_py/models/scheduled_jobs.py deleted file mode 100644 index 46e83d63..00000000 --- a/scrapegraph-py/scrapegraph_py/models/scheduled_jobs.py +++ /dev/null @@ -1,151 +0,0 @@ -""" -Pydantic models for the Scheduled Jobs API endpoints. - -This module defines request and response models for managing scheduled jobs, -which allow you to automate recurring scraping tasks using cron expressions. - -Scheduled Jobs support: -- Creating recurring scraping jobs -- Managing job lifecycle (pause, resume, delete) -- Manually triggering jobs on demand -- Viewing execution history -- Filtering and pagination -""" - -from typing import Any, Dict, Optional -from enum import Enum -from pydantic import BaseModel, Field, model_validator - - -class ServiceType(str, Enum): - """ - Enum defining available service types for scheduled jobs. - - Available services: - SMART_SCRAPER: AI-powered web scraping - SEARCH_SCRAPER: Web research across multiple sources - AGENTIC_SCRAPER: Automated browser interactions - """ - SMART_SCRAPER = "smartscraper" - SEARCH_SCRAPER = "searchscraper" - AGENTIC_SCRAPER = "agenticscraper" - - -class ScheduledJobCreate(BaseModel): - """Model for creating a new scheduled job""" - job_name: str = Field(..., min_length=1, description="Name of the scheduled job") - service_type: str = Field(..., description="Type of service (smartscraper, searchscraper, etc.)") - cron_expression: str = Field(..., description="Cron expression for scheduling") - job_config: Dict[str, Any] = Field( - ..., - example={ - "website_url": "https://example.com", - "user_prompt": "Extract company information", - "headers": { - "User-Agent": "scrapegraph-py", - "Cookie": "session=abc123" - } - }, - description="Configuration for the job" - ) - is_active: bool = Field(default=True, description="Whether the job is active") - - @model_validator(mode="after") - def validate_cron_expression(self) -> "ScheduledJobCreate": - parts = self.cron_expression.strip().split() - if len(parts) != 5: - raise ValueError("Cron expression must have exactly 5 fields") - return self - - -class ScheduledJobUpdate(BaseModel): - """Model for updating a scheduled job (partial update)""" - job_name: Optional[str] = Field(None, description="Name of the scheduled job") - cron_expression: Optional[str] = Field(None, description="Cron expression for scheduling") - job_config: Optional[Dict[str, Any]] = Field(None, description="Configuration for the job") - is_active: Optional[bool] = Field(None, description="Whether the job is active") - - -class GetScheduledJobsRequest(BaseModel): - """Model for getting list of scheduled jobs""" - page: int = Field(default=1, ge=1, description="Page number") - page_size: int = Field(default=20, ge=1, le=100, description="Number of jobs per page") - service_type: Optional[str] = Field(None, description="Filter by service type") - is_active: Optional[bool] = Field(None, description="Filter by active status") - - -class GetScheduledJobRequest(BaseModel): - """Model for getting a specific scheduled job""" - job_id: str = Field(..., description="ID of the scheduled job") - - -class JobActionRequest(BaseModel): - """Model for job actions (pause, resume, delete)""" - job_id: str = Field(..., description="ID of the scheduled job") - - -class TriggerJobRequest(BaseModel): - """Model for manually triggering a job""" - job_id: str = Field(..., description="ID of the scheduled job") - - -class GetJobExecutionsRequest(BaseModel): - """Model for getting job execution history""" - job_id: str = Field(..., description="ID of the scheduled job") - page: int = Field(default=1, ge=1, description="Page number") - page_size: int = Field(default=20, ge=1, le=100, description="Number of executions per page") - status: Optional[str] = Field(None, description="Filter by execution status") - - -class JobActionResponse(BaseModel): - """Response model for job actions""" - success: bool = Field(..., description="Whether the action was successful") - message: str = Field(..., description="Response message") - job_id: str = Field(..., description="ID of the scheduled job") - - -class JobExecutionListResponse(BaseModel): - """Response model for job execution list""" - executions: list = Field(..., description="List of job executions") - total_count: int = Field(..., description="Total number of executions") - page: int = Field(..., description="Current page number") - page_size: int = Field(..., description="Number of executions per page") - - -class JobTriggerResponse(BaseModel): - """Response model for job trigger""" - success: bool = Field(..., description="Whether the job was triggered successfully") - message: str = Field(..., description="Response message") - job_id: str = Field(..., description="ID of the scheduled job") - execution_id: Optional[str] = Field(None, description="ID of the triggered execution") - - -class ScheduledJobListResponse(BaseModel): - """Response model for scheduled job list""" - jobs: list = Field(..., description="List of scheduled jobs") - total_count: int = Field(..., description="Total number of jobs") - page: int = Field(..., description="Current page number") - page_size: int = Field(..., description="Number of jobs per page") - - -class JobExecutionResponse(BaseModel): - """Response model for a single job execution""" - execution_id: str = Field(..., description="ID of the job execution") - job_id: str = Field(..., description="ID of the scheduled job") - status: str = Field(..., description="Execution status") - started_at: Optional[str] = Field(None, description="Execution start timestamp") - completed_at: Optional[str] = Field(None, description="Execution completion timestamp") - result: Optional[Dict[str, Any]] = Field(None, description="Execution result data") - error_message: Optional[str] = Field(None, description="Error message if execution failed") - - -class ScheduledJobResponse(BaseModel): - """Response model for a single scheduled job""" - job_id: str = Field(..., description="ID of the scheduled job") - job_name: str = Field(..., description="Name of the scheduled job") - service_type: str = Field(..., description="Type of service") - cron_expression: str = Field(..., description="Cron expression for scheduling") - job_config: Dict[str, Any] = Field(..., description="Configuration for the job") - is_active: bool = Field(..., description="Whether the job is active") - created_at: Optional[str] = Field(None, description="Job creation timestamp") - updated_at: Optional[str] = Field(None, description="Job last update timestamp") \ No newline at end of file diff --git a/scrapegraph-py/scrapegraph_py/models/schema.py b/scrapegraph-py/scrapegraph_py/models/schema.py deleted file mode 100644 index d747f4b8..00000000 --- a/scrapegraph-py/scrapegraph_py/models/schema.py +++ /dev/null @@ -1,117 +0,0 @@ -""" -Pydantic models for the Schema Generation API endpoint. - -This module defines request and response models for the Schema Generation endpoint, -which uses AI to generate or refine JSON schemas based on user prompts. - -The Schema Generation endpoint can: -- Generate new schemas from natural language descriptions -- Refine and extend existing schemas -- Create structured data models for web scraping -""" - -from typing import Any, Dict, Optional -from uuid import UUID - -from pydantic import BaseModel, Field, model_validator - - -class GenerateSchemaRequest(BaseModel): - """Request model for generate_schema endpoint""" - - user_prompt: str = Field( - ..., - example="Find laptops with specifications like brand, processor, RAM, storage, and price", - description="The user's search query to be refined into a schema" - ) - existing_schema: Optional[Dict[str, Any]] = Field( - default=None, - example={ - "$defs": { - "ProductSchema": { - "title": "ProductSchema", - "type": "object", - "properties": { - "name": {"title": "Name", "type": "string"}, - "price": {"title": "Price", "type": "number"}, - }, - "required": ["name", "price"], - } - } - }, - description="Optional existing JSON schema to modify/extend" - ) - - @model_validator(mode="after") - def validate_user_prompt(self) -> "GenerateSchemaRequest": - if not self.user_prompt or not self.user_prompt.strip(): - raise ValueError("user_prompt cannot be empty") - self.user_prompt = self.user_prompt.strip() - return self - - def model_dump(self, *args, **kwargs) -> dict: - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) - - -class GetSchemaStatusRequest(BaseModel): - """Request model for get_schema_status endpoint""" - - request_id: str = Field( - ..., - example="123e4567-e89b-12d3-a456-426614174000", - description="The request ID returned from generate_schema" - ) - - @model_validator(mode="after") - def validate_request_id(self) -> "GetSchemaStatusRequest": - self.request_id = self.request_id.strip() - try: - # Validate the request_id is a valid UUID - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self - - -class SchemaGenerationResponse(BaseModel): - """Response model for schema generation endpoints""" - - request_id: str = Field( - ..., - description="Unique identifier for the schema generation request" - ) - status: str = Field( - ..., - example="completed", - description="Status of the schema generation (pending, processing, completed, failed)" - ) - user_prompt: str = Field( - ..., - description="The original user prompt that was processed" - ) - refined_prompt: Optional[str] = Field( - default=None, - description="AI-refined version of the user prompt" - ) - generated_schema: Optional[Dict[str, Any]] = Field( - default=None, - description="The generated JSON schema" - ) - error: Optional[str] = Field( - default=None, - description="Error message if the request failed" - ) - created_at: Optional[str] = Field( - default=None, - description="Timestamp when the request was created" - ) - updated_at: Optional[str] = Field( - default=None, - description="Timestamp when the request was last updated" - ) - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) diff --git a/scrapegraph-py/scrapegraph_py/models/scrape.py b/scrapegraph-py/scrapegraph_py/models/scrape.py deleted file mode 100644 index a0809574..00000000 --- a/scrapegraph-py/scrapegraph_py/models/scrape.py +++ /dev/null @@ -1,91 +0,0 @@ -""" -Pydantic models for the Scrape API endpoint. - -This module defines request and response models for the basic Scrape endpoint, -which retrieves raw HTML content from websites. - -The Scrape endpoint is useful for: -- Getting clean HTML content from websites -- Handling JavaScript-heavy sites -- Preprocessing before AI extraction -""" - -from typing import Optional -from uuid import UUID - -from pydantic import BaseModel, Field, model_validator - - -class ScrapeRequest(BaseModel): - """ - Request model for the Scrape endpoint. - - This model validates and structures requests for basic HTML scraping - without AI extraction. - - Attributes: - website_url: URL of the website to scrape - render_heavy_js: Whether to render heavy JavaScript (default: False) - branding: Whether to include branding in the response (default: False) - headers: Optional HTTP headers including cookies - mock: Whether to use mock mode for testing - - Example: - >>> request = ScrapeRequest( - ... website_url="https://example.com", - ... render_heavy_js=True, - ... branding=True - ... ) - """ - website_url: str = Field(..., example="https://scrapegraphai.com/") - render_heavy_js: bool = Field( - False, - description="Whether to render heavy JavaScript (defaults to False)", - ) - branding: bool = Field( - False, - description="Whether to include branding in the response (defaults to False)", - ) - headers: Optional[dict[str, str]] = Field( - None, - example={ - "User-Agent": "scrapegraph-py", - "Cookie": "cookie1=value1; cookie2=value2", - }, - description="Optional headers to send with the request, including cookies " - "and user agent", - ) - mock: bool = Field(default=False, description="Whether to use mock mode for the request") - stealth: bool = Field(default=False, description="Enable stealth mode to avoid bot detection") - wait_ms: Optional[int] = Field(default=None, description="The number of milliseconds to wait before scraping the website") - - @model_validator(mode="after") - def validate_url(self) -> "ScrapeRequest": - if self.website_url is None or not self.website_url.strip(): - raise ValueError("Website URL cannot be empty") - if not ( - self.website_url.startswith("http://") - or self.website_url.startswith("https://") - ): - raise ValueError("Invalid URL") - return self - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) - - -class GetScrapeRequest(BaseModel): - """Request model for get_scrape endpoint""" - - request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - - @model_validator(mode="after") - def validate_request_id(self) -> "GetScrapeRequest": - try: - # Validate the request_id is a valid UUID - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self diff --git a/scrapegraph-py/scrapegraph_py/models/searchscraper.py b/scrapegraph-py/scrapegraph_py/models/searchscraper.py deleted file mode 100644 index d143f978..00000000 --- a/scrapegraph-py/scrapegraph_py/models/searchscraper.py +++ /dev/null @@ -1,142 +0,0 @@ -""" -Pydantic models for the SearchScraper API endpoint. - -This module defines request and response models for the SearchScraper endpoint, -which performs AI-powered web research by searching, scraping, and synthesizing -information from multiple sources. - -The SearchScraper: -- Searches the web for relevant pages based on a query -- Scrapes multiple websites (3-20 configurable) -- Extracts and synthesizes information using AI -- Supports both AI extraction and markdown conversion modes -""" - -from enum import Enum -from typing import Optional, Type -from uuid import UUID - -from pydantic import BaseModel, Field, model_validator - - -class TimeRange(str, Enum): - """Time range filter for search results. - - Controls how recent the search results should be. This is useful for - finding recent news, updates, or time-sensitive information. - - Values: - PAST_HOUR: Results from the past hour - PAST_24_HOURS: Results from the past 24 hours - PAST_WEEK: Results from the past week - PAST_MONTH: Results from the past month - PAST_YEAR: Results from the past year - """ - - PAST_HOUR = "past_hour" - PAST_24_HOURS = "past_24_hours" - PAST_WEEK = "past_week" - PAST_MONTH = "past_month" - PAST_YEAR = "past_year" - - -class SearchScraperRequest(BaseModel): - """ - Request model for the SearchScraper endpoint. - - This model validates and structures requests for web research and scraping - across multiple search results. - - Attributes: - user_prompt: The search query/prompt - num_results: Number of websites to scrape (3-20, default 3) - headers: Optional HTTP headers - output_schema: Optional Pydantic model for structured extraction - extraction_mode: Use AI extraction (True) or markdown (False) - mock: Whether to use mock mode for testing - render_heavy_js: Whether to render heavy JavaScript - location_geo_code: Optional geo code for location-based search (e.g., "us") - time_range: Optional time range filter for search results - - Example: - >>> request = SearchScraperRequest( - ... user_prompt="What is the latest version of Python?", - ... num_results=5, - ... extraction_mode=True - ... ) - """ - user_prompt: str = Field(..., example="What is the latest version of Python?") - num_results: Optional[int] = Field( - default=3, - ge=3, - le=20, - example=5, - description="Number of websites to scrape (3-20). Default is 3. More " - "websites provide better research depth but cost more credits.", - ) - headers: Optional[dict[str, str]] = Field( - None, - example={ - "User-Agent": "scrapegraph-py", - "Cookie": "cookie1=value1; cookie2=value2", - }, - description="Optional headers to send with the request, including cookies " - "and user agent", - ) - output_schema: Optional[Type[BaseModel]] = None - extraction_mode: bool = Field( - default=True, - description="Whether to use AI extraction (True) or markdown conversion (False). " - "AI extraction costs 10 credits per page, markdown conversion costs 2 credits per page.", - ) - mock: bool = Field(default=False, description="Whether to use mock mode for the request") - render_heavy_js: bool = Field(default=False, description="Whether to render heavy JavaScript on the page") - stealth: bool = Field(default=False, description="Enable stealth mode to avoid bot detection") - location_geo_code: Optional[str] = Field( - None, - description="The geo code of the location to search in", - example="us", - ) - time_range: Optional[TimeRange] = Field( - None, - description="The date range to filter search results", - examples=[ - TimeRange.PAST_HOUR, - TimeRange.PAST_24_HOURS, - TimeRange.PAST_WEEK, - TimeRange.PAST_MONTH, - TimeRange.PAST_YEAR, - ], - ) - - @model_validator(mode="after") - def validate_user_prompt(self) -> "SearchScraperRequest": - if self.user_prompt is None or not self.user_prompt.strip(): - raise ValueError("User prompt cannot be empty") - if not any(c.isalnum() for c in self.user_prompt): - raise ValueError("User prompt must contain a valid prompt") - return self - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - data = super().model_dump(*args, **kwargs) - # Convert the Pydantic model schema to dict if present - if self.output_schema is not None: - data["output_schema"] = self.output_schema.model_json_schema() - return data - - -class GetSearchScraperRequest(BaseModel): - """Request model for get_searchscraper endpoint""" - - request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - - @model_validator(mode="after") - def validate_request_id(self) -> "GetSearchScraperRequest": - try: - # Validate the request_id is a valid UUID - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self diff --git a/scrapegraph-py/scrapegraph_py/models/sitemap.py b/scrapegraph-py/scrapegraph_py/models/sitemap.py deleted file mode 100644 index 4095cbb3..00000000 --- a/scrapegraph-py/scrapegraph_py/models/sitemap.py +++ /dev/null @@ -1,192 +0,0 @@ -"""Models for sitemap endpoint""" - -from typing import Optional - -from pydantic import BaseModel, Field, model_validator - - -class SitemapRequest(BaseModel): - """Request model for sitemap endpoint. - - Extracts all URLs from a website's sitemap. Automatically discovers sitemap - from robots.txt or common sitemap locations like /sitemap.xml and sitemap - index files. - - The sitemap endpoint is useful for: - - Discovering all pages on a website - - Building comprehensive crawling lists - - SEO audits and analysis - - Content inventory management - - Attributes: - website_url (str): The base URL of the website to extract sitemap from. - Must start with http:// or https://. The API will automatically - discover the sitemap location. - mock (bool): Whether to use mock mode for the request. When True, returns - stubbed responses without making actual API calls. Defaults to False. - - Raises: - ValueError: If website_url is empty, None, or doesn't start with - http:// or https://. - - Examples: - Basic usage:: - - >>> request = SitemapRequest(website_url="https://example.com") - >>> print(request.website_url) - https://example.com - - With mock mode:: - - >>> request = SitemapRequest( - ... website_url="https://example.com", - ... mock=True - ... ) - >>> print(request.mock) - True - - The API automatically discovers sitemaps from: - - robots.txt directives (Sitemap: https://example.com/sitemap.xml) - - Common locations (/sitemap.xml, /sitemap_index.xml) - - Sitemap index files with nested sitemaps - - Note: - The website_url should be the base domain URL. The API will handle - sitemap discovery automatically. - """ - - website_url: str = Field( - ..., - example="https://scrapegraphai.com/", - description="The URL of the website to extract sitemap from" - ) - mock: bool = Field( - default=False, - description="Whether to use mock mode for the request" - ) - - @model_validator(mode="after") - def validate_url(self) -> "SitemapRequest": - """Validate the website URL. - - Ensures the URL is not empty and uses http:// or https:// protocol. - - Returns: - SitemapRequest: The validated instance. - - Raises: - ValueError: If URL is empty or uses invalid protocol. - """ - if self.website_url is None or not self.website_url.strip(): - raise ValueError("Website URL cannot be empty") - if not ( - self.website_url.startswith("http://") - or self.website_url.startswith("https://") - ): - raise ValueError("URL must start with http:// or https://") - return self - - def model_dump(self, *args, **kwargs) -> dict: - """Serialize the model to a dictionary. - - Automatically excludes None values from the serialized output to - produce cleaner JSON payloads for the API. - - Args: - *args: Positional arguments passed to parent model_dump. - **kwargs: Keyword arguments passed to parent model_dump. - If 'exclude_none' is not specified, it defaults to True. - - Returns: - dict: Dictionary representation of the model with None values excluded. - - Examples: - >>> request = SitemapRequest(website_url="https://example.com") - >>> data = request.model_dump() - >>> print(data) - {'website_url': 'https://example.com', 'mock': False} - """ - kwargs.setdefault("exclude_none", True) - return super().model_dump(*args, **kwargs) - - -class SitemapResponse(BaseModel): - """Response model for sitemap endpoint. - - Contains the complete list of URLs extracted from the website's sitemap. - The URLs are returned in the order they appear in the sitemap, which - typically reflects the website's intended structure and priority. - - This response is useful for: - - Building comprehensive URL lists for crawling - - Identifying content structure and organization - - Discovering all public pages on a website - - Planning content migration or archival - - Attributes: - urls (list[str]): Complete list of URLs extracted from the sitemap. - Each URL is a fully-qualified absolute URL string. The list may - be empty if no sitemap is found or if the sitemap contains no URLs. - URLs are deduplicated and ordered as they appear in the sitemap. - - Examples: - Basic usage:: - - >>> response = SitemapResponse(urls=[ - ... "https://example.com/", - ... "https://example.com/about" - ... ]) - >>> print(f"Found {len(response.urls)} URLs") - Found 2 URLs - - Iterating over URLs:: - - >>> response = SitemapResponse(urls=[ - ... "https://example.com/", - ... "https://example.com/products", - ... "https://example.com/contact" - ... ]) - >>> for url in response.urls: - ... print(url) - https://example.com/ - https://example.com/products - https://example.com/contact - - Filtering URLs:: - - >>> response = SitemapResponse(urls=[ - ... "https://example.com/", - ... "https://example.com/blog/post-1", - ... "https://example.com/blog/post-2", - ... "https://example.com/products" - ... ]) - >>> blog_urls = [url for url in response.urls if '/blog/' in url] - >>> print(f"Found {len(blog_urls)} blog posts") - Found 2 blog posts - - Empty sitemap:: - - >>> response = SitemapResponse(urls=[]) - >>> if not response.urls: - ... print("No URLs found in sitemap") - No URLs found in sitemap - - Note: - The urls list may contain various types of pages including: - - Homepage and main sections - - Blog posts and articles - - Product pages - - Category and tag pages - - Media files (images, PDFs) if included in sitemap - """ - - urls: list[str] = Field( - ..., - description="List of URLs extracted from the sitemap", - example=[ - "https://example.com/", - "https://example.com/about", - "https://example.com/products", - "https://example.com/contact" - ] - ) diff --git a/scrapegraph-py/scrapegraph_py/models/smartscraper.py b/scrapegraph-py/scrapegraph_py/models/smartscraper.py deleted file mode 100644 index e68b2d80..00000000 --- a/scrapegraph-py/scrapegraph_py/models/smartscraper.py +++ /dev/null @@ -1,186 +0,0 @@ -""" -Pydantic models for the SmartScraper API endpoint. - -This module defines request and response models for the SmartScraper endpoint, -which performs AI-powered web scraping with optional pagination and scrolling support. - -The SmartScraper can: -- Extract structured data from websites based on user prompts -- Handle infinite scroll scenarios -- Support pagination across multiple pages -- Accept custom output schemas for structured extraction -- Process URLs, raw HTML content, or Markdown content -""" - -from typing import Dict, Optional, Type -from uuid import UUID - -try: - from bs4 import BeautifulSoup - HAS_BS4 = True -except ImportError: - HAS_BS4 = False - -from pydantic import BaseModel, Field, conint, model_validator - - -class SmartScraperRequest(BaseModel): - """ - Request model for the SmartScraper endpoint. - - This model validates and structures requests for AI-powered web scraping. - You must provide exactly one of: website_url, website_html, or website_markdown. - - Attributes: - user_prompt: Natural language prompt describing what to extract - website_url: URL of the website to scrape (optional) - website_html: Raw HTML content to scrape (optional, max 2MB) - website_markdown: Markdown content to process (optional, max 2MB) - headers: Optional HTTP headers including cookies - cookies: Optional cookies for authentication/session management - output_schema: Optional Pydantic model defining the output structure - number_of_scrolls: Number of times to scroll (0-100) for infinite scroll pages - total_pages: Number of pages to scrape (1-10) for pagination - mock: Whether to use mock mode for testing - plain_text: Whether to return plain text instead of structured data - render_heavy_js: Whether to render heavy JavaScript content - - Example: - >>> request = SmartScraperRequest( - ... website_url="https://example.com", - ... user_prompt="Extract all product names and prices" - ... ) - """ - user_prompt: str = Field( - ..., - example="Extract info about the company", - ) - website_url: Optional[str] = Field( - default=None, example="https://scrapegraphai.com/" - ) - website_html: Optional[str] = Field( - default=None, - example="

Title

Content

", - description="HTML content, maximum size 2MB", - ) - website_markdown: Optional[str] = Field( - default=None, - example="# Title\n\nContent goes here", - description="Markdown content, maximum size 2MB", - ) - headers: Optional[dict[str, str]] = Field( - None, - example={ - "User-Agent": "scrapegraph-py", - "Cookie": "cookie1=value1; cookie2=value2", - }, - description="Optional headers to send with the request, including cookies " - "and user agent", - ) - cookies: Optional[Dict[str, str]] = Field( - None, - example={"session_id": "abc123", "user_token": "xyz789"}, - description="Dictionary of cookies to send with the request for " - "authentication or session management", - ) - output_schema: Optional[Type[BaseModel]] = None - number_of_scrolls: Optional[conint(ge=0, le=100)] = Field( - default=None, - description="Number of times to scroll the page (0-100). If None, no " - "scrolling will be performed.", - example=10, - ) - total_pages: Optional[conint(ge=1, le=10)] = Field( - default=None, - description="Number of pages to scrape (1-10). If None, only the first " - "page will be scraped.", - example=5, - ) - mock: bool = Field(default=False, description="Whether to use mock mode for the request") - plain_text: bool = Field(default=False, description="Whether to return the result as plain text") - render_heavy_js: bool = Field(default=False, description="Whether to render heavy JavaScript on the page") - stealth: bool = Field(default=False, description="Enable stealth mode to avoid bot detection") - wait_ms: Optional[int] = Field(default=None, description="The number of milliseconds to wait before scraping the website") - - @model_validator(mode="after") - def validate_user_prompt(self) -> "SmartScraperRequest": - if self.user_prompt is None or not self.user_prompt.strip(): - raise ValueError("User prompt cannot be empty") - if not any(c.isalnum() for c in self.user_prompt): - raise ValueError("User prompt must contain a valid prompt") - return self - - @model_validator(mode="after") - def validate_url_and_html(self) -> "SmartScraperRequest": - # Count how many input sources are provided - inputs_provided = sum([ - self.website_url is not None, - self.website_html is not None, - self.website_markdown is not None - ]) - - if inputs_provided == 0: - raise ValueError("Exactly one of website_url, website_html, or website_markdown must be provided") - elif inputs_provided > 1: - raise ValueError("Only one of website_url, website_html, or website_markdown can be provided") - - # Validate HTML content - if self.website_html is not None: - if len(self.website_html.encode("utf-8")) > 2 * 1024 * 1024: - raise ValueError("Website HTML content exceeds maximum size of 2MB") - if not HAS_BS4: - raise ImportError( - "beautifulsoup4 is required for HTML validation. " - "Install it with: pip install scrapegraph-py[html] or pip install beautifulsoup4" - ) - try: - soup = BeautifulSoup(self.website_html, "html.parser") - if not soup.find(): - raise ValueError("Invalid HTML - no parseable content found") - except Exception as e: - if isinstance(e, ImportError): - raise - raise ValueError(f"Invalid HTML structure: {str(e)}") - - # Validate URL - elif self.website_url is not None: - if not self.website_url.strip(): - raise ValueError("Website URL cannot be empty") - if not ( - self.website_url.startswith("http://") - or self.website_url.startswith("https://") - ): - raise ValueError("Invalid URL") - - # Validate Markdown content - elif self.website_markdown is not None: - if not self.website_markdown.strip(): - raise ValueError("Website markdown cannot be empty") - if len(self.website_markdown.encode("utf-8")) > 2 * 1024 * 1024: - raise ValueError("Website markdown content exceeds maximum size of 2MB") - - return self - - def model_dump(self, *args, **kwargs) -> dict: - # Set exclude_none=True to exclude None values from serialization - kwargs.setdefault("exclude_none", True) - data = super().model_dump(*args, **kwargs) - # Convert the Pydantic model schema to dict if present - if self.output_schema is not None: - data["output_schema"] = self.output_schema.model_json_schema() - return data - - -class GetSmartScraperRequest(BaseModel): - """Request model for get_smartscraper endpoint""" - - request_id: str = Field(..., example="123e4567-e89b-12d3-a456-426614174000") - - @model_validator(mode="after") - def validate_request_id(self) -> "GetSmartScraperRequest": - try: - # Validate the request_id is a valid UUID - UUID(self.request_id) - except ValueError: - raise ValueError("request_id must be a valid UUID") - return self diff --git a/scrapegraph-py/scrapegraph_py/utils/__init__.py b/scrapegraph-py/scrapegraph_py/utils/__init__.py deleted file mode 100644 index 7726c5b0..00000000 --- a/scrapegraph-py/scrapegraph_py/utils/__init__.py +++ /dev/null @@ -1,6 +0,0 @@ -""" -Utility functions for the ScrapeGraphAI SDK. - -This module contains helper functions for API key validation, -HTTP response handling, and other common operations used throughout the SDK. -""" \ No newline at end of file diff --git a/scrapegraph-py/scrapegraph_py/utils/helpers.py b/scrapegraph-py/scrapegraph_py/utils/helpers.py deleted file mode 100644 index 8e0e0619..00000000 --- a/scrapegraph-py/scrapegraph_py/utils/helpers.py +++ /dev/null @@ -1,120 +0,0 @@ -""" -Helper utility functions for the ScrapeGraphAI SDK. - -This module provides utility functions for API key validation and -HTTP response handling for both synchronous and asynchronous requests. -""" - -from typing import Any, Dict -from uuid import UUID - -import aiohttp -from requests import Response - -from scrapegraph_py.exceptions import APIError - - -def validate_api_key(api_key: str) -> bool: - """ - Validate the format of a ScrapeGraphAI API key. - - API keys must follow the format: 'sgai-' followed by a valid UUID. - - Args: - api_key: The API key string to validate - - Returns: - True if the API key is valid - - Raises: - ValueError: If the API key format is invalid - - Example: - >>> validate_api_key("sgai-12345678-1234-1234-1234-123456789abc") - True - >>> validate_api_key("invalid-key") - ValueError: Invalid API key format... - """ - if not api_key.startswith("sgai-"): - raise ValueError("Invalid API key format. API key must start with 'sgai-'") - uuid_part = api_key[5:] # Strip out 'sgai-' - try: - UUID(uuid_part) - except ValueError: - raise ValueError( - "Invalid API key format. API key must be 'sgai-' followed by a valid UUID. " - "You can get one at https://dashboard.scrapegraphai.com/" - ) - return True - - -def handle_sync_response(response: Response) -> Dict[str, Any]: - """ - Handle and parse synchronous HTTP responses. - - Parses the JSON response and raises APIError for error status codes. - - Args: - response: The requests Response object - - Returns: - Parsed JSON response data as a dictionary - - Raises: - APIError: If the response status code indicates an error (>= 400) - - Example: - >>> response = requests.get("https://api.example.com/data") - >>> data = handle_sync_response(response) - """ - try: - data = response.json() - except ValueError: - # If response is not JSON, use the raw text - data = {"error": response.text} - - if response.status_code >= 400: - error_msg = data.get( - "error", data.get("detail", f"HTTP {response.status_code}: {response.text}") - ) - raise APIError(error_msg, status_code=response.status_code) - - return data - - -async def handle_async_response(response: aiohttp.ClientResponse) -> Dict[str, Any]: - """ - Handle and parse asynchronous HTTP responses. - - Parses the JSON response and raises APIError for error status codes. - - Args: - response: The aiohttp ClientResponse object - - Returns: - Parsed JSON response data as a dictionary - - Raises: - APIError: If the response status code indicates an error (>= 400) - - Example: - >>> async with session.get("https://api.example.com/data") as response: - ... data = await handle_async_response(response) - """ - try: - data = await response.json() - text = None - except ValueError: - # If response is not JSON, use the raw text - text = await response.text() - data = {"error": text} - - if response.status >= 400: - if text is None: - text = await response.text() - error_msg = data.get( - "error", data.get("detail", f"HTTP {response.status}: {text}") - ) - raise APIError(error_msg, status_code=response.status) - - return data diff --git a/scrapegraph-py/scrapegraph_py/utils/toon_converter.py b/scrapegraph-py/scrapegraph_py/utils/toon_converter.py deleted file mode 100644 index 934efd27..00000000 --- a/scrapegraph-py/scrapegraph_py/utils/toon_converter.py +++ /dev/null @@ -1,60 +0,0 @@ -""" -TOON format conversion utilities. - -This module provides utilities to convert API responses to TOON format, -which reduces token usage by 30-60% compared to JSON. -""" -from typing import Any, Dict, Optional - -try: - from toon import encode as toon_encode - TOON_AVAILABLE = True -except ImportError: - TOON_AVAILABLE = False - toon_encode = None - - -def convert_to_toon(data: Any, options: Optional[Dict[str, Any]] = None) -> str: - """ - Convert data to TOON format. - - Args: - data: Python dict or list to convert to TOON format - options: Optional encoding options for TOON - - delimiter: 'comma' (default), 'tab', or 'pipe' - - indent: Number of spaces per level (default: 2) - - key_folding: 'off' (default) or 'safe' - - flatten_depth: Max depth for key folding (default: None) - - Returns: - TOON formatted string - - Raises: - ImportError: If toonify library is not installed - """ - if not TOON_AVAILABLE or toon_encode is None: - raise ImportError( - "toonify library is not installed. " - "Install it with: pip install toonify" - ) - - return toon_encode(data, options=options) - - -def process_response_with_toon(response: Dict[str, Any], return_toon: bool = False) -> Any: - """ - Process API response and optionally convert to TOON format. - - Args: - response: The API response dictionary - return_toon: If True, convert the response to TOON format - - Returns: - Either the original response dict or TOON formatted string - """ - if not return_toon: - return response - - # Convert the response to TOON format - return convert_to_toon(response) - diff --git a/scrapegraph-py/test_async_render_heavy_js.py b/scrapegraph-py/test_async_render_heavy_js.py deleted file mode 100644 index 0519ecba..00000000 --- a/scrapegraph-py/test_async_render_heavy_js.py +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/scrapegraph-py/test_render_heavy_js.py b/scrapegraph-py/test_render_heavy_js.py deleted file mode 100644 index 0519ecba..00000000 --- a/scrapegraph-py/test_render_heavy_js.py +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/scrapegraph-py/tests/__init__.py b/scrapegraph-py/tests/__init__.py deleted file mode 100644 index e69de29b..00000000 diff --git a/scrapegraph-py/tests/test_async_client.py b/scrapegraph-py/tests/test_async_client.py deleted file mode 100644 index 592cfc5d..00000000 --- a/scrapegraph-py/tests/test_async_client.py +++ /dev/null @@ -1,838 +0,0 @@ -import asyncio -from uuid import uuid4 - -import pytest -from aioresponses import aioresponses -from pydantic import BaseModel - -from scrapegraph_py.async_client import AsyncClient -from scrapegraph_py.exceptions import APIError -from tests.utils import generate_mock_api_key - - -@pytest.fixture -def mock_api_key(): - return generate_mock_api_key() - - -@pytest.fixture -def mock_uuid(): - return str(uuid4()) - - -@pytest.mark.asyncio -async def test_smartscraper_with_url(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": {"description": "Example domain."}, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.smartscraper( - website_url="https://example.com", user_prompt="Describe this page." - ) - assert response["status"] == "completed" - assert "description" in response["result"] - - -@pytest.mark.asyncio -async def test_smartscraper_with_html(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": {"description": "Test content."}, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.smartscraper( - website_html="

Test content

", - user_prompt="Extract info", - ) - assert response["status"] == "completed" - assert "description" in response["result"] - - -@pytest.mark.asyncio -async def test_smartscraper_with_headers(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": {"description": "Example domain."}, - }, - ) - - headers = { - "User-Agent": "Mozilla/5.0", - "Cookie": "session=123", - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.smartscraper( - website_url="https://example.com", - user_prompt="Describe this page.", - headers=headers, - ) - assert response["status"] == "completed" - assert "description" in response["result"] - - -@pytest.mark.asyncio -async def test_get_credits(mock_api_key): - with aioresponses() as mocked: - mocked.get( - "https://api.scrapegraphai.com/v1/credits", - payload={"remaining_credits": 100, "total_credits_used": 50}, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.get_credits() - assert response["remaining_credits"] == 100 - assert response["total_credits_used"] == 50 - - -@pytest.mark.asyncio -async def test_submit_feedback(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/feedback", payload={"status": "success"} - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.submit_feedback( - request_id=str(uuid4()), rating=5, feedback_text="Great service!" - ) - assert response["status"] == "success" - - -@pytest.mark.asyncio -async def test_get_smartscraper(mock_api_key, mock_uuid): - with aioresponses() as mocked: - mocked.get( - f"https://api.scrapegraphai.com/v1/smartscraper/{mock_uuid}", - payload={ - "request_id": mock_uuid, - "status": "completed", - "result": {"data": "test"}, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.get_smartscraper(mock_uuid) - assert response["status"] == "completed" - assert response["request_id"] == mock_uuid - - -@pytest.mark.asyncio -async def test_smartscraper_with_pagination(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": { - "products": [ - {"name": "Product 1", "price": "$10"}, - {"name": "Product 2", "price": "$20"}, - {"name": "Product 3", "price": "$30"}, - ] - }, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.smartscraper( - website_url="https://example.com/products", - user_prompt="Extract product information", - total_pages=3, - ) - assert response["status"] == "completed" - assert "products" in response["result"] - assert len(response["result"]["products"]) == 3 - - -@pytest.mark.asyncio -async def test_smartscraper_with_pagination_and_scrolls(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": { - "products": [ - {"name": "Product 1", "price": "$10"}, - {"name": "Product 2", "price": "$20"}, - {"name": "Product 3", "price": "$30"}, - {"name": "Product 4", "price": "$40"}, - {"name": "Product 5", "price": "$50"}, - ] - }, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.smartscraper( - website_url="https://example.com/products", - user_prompt="Extract product information from paginated results", - total_pages=5, - number_of_scrolls=10, - ) - assert response["status"] == "completed" - assert "products" in response["result"] - assert len(response["result"]["products"]) == 5 - - -@pytest.mark.asyncio -async def test_smartscraper_with_pagination_and_all_features(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": { - "products": [ - {"name": "Product 1", "price": "$10", "rating": 4.5}, - {"name": "Product 2", "price": "$20", "rating": 4.0}, - ] - }, - }, - ) - - headers = { - "User-Agent": "Mozilla/5.0", - "Cookie": "session=123", - } - - class ProductSchema(BaseModel): - name: str - price: str - rating: float - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.smartscraper( - website_url="https://example.com/products", - user_prompt="Extract product information with ratings", - headers=headers, - output_schema=ProductSchema, - number_of_scrolls=5, - total_pages=2, - ) - assert response["status"] == "completed" - assert "products" in response["result"] - - -@pytest.mark.asyncio -async def test_api_error(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/smartscraper", - status=400, - payload={"error": "Bad request"}, - exception=APIError("Bad request", status_code=400), - ) - - async with AsyncClient(api_key=mock_api_key) as client: - with pytest.raises(APIError) as exc_info: - await client.smartscraper( - website_url="https://example.com", user_prompt="Describe this page." - ) - assert exc_info.value.status_code == 400 - assert "Bad request" in str(exc_info.value) - - -@pytest.mark.asyncio -async def test_markdownify(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/markdownify", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": "# Example Page\n\nThis is markdown content.", - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.markdownify(website_url="https://example.com") - assert response["status"] == "completed" - assert "# Example Page" in response["result"] - - -@pytest.mark.asyncio -async def test_markdownify_with_headers(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/markdownify", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": "# Example Page\n\nThis is markdown content.", - }, - ) - - headers = { - "User-Agent": "Mozilla/5.0", - "Cookie": "session=123", - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.markdownify( - website_url="https://example.com", headers=headers - ) - assert response["status"] == "completed" - assert "# Example Page" in response["result"] - - -@pytest.mark.asyncio -async def test_get_markdownify(mock_api_key, mock_uuid): - with aioresponses() as mocked: - mocked.get( - f"https://api.scrapegraphai.com/v1/markdownify/{mock_uuid}", - payload={ - "request_id": mock_uuid, - "status": "completed", - "result": "# Example Page\n\nThis is markdown content.", - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.get_markdownify(mock_uuid) - assert response["status"] == "completed" - assert response["request_id"] == mock_uuid - - -@pytest.mark.asyncio -async def test_searchscraper(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/searchscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": {"answer": "Python 3.12 is the latest version."}, - "reference_urls": ["https://www.python.org/downloads/"], - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.searchscraper( - user_prompt="What is the latest version of Python?" - ) - assert response["status"] == "completed" - assert "answer" in response["result"] - assert "reference_urls" in response - assert isinstance(response["reference_urls"], list) - - -@pytest.mark.asyncio -async def test_searchscraper_with_headers(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/searchscraper", - payload={ - "request_id": str(uuid4()), - "status": "completed", - "result": {"answer": "Python 3.12 is the latest version."}, - "reference_urls": ["https://www.python.org/downloads/"], - }, - ) - - headers = { - "User-Agent": "Mozilla/5.0", - "Cookie": "session=123", - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.searchscraper( - user_prompt="What is the latest version of Python?", - headers=headers, - ) - assert response["status"] == "completed" - assert "answer" in response["result"] - assert "reference_urls" in response - assert isinstance(response["reference_urls"], list) - - -@pytest.mark.asyncio -async def test_get_searchscraper(mock_api_key, mock_uuid): - with aioresponses() as mocked: - mocked.get( - f"https://api.scrapegraphai.com/v1/searchscraper/{mock_uuid}", - payload={ - "request_id": mock_uuid, - "status": "completed", - "result": {"answer": "Python 3.12 is the latest version."}, - "reference_urls": ["https://www.python.org/downloads/"], - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.get_searchscraper(mock_uuid) - assert response["status"] == "completed" - assert response["request_id"] == mock_uuid - assert "answer" in response["result"] - assert "reference_urls" in response - assert isinstance(response["reference_urls"], list) - - -@pytest.mark.asyncio -async def test_crawl(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/crawl", - payload={ - "id": str(uuid4()), - "status": "processing", - "message": "Crawl job started", - }, - ) - - schema = { - "$schema": "http://json-schema.org/draft-07/schema#", - "title": "Test Schema", - "type": "object", - "properties": { - "name": {"type": "string"}, - "age": {"type": "integer"}, - }, - "required": ["name"], - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.crawl( - url="https://example.com", - prompt="Extract company information", - data_schema=schema, - cache_website=True, - depth=2, - max_pages=5, - same_domain_only=True, - batch_size=1, - ) - assert response["status"] == "processing" - assert "id" in response - - -@pytest.mark.asyncio -async def test_crawl_with_minimal_params(mock_api_key): - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/crawl", - payload={ - "id": str(uuid4()), - "status": "processing", - "message": "Crawl job started", - }, - ) - - schema = { - "$schema": "http://json-schema.org/draft-07/schema#", - "title": "Test Schema", - "type": "object", - "properties": { - "name": {"type": "string"}, - }, - "required": ["name"], - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.crawl( - url="https://example.com", - prompt="Extract company information", - data_schema=schema, - ) - assert response["status"] == "processing" - assert "id" in response - - -@pytest.mark.asyncio -async def test_get_crawl(mock_api_key, mock_uuid): - with aioresponses() as mocked: - mocked.get( - f"https://api.scrapegraphai.com/v1/crawl/{mock_uuid}", - payload={ - "id": mock_uuid, - "status": "completed", - "result": { - "llm_result": { - "company": { - "name": "Example Corp", - "description": "A technology company", - }, - "services": [ - { - "service_name": "Web Development", - "description": "Custom web solutions", - } - ], - "legal": { - "privacy_policy": "Privacy policy content", - "terms_of_service": "Terms of service content", - }, - } - }, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.get_crawl(mock_uuid) - assert response["status"] == "completed" - assert response["id"] == mock_uuid - assert "result" in response - assert "llm_result" in response["result"] - - -@pytest.mark.asyncio -async def test_crawl_markdown_mode(mock_api_key): - """Test async crawl in markdown conversion mode (no AI processing)""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/crawl", - payload={ - "id": str(uuid4()), - "status": "processing", - "message": "Markdown crawl job started", - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.crawl( - url="https://example.com", - extraction_mode=False, # Markdown conversion mode - depth=2, - max_pages=3, - same_domain_only=True, - sitemap=True, - ) - assert response["status"] == "processing" - assert "id" in response - - -@pytest.mark.asyncio -async def test_crawl_markdown_mode_validation(mock_api_key): - """Test that async markdown mode rejects prompt and data_schema parameters""" - async with AsyncClient(api_key=mock_api_key) as client: - # Should raise validation error when prompt is provided in markdown mode - try: - await client.crawl( - url="https://example.com", - extraction_mode=False, - prompt="This should not be allowed", - ) - assert False, "Should have raised validation error" - except Exception as e: - assert "Prompt should not be provided when extraction_mode=False" in str(e) - - # Should raise validation error when data_schema is provided in markdown mode - try: - await client.crawl( - url="https://example.com", - extraction_mode=False, - data_schema={"type": "object"}, - ) - assert False, "Should have raised validation error" - except Exception as e: - assert ( - "Data schema should not be provided when extraction_mode=False" - in str(e) - ) - - -# ============================================================================ -# ASYNC SCRAPE TESTS -# ============================================================================ - - -@pytest.mark.asyncio -async def test_async_scrape_basic(mock_api_key): - """Test basic async scrape request""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "scrape_request_id": str(uuid4()), - "status": "completed", - "html": "

Example Page

This is HTML content.

", - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.scrape(website_url="https://example.com") - assert response["status"] == "completed" - assert "html" in response - assert "

Example Page

" in response["html"] - - -@pytest.mark.asyncio -async def test_async_scrape_with_heavy_js(mock_api_key): - """Test async scrape request with heavy JavaScript rendering""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "scrape_request_id": str(uuid4()), - "status": "completed", - "html": "
JavaScript rendered content
", - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.scrape( - website_url="https://example.com", - render_heavy_js=True - ) - assert response["status"] == "completed" - assert "html" in response - assert "JavaScript rendered content" in response["html"] - - -@pytest.mark.asyncio -async def test_async_scrape_with_headers(mock_api_key): - """Test async scrape request with custom headers""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "scrape_request_id": str(uuid4()), - "status": "completed", - "html": "

Content with custom headers

", - }, - ) - - headers = { - "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", - "Cookie": "session=123" - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.scrape( - website_url="https://example.com", - headers=headers - ) - assert response["status"] == "completed" - assert "html" in response - - -@pytest.mark.asyncio -async def test_async_scrape_with_all_options(mock_api_key): - """Test async scrape request with all options enabled""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "scrape_request_id": str(uuid4()), - "status": "completed", - "html": "
Full featured content
", - }, - ) - - headers = { - "User-Agent": "Custom Agent", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" - } - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.scrape( - website_url="https://example.com", - render_heavy_js=True, - headers=headers - ) - assert response["status"] == "completed" - assert "html" in response - - -@pytest.mark.asyncio -async def test_async_get_scrape(mock_api_key, mock_uuid): - """Test async get scrape result""" - with aioresponses() as mocked: - mocked.get( - f"https://api.scrapegraphai.com/v1/scrape/{mock_uuid}", - payload={ - "scrape_request_id": mock_uuid, - "status": "completed", - "html": "

Retrieved HTML content

", - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.get_scrape(mock_uuid) - assert response["status"] == "completed" - assert response["scrape_request_id"] == mock_uuid - assert "html" in response - - -@pytest.mark.asyncio -async def test_async_scrape_error_response(mock_api_key): - """Test async scrape error response handling""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "error": "Website not accessible", - "status": "error" - }, - status=400 - ) - - async with AsyncClient(api_key=mock_api_key) as client: - with pytest.raises(Exception): - await client.scrape(website_url="https://inaccessible-site.com") - - -@pytest.mark.asyncio -async def test_async_scrape_processing_status(mock_api_key): - """Test async scrape processing status response""" - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "scrape_request_id": str(uuid4()), - "status": "processing", - "message": "Scrape job started" - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.scrape(website_url="https://example.com") - assert response["status"] == "processing" - assert "scrape_request_id" in response - - -@pytest.mark.asyncio -async def test_async_scrape_complex_html_response(mock_api_key): - """Test async scrape with complex HTML response""" - complex_html = """ - - - - - - Complex Page - - - -
- -
-
-

Welcome

-

This is a complex HTML page with multiple elements.

-
- Sample image - - -
Data 1Data 2
-
-
- - - - """ - - with aioresponses() as mocked: - mocked.post( - "https://api.scrapegraphai.com/v1/scrape", - payload={ - "scrape_request_id": str(uuid4()), - "status": "completed", - "html": complex_html, - }, - ) - - async with AsyncClient(api_key=mock_api_key) as client: - response = await client.scrape(website_url="https://complex-example.com") - assert response["status"] == "completed" - assert "html" in response - assert "" in response["html"] - assert "Complex Page" in response["html"] - assert " - - - """ - - responses.add( - responses.POST, - "https://api.scrapegraphai.com/v1/scrape", - json={ - "scrape_request_id": str(uuid4()), - "status": "completed", - "html": complex_html, - }, - ) - - with Client(api_key=mock_api_key) as client: - response = client.scrape(website_url="https://complex-example.com") - assert response["status"] == "completed" - assert "html" in response - assert "" in response["html"] - assert "Complex Page" in response["html"] - assert "