Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 5 additions & 8 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
# ScrapeGraphAI API Key (required for scrapegraph-py SDK)
SGAI_API_KEY=your-scrapegraphai-api-key-here

# Elasticsearch Configuration
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_SCHEME=http
ELASTICSEARCH_USERNAME=elastic
ELASTICSEARCH_PASSWORD=changeme

# ScrapeGraphAI Configuration
SCRAPEGRAPHAI_API_KEY=your_api_key_here

# Optional: OpenAI API Key for LLM functionality
OPENAI_API_KEY=your_openai_api_key_here
# ELASTICSEARCH_USERNAME=
# ELASTICSEARCH_PASSWORD=
79 changes: 60 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,25 @@
# ScrapeGraphAI Elasticsearch Demo

A comprehensive demo project showcasing the integration of **ScrapeGraphAI SDK** with **Elasticsearch** for intelligent marketplace product scraping, storage, and comparison.
A comprehensive demo project showcasing the integration of **ScrapeGraphAI API (via scrapegraph-py SDK)** with **Elasticsearch** for intelligent marketplace product scraping, storage, and comparison.

> **Note**: This demo uses the `scrapegraph-py` SDK which provides API-based scraping through ScrapeGraphAI's cloud service. This means simpler setup, no local LLM requirements, and managed infrastructure.

## 🚀 Features

- **Web Scraping with ScrapeGraphAI**: Leverage AI-powered scraping to extract structured product data from marketplace websites
- **Web Scraping with ScrapeGraphAI API**: Leverage cloud-based AI scraping to extract structured product data from marketplace websites
- **Simple SDK Integration**: Use the `scrapegraph-py` SDK for easy API-based scraping
- **Elasticsearch Integration**: Store and index product data for powerful search and analytics
- **Multi-Marketplace Support**: Scrape and compare products across different marketplaces (Amazon, eBay, etc.)
- **Product Comparison**: Advanced features to compare products by price, ratings, and specifications
- **Flexible Search**: Full-text search with filters for marketplace, price range, and more
- **Data Analytics**: Aggregations and statistics on product data
- **No Local LLM Setup**: All AI processing happens in the cloud - just use your API key

## 📋 Prerequisites

- Python 3.8 or higher
- Docker and Docker Compose (for Elasticsearch)
- OpenAI API key (optional, for AI-powered scraping)
- ScrapeGraphAI API key (get one at [scrapegraphai.com](https://scrapegraphai.com))

## 🔧 Installation

Expand Down Expand Up @@ -48,11 +52,16 @@ pip install -r requirements.txt
# Copy the example environment file
cp .env.example .env

# Edit .env and add your configuration
# At minimum, you need to set:
# - SCRAPEGRAPHAI_API_KEY or OPENAI_API_KEY
# Edit .env and add your ScrapeGraphAI API key
# Required: SGAI_API_KEY=your-api-key-here
```

**Getting your API Key:**
1. Visit [scrapegraphai.com](https://scrapegraphai.com)
2. Sign up or log in to your account
3. Navigate to your API settings
4. Copy your API key and add it to `.env` as `SGAI_API_KEY`

### 4. Start Elasticsearch

```bash
Expand Down Expand Up @@ -117,14 +126,14 @@ This demonstrates:
```python
from src.scrapegraph_demo import Config, ElasticsearchClient, MarketplaceScraper

# Load configuration
# Load configuration (reads SGAI_API_KEY from environment)
config = Config.from_env()

# Initialize clients
es_client = ElasticsearchClient(config)
scraper = MarketplaceScraper(config)

# Scrape a product
# Scrape a product using the SDK
product = scraper.scrape_product(
url="https://www.amazon.com/dp/PRODUCTID",
marketplace="Amazon"
Expand All @@ -144,12 +153,16 @@ results = es_client.search_products(
# Print results
for product in results:
print(f"{product.name} - ${product.price}")

# Clean up
scraper.close()
es_client.close()
```

### Scraping Search Results

```python
# Scrape multiple products from a search
# Scrape multiple products from a search using the SDK
products = scraper.scrape_search_results(
search_query="wireless mouse",
marketplace="Amazon",
Expand All @@ -159,6 +172,9 @@ products = scraper.scrape_search_results(
# Bulk index
success, failed = es_client.index_products(products)
print(f"Indexed {success} products")

# Don't forget to close the scraper
scraper.close()
```

### Product Comparison
Expand Down Expand Up @@ -213,11 +229,12 @@ Manages all Elasticsearch operations:

### MarketplaceScraper

Handles web scraping using ScrapeGraphAI:
- Scrape individual product pages
- Scrape search results
Handles web scraping using ScrapeGraphAI API (via scrapegraph-py SDK):
- Scrape individual product pages using cloud-based AI
- Scrape search results with structured data extraction
- Extract structured data (price, rating, specs, etc.)
- Support for multiple marketplaces
- Automatic fallback to mock data if API is unavailable

### Product Model

Expand All @@ -234,15 +251,14 @@ Pydantic model representing a marketplace product:

| Variable | Description | Required | Default |
|----------|-------------|----------|---------|
| `SGAI_API_KEY` | ScrapeGraphAI API key | Yes* | - |
| `ELASTICSEARCH_HOST` | Elasticsearch host | No | `localhost` |
| `ELASTICSEARCH_PORT` | Elasticsearch port | No | `9200` |
| `ELASTICSEARCH_SCHEME` | HTTP or HTTPS | No | `http` |
| `ELASTICSEARCH_USERNAME` | Elasticsearch username | No | - |
| `ELASTICSEARCH_PASSWORD` | Elasticsearch password | No | - |
| `SCRAPEGRAPHAI_API_KEY` | ScrapeGraphAI API key | Yes* | - |
| `OPENAI_API_KEY` | OpenAI API key | Yes* | - |

*Either `SCRAPEGRAPHAI_API_KEY` or `OPENAI_API_KEY` is required for AI-powered scraping.
*`SGAI_API_KEY` is required for API-based scraping. Without it, the demo will use mock data for testing.

## 📊 Elasticsearch Index

Expand Down Expand Up @@ -278,15 +294,23 @@ Use Kibana to:

## 🧪 Testing

The project includes mock data functionality for testing without actual web scraping:
Run the test suite:

```bash
python run_tests.py
```

The project includes mock data functionality for testing without API credits:

```python
# The scraper automatically falls back to mock data if ScrapeGraphAI is unavailable
# The scraper automatically falls back to mock data if API key is not set
scraper = MarketplaceScraper(config)
products = scraper.scrape_search_results("laptop", "Amazon", max_results=5)
# Returns mock products for testing
```

All tests use mock data and don't require an API key.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
Expand All @@ -297,9 +321,11 @@ This project is provided as-is for demonstration purposes.

## 🔗 Related Resources

- [ScrapeGraphAI Documentation](https://scrapegraphai.com/docs)
- [ScrapeGraphAI Website](https://scrapegraphai.com) - Get your API key
- [ScrapeGraphAI SDK Documentation](https://github.com/ScrapeGraphAI/scrapegraph-sdk) - scrapegraph-py SDK reference
- [ScrapeGraphAI API Documentation](https://scrapegraphai.com/docs) - API documentation
- [Elasticsearch Documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html)
- [ScrapeGraphAI GitHub](https://github.com/ScrapeGraphAI/Scrapegraph-ai)
- [ScrapeGraphAI Open Source](https://github.com/ScrapeGraphAI/Scrapegraph-ai) - Original open-source library

## 💡 Use Cases

Expand All @@ -313,6 +339,21 @@ This demo can be adapted for various use cases:

## 🐛 Troubleshooting

### ScrapeGraphAI API Issues

```bash
# Verify your API key is set
echo $SGAI_API_KEY

# Test the SDK
python -c "from scrapegraph_py import Client; print('SDK installed correctly')"
```

**Common Issues:**
- **"SGAI_API_KEY not set"**: Make sure you've added your API key to `.env`
- **API credits exhausted**: Check your account at scrapegraphai.com
- **Connection timeout**: Check your internet connection

### Elasticsearch Connection Issues

```bash
Expand Down
1 change: 1 addition & 0 deletions examples/advanced_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ def main():
print("Product not found")

# Clean up
scraper.close()
es_client.close()

print("\n\n=== Advanced search demo completed! ===")
Expand Down
1 change: 1 addition & 0 deletions examples/basic_usage.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ def main():

# Clean up
print("\n9. Closing connections...")
scraper.close()
es_client.close()

print("\n=== Demo completed successfully! ===")
Expand Down
1 change: 1 addition & 0 deletions examples/product_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def main():
print(f" Availability: {product.availability}")

# Clean up
scraper.close()
es_client.close()

print("\n" + "=" * 60)
Expand Down
9 changes: 8 additions & 1 deletion quickstart.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,11 @@ def main():
print_step(3, "Initializing Marketplace Scraper")
scraper = MarketplaceScraper(config)
print("✓ Scraper initialized")
print(" Using mock data for demonstration")
if not config.sgai_api_key:
print(" Note: SGAI_API_KEY not set, using mock data for demonstration")
print(" To use real API scraping, set SGAI_API_KEY in your .env file")
else:
print(" Using ScrapeGraphAI SDK for scraping")
wait_for_user()

# Step 4: Scrape Products
Expand Down Expand Up @@ -220,6 +224,9 @@ def main():
print(" - python examples/advanced_search.py")
print()

# Clean up connections
scraper.close()

if es_connected:
print(" 5. Access Kibana at http://localhost:5601 for data visualization")
print()
Expand Down
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ScrapeGraphAI SDK
scrapegraphai>=1.0.0
# ScrapeGraphAI SDK (API-based)
scrapegraph-py>=1.0.0

# Elasticsearch
elasticsearch>=8.0.0
Expand Down
10 changes: 3 additions & 7 deletions src/scrapegraph_demo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,8 @@ class Config:
elasticsearch_username: Optional[str]
elasticsearch_password: Optional[str]

# ScrapeGraphAI settings
scrapegraphai_api_key: Optional[str]

# OpenAI settings (optional)
openai_api_key: Optional[str]
# ScrapeGraphAI SDK settings
sgai_api_key: Optional[str]

@classmethod
def from_env(cls) -> "Config":
Expand All @@ -36,8 +33,7 @@ def from_env(cls) -> "Config":
elasticsearch_scheme=os.getenv("ELASTICSEARCH_SCHEME", "http"),
elasticsearch_username=os.getenv("ELASTICSEARCH_USERNAME"),
elasticsearch_password=os.getenv("ELASTICSEARCH_PASSWORD"),
scrapegraphai_api_key=os.getenv("SCRAPEGRAPHAI_API_KEY"),
openai_api_key=os.getenv("OPENAI_API_KEY"),
sgai_api_key=os.getenv("SGAI_API_KEY"),
)

@property
Expand Down
Loading