This is a go based web scraper service, which ouputs clean markdown content of the webpage and screenshots, through an easy to use REST API.
Scraper can automatically handle cookie consent banners (cookie consent rules from here)
- Go 1.21 or higher
- Chrome/Chromium browser installed on your system
- Git (for cloning the repository)
git clone https://github.com/SubhanAfz/crawler.git
cd crawler
# Download and install all Go dependencies
go mod download
go mod tidy
# Build the binary (outputs to bin/api)
# This will automatically download dependencies if not already done
make build
# Or build manually
go build -o bin/api cmd/api/main.go
# Run directly without building
make run
# Or run manually
go run cmd/api/main.go
make clean
The built binary will be located at bin/api
and can be executed directly.
The scraper service runs on http://localhost:8080
by default and provides two main endpoints:
Endpoint: GET /get_page
Description: Scrapes a webpage, automatically handles cookie consent banners, and returns the page content with optional format conversion.
Parameters:
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
url |
string | Yes | - | The URL of the webpage to scrape |
wait_time |
integer | No | 1000 | Wait time in milliseconds after page load |
format |
string | No | - | Output format conversion (markdown ) |
Response:
{
"title": "Page Title",
"content": "Page content...",
"url": "https://example.com"
}
Endpoint: GET /screenshot
Description: Takes a full-page screenshot of a webpage after automatically handling cookie consent banners.
Parameters:
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
url |
string | Yes | - | The URL of the webpage to capture |
wait_time |
integer | No | 1000 | Wait time in milliseconds after page load |
Response:
{
"image": "base64-encoded-image-data..."
}
All endpoints return standardized error responses:
Format:
{
"error": "Error description"
}
HTTP Status Codes:
200
- Success400
- Bad Request (missing/invalid parameters)500
- Internal Server Error (scraping/processing failed)