Skip to content

SubhanAfz/scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraper

This is a go based web scraper service, which ouputs clean markdown content of the webpage and screenshots, through an easy to use REST API.

Scraper can automatically handle cookie consent banners (cookie consent rules from here)

Prerequisites

  • Go 1.21 or higher
  • Chrome/Chromium browser installed on your system
  • Git (for cloning the repository)

Building

Clone the Repository

git clone https://github.com/SubhanAfz/crawler.git
cd crawler

Download Dependencies

# Download and install all Go dependencies
go mod download
go mod tidy

Build the Binary

# Build the binary (outputs to bin/api)
# This will automatically download dependencies if not already done
make build

# Or build manually
go build -o bin/api cmd/api/main.go

Run the Development Server

# Run directly without building
make run

# Or run manually
go run cmd/api/main.go

Clean Build Artifacts

make clean

The built binary will be located at bin/api and can be executed directly.

API Documentation

The scraper service runs on http://localhost:8080 by default and provides two main endpoints:

1. Get Page Content

Endpoint: GET /get_page

Description: Scrapes a webpage, automatically handles cookie consent banners, and returns the page content with optional format conversion.

Parameters:

Parameter Type Required Default Description
url string Yes - The URL of the webpage to scrape
wait_time integer No 1000 Wait time in milliseconds after page load
format string No - Output format conversion (markdown)

Response:

{
  "title": "Page Title",
  "content": "Page content...",
  "url": "https://example.com"
}

2. Take Screenshot

Endpoint: GET /screenshot

Description: Takes a full-page screenshot of a webpage after automatically handling cookie consent banners.

Parameters:

Parameter Type Required Default Description
url string Yes - The URL of the webpage to capture
wait_time integer No 1000 Wait time in milliseconds after page load

Response:

{
  "image": "base64-encoded-image-data..."
}

Error Responses

All endpoints return standardized error responses:

Format:

{
  "error": "Error description"
}

HTTP Status Codes:

  • 200 - Success
  • 400 - Bad Request (missing/invalid parameters)
  • 500 - Internal Server Error (scraping/processing failed)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published