Scraper

This is a go based web scraper service, which ouputs clean markdown content of the webpage and screenshots, through an easy to use REST API.

Scraper can automatically handle cookie consent banners (cookie consent rules from here)

Prerequisites

Go 1.21 or higher
Chrome/Chromium browser installed on your system
Git (for cloning the repository)

Building

Clone the Repository

git clone https://github.com/SubhanAfz/crawler.git
cd crawler

Download Dependencies

# Download and install all Go dependencies
go mod download
go mod tidy

Build the Binary

# Build the binary (outputs to bin/api)
# This will automatically download dependencies if not already done
make build

# Or build manually
go build -o bin/api cmd/api/main.go

Run the Development Server

# Run directly without building
make run

# Or run manually
go run cmd/api/main.go

Clean Build Artifacts

make clean

The built binary will be located at bin/api and can be executed directly.

API Documentation

The scraper service runs on http://localhost:8080 by default and provides two main endpoints:

1. Get Page Content

Endpoint: GET /get_page

Description: Scrapes a webpage, automatically handles cookie consent banners, and returns the page content with optional format conversion.

Parameters:

Parameter	Type	Required	Default	Description
`url`	string	Yes	-	The URL of the webpage to scrape
`wait_time`	integer	No	1000	Wait time in milliseconds after page load
`format`	string	No	-	Output format conversion (`markdown`)

Response:

{
  "title": "Page Title",
  "content": "Page content...",
  "url": "https://example.com"
}

2. Take Screenshot

Endpoint: GET /screenshot

Description: Takes a full-page screenshot of a webpage after automatically handling cookie consent banners.

Parameters:

Parameter	Type	Required	Default	Description
`url`	string	Yes	-	The URL of the webpage to capture
`wait_time`	integer	No	1000	Wait time in milliseconds after page load

Response:

{
  "image": "base64-encoded-image-data..."
}

Error Responses

All endpoints return standardized error responses:

Format:

{
  "error": "Error description"
}

HTTP Status Codes:

200 - Success
400 - Bad Request (missing/invalid parameters)
500 - Internal Server Error (scraping/processing failed)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
cmd/api		cmd/api
pkg		pkg
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
rules.json		rules.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scraper

Prerequisites

Building

Clone the Repository

Download Dependencies

Build the Binary

Run the Development Server

Clean Build Artifacts

API Documentation

1. Get Page Content

2. Take Screenshot

Error Responses

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SubhanAfz/scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

Prerequisites

Building

Clone the Repository

Download Dependencies

Build the Binary

Run the Development Server

Clean Build Artifacts

API Documentation

1. Get Page Content

2. Take Screenshot

Error Responses

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages